Banner

Downtime & Hardware Failure

Created By Christian Deacon

Hey everyone,

Starting from March 6th, we've been experiencing occasional OS crashes on our web server. I looked at the server logs and couldn't find any cause to this (e.g. no output from journalctl before the crashes). I made a ticket with our hosting provider to see if they could take a look at the node our web server was hosted on and make sure the hardware was okay via disk/RAM tests. They came back stating the resource usage looked fine on the node and that the crashes were most likely related to our server specifically (e.g. running out of available RAM). I personally didn't believe this was the case since we always had a lot of available RAM and the old web server we were on had over a year of uptime while running the same services. Either way, I know there was a chance it could have been the services or our OS, so I enabled kdump and was planning to inspect crash dumps that were generated if we experienced another crash (we never got to this point since the next crash wasn't just a crash which is explained below). Unfortunately, the hosting provider also didn't perform tests on the disk/RAM which is what I tried telling them to do to be safe.

We experienced a total of 3 crashes since March 6th until last night around 11:30 PM EST when our server went down again, but this time wouldn't come back up. The KVM through our hosting provider's portal wasn't working so we couldn't see the server's console along with status being stuck on rebooting through the panel. This indicated the node was completely down. I made another ticket and tried calling the hosting provider, but I didn't receive any responses for a while. They did eventually make an event regarding this on their status page around an hour and a half later.

Around 10 AM EST this morning, they notified us that their node had experienced hardware failure and that they were restoring our services from a backup on another node. While I was suspecting the node's hardware was bad due to the crashes prior, I was hoping it would be something like bad RAM. However, given they needed to restore from backups, I believe it was due to a drive failure (they never confirmed this with me, though).

The entire server was restored to a backup from February 29th. Deaconn was also restored to Feburary 29th, but we do have daily backups of the database. However, since there haven't been any new articles or users since February 29th, I figured it was fine keeping the February 29th date.

I am going to see if I can get more information from our hosting provider on this incident. I'll also continue monitoring the server to ensure it doesn't crash again.

I apologize for the inconvenience and thank you for understanding!


Share!


Hi! I am the founder and CEO of Deaconn. I specialize in software and network engineering. I also love system administration and I'm a huge fan of Linux! I contribute to a few open source projects as well!