Post mortem of yesterday’s outage
Yesterday the Cobot website was down for a bit over one hour. Downtimes are very rare at Cobot, so this was very annoying for you our customers, but also for us who strive for 100% uptime as much as possible.
As we tweeted yesterday…
… the hard drive our main database runs on ran out of space. Contrary to the tweet we did monitor that hard drive’s capacity, but unfortunately in a wrong way.
We use scout to monitor our servers and applications. This includes CPU load, hard disk capacity, memory usage, response times etc. For every hard drive we had set an alarm to email us should the drive become more than 80% full. At least we thought we had. By accident we had set the alarm to 80GB — which on a 20GB drive would never have fired. This is why the full hard disk was undetected until our database started throwing errors — which was around yesterday evening.
After wasting some time trying to free up space by compacting data we eventually added a new drive (actually a virtual EBS volume), shut down the site …
… and copied over the data. After that we spun up our database and the site was back up.
Conclusion
As usual it was human error. Had we thoroughly checked our alarm settings we would have been notified of the disk problems and could have solved them without any downtime. We have of course corrected the settings now so this should not happen again. We are sincerely sorry for what happened and have learned out part. Off to the next months with 100% uptime — like it should be.