Today between midnight and 11am UTC cobot experienced a major downtime of all services. We are awefully sorry that this happened and are working at this very moment on several improvements to prevent future incidents.
Around midnight UTC on Jan 31 2012 our database crashed and was not automatically restarted as it should have been. We use CouchDB, which uses a system called Erlang OTP to monitor itself and restart crashed processes. After this mechanism failed the cobot main application could not connect to the database anymore which resulted in most pages generating an error.
Next our monitoring system (we use pingdom.com) sent an email notifying me of the errors. The only problem was that in that very moment I was boarding a transatlantic flight rendering me offline for 12h. As it was in the middle of the night the others in the team were already sleeping and did not notice there was any problem. In the morning Thilo was first in the office and restarted the database which brought the site back.
What we are doing to fix it
Immediately after coming to the office I bumped up our Pingdom account. Should there ever be another downtime Pingdom will now send an email, SMS, Twitter direct message and push notification to me and Thilo.
In addition I am setting up another monitoring layer around the database using a tool called monit. Should the database crash or become unresponsive again this tool will automatically kill and restart it, dramatically reducing the likelyhood we will have to intervene by hand.
Again, I am very sorry this happened. Software and servers can and will always fail, but as a service provider it is our responsibility to make sure that our customers aren’t affected. We messed up. We can’t make it up, but we have learned our lessons, and will do everything we can to prevent anything like this from happening again.