Post Mortem: Payment Processing Incident
Between April 24 and 26 cobot processed some payments twice. This resulted in the same invoice being emailed twice for all affected payments. For coworkers with credit card processing enabled it sometimes resulted in ‘duplicate transaction’ error emails, in other cases cards were charged twice.
We have notified all coworking spaces whose members are affected and provided instructions for refunding the payments through their Authorize.net account. In addition we waived the spaces’ fees for the current month.
We are deeply sorry that this happened and are working on measures to prevent this from happening in the future.
What happened
On April 24 I set up a new server in order to upgrade some software. This is a standard procedure at Cobot. We fire up a new server, move the data over, then kill the old one. Except this time I forgot to stop the old server.
All our payments are processed every day around midnight. The following night both the old and the new server processed the payments for the day. Authorize.net has a duplicate transaction detection system. This caught some of the transactions and resulted in the aforementioned ‘duplicate transaction’ errors which also landed in my error inbox.
First mistake: I ignored the errors. We get a constant stream of payment errors, most of them either because a credit card is invalid or there is a temporary problem at the payment provider’s gateway. Cobot automatically deals with these by sending out error emails and reprocessing failed transactions the following night.
Only when the same errors kept coming up the next day I started looking into the problem and quickly found out we had one server too many running. I immediately stopped it, but it took me another day until I sent out the notification to the spaces.
How we will fix it
There are three parts to the problem:
- the human error of forgetting to shut down the server
- the system not detecting the duplicate transactions
- my failure to recognize and communicate the problem proactively
Human Error
There is no way to completely rule out human error. I’m pretty sure I’ll make my next mistake sooner than later, but we can still make some improvements. I have started to build check lists for critical operations such as moving a server. Using these should help avoid a lot of mistakes. At least they have helped others.
Duplicate transaction detection
The way the duplicate transaction detector at Authorize.net works is that it compares the amount and other details to past transactions within a time window. If it finds duplicates the transactions is not processed and instead the service returns an error. The problem is that the default time window is only 2 minutes. As the two servers processed the payments at roughly the same time it caught some duplicates but not all of them. We will increase that window to 12h which should allow it to catch all duplicates.
Communication
Two nights of too little sleep and a public holiday spent putting out fires taught me that acting on and communicating problems proactively is a thousand times easier than dealing with them a day later. The next time I see something suspicious rest assured I’ll react a lot faster.
In addition we have set up an emergency email address that goes straight to our phones. Logged in users can find it on our support page. Customers with emergency problems can reach us there much faster than through the other channels.
Again I am very sorry this happened. Again we have learned our lessons and are taking measures to prevent any future mistakes. We will not always succeed but we’ll do everything we can.
Alex