Coinigy Outage: Post-Mortem

We’d like to give some insight on our outage yesterday, its’ implications, what went wrong, and what systems we’re putting in place to mitigate this type of failure in the future.

Scheduled maintenance

Firstly, we apologize for the “scheduled maintenance” message.

DRZbKhw[1]

Not intended :(

The hard decision was made (more on this below) to enter maintenance mode at 3:03 PM EST (8:03 PM UTC). The maintenance page said “scheduled maintenance”, when the maintenance was clearly not scheduled in advance. This was not intentional and we’ve made necessary changes.

A Day In Photos

qotL6x0[1]

I’m givin’ her all she’s got, captain. The above chart represents mean CPU usage across Coinigy’s infrastructure, in UTC.

Around 1:00 PM CST we began getting reports from users that various exchange sites were not working or were not responsive.  We also began experiencing intermittent reports of failed API requests through the Coinigy interface.  Throughout the next hour, Coinigy began experiencing users logging onto our platform at a rate of about 100 per minute.  Signup rates also increased dramatically and heuristics indicated that many of these accounts were potential duplicate trial accounts created by users desperate to monitor price action somewhere.

LXlL47m[1]

Nginx RPS for the past 2 days; spike represents a cascading failure.

This is where things began to fail. As our auto-scaling solution began to add web servers, for some reason our primary database was locking up. After some troubleshooting, it was determined that the unprecedented login rate was exposing some authentication related code that was not scaling well. This was locking up our primary database server which also impacts data collection so we made the hard choice to put the site in maintenance mode.

Qw7WJV3[1]

Some y-axis values for good measure.

We quickly migrated the authentication query to run on a database replicant . We also implemented a solution to optionally allow only subscribers to access the site, because every time we tried to take it out of maintenance mode, a massive in-rush of logins followed soon after.  We thought this would be a good stop-gap solution while we were able to determine if the authentication scaling issue was resolved. After these solutions were in place, we re-enabled the site and allowed subscribers to login.  After a couple of hours, and with the authentication scaling issues resolved, we were able to open the site up to all users again.

Future mitigation strategies and new procedures

The event yesterday exposed inefficiencies that unfortunately impacted subscribers. As a result, we have instituted a new “subscribers-only” mode that will automatically trigger during high load events.  We have also been compelled to re-evaluate our lenient policy on creating multiple free trial accounts.

DDOS much? Or just influx of new users due to exchanges being down or combination of both?

— Simon Dodd (@Simon_Doddy) November 29, 2017

From our engineering team and the founders, we thank you for your continued understanding. Should you need any assistance with your account please contact support, we’ll be happy to tend to your needs.