The Email Bomb Bug

The support team came running over. "I think we have a problem, check the link I sent you!" Oh boy, we were flooding a customer's inbox with several emails per second. They had posted a screenshot on Twitter. Yikes. We immediately paused the email queue and started to investigate.

A few months earlier, I joined Coin as one of the first few software engineers. The startup was still less than 10 people and had just run an extremely successful preorder campaign. Eventually totalling over 350,000 cards. The preorder form was minimal in order to maximize conversion. So we didn't have all of the information we needed to print and ship physical cards. Over the next few months, we were busy building the site to allow customers to complete registration (in addition to APIs needed to support all of the mobile app features such as identity verification and card ownership verification).

Manufacturing new hardware is usually a gradual process. As you iterate and work out the inevitable kinks, you slowly ramp up output. For many months, the plan was to email customers just before we were ready to engrave and ship. However, the hardware efforts were behind schedule and the last minute decision came down from the founder to email all customers at once. Notifying them that their delivery would be delayed but they could still complete registration. We debated alternative plans with solid reason but were overruled. So at the last minute, we scrambled to get the software changes in place.

We were using Ruby on Rails at the time. The mass email script was fairly simple. Select customers who hadn't received the email, queue the email in sidekiq, send the email when it reaches the front of the queue, and update the customer record to note that the email had been sent. For 99.999% of customers, this worked great. What we didn't know was that 3 or 4 (0.001%) of the customer records had been manually edited in mysql in the early days of the startup. As a result, those records were in what should have been an impossible state. A validation (completely unrelated to the email) was failing in the final save step. So we kept repeating the first step and queueing emails to the same customers over and over. To make matters worse, the retry policy wasn't configured properly. Each failed send attempt was being repeated several times, amplifying the number of emails sent.

In the end, it didn't take long to find the few customers who were impacted and fix the rows that were causing the save validation to fail. We asked the support team to follow up, apologize profusely, and send the hardware for free (refunding the initial purchase).

There were many lessons to be learned and ways in which we could have prevented this bug or detected it much sooner. Save the customer record before queueing the email. Make sure new validations pass for all existing records, even when it seems impossible they wouldn't. Reviewing configuration policies more closely. Plenty on the monitoring and alerting side as well. The list goes on. At a megacorp, something like this would be a travesty. At a startup, it's bad but in some ways an acceptable tradeoff for moving fast and making last minute direction changes. If the product or service is successful, you'll eventually earn the time to correct earlier shortcuts (as long as they're non-fatal, which most are).

I was a bit hesitant to write this post. It's better to highlight your successes but if you work in this industry for long enough, speed bumps are inevitable. In the end, the hardware never reached the reliability of a regular magnetic strip card. The product and the company are no more. But I'm still proud of the many talented people I got to work with. The software and hardware we shipped in a short time with such a small team was impressive. And while there were hiccups like this email bug, most of the software worked flawlessly.

Hi, I'm Eddie Scholtz. These are my notes. You can reach me at eascholtz@gmail.com. Atom