A myBalsamiq morning to forget: an apology
For the second time in the history of Balsamiq, I write you today to apologize for our mistakes.
This morning we started what was supposed to be a routine myBalsamiq update. We couldn't do a zero-downtime update because this update required data migration in the database, so we announced a 30 minute downtime, thinking it was really only going to take 10 minutes, but we said 30 just to be safe.
How wrong we were. MyBalsamiq was in maintenance mode for about 3 hours today. Given that we would like to compete on reliability for myBalsamiq, this is clearly really, really bad.
A number of things happened during the downtime, it was a nightmare. We ran out of disk space in the database, a machine got rebooted while running data migration, and even our personal internet connection went down at some point. It was, simply, awful.
Some things were just unlucky, but we should have prepared for most of the others. This was our fault, no two ways around it.
In the end we ended up reverting to the old build, so the 3 hours of downtime were totally wasted on your side. We'll make sure they're not wasted on our side though, we've learned a bunch of lessons and will take them to heart.
First of all, we're going to start doing updates on Saturdays instead of Tuesday mornings. I didn't want to do this because it means that a few of us will have to work during the weekend, both to do the update and to man the support lines in case something goes wrong with the new version. As the CEO I hate to ask people to work weekends, but we all agree that your collective time is more important than our own, it's just the nature of the business we decided to get into, so we'll happily make the schedule change.
Other than that, we are improving our "things to check before a release" checklist with the lessons we've learned today, and going to make changes to our database structure so that data migration won't take nearly as long (in case you're interested, we're going to move the bmml data from the database to S3).
We also need to make sure our maintenance page embeds the @myBalsamiq twitter feed, so that people can stay updated on our progress more easily. Plus I have ideas about automated backups emailed to you, desktop sync, Dropbox integration...all things that should mitigate your downtime in case this happens again.
If you were affected by today's outage, please email firstname.lastname@example.org and we'll credit your myBalsamiq site for 3 months or extend your trial for 3 months. It's the least we can do, and fully understand that it's not enough to regain your trust in our service.
We are committed to making myBalsamiq known for its uptime, but clearly we have a long way to go. We are learning, and I feel very sorry that our early adopters have to pay for our inexperience. 🙁
Alright, back to work for us. Again, I'm so sorry.