Recovering from an email outage

If I could do this week over I would.  Too bad I can’t.

Email today is vital.  Not having it makes your heart palpitate. 

Monday morning, during a swap of a failed hard drive (something we’ve done countless times) the storage array we use for email went offline.  The whole thing.  And for various reasons, the last known good backup was from awhile ago. 

I painfully remember thinking “oh shit" when I realized what this meant.

[This isn’t a post about all the things I should have done to make sure I was never in this spot.  Everything’s obvious now.]

I learned a couple things this week:

  1. Hire the absolute best people (and geezus, hire people smarter than you!). You never know when you’ll need them.  You never know who will have the answer to the problem.  Hire people who care about each other.  You never know when you need them to look out for the one guy who, in 73 hours, forgot to sleep.  The same one guy who has to run point on The Next Big Step in 7 hours.
  2. Work somewhere where everyone realizes we’re all fighting the same fight. I’m surrounded by coders and when we needed coding, 1492 python coders lined up to help.  Not a single one of them reports to me.
  3. Get upset, yell, demand results.  But realize when it’s the right time to yell and when it’s not.  During a firefight, I need you to be on the best fucking game of your entire life.  It is not the time to be berating you.  It’s the time to treat you like a hero, a magician.  It’s when I do what you tell me to do for you.
  4. Communicate the heck out of everything.  Throughout this outage we found other tools to use to let users know what was going on and what to expect.  I’d post updates even when the information I had was incomplete.  I’d say so.  I hated having folks in the dark.  
  5. Expect criticism.  Some of it will be searing.
  6. Realize that the people working under me on this are collectively smarter than I am.  Offer help whenever but let them work.  Take point at handling communication.  Make sure #5 doesn’t get to them. Remind yourself of #3.

It took nearly two days to get things back to an okay state, a state where we had new emails.  Still recovering data from backups and reconstructing state from a now corrupt MySQL database.  

I’ll probably never be able to express my gratitude to the team I manage for their efforts this week.  Sucks we got here but without thinking, I’d go to battle with this team again.

We made mistakes that got us here but we can talk about that later and make sure it doesn’t happen again.  

  1. mzeier posted this