Last night, 108M+ people tuned in for Super Bowl XLVII. It looked to be a blowout of a game, yet ended in exciting fashion, perhaps to one factor not related at all to football.
A power outage.
Early in the 3rd quarter, exactly half of the lights went out in the Superdome. A 34 minute game delay ensued while a likely frantic army of individuals behind the scenes attempted to get the lights back on. After the delay, the game momentum turned dramatically. Thankfully (in my opinion) the Ravens held on to win or we’d be hearing nothing but “Power Gate” complaints for the next 6 months.
It got me thinking though – is there anything cloud consumers can learn from this power outage? They seem like unrelated events, but let me clarify some of my brief thoughts.
- Outage pain is often more about time to recovery – would anyone have been upset if the lights came back on 2 minutes later? Probably not. But just like with cloud outages, we’re dealing with highly complex and slow to recover systems. When something goes wrong in the cloud, don’t expect it fixed in minutes. By comparison, 34 minutes would be fantastic. Therefore, cloud customers should adequately plan contingency, recovery, or triage plans to operate in the midst of a prolonged outage.
- Even highly resilient systems will fail – failure is imminent. We all assume that something the size of the Superdome has multiple power paths, protections, circuits, breakers, and generators. Even with all that planning, something went wrong. To my knowledge, no cause has 100% been determined (more on that later). Similarly with cloud, no matter how many geographic zones or data centers you distribute your application, there is always some event that can knock you down. We’ve seen outages due to control plane complexity, software bugs, and outages of resiliency-enabling components like load balancers. Customers should architect for resiliency, but also architect for many levels of failure if possible. Keep in mind all of this comes at a cost too.
- Root cause analysis – as of yet, no one has taken responsibility for the Superdome power outage. Eventually, the truth will come out. We are not sure whether it will be an admitted mistake, or a mistake uncovered by investigative journalism. But we will get the truth. Cloud providers have dramatically improved in these regards during 2012. Customers are getting much better post mortems and root cause analysis documents after an outage from providers. Sometimes these take a few days or a week, but they come. If you, as a customer, are not getting a post mortem on a cloud issue, I’d encourage you to demand it or move to a different provider.
- Outages improve the market – whatever was the cause of the power outage, you can bet it will be addressed or improved prior to any other major event at the Superdome. Furthermore, every major sporting arena will be tuning in to see what they can learn from the issue in New Orleans. Power designs in sporting stadiums will improve as a result of this. Similarly, after every major cloud outage, both the provider affected and its competitors learn and improve. Outages, while painful, are often beneficial.
- Employing the best staff is not fool proof – I’ve read that the the Super Bowl and Superdome had some of the best technicians on hand both planning and running the event. Yet the power still went out. Cloud providers also tend to employ the best and brightest, but issues still happen. Humans, no matter how brilliant, are not perfect, nor can they prevent every issue.
- Outages are not only for nascent markets – I’ve heard many people blame cloud outages on the fact that many providers are young and services are immature/nascent. It’s a fair argument. But power distribution to major sporting events is a very mature market. And yet a problem still occurred. Similarly, as cloud providers mature, we should expect fewer outages, but they will not disappear. See prior bullets for justification.
These are just a few of the correlations I’ve come up with. What other connections do you see?