I want to share that I was very pleased with the level of detail in Microsoft’s RCA. As we learned with the AWS EBS outage in 2011, an RCA or Post Mortem is one of the best insights into architecture, testing, recovery, and communication plans in existence at a cloud provider. Microsoft’s RCA was no exception.
I encourage all current and prospective Azure customers to read and digest the Azure RCA. There is significant insight and knowledge around how Azure is architected, much more so than customers have received in the past. It is also important for customers to gauge how a provider responds to an outage. We continuously advise clients to pay close attention to how providers respond to issues, degradations, and outages of service.
I do not want to copy the RCA, but here are a few bullet points I’d like to highlight.
- It’s erie how similar the leap day outage at Azure was to AWS’ EBS outage. Both involved software bugs and human errors. Both were cascading issues. Both issues went unnoticed longer than necessary. As a result, both companies have implemented key traps in their service to catch and prevent errors like this sooner and to prevent spreading.
- Microsoft decidedly suspended service management in order to stop or slow the spread of the issue. Microsoft made this decision with very good reason. Customers would have appreciated knowing the rationale around this decision right away and Microsoft is committing to improving real time communication.
- The actual leap day bug issue and resolution was identified, tested, and rolled out within 12 hours. That is pretty fast. The other issues resulted from unfortunate timing of upgrading software at the time the bug hit, as well as a human error in trying to resolve some other software issues. Microsoft even admits, “in our eagerness to get the fix deployed, we had overlooked the fact that the update package we created….[was] incompatible.”
- Even though the human error only affected seven Azure clusters, those clusters happened to contain Access Control (ACS) and Service Bus services, thereby taking those key services offline. As I spoke to customers the last two weeks, it became quite clear that without such key services as ACS and Service Bus, many other functions of Azure are unusable.
- Microsoft took steps to prevent the outage from worsening. Had these steps not been taken, we might have seen a much bigger issue.
- The issues with the Health Dashboard were a result of increased load and traffic. Microsoft will be addressing this problem.
- Microsoft understands that real-time communication must improve during an outage and are taking steps to improve.
- A 33% service credit is being applied to all customers of the affected services, regardless of whether they were affected. This 33% credit is quickly becoming a de facto standard for cloud outages. Customers appreciate this offer as it benefits both customers and providers alike from having to deal with SLA claims and the administrative overhead involved.
As a final note, Microsoft stated in the RCA many times that they would be working to improve many different processes. I hope that as time moves forward, Microsoft continues to use their blog to share more specifics about the improvements in those processes and the progress against achieving those goals.
What did you think of the Azure RCA?