Blog post

Azure Root Cause Analysis: Transparency Wins

By Kyle Hilgendorf | March 12, 2012 | 3 Comments

MicrosoftCloudOutageProviders

Late Friday evening, Microsoft released their root cause analysis (RCA) for the Azure Leap Day Bug outage.  My last two blog posts chronicled what I heard from Azure customers regarding the outage.

I want to share that I was very pleased with the level of detail in Microsoft’s RCA.  As we learned with the AWS EBS outage in 2011, an RCA or Post Mortem is one of the best insights into architecture, testing, recovery, and communication plans in existence at a cloud provider.  Microsoft’s RCA was no exception.

I encourage all current and prospective Azure customers to read and digest the Azure RCA.  There is significant insight and knowledge around how Azure is architected, much more so than customers have received in the past.  It is also important for customers to gauge how a provider responds to an outage.  We continuously advise clients to pay close attention to how providers respond to issues, degradations, and outages of service.

I do not want to copy the RCA, but here are a few bullet points I’d like to highlight.

  • It’s erie how similar the leap day outage at Azure was to AWS’ EBS outage.  Both involved software bugs and human errors.  Both were cascading issues.  Both issues went unnoticed longer than necessary.  As a result, both companies have implemented key traps in their service to catch and prevent errors like this sooner and to prevent spreading.
  • Microsoft decidedly suspended service management in order to stop or slow the spread of the issue.  Microsoft made this decision with very good reason. Customers would have appreciated knowing the rationale around this decision right away and Microsoft is committing to improving real time communication.
  • The actual leap day bug issue and resolution was identified, tested, and rolled out within 12 hours.  That is pretty fast.  The other issues resulted from unfortunate timing of upgrading software at the time the bug hit, as well as a human error in trying to resolve some other software issues.  Microsoft even admits, “in our eagerness to get the fix deployed, we had overlooked the fact that the update package we created….[was] incompatible.”
  • Even though the human error only affected seven Azure clusters, those clusters happened to contain Access Control (ACS) and Service Bus services, thereby taking those key services offline.  As I spoke to customers the last two weeks, it became quite clear that without such key services as ACS and Service Bus, many other functions of Azure are unusable.
  • Microsoft took steps to prevent the outage from worsening.  Had these steps not been taken, we might have seen a much bigger issue.
  • The issues with the Health Dashboard were a result of increased load and traffic.  Microsoft will be addressing this problem.
  • Microsoft understands that real-time communication must improve during an outage and are taking steps to improve.
  • A 33% service credit is being applied to all customers of the affected services, regardless of whether they were affected.  This 33% credit is quickly becoming a de facto standard for cloud outages.  Customers appreciate this offer as it benefits both customers and providers alike from having to deal with SLA claims and the administrative overhead involved.

As a final note, Microsoft stated in the RCA many times that they would be working to improve many different processes.  I hope that as time moves forward, Microsoft continues to use their blog to share more specifics about the improvements in those processes and the progress against achieving those goals.

What did you think of the Azure RCA?

The Gartner Blog Network provides an opportunity for Gartner analysts to test ideas and move research forward. Because the content posted by Gartner analysts on this site does not undergo our standard editorial review, all comments or opinions expressed hereunder are those of the individual contributors and do not represent the views of Gartner, Inc. or its management.

Comments are closed

3 Comments

  • I have heard a lot of positive comment from Azure users and I am impressed with Microsoft’s cloud strategies.

    However, it is interesting to read the term ‘service bus’ in your summary of the key points. Service bus is such an “enterprise platform” term, indicative of tightly coupled processes with centralized resources (i.e. ACS), which can become single points of failure (as reported above).

    More progress required, to get to the loosely coupled distributed autonomy of natural systems, which is the hallmark of our evolutionary success.

  • Kyle Hilgendorf says:

    Aidan,
    Thanks for your comment. I was referring to the Azure Service Bus service: https://www.windowsazure.com/en-us/home/features/service-bus/

    It is an integral offering of the overall Azure platform, but I agree, “service bus” is a broad concept and overused.

  • Rob Addy says:

    Hi Kyle

    Thank you sooo much for providing the perfect case study for my own recent post on major incident response planning and communications – https://blogs.gartner.com/rob-addy/2012/02/10/communication-creates-calm-confounds-critics-and-cements-customer-confidence/

    To me it looks like MS have managed to turn the Azure outage from being a PR disaster into something that will help create confidence and demonstrate their commitment to continuous improvement of their services… I agree with you that they have done a good job on transparency 🙂

    Best regards

    Rob