On February 29, 2012 (leap day), Microsoft Windows Azure experienced a significant cloud service outage. Microsoft announced the outage and resolution in their public blogs (http://blogs.msdn.com/b/windowsazure/archive/2012/03/01/windows-azure-service-disruption-update.aspx and http://blogs.msdn.com/b/windowsazure/archive/2012/03/01/window-azure-service-disruption-resolved.aspx). After the outage, I was able to interview customers of Azure that expressed to me that the outage was very impactful. During the outage last week, I summarized on this blog some high level points about the outage that customers had quickly sent me. However, now that the dust has settled and I’ve had an opportunity to personally interview more Azure customers, I wanted to take this opportunity and provide deeper insight.
Every customer I spoke to agreed to do so under strict confidence. This is always of primary importance to Gartner. I am very thankful to be in the unique position where I get direct and specific details from customers and will always respect their confidentiality. Therefore, I have anonymized all the details. Readers can be certain however, that the below points came directly from real customers using Windows Azure services. I will deliberately not replay the insights from my previous blog, but they still apply. While these insights are specific to Microsoft Windows Azure, they can be applied to any cloud service and I encourage customers and providers alike to consider the learnings. Let’s look at the new insights.
- Communication from Microsoft should have been better – Every customer I spoke to mentioned this. Even 2-3 days after the outage, some customers had not received any formal communication. One customer informed me that they received some personal emails from support friends at Microsoft but nothing official. Those customers that did get formal communication received a very brief synopsis email of the outage stating that the issue started at 5:45pm PST on February 28th and that customers may have experienced issues with Access Control 2.0, Marketplace, Service Bus, and Access Control & Caching Portal. This is a different list of services than those posted in the public blog by VP Bill Laing, however the services listed in the email more closely align to what the Azure Health Dashboard displayed at various points during the outage. The history of the Azure Health Dashboard also shows service interruptions or degradations with the following services on February 29 and/or March 1: SQL Azure Data Sync, Management Portal, Compute, and Service Management. Depending on which communication customers reviewed, there was conflicting information.
- Customers are frustrated with the lack of transparency by Microsoft – The Azure blog announced services were restored for most customers by 2:57am PST on the 29th. Yet, every customer I spoke to informed me that they experienced widespread issues until late on the 29th. I was specifically told by multiple customer that they did not see services come back online until approximately 8pm PST on the 29th, essentially canceling the entire day for them on the 29th, especially for those on the US east coast or in Europe. One customer told me, “we live in a transparent culture. Services go down, but the best practice is ultimate transparency.” Customer sentiment was that Microsoft was not honest immediately during the outage and continued to post conflicting information regarding the outage and its breadth.
- The service outage was far more impactful than advertised – Customers informed me of outages or issues with all of the following services: Access Control 2.0, Service Bus, SQL Azure Data Sync, SQL Azure Database, SQL Azure Management Portal, Windows Azure Compute, and Windows Azure Marketplace. Furthermore, many of these services were having issues in multiple Azure regions. Even though these services were offline at different times, more than one customer informed me that the integrated nature of Azure services means that even if one service is offline, it actually severely affects any of the other services from working properly. For example, when SQL Azure and Azure Compute were online, Azure Data Sync was not. When a customer relies on Data Sync for connecting SQL and Compute services, all three services end up being unavailable. When Access Control was offline, users could not authenticate, rendering the backend application unusable. Furthermore, when Service Management capabilities are offline, it prevents customers from executing any administrative tasks that would assist the customer from redeploying in other regions or implementing business continuity plans. The key learning here is that even if a single component of a cloud service is offline, the impact for an individual customer could be far reaching throughout the other services in the cloud.
- Customers are not leaving Azure, but they are brainstorming options – A consistent theme among Azure customers is that this outage by itself is not driving away current business from Azure. In fact, most customers have been pleased with the service over the past months and years. Most customers are willing to give Microsoft a “black swan” pass on the actual technical issue, but hope it causes Microsoft to improve upon the first few bullet points. With that said, some customers are considering options to protect themselves further from an Azure outage in the future. As mentioned in the previous bullet point, Azure by itself was not able to offer the resiliency and availability to sustain this outage for customers. Because this was a wide-reaching software bug, most regions and services were affected at some point. Customers concluded after the outage that the only true protection against such a widespread software bug is to build a multi-provider or hybrid operating strategy. Therefore, customers are looking at possibilities to maintain some services locally on-premises or enlisting a secondary provider. The challenge in the latter is that very few legitimate .NET and SQL Server as-a-service alternatives exist. Microsoft may be contributing to this problem by building up Azure to such a large offering and cannibalizing its own channel of partners. One customer informed me that they would love to see Microsoft resell Azure to other providers. Other customers that are looking at an on-premises deployment are weighing the costs and risks to do so as compared to the business lost in a single business day. Building such architecture can be quite costly.
- Customers were surprised at the lack of “press” – This is an interesting insight. More than one customer informed me that they were surprised how little information was published regarding the Azure outage and how few customers were publicly complaining about the outage. In comparison to cloud outages in 2011, customers expressed that the news and twitter traffic was much lower. One customer informed me that they were actually wondering if this was an indication of how few customers are in production with Azure and whether they are one of the few in that situation. That did not make them feel very good. However, as I learned from other customers later, many customers deliberately refrained from commenting publicly or in venues such as Twitter because they did not want to elevate to the public that they were having an outage as a result of the Azure outage. As an analyst I have to wonder whether admitting use of public cloud services is a good PR move or a bad PR move.
- Customers are not bothering with SLA claims – Most customers when asked about submitting an SLA claim responded that they were not going to waste their time. To begin, many of the customers complained about the Azure standard SLA, concluding that it is open for interpretation and highly beneficial to Microsoft. One customer even informed me that Microsoft told them this outage did not violate the Azure SLAs. Regardless of whether the outage violated the SLA or not, customers commonly shared that submitting an SLA claim is not worth the time and effort. After all, these businesses lost nearly a day of service and are focusing their time and effort on making sure services are restored, working, and better resilient for the future. Customers did express that it would be welcomed if Microsoft proactively offered them a credit for this outage as a sign of good will and to lessen any need to go through the hassle of submitting an SLA claim. AWS did this in April of 2011 for all customers and it was a popular move. One customer did tell me that Microsoft extended a compensation offer to them after the outage.
- Customers need better health status of Azure – As I mentioned in my blog last week, cloud providers need to host their health dashboard outside of their own service and be prepared for large amounts of traffic to the dashboard in the event of an outage.. The Azure Health Dashboard was frequently unavailable during the outage, making it hard for customers to understand what was going on. Current health status is very important, especially for those customers that desperately want to try to leverage other regions or services to bring capabilities back online. Customers are therefore urging that Microsoft take this advice and some customers are looking at 3rd party options that can monitor Azure health from the outside.
We are near the 10-day commitment by Microsoft to deliver the Root Cause Analysis. Customers should pay close attention to the root cause analysis as often such documentation will provide insights and learnings into not only the architecture of the cloud service, but also the commitment by the cloud service provider to customers. I hope the analysis will be transparent into what happened, what Microsoft has learned from it, how it will be prevented in the future, and what help Microsoft is offering to Azure customers to avoid impacts in the future.
Cloud outages are a sad and unfortunate event. However, if we learn from them, build better services, increase transparency, and guide towards better application design, then we can make something great out of something bad.