Gartner Blog Network


Another Cloud Outage: Insight and Reactions

by Kyle Hilgendorf  |  February 29, 2012  |  9 Comments

Today, Microsoft Windows Azure had an advertised outage.  As of writing this blog, the outage is still in recovery mode.  I spent the morning talking to a handful of Azure customers via phone, email, and Twitter.  Here are some observations becoming quite evident and important learnings for cloud customers and cloud providers:

  1. Cloud providers continue to track cloud outages/issues based only on availability whereas it must also include performance and response metrics
  2. Service dashboards continue to rely on the underlying cloud service being online
  3. Customers can never get enough information during the outage from the provider
  4. We all know outages are a fact of life, but in the midst of one, pain is real
  5. Customer application design needs to continue to evolve

Let me dive into each of these points with my own commentary.

  1. Cloud providers continue to track cloud outages/issues based only on availability whereas it must also include performance and response metrics: Azure’s health dashboard and communication originally communicated that only 3.8% of customers were affected with this outage.  There was no context around where the 3.8% came from or how it was measured but I spoke to several customers this morning that suspect they were not included in the 3.8%.  Just recently, the percentages were increased at the dashboard.  Based upon region, the latest affected customer percentages are 6.7%, 37%, and 28% (and may still change).  I was informed by some customers that various Azure roles (web, worker, VM) are up and online for many of these customers but that service performance is degraded to such a point of being unusable.  Because most provider SLAs are based upon uptime and availability, and not performance or response, these outages may not be reported as being affected.  You can follow some of my interactions via Twitter (@kylehilgendorf) from this morning to see a couple of examples.  Providers MUST start including performance and response SLAs into their standard service.  A degraded service is often as impactful as a down service. A great quote came in on twitter this morning via @qthrul, “…a falling tower is ‘up’ until it is ‘down’.”  A falling tower is not very useful for most customers.
  2. Service dashboards continue to rely on the underlying cloud service being online: The Azure Service Dashboard (http://www.windowsazure.com/en-us/support/service-dashboard/) has been experiencing very intermittent availability.  Throughout this morning, I have had about a 25%-30% success rate of getting the dashboard to load.  I’ve been informing providers frequently that service health systems and dashboards must be hosted independently from the provider’s cloud service.  If the cloud service is down or degraded, customers had better be able to see the status at all times.  I recently finished a lengthy document on evaluation criteria for public IaaS providers that will publish in the near future, and one of those criteria specifically states this as a requirement.  If the service dashboard is the primary vessel by which cloud providers communicate outage updates, it must be up while the service is down.
  3. Customers can never get enough information during the outage from the provider: Looking back to 2011 and the AWS and Microsoft outages it became very clear that frequent status updates are paramount during an outage.  AWS led the way with 30-45 min outage updates through their painful EBS outage and Ireland issues.  While updates don’t solve the problem, they do demonstrate customer advocacy and concern.  Some customers told me this morning they feel completely in the dark. There is no reason why a cloud provider should not have a dedicated communication team providing at least 30 min updates throughout the entire outage.  Microsoft seems to be in a good cadence late this morning on more frequent updates, but there were large gaps in updates when the outage first occurred.  More important in my opinion however, is a thorough post-mortem on the outage once the service has been restored.  This should come within 3-4 days of the outage and must be very open and honest about the root cause, the fix, and the take-aways for the future.  Providers please note, the world is very smart.  If a provider even tries to mask or hide any of the details, it will come back to reflect negatively.  Honesty wins.
  4. We all know outages are inevitabilities, but in the midst of one, pain is real: I’ve heard from some customers very impacted and as a result very frustrated and disappointed.  When a cloud service has a good track record, we all admit that an outage will happen at some point.  Yet, in the middle of an outage, emotion gets involved.  Therefore, see point #5.
  5. Customer application design needs to continue to evolve: Similar to previous cloud outages, customer application design must continue to evolve to account for possible (some would say probable) cloud outages and issues.  No cloud services is identical to anotherand each has its own unique design and configuration options.  Most cloud services have the concept of zones and regions from a geographical or hosting location standpoint.  In most cloud outages, not every zone or region is affected.  Therefore, the best-prepared applications will be those designed cross-zone and cross-region to avoid an outage or degradation in any one area.  However, this comes at extreme complexity and increase in cost.  Many times 3x-10x the cost advertised by providers.  If you will be running a critical application at a cloud provider, expect an outage, design for resiliency, and be prepared to pay for it.  This may also mean that you have to hire or retain some very skilled cloud staff.

It is always a sad day as a cloud analyst to see these outages.  However, it seems that significant change in the industry, at both a provider and customer level, only tends to come after an emergency.

I’d love your comments here.  Let’s engage in a conversation.

Category: cloud  microsoft  outage  providers  

Tags: azure  cloud  microsoft  outage  

Kyle Hilgendorf
Research Vice President
5 years with Gartner
15 years in IT industry

Kyle Hilgendorf works as a Research VP and Chief of Research in Gartner for Technology Professionals (GTP). He covers public cloud computing and hybrid cloud computing. Areas of focus include cloud computing technology, providers, IaaS, SaaS, managed hosting, and colocation. He brings 10 years of enterprise IT operations and architecture experience. Read Full Bio


Thoughts on Another Cloud Outage: Insight and Reactions


  1. vijayan says:

    what could have averted this scenario from a service management perspective?

  2. Kyle Hilgendorf says:

    Are you referring to what could have been done at the provider level or at the customer level?

  3. […] Another Cloud Outage: Insight and Reactions […]

  4. As with most public cloud systems, Azure also experienced its day in the dark. Having spent the last 3+ years in this ecosystem, I can say that most of Azure users are in the long tail. Startups ,upstart efforts from large enterprises etc and this outage went under the radar for the most part due to the long tail nature of the user base.

    To your point #5, as with most apps deploying to public clouds, a good monitoring and auto-healing system is critical to ensure timely notification and recovery.
    We have outlined that in our blog here http://www.opstera.com/blog/

  5. […] A gartner blog notes that outages, whether fact or otherwise, are a pain. The change in the industry is definitely going to take some getting used to. Read the article, here. […]

  6. […] http://blogs.gartner.com/kyle-hilgendorf/2012/02/29/another-cloud-outage-insight-and-reactions/ Right on the nail.  So true.  Why are we rushing to the public cloud?  Get it right the first time! Share this:TwitterFacebookLike this:LikeBe the first to like this post. […]

  7. […] Kyle Hilgendorf works as a principal research analyst in Gartner's IT Professionals service. He covers cloud computing (external and hybrid), as well as application, desktop and server virtualization. Read Full Bio Coverage Areas: ← Another Cloud Outage: Insight and Reactions […]

  8. […] expressed to me that the outage was very impactful.  During the outage last week, I summarized on this blog some high level points about the outage that customers had quickly sent me.  However, now that the […]

  9. […] with simple money equations or IT (I’m sure that his title as Gartner’s “Principal Research Analyst” focusing on IT, especially cloud computing, is mostly honorific). Kyle also graces us with […]



Comments are closed

Comments or opinions expressed on this blog are those of the individual contributors only, and do not necessarily represent the views of Gartner, Inc. or its management. Readers may copy and redistribute blog postings on other blogs, or otherwise for private, non-commercial or journalistic purposes, with attribution to Gartner. This content may not be used for any other purposes in any other formats or media. The content on this blog is provided on an "as-is" basis. Gartner shall not be liable for any damages whatsoever arising out of the content or use of this blog.