Today, Microsoft Windows Azure had an advertised outage. As of writing this blog, the outage is still in recovery mode. I spent the morning talking to a handful of Azure customers via phone, email, and Twitter. Here are some observations becoming quite evident and important learnings for cloud customers and cloud providers:
- Cloud providers continue to track cloud outages/issues based only on availability whereas it must also include performance and response metrics
- Service dashboards continue to rely on the underlying cloud service being online
- Customers can never get enough information during the outage from the provider
- We all know outages are a fact of life, but in the midst of one, pain is real
- Customer application design needs to continue to evolve
Let me dive into each of these points with my own commentary.
- Cloud providers continue to track cloud outages/issues based only on availability whereas it must also include performance and response metrics: Azure’s health dashboard and communication originally communicated that only 3.8% of customers were affected with this outage. There was no context around where the 3.8% came from or how it was measured but I spoke to several customers this morning that suspect they were not included in the 3.8%. Just recently, the percentages were increased at the dashboard. Based upon region, the latest affected customer percentages are 6.7%, 37%, and 28% (and may still change). I was informed by some customers that various Azure roles (web, worker, VM) are up and online for many of these customers but that service performance is degraded to such a point of being unusable. Because most provider SLAs are based upon uptime and availability, and not performance or response, these outages may not be reported as being affected. You can follow some of my interactions via Twitter (@kylehilgendorf) from this morning to see a couple of examples. Providers MUST start including performance and response SLAs into their standard service. A degraded service is often as impactful as a down service. A great quote came in on twitter this morning via @qthrul, “…a falling tower is ‘up’ until it is ‘down’.” A falling tower is not very useful for most customers.
- Service dashboards continue to rely on the underlying cloud service being online: The Azure Service Dashboard (http://www.windowsazure.com/en-us/support/service-dashboard/) has been experiencing very intermittent availability. Throughout this morning, I have had about a 25%-30% success rate of getting the dashboard to load. I’ve been informing providers frequently that service health systems and dashboards must be hosted independently from the provider’s cloud service. If the cloud service is down or degraded, customers had better be able to see the status at all times. I recently finished a lengthy document on evaluation criteria for public IaaS providers that will publish in the near future, and one of those criteria specifically states this as a requirement. If the service dashboard is the primary vessel by which cloud providers communicate outage updates, it must be up while the service is down.
- Customers can never get enough information during the outage from the provider: Looking back to 2011 and the AWS and Microsoft outages it became very clear that frequent status updates are paramount during an outage. AWS led the way with 30-45 min outage updates through their painful EBS outage and Ireland issues. While updates don’t solve the problem, they do demonstrate customer advocacy and concern. Some customers told me this morning they feel completely in the dark. There is no reason why a cloud provider should not have a dedicated communication team providing at least 30 min updates throughout the entire outage. Microsoft seems to be in a good cadence late this morning on more frequent updates, but there were large gaps in updates when the outage first occurred. More important in my opinion however, is a thorough post-mortem on the outage once the service has been restored. This should come within 3-4 days of the outage and must be very open and honest about the root cause, the fix, and the take-aways for the future. Providers please note, the world is very smart. If a provider even tries to mask or hide any of the details, it will come back to reflect negatively. Honesty wins.
- We all know outages are inevitabilities, but in the midst of one, pain is real: I’ve heard from some customers very impacted and as a result very frustrated and disappointed. When a cloud service has a good track record, we all admit that an outage will happen at some point. Yet, in the middle of an outage, emotion gets involved. Therefore, see point #5.
- Customer application design needs to continue to evolve: Similar to previous cloud outages, customer application design must continue to evolve to account for possible (some would say probable) cloud outages and issues. No cloud services is identical to anotherand each has its own unique design and configuration options. Most cloud services have the concept of zones and regions from a geographical or hosting location standpoint. In most cloud outages, not every zone or region is affected. Therefore, the best-prepared applications will be those designed cross-zone and cross-region to avoid an outage or degradation in any one area. However, this comes at extreme complexity and increase in cost. Many times 3x-10x the cost advertised by providers. If you will be running a critical application at a cloud provider, expect an outage, design for resiliency, and be prepared to pay for it. This may also mean that you have to hire or retain some very skilled cloud staff.
It is always a sad day as a cloud analyst to see these outages. However, it seems that significant change in the industry, at both a provider and customer level, only tends to come after an emergency.
I’d love your comments here. Let’s engage in a conversation.