by Kyle Hilgendorf | March 29, 2013 | 2 Comments
I’m happy to share today that our newest Gartner for Technical Professionals research is out: Devising a Cloud Exit Strategy: Proper Planning Prevents Poor Performance.
There is very little published in the industry about how to create a cloud exit strategy or plan. If you search hard, you’ll find some blogs or magazine articles about why a cloud exit strategy is important, but almost nothing in terms of how you put the strategy and plan in place.
Proper risk management includes four pillars: Accept, Avoid, Transfer and Mitigate. For the last pillar, mitigate, customers have many options available: encrypt data, distribute availability, scale horizontally, enlist cloud brokers, implement hybrid architectures, monitor, alert, etc. However, perhaps the ultimate risk mitigation is to have a cloud exit strategy. While we all have faith in the future of cloud computing many events might occur that would warrant an exit from a cloud service including:
- Provider’s services less reliable than advertised in SLAs, contracts or expressed expectations
- Soured relationship with provider
- Change in service levels
- Change of provider ownership
- Change of price
- Change of terms and conditions
- Expiration of enterprise agreement or contract
- Lack of support
- Data, security or privacy breach
- Provider inability to stay competitive with industry features
- Repeated or prolonged outages
- Lack of remuneration for services lost
- Change in your organization’s internal leadership, strategy or corporate direction
And so the remainder of this research focuses on how to put that cloud exit strategy and plan in place. This is where we as GTP take pride, in delivering practical, How-to advice for our clients and it’s a focus for us at this year’s Gartner Catalyst conference in our Cloud Track. Our Cloud Track Theme is “Cloud Services: Moving Past Partial Commitment”. We’ve exhausted the ‘what’ and the ‘why’ of cloud computing and for organizations to move past occasional uses of cloud services, we will be delivering how-to advice for:
- Building cloud strategies
- Mitigating cloud risks
- Building and delivering cloud native applications
- Managing and operating cloud applications and assets
- Forging and managing cloud provider relationships
- Building private cloud architectures
- Leveraging Hybrid Cloud solutions
Are you using cloud services? Are you prepared with a Cloud Exit Strategy? Are you looking for more “how-to” advice? We hope to see you at Gartner Catalyst 2013!
Category: Cloud Evaluation Gartner Outage Providers Tags: Cloud, Exit, Strategy
by Kyle Hilgendorf | March 14, 2013 | Comments Off
When I talk about cloud characteristics with clients, elasticity and scalability often come up in conversation. However, far too often clients jump to conclusions that auto scaling capabilities are out of scope or too complex for what they actually need. Many times they may be right, but a recent industry announcement provides a perfect case study as to why you should always consider auto scaling designs.
Yesterday, Google announced that it would be shutting down Google Reader on July 1, 2013. It was a painful announcement for me as my research is very dependent upon a meticulously designed news consumption strategy. Google Reader and RSS feeds is vital to my personal strategy for keeping up with the industry.
A lot has been published on Google Reader alternatives. One of the more popular options mentioned is NewsBlur. However, in parsing tweets and news today, NewsBlur has been having major trouble keeping up with the incredible demand on its service today from defecting Google Reader users looking for alternatives. Furthermore, I suspect the service issues at NewsBlur today may lead some prospective customers to conclude NewsBlur is not for them and move on to other alternatives. And that is too bad. Below is a tweet from NewsBlur today.
Auto scaling application designs should not only rely on your internal forecast for growth and possible demand, but should also consider external forces that might drive additional demand your way. What if your top competitor or leader in the market announced tomorrow that they were shutting doors? What if that competitor had a catastrophic corporate or technical event? Are you ready and able to take on all their customers? If not, they might go to the next option – opportunity lost!
These scenarios do not apply to every application, but they serve as an interesting case study for your business being ready to respond to demand not only when you do something good, but also when your competitors do something bad. And the latter will likely come at the most unexpected times.
Category: Cloud Uncategorized Tags: Cloud, Scale
by Kyle Hilgendorf | February 4, 2013 | Comments Off
Last night, 108M+ people tuned in for Super Bowl XLVII. It looked to be a blowout of a game, yet ended in exciting fashion, perhaps to one factor not related at all to football.
A power outage.
Early in the 3rd quarter, exactly half of the lights went out in the Superdome. A 34 minute game delay ensued while a likely frantic army of individuals behind the scenes attempted to get the lights back on. After the delay, the game momentum turned dramatically. Thankfully (in my opinion) the Ravens held on to win or we’d be hearing nothing but “Power Gate” complaints for the next 6 months.
It got me thinking though – is there anything cloud consumers can learn from this power outage? They seem like unrelated events, but let me clarify some of my brief thoughts.
- Outage pain is often more about time to recovery – would anyone have been upset if the lights came back on 2 minutes later? Probably not. But just like with cloud outages, we’re dealing with highly complex and slow to recover systems. When something goes wrong in the cloud, don’t expect it fixed in minutes. By comparison, 34 minutes would be fantastic. Therefore, cloud customers should adequately plan contingency, recovery, or triage plans to operate in the midst of a prolonged outage.
- Even highly resilient systems will fail – failure is imminent. We all assume that something the size of the Superdome has multiple power paths, protections, circuits, breakers, and generators. Even with all that planning, something went wrong. To my knowledge, no cause has 100% been determined (more on that later). Similarly with cloud, no matter how many geographic zones or data centers you distribute your application, there is always some event that can knock you down. We’ve seen outages due to control plane complexity, software bugs, and outages of resiliency-enabling components like load balancers. Customers should architect for resiliency, but also architect for many levels of failure if possible. Keep in mind all of this comes at a cost too.
- Root cause analysis – as of yet, no one has taken responsibility for the Superdome power outage. Eventually, the truth will come out. We are not sure whether it will be an admitted mistake, or a mistake uncovered by investigative journalism. But we will get the truth. Cloud providers have dramatically improved in these regards during 2012. Customers are getting much better post mortems and root cause analysis documents after an outage from providers. Sometimes these take a few days or a week, but they come. If you, as a customer, are not getting a post mortem on a cloud issue, I’d encourage you to demand it or move to a different provider.
- Outages improve the market – whatever was the cause of the power outage, you can bet it will be addressed or improved prior to any other major event at the Superdome. Furthermore, every major sporting arena will be tuning in to see what they can learn from the issue in New Orleans. Power designs in sporting stadiums will improve as a result of this. Similarly, after every major cloud outage, both the provider affected and its competitors learn and improve. Outages, while painful, are often beneficial.
- Employing the best staff is not fool proof – I’ve read that the the Super Bowl and Superdome had some of the best technicians on hand both planning and running the event. Yet the power still went out. Cloud providers also tend to employ the best and brightest, but issues still happen. Humans, no matter how brilliant, are not perfect, nor can they prevent every issue.
- Outages are not only for nascent markets – I’ve heard many people blame cloud outages on the fact that many providers are young and services are immature/nascent. It’s a fair argument. But power distribution to major sporting events is a very mature market. And yet a problem still occurred. Similarly, as cloud providers mature, we should expect fewer outages, but they will not disappear. See prior bullets for justification.
These are just a few of the correlations I’ve come up with. What other connections do you see?
Category: Cloud Outage Tags: Cloud, Outage, Super Bowl, Superdome
by Kyle Hilgendorf | October 23, 2012 | Comments Off
Yesterday was marked with another major cloud outage. Amazon Web Services experienced a single availability zone issue in the US-East-1 region. As with all major cloud provider outages, I get the opportunity to speak to customers affected by the outage or customers considering broader public cloud adoption.
One question was asked of me in multiple conversations: Does US-East-1 have systemic design and availability issues?
This question stems from the fact that most (but not all) AWS issues have occurred in the US-East-1 region. Unfortunately the answer is not a definitive yes or a no, but let me elaborate.
US-East-1 (also simply referred to as US-East) is the oldest (i.e., original) and substantially largest AWS region. It is unclear exactly how much larger US-East-1 is from other regions such as US-West or EU-West, but substantial is probably an understatement. In the July 2, 2012 power outage post mortem, AWS stated that US-East-1 is composed of more than 10 data centers.
US-East-1 is also the cheapest and default region for many deployments. Therefore, the scale and impact of US-East-1 is quantitatively larger than other AWS regions.
So while US-East-1 may not have systemic design and availability issues, it is fair to say that US-East-1 pushes the limits in terms of scale, capacity, stress on software logic, distribution, and complexity. While AWS does not deliberately use US-East-1 as a test bed or trial ground, the unfortunate results of it being so much larger than the other regions is that US-East-1 by default becomes that trial ground.
Which leads to the basis of the question in the title of this blog: Are customers better off by moving to other AWS Regions (e.g., US-West, EU-West, etc…)? Unfortunately the answer may be yes. It might be beneficial to not be in the biggest AWS pond where scale and complexity issues first occur. The advantage of this is that fixes and optimizations can be uncovered in US-East-1 and deployed to the region you reside in before that region gets to the same size/scale.
However, perhaps you must be in US-East-1 for location requirements, price constraints, or a number of other reasons. But if you don’t have an affinity to the east coast of the United States, and if you can tolerate slightly higher prices in another region, perhaps it is time to move west AWS cloud customers.
I’d love to hear your comments and thoughts.
Category: AWS Cloud IaaS Outage Providers Tags: AWS, Cloud, Iaas, Outage
by Kyle Hilgendorf | April 20, 2012 | 2 Comments
I just returned from full days at the OpenStack conference and analyst day in San Francisco. For full transparency, I’ve been somewhat skeptical about the size of the OpenStack movement and the crippling effect of too many players and the competitive nature of those involved for moving the initiative forward. Perhaps some of this comes from me having little to no open source background. Perhaps I am right. Perhaps I am wrong.
I walk away from this week being convinced of one thing. Those involved in OpenStack are “all-in”. And only a few key industry players are absent (VMware, Microsoft, Amazon, Oracle, Citrix)…but it is pretty obvious why each of those players would not be involved. To be accurate, Citrix is still minimally involved.
But let’s look at some of those who are committed and were very vocal at the conference. Rackspace, HP, IBM, Red Hat, Intel, Dell, Cisco, AT&T, Canonical, SUSE, Nebula, and MANY others (165+), too many to name them all. My apologies.
Those listed above are some heavy pillars in the IT industry, and more importantly the open source world. We are talking about companies that have significant open source experience and wisdom. As mentors and industry luminaries have taught me, the organization of the open source foundation is more key than anything else in a movement like this. Citrix is attempting to make this exact point by moving CloudStack to the Apache Software Foundation (the most proven such foundation). Is it a concern that OpenStack did not go there first? Maybe. But the currently forming OpenStack Foundation is making key progress and apparently has looked extensively at both the ASF and Eclipse in order to build a foundation with the best aspects of other open source foundations. The emerging governance model, structure, leadership, and process is coming together. Q3, 2012 is the goal to have the foundation in place. If OpenStack meets that objective and is successful, there may be no stopping OpenStack.
But are there still too many companies involved? Probably, but that may also shake itself out. Companies will come and go. However, I no longer consider the size as a threat to destroy the initiative. When you consider the complexity of a cloud stack (not to be confused with Citrix’s CloudStack), you realize how intricate and expansive it really is. It is not as though every company involved is participating in every OpenStack project. Nova, the compute core project has perhaps the most involvement, but other love is being spread around. Thought leading networking companies like Cisco, Brocade, Nicira, and Internap are focusing their efforts on Project Quantum (Network). I was told this week that even though there are really only a few major Linux distros in terms of market share, there are still over 700 Linux ecosystem players. OpenStack is not even close yet to that size, and one could argue the potential market for cloud stacks is significantly larger than server OS.
OpenStack does not get to waltz into the party though without a fight. Many others will have something to say, and they are pillars themselves. Chris Kemp, CEO of Nebula and former NASA CTO delivered a keynote where he said OpenStack is not competing with VMware nor Amazon Web Services. He said both have completely different use cases in the industry. Yet the theme throughout the entire conference, and even keynote speakers immediately following Kemp reiterated that VMware and AWS are squarely in the crosshairs as the core competition. Don’t think VMware, AWS, Microsoft, Oracle, Citrix, and possibly others will stay idle.
One of the ways OpenStack is targeting VMware and AWS is by marketing choice and avoiding vendor lock-in as key benefits against them. I do have to issue one caution here. Does OpenStack desire to offer choice and interoperability? Yes. Are we anywhere near that reality? Not even close. The OpenStack platform is open. Public and private cloud providers building solutions on top of OpenStack however are doing many interesting (and closed) things to add value (for market differentiation) to the stack that will introduce lock-in. For instance, Rackspace, in its Next Generation Cloud built its own customer management portal, choosing to deploy a more robust portal than the default from OpenStack. While this portal is not a technology lock-in per se, it surely will be a process, management, and support lock-in. Customers will find it difficult to lift and shift from Rackspace to HP or OpenStack Provider X because of the effort involved to learn, train, and deploy their solutions into a new management portal. Possibly even more difficult is the fact that OpenStack is hypervisor agnostic (to a degree). If one cloud provider is running KVM under OpenStack and another is running XenServer, the complexity to move workloads and convert cannot be understated. I could go on with many more examples of value added lock-ins that exist, and if you are a Gartner for Technology Professionals customer, give me a call, I would love to discuss.
I’m excited to track this market over the next several months and years. I’ll get a chance to have some similar discussions with Citrix and CloudStack in early May and I hope to bring more key insights back from there.
Category: AWS Citrix Cloud IaaS OpenStack Private Cloud Providers Tags: AWS, CloudStack, Iaas, OpenStack, Rackspace, VMware
by Kyle Hilgendorf | April 16, 2012 | Comments Off
Last week I wrote a short blog on some not-so-common differentiators among public cloud IaaS providers. This week, Gartner published a large corresponding research project that I wanted to highlight.
“Evaluation Criteria for Public Cloud IaaS Providers” (Gartner for Technology Professionals subscription required) is the result of a year worth of customer interactions and personal testing. The research document covers important pieces of criteria by which enterprise customers should evaluate public cloud IaaS providers.
The document has 163 criteria components broken down into eight categories:
- Cross Service
- Support & Communication
- Service Levels
- Price & Billing
Within each category, I assigned one of three category ratings to each criterion:
- Required – Criteria that Gartner considers essential for IaaS providers to be capable of hosting production applications and to be considered “enterprise-grade.” IaaS solutions meeting less than all of the required criteria may still be employed for less-critical workloads or for very specific use cases where there is some work-around for a missing piece.
- Preferred – Criteria that Gartner considers nice to have and which are often those features that separate or differentiate good services from the best services. When evaluating IaaS providers, customers should always ask to see road maps that specify when providers plan to meet missing preferred criteria.
- Optional – Criteria that may be unique to specific use cases, or emerging criteria that will be more important as time progresses.
The research document (68 pages in length) includes a downloadable spreadsheet that allows customers to cut and paste into RFIs/RFPs or to change the weighting themselves for individual business requirements.
I am very proud of this research and many customer discussions went into creating the criteria list. I am confident that it will help many organizations cut down the time it takes to build evaluation criteria and questions for IaaS providers.
Category: Cloud Evaluation Gartner IaaS Providers Tags: Cloud, Criteria, Evaluation, Iaas, Providers
by Kyle Hilgendorf | April 11, 2012 | 4 Comments
I recently took a customer phone dialog regarding key differentiators in a public cloud IaaS offering. The customer wanted to discuss differentiators among services that are not commonly considered. The customer had already considered differences among scalability, geographic offerings, VM catalogs, pricing, security controls, network options, and storage tiers. They were essentially asking about “periphery” differences or those things that might not be immediately obvious.
I created a journal list of the things we discussed and thought that this blog would serve valuable for many others. I will not take the time to describe each in detail here, but I welcome comments and debate below. As always, Garter for Technology Professional (GTP) customers can schedule a call with me at anytime.
This is not an exhaustive list and is not in any particular order:
- Graphical User Interface / Management Console
- Provider Transparency
- Payment Models
- Billing (granularity)
- Enterprise management capabilities (asset, deploy, change, incident, problem, …)
- Ecosystem of partners & user community
- Support levels
- Simplicity of service vs. Feature set
- SLAs (breadth of, definition, and clarity)
- APIs (robustness & documentation)
Each of the above bullets can warrant a lengthy discussion in itself. But as you are considering public cloud IaaS offerings, do not forget these items.
Category: Cloud Evaluation IaaS Tags: Assessment, Cloud, Iaas, Providers, Transparency, Vendor Management
by Kyle Hilgendorf | March 12, 2012 | 6 Comments
Late Friday evening, Microsoft released their root cause analysis (RCA) for the Azure Leap Day Bug outage. My last two blog posts chronicled what I heard from Azure customers regarding the outage.
I want to share that I was very pleased with the level of detail in Microsoft’s RCA. As we learned with the AWS EBS outage in 2011, an RCA or Post Mortem is one of the best insights into architecture, testing, recovery, and communication plans in existence at a cloud provider. Microsoft’s RCA was no exception.
I encourage all current and prospective Azure customers to read and digest the Azure RCA. There is significant insight and knowledge around how Azure is architected, much more so than customers have received in the past. It is also important for customers to gauge how a provider responds to an outage. We continuously advise clients to pay close attention to how providers respond to issues, degradations, and outages of service.
I do not want to copy the RCA, but here are a few bullet points I’d like to highlight.
- It’s erie how similar the leap day outage at Azure was to AWS’ EBS outage. Both involved software bugs and human errors. Both were cascading issues. Both issues went unnoticed longer than necessary. As a result, both companies have implemented key traps in their service to catch and prevent errors like this sooner and to prevent spreading.
- Microsoft decidedly suspended service management in order to stop or slow the spread of the issue. Microsoft made this decision with very good reason. Customers would have appreciated knowing the rationale around this decision right away and Microsoft is committing to improving real time communication.
- The actual leap day bug issue and resolution was identified, tested, and rolled out within 12 hours. That is pretty fast. The other issues resulted from unfortunate timing of upgrading software at the time the bug hit, as well as a human error in trying to resolve some other software issues. Microsoft even admits, “in our eagerness to get the fix deployed, we had overlooked the fact that the update package we created….[was] incompatible.”
- Even though the human error only affected seven Azure clusters, those clusters happened to contain Access Control (ACS) and Service Bus services, thereby taking those key services offline. As I spoke to customers the last two weeks, it became quite clear that without such key services as ACS and Service Bus, many other functions of Azure are unusable.
- Microsoft took steps to prevent the outage from worsening. Had these steps not been taken, we might have seen a much bigger issue.
- The issues with the Health Dashboard were a result of increased load and traffic. Microsoft will be addressing this problem.
- Microsoft understands that real-time communication must improve during an outage and are taking steps to improve.
- A 33% service credit is being applied to all customers of the affected services, regardless of whether they were affected. This 33% credit is quickly becoming a de facto standard for cloud outages. Customers appreciate this offer as it benefits both customers and providers alike from having to deal with SLA claims and the administrative overhead involved.
As a final note, Microsoft stated in the RCA many times that they would be working to improve many different processes. I hope that as time moves forward, Microsoft continues to use their blog to share more specifics about the improvements in those processes and the progress against achieving those goals.
What did you think of the Azure RCA?
Category: Cloud Microsoft Outage Providers Tags: Azure, Cloud, Microsoft, Outage, Transparency
by Kyle Hilgendorf | March 9, 2012 | 2 Comments
On February 29, 2012 (leap day), Microsoft Windows Azure experienced a significant cloud service outage. Microsoft announced the outage and resolution in their public blogs (http://blogs.msdn.com/b/windowsazure/archive/2012/03/01/windows-azure-service-disruption-update.aspx and http://blogs.msdn.com/b/windowsazure/archive/2012/03/01/window-azure-service-disruption-resolved.aspx). After the outage, I was able to interview customers of Azure that expressed to me that the outage was very impactful. During the outage last week, I summarized on this blog some high level points about the outage that customers had quickly sent me. However, now that the dust has settled and I’ve had an opportunity to personally interview more Azure customers, I wanted to take this opportunity and provide deeper insight.
Every customer I spoke to agreed to do so under strict confidence. This is always of primary importance to Gartner. I am very thankful to be in the unique position where I get direct and specific details from customers and will always respect their confidentiality. Therefore, I have anonymized all the details. Readers can be certain however, that the below points came directly from real customers using Windows Azure services. I will deliberately not replay the insights from my previous blog, but they still apply. While these insights are specific to Microsoft Windows Azure, they can be applied to any cloud service and I encourage customers and providers alike to consider the learnings. Let’s look at the new insights.
- Communication from Microsoft should have been better – Every customer I spoke to mentioned this. Even 2-3 days after the outage, some customers had not received any formal communication. One customer informed me that they received some personal emails from support friends at Microsoft but nothing official. Those customers that did get formal communication received a very brief synopsis email of the outage stating that the issue started at 5:45pm PST on February 28th and that customers may have experienced issues with Access Control 2.0, Marketplace, Service Bus, and Access Control & Caching Portal. This is a different list of services than those posted in the public blog by VP Bill Laing, however the services listed in the email more closely align to what the Azure Health Dashboard displayed at various points during the outage. The history of the Azure Health Dashboard also shows service interruptions or degradations with the following services on February 29 and/or March 1: SQL Azure Data Sync, Management Portal, Compute, and Service Management. Depending on which communication customers reviewed, there was conflicting information.
- Customers are frustrated with the lack of transparency by Microsoft – The Azure blog announced services were restored for most customers by 2:57am PST on the 29th. Yet, every customer I spoke to informed me that they experienced widespread issues until late on the 29th. I was specifically told by multiple customer that they did not see services come back online until approximately 8pm PST on the 29th, essentially canceling the entire day for them on the 29th, especially for those on the US east coast or in Europe. One customer told me, “we live in a transparent culture. Services go down, but the best practice is ultimate transparency.” Customer sentiment was that Microsoft was not honest immediately during the outage and continued to post conflicting information regarding the outage and its breadth.
- The service outage was far more impactful than advertised – Customers informed me of outages or issues with all of the following services: Access Control 2.0, Service Bus, SQL Azure Data Sync, SQL Azure Database, SQL Azure Management Portal, Windows Azure Compute, and Windows Azure Marketplace. Furthermore, many of these services were having issues in multiple Azure regions. Even though these services were offline at different times, more than one customer informed me that the integrated nature of Azure services means that even if one service is offline, it actually severely affects any of the other services from working properly. For example, when SQL Azure and Azure Compute were online, Azure Data Sync was not. When a customer relies on Data Sync for connecting SQL and Compute services, all three services end up being unavailable. When Access Control was offline, users could not authenticate, rendering the backend application unusable. Furthermore, when Service Management capabilities are offline, it prevents customers from executing any administrative tasks that would assist the customer from redeploying in other regions or implementing business continuity plans. The key learning here is that even if a single component of a cloud service is offline, the impact for an individual customer could be far reaching throughout the other services in the cloud.
- Customers are not leaving Azure, but they are brainstorming options – A consistent theme among Azure customers is that this outage by itself is not driving away current business from Azure. In fact, most customers have been pleased with the service over the past months and years. Most customers are willing to give Microsoft a “black swan” pass on the actual technical issue, but hope it causes Microsoft to improve upon the first few bullet points. With that said, some customers are considering options to protect themselves further from an Azure outage in the future. As mentioned in the previous bullet point, Azure by itself was not able to offer the resiliency and availability to sustain this outage for customers. Because this was a wide-reaching software bug, most regions and services were affected at some point. Customers concluded after the outage that the only true protection against such a widespread software bug is to build a multi-provider or hybrid operating strategy. Therefore, customers are looking at possibilities to maintain some services locally on-premises or enlisting a secondary provider. The challenge in the latter is that very few legitimate .NET and SQL Server as-a-service alternatives exist. Microsoft may be contributing to this problem by building up Azure to such a large offering and cannibalizing its own channel of partners. One customer informed me that they would love to see Microsoft resell Azure to other providers. Other customers that are looking at an on-premises deployment are weighing the costs and risks to do so as compared to the business lost in a single business day. Building such architecture can be quite costly.
- Customers were surprised at the lack of “press” – This is an interesting insight. More than one customer informed me that they were surprised how little information was published regarding the Azure outage and how few customers were publicly complaining about the outage. In comparison to cloud outages in 2011, customers expressed that the news and twitter traffic was much lower. One customer informed me that they were actually wondering if this was an indication of how few customers are in production with Azure and whether they are one of the few in that situation. That did not make them feel very good. However, as I learned from other customers later, many customers deliberately refrained from commenting publicly or in venues such as Twitter because they did not want to elevate to the public that they were having an outage as a result of the Azure outage. As an analyst I have to wonder whether admitting use of public cloud services is a good PR move or a bad PR move.
- Customers are not bothering with SLA claims – Most customers when asked about submitting an SLA claim responded that they were not going to waste their time. To begin, many of the customers complained about the Azure standard SLA, concluding that it is open for interpretation and highly beneficial to Microsoft. One customer even informed me that Microsoft told them this outage did not violate the Azure SLAs. Regardless of whether the outage violated the SLA or not, customers commonly shared that submitting an SLA claim is not worth the time and effort. After all, these businesses lost nearly a day of service and are focusing their time and effort on making sure services are restored, working, and better resilient for the future. Customers did express that it would be welcomed if Microsoft proactively offered them a credit for this outage as a sign of good will and to lessen any need to go through the hassle of submitting an SLA claim. AWS did this in April of 2011 for all customers and it was a popular move. One customer did tell me that Microsoft extended a compensation offer to them after the outage.
- Customers need better health status of Azure – As I mentioned in my blog last week, cloud providers need to host their health dashboard outside of their own service and be prepared for large amounts of traffic to the dashboard in the event of an outage.. The Azure Health Dashboard was frequently unavailable during the outage, making it hard for customers to understand what was going on. Current health status is very important, especially for those customers that desperately want to try to leverage other regions or services to bring capabilities back online. Customers are therefore urging that Microsoft take this advice and some customers are looking at 3rd party options that can monitor Azure health from the outside.
We are near the 10-day commitment by Microsoft to deliver the Root Cause Analysis. Customers should pay close attention to the root cause analysis as often such documentation will provide insights and learnings into not only the architecture of the cloud service, but also the commitment by the cloud service provider to customers. I hope the analysis will be transparent into what happened, what Microsoft has learned from it, how it will be prevented in the future, and what help Microsoft is offering to Azure customers to avoid impacts in the future.
Cloud outages are a sad and unfortunate event. However, if we learn from them, build better services, increase transparency, and guide towards better application design, then we can make something great out of something bad.
Category: Cloud Microsoft Outage Providers Tags: Azure, Cloud, Microsoft, Outage, Providers
by Kyle Hilgendorf | February 29, 2012 | 9 Comments
Today, Microsoft Windows Azure had an advertised outage. As of writing this blog, the outage is still in recovery mode. I spent the morning talking to a handful of Azure customers via phone, email, and Twitter. Here are some observations becoming quite evident and important learnings for cloud customers and cloud providers:
- Cloud providers continue to track cloud outages/issues based only on availability whereas it must also include performance and response metrics
- Service dashboards continue to rely on the underlying cloud service being online
- Customers can never get enough information during the outage from the provider
- We all know outages are a fact of life, but in the midst of one, pain is real
- Customer application design needs to continue to evolve
Let me dive into each of these points with my own commentary.
- Cloud providers continue to track cloud outages/issues based only on availability whereas it must also include performance and response metrics: Azure’s health dashboard and communication originally communicated that only 3.8% of customers were affected with this outage. There was no context around where the 3.8% came from or how it was measured but I spoke to several customers this morning that suspect they were not included in the 3.8%. Just recently, the percentages were increased at the dashboard. Based upon region, the latest affected customer percentages are 6.7%, 37%, and 28% (and may still change). I was informed by some customers that various Azure roles (web, worker, VM) are up and online for many of these customers but that service performance is degraded to such a point of being unusable. Because most provider SLAs are based upon uptime and availability, and not performance or response, these outages may not be reported as being affected. You can follow some of my interactions via Twitter (@kylehilgendorf) from this morning to see a couple of examples. Providers MUST start including performance and response SLAs into their standard service. A degraded service is often as impactful as a down service. A great quote came in on twitter this morning via @qthrul, “…a falling tower is ‘up’ until it is ‘down’.” A falling tower is not very useful for most customers.
- Service dashboards continue to rely on the underlying cloud service being online: The Azure Service Dashboard (http://www.windowsazure.com/en-us/support/service-dashboard/) has been experiencing very intermittent availability. Throughout this morning, I have had about a 25%-30% success rate of getting the dashboard to load. I’ve been informing providers frequently that service health systems and dashboards must be hosted independently from the provider’s cloud service. If the cloud service is down or degraded, customers had better be able to see the status at all times. I recently finished a lengthy document on evaluation criteria for public IaaS providers that will publish in the near future, and one of those criteria specifically states this as a requirement. If the service dashboard is the primary vessel by which cloud providers communicate outage updates, it must be up while the service is down.
- Customers can never get enough information during the outage from the provider: Looking back to 2011 and the AWS and Microsoft outages it became very clear that frequent status updates are paramount during an outage. AWS led the way with 30-45 min outage updates through their painful EBS outage and Ireland issues. While updates don’t solve the problem, they do demonstrate customer advocacy and concern. Some customers told me this morning they feel completely in the dark. There is no reason why a cloud provider should not have a dedicated communication team providing at least 30 min updates throughout the entire outage. Microsoft seems to be in a good cadence late this morning on more frequent updates, but there were large gaps in updates when the outage first occurred. More important in my opinion however, is a thorough post-mortem on the outage once the service has been restored. This should come within 3-4 days of the outage and must be very open and honest about the root cause, the fix, and the take-aways for the future. Providers please note, the world is very smart. If a provider even tries to mask or hide any of the details, it will come back to reflect negatively. Honesty wins.
- We all know outages are inevitabilities, but in the midst of one, pain is real: I’ve heard from some customers very impacted and as a result very frustrated and disappointed. When a cloud service has a good track record, we all admit that an outage will happen at some point. Yet, in the middle of an outage, emotion gets involved. Therefore, see point #5.
- Customer application design needs to continue to evolve: Similar to previous cloud outages, customer application design must continue to evolve to account for possible (some would say probable) cloud outages and issues. No cloud services is identical to anotherand each has its own unique design and configuration options. Most cloud services have the concept of zones and regions from a geographical or hosting location standpoint. In most cloud outages, not every zone or region is affected. Therefore, the best-prepared applications will be those designed cross-zone and cross-region to avoid an outage or degradation in any one area. However, this comes at extreme complexity and increase in cost. Many times 3x-10x the cost advertised by providers. If you will be running a critical application at a cloud provider, expect an outage, design for resiliency, and be prepared to pay for it. This may also mean that you have to hire or retain some very skilled cloud staff.
It is always a sad day as a cloud analyst to see these outages. However, it seems that significant change in the industry, at both a provider and customer level, only tends to come after an emergency.
I’d love your comments here. Let’s engage in a conversation.
Category: Cloud Microsoft Outage Providers Tags: Azure, Cloud, Microsoft, Outage