Blog post

Amazon outage and the auto-immune vulnerabilities of resiliency

By Lydia Leong | April 21, 2011 | 7 Comments


Today is Judgment Day, when Skynet becomes self-aware. It is, apparently, also a very, very bad day for Amazon Web Services.

Lots of people have raised questions today about what Amazon’s difficulties today mean for the future of cloud IaaS. My belief is that this doesn’t do anything to the adoption curve — but I do believe that customers who rely upon Amazon to run their businesses will, and should, think hard about the resiliency of their architectures.

It’s important to understand what did and did not happen today. There’s been a popular impression that “EC2 is down”. It’s not. To understand what happened, though, some explanation of Amazon’s infrastructure is necessary.

Amazon divides its infrastructure into “regions”. You can think of a region as basically analogous to “a data center”. For instance, US-East-1 is Amazon’s Northern Virginia data center, while US-West-1 is Amazon’s Silicon Valley data center. Each region, in turn, is divided into multiple “availability zones” (AZs). You can think of an AZ is basically analogous to “a cluster” — it’s a grouping of physical and logical resources. Each AZ is designated by letters — for instance, US-East-1a, US-East-1b, etc. However, each of these designations are customer-specific (which is why Amazon’s status information cannot easily specify which AZ is affected by a problem).

Amazon’s virtual machine offering is the Elastic Compute Cloud (EC2). When you provision an EC2 “instance” (Amazon’s term for a VM), you also get an allocation of “instance storage”. Instance storage is transient — it exists only as long as the VM exists. Consequently, it’s not useful for storing anything that you actually want to keep. To get persistent storage, you use Amazon’s Elastic Block Store (EBS), which is basically just network-attached storage. Many people run databases on EC2 that are backed by EBS, for instance. Because that’s such a common use case, Amazon offers the Relational Database Service (RDS), which is basically an EC2 instance running MySQL.

Amazon’s issues today are with EBS, and with RDS, both in the US-East-1 region. (My guess is that the issues are related, but Amazon has not specifically stated that they are.) Customers who aren’t in the US-East-1 region aren’t affected (customers always choose which region and specific AZs they run in). Customers who don’t use EBS or RDS are also unaffected. However, use of EBS is highly commonplace, and likely just about everyone using EC2 for a production application or Web site is reliant upon EBS. Consequently, even though EC2 itself has been running just fine, the issues have nevertheless had a major impact on customers. If you’re storing your data on EBS, the issues with EBS have made your data inaccessible, or they’ve made access to that data slow and unreliable. Ditto with RDS. Obviously, if you can’t get to your data, you’re not going to be doing much of anything.

In order to get Amazon’s SLA for EC2, you, as a customer, have to run your application in multiple AZs within the same region. Running in multiple AZs is supposed to isolate you from the failure of any single AZ. In practice, of course, this only provides you so much protection — since the AZs are typically all in the same physical data center, anything that affects that whole data center would probably affect all the AZs. Similarly, the AZs are not totally isolated from one another, either physically or logically.

However, when you create an EBS volume, you place it in a specific availability zone, and you can only attach that EBS volume to EC2 instances within that same availability zone. That complicates resiliency, since if you wanted to fail over into another AZ, you’d still need access to your data. That means if you’re going to run in multiple AZs, you have to replicate your data across multiple AZs.

One of the ways you can achieve this is with the Multi-AZ option of RDS. If you’re running a MySQL database and can do so within the constraints of RDS, the multi-AZ option lets you gain the necessary resiliency for your database without having to replicate EBS volumes between AZs.

As one final caveat, data transfer within a region is free and fast — it’s basically over a local LAN, after all. By contrast, Amazon charges you for transfers between regions, which goes over the Internet and has the attendant cost and latency.

Consequently, there are lots of Amazon customers who are running in just a single region. A lot of those customers may be running in just a single AZ (because they didn’t architect their app to easily run in multiple AZs). And of the ones who are running in multiple AZs, a fair number are reliant upon the multi-AZ functionality of RDS.

That’s why today’s impacts were particularly severe. US-East-1 is Amazon’s most popular region. The problems with EBS impacted the entire region, as did the RDS problems (and multi-AZ RDS was particularly impacted), not just a single AZ, so if you were multiple-AZ but not multi-region, the resiliency you were theoretically getting was of no help to you. Today, people learned that it’s not necessarily adequate to run in multiple AZs. (Justin Santa Barbara has a good post about this.)

My perspective on this is pretty much exactly what I would tell a traditional Web hosting customer who’s running only in one data center: If you want more resiliency, you need to run in more than one data center. And on Amazon, if you want more resiliency, you need to not only be multi-AZ but also multi-region.

Amazon’s SLA for EC2 is 99.95% for multi-AZ deployments. That means that you should expect that you can have about 4.5 hours of total region downtime each year without Amazon violating their SLA. Note, by the way, that this outage does not actually violate their SLA. Their SLA defines unavailability as a lack of external connectivity to EC2 instances, coupled with the inability to provision working instances. In this case, EC2 was just fine by that definition. It was EBS and RDS which weren’t, and neither of those services have SLAs.

So how did Amazon end up with a problem that affected all the AZs within the US-East-1 region? Well, according to their status dashboard, they had some sort of network problem last night in their east coast data center. That problem resulted in their automated resiliency mechanisms attempting to re-mirror a large number of EBS volumes. This impacted one of the AZs, but it also overloaded the control infrastructure for EBS in that region. My guess is that RDS also uses this same storage infrastructure, so the capacity shortages and whatnot created by all of this activity ended up also impacting RDS.

My colleague Jay Heiser, who follows, among other things, risk management, calls this “auto-immune disease” — i.e., resiliency mechanisms can sometimes end up causing you harm. (We’ve seen auto-immune problems happen before in a prior Amazon S3 outage, as well as a Google Gmail outage.) The way to limit auto-immune damage is isolation — ensuring limits to the propagation.

Will some Amazon customers pack up and leave? Will some of them swear off the cloud? Probably. But realistically, we’re talking about data centers, and infrastructure, here. They can and do fail. You have to architect your app to have continuous availability across multiple data centers, if it can never ever go down. Whether you’re running your own data center, running in managed hosting, or running in the cloud, you’re going to face this issue. (Your problems might be different — i.e., your own little data center isn’t going to have the kind of complex problem that Amazon experienced today — but you’re still going to have downtime-causing issues.)

There are a lot of moving parts in cloud IaaS. Any one of them going wrong can bork your entire site/application. Your real problem is appropriate risk mitigation — the risk of downtime and its attendant losses, versus the complications and technical challenges and costs created by infrastructure redundancy.

The Gartner Blog Network provides an opportunity for Gartner analysts to test ideas and move research forward. Because the content posted by Gartner analysts on this site does not undergo our standard editorial review, all comments or opinions expressed hereunder are those of the individual contributors and do not represent the views of Gartner, Inc. or its management.

Comments are closed


  • Very well written. Thanks for the information. Oddly, I posted on my blog yesterday that the cluster architecture is going to lead to unique data quality processes when the clusters fall out of synch and the conflict resolution measures are unable to settle the discrepancies.

  • I think your observation that “this doesn’t do anything to the adoption curve” of cloud computing is a very incite full one. Common wisdom says that this is a black eye on the AWS and cloud computing as a whole. I think it is an exact opposite. The events today highlighted just how pervasive usage of the cloud already is. It also highlighted, as you explain in your analysis, that there are sufficient tools and facilities available to achieve continuous availability but most people do not take advantage of them. We do use multi-availability zone and multi-region deployments using IBM DB2 HADR facility and fail-over automation.

  • Nathan Butler says:

    I’ve spoken with Amazon Reps before and the definition you have of AZs is not quite right. Availability zones are logical datacenters withing a region. Meaning, one AZ is a series of multiple physical buildings within a region. Each AZ is on a separate flood plain, has separate power etc from each other AZ. So each AZ is actually a datacenter, and in the us-east-1 region, there are 4 AZs and hence 4 datacenters. So when you say that each AZ is a cluster in one physical data center, that is incorrect. Amazon has always recommended that you go multi-region to effect better response time to your end-user, not necessarily for HA. You could do multi-region HA and that does make sense from a catastrophe perspective (I’m thinking natural catastrophe) but if your architecture is multi-AZ you already are multi-datacenter. For instance, I run the infrastructure for my company in the us-east-1 region and half of our instances are in one AZ and the other half in another, so if one AZ goes down, we can still serve on half our inf.

  • Alan Berkson says:

    Great piece. Regardless of the accuracy of the specifics in terms of Amazon architecture, it highlights two important issues:

    1- cloud is NOT a redundancy solution, it’s a tool;
    2- know your vendors SLA’s and be sure they align with your own SLA’s.

    Also love the auto-immune analogy.

  • Very insightful article. I don’t think that anyone can argue against cloud being a wonderful disruptive technology, both in terms of how it can impact the efficiency of IT Operations and how it can shift the traditional capital intense IT Budget to something other. Further, I don’t believe that the issues Amazon experienced have tainted the market’s interest in leveraging cloud, but I do believe that it has opened some eyes into the proper placement of cloud within the landscape of IT Infrastructure solutions as well as proper due diligence of a provider’s SLAs, supporting architecture, and processes.

    There are vast differences between cloud environments (yes, there are many versions of cloud) and they deserve sufficient analysis before being labeled as the “right choice.” After all, the market offers enterprise cloud options and those that don’t fit that definition, internal clouds, external clouds (private, public and hybrid versions), and SLA options and technologies associated with with each can vary based on your ideal design. In the end, as hard a pill as it may be to swallow, “you get what you pay for.”

  • Mark Clifton says:

    Scott Huguenin’s “you get what you pay for” is where the heart of the issue lies. The fact is that redundancy, uptime, latency, traffic and data loss protection, all cost money, several times the amount that is required just to get “the ball rolling”.

    My opinion is that these ‘extra’ features need to cost less and be more streamlined, and eventually the market will take us there. Eventually the paths and options will become more clear as far as managing risk vs the extra cost required to achieve the desired levels in each of the categories I listed above. But right now, it’s been a problem for us trying to decide what amount of money is “worth the potential benefit” and how much is required to sleep at night. I’m hoping some day we’ll see a comprehensive chart of options for companies in different economic situations so that we at least fully understand the risk-levels we are at, with a nice “You Are Here” arrow.

    In our situation, we had 2 servers with no hot backups running in the affected zone. S3 backups of the important data are done every 48 hours. We were comfortable with the potential risk of being down for 1-2 days, because the cost of redundancy would increase our total upkeep costs by 200% taking away from that precious startup revenue.. This situation was tricky for us because of the lack of ETA, so the gambling did not stop even after the initial outage. If Amazon had told us we’d lost everything, it would actually have been better than them telling us to ‘sit tight’. We would have been back up by Friday, then another day of re-entering lost data from emails, etc. However, technology is such that these days it’s even more unpredictable than the weather, with the additional lack of transparency encouraged by the business community and PR philosophies. I would move to another cloud service in a heartbeat if I could actually take advantage of all the tools available to the employees managing the cloud, in read-only mode of course, even though I lack the knowledge to really understand them. I would trade any kind of support for that transparency. What are they afraid of, I can’t afford a datacenter, nor the education / time to spend on learning cloud management and scaling, etc. This is what I pay for, not some hands-off magical black box. Showing all your ‘trade secrets’ in this case will actually be a major sellign point for those control-crazed IT professionals and get you more clients IMO. This would be your edge over the competition, and the expertise provided by the community with all this new information would be your support that companies don’t even need to pay for.

  • Nolio customers were able to automatically deploy their applications to US West as well as other cloud providers – What’s your story?