During a recent inquiry, a client asked me how they could purchase “guaranteed capacity” at AWS in the event of a disaster. Frankly, I had never even considered such a scenario. After asking the client for clarification, I discovered that they were concerned about AWS’ ability to guarantee capacity when/if a large number of organizations tried to simultaneously provision or power-on instances. This is assuming, of course, that the disaster affected a large geographic area and, consequently, a large number of organizations. That immediately reminded me of the old way of doing DR, where organizations would pay to have physical servers reserved and, in the event of a disaster, would be guaranteed access to those servers to rebuild their environment.
So I began to answer the question: “Well, with AWS, you could purchase Reserved Instances which would guarantee that your instances would power on, but there are design best practices that are intended to avoid those types of situations. For starters, you would need to deploy your instances into two Availability Zones within the same Region in order to adhere to AWS’ compute SLA.” Now, of course the client is thinking about DR, and as such, may not be willing to deploy instances to two Availability Zones. Nonetheless, that is the correct deployment method to avoid such an outage. The client then asked, ”What if the disaster affected the entire Region?” I then explained that AWS Regions have at least two AZs, and some have more. Furthermore, the likelihood of a Region running out of capacity is extremely low. And of course, you can always architect your environment to work across multiple Regions; you sacrifice synchronous replication capabilities and some other things, but it is doable.
I then took the client through the importance of a Business Impact Analysis (BIA), which would not only consider the types of disasters the business needs to protect against, but would also identify the RTOs and RPOs needed to design an effective DR plan. I also explained to the client that at some level, they must consider the social aspects of a disaster, not just the technical aspects. If the disaster is that big, the last thing on anyone’s mind is how to bring back services – you are in survival mode, at that point. If you are trying to protect against more than a hurricane, a tornado, an earthquake of a certain magnitude, or even against terrorist attacks of a specific caliber, you face many more challenges than whether or not your instances will power on.
But that got me thinking – could there be a run on resources at large cloud service providers like AWS, Azure, or others in the event of a really large disaster? And if so, what would that look like?
Putting the social and survival instinct aside, and assuming a disaster was large enough where organizations would look to AWS and Azure to quickly rebuild their environments, what could these providers do to avoid running out of capacity? Sure, I know what you will say “the cloud offers infinite capacity,” but we all know that isn’t true, and at some point, there is a limit to that capacity, even if it is very high.
If the disaster was large enough to, say, affect the East Coast of the United States, and therefore push customers to redeploy their systems on AWS, I would assume they can take the following measures, provided that organizations decide to deploy on a particular region or two:
- I would assume that AWS would immediately inventory instances and services that they are consuming, which are not critical and can be shut down to allow incoming organizations to power on and use services.
- I would almost guarantee that AWS’ Spot instances would immediately be powered off, and there are a lot of those spot instances available. This would free up a significant amount of extra capacity on short order.
- Depending on the severity of the outage or expected resource usage, Amazon.com (being one of the bigger consumers of AWS) would also power down certain services and instances that serve the affected area, considering that business would be affected. If not power down, it would at least reduce the number of instances, as their business would inevitably be affected and, therefore, they would not need that capacity.
- If need be, I would also suspect that AWS would contact clients on the platform, asking them to shut down non-critical instances or reduce instances, if possible.
- As a final resort, I would not be surprised if AWS, facing capacity “drought,” would consider helping organizations deploy on another cloud provider’s infrastructure, similar to how airlines will leverage each other in times of need. Of course, that is a final desperate measure, but if the disaster is that grave, I am sure the greater good would be put forth.
What would Azure do? Well, for the sake of not being repetitive, they can adopt very similar measures; they can also reduce the instances consumed by their other businesses (XBOX comes to mind immediately).
My analysis is based on a disaster we have not yet seen, and I am not considering the social factor. In the event of that perfect storm, I think that the cloud service providers still have better options at their disposal than what would be available to enterprises not leveraging the cloud today.
Comments or opinions expressed on this blog are those of the individual contributors only, and do not necessarily represent the views of Gartner, Inc. or its management. Readers may copy and redistribute blog postings on other blogs, or otherwise for private, non-commercial or journalistic purposes, with attribution to Gartner. This content may not be used for any other purposes in any other formats or media. The content on this blog is provided on an "as-is" basis. Gartner shall not be liable for any damages whatsoever arising out of the content or use of this blog.