Blog post

Multicloud failover is almost always a terrible idea

By Lydia Leong | October 14, 2021 | 0 Comments

InfrastructureCloud Computing for Technical ProfessionalsCloud Computing

Most people — and notably, almost all regulators — are entirely wrong about addressing cloud resilience through the belief that they should do multicloud failover because, as I noted in a previous blog post,  the cloud is NOT just someone else’s computer. (I have been particularly aghast at a recent Reuters article about the Bank of England’s stance.)

Regulators, risk managers, and plenty of IT management largely think of AWS, Azure, etc. as monolithic entities, where “the cloud” can just break for them, and then kaboom, everything is dead everywhere worldwide. They imagine one gargantuan amorphous data center, subject to all the problems that can afflict single data centers, or single systems. But that’s not how it works, that’s not the most effective way to address risk, and testing the “resilience of the provider” (as a generic whole) is both impossible and meaningless.

I mean, yes, there’s the possibility of the catastrophic failure of practically any software technology. There could be, for instance, a bug in the control systems of airplanes from fill-in-the-blank manufacturer that could be simultaneously triggered at a particular time and cause all their airplanes to drop out of the sky simultaneously. But we don’t plan to make commercial airlines maintain backup planes from some other manufacturer in case it happens. Instead, we try to ensure that each plane is resilient in many ways — which importantly addresses the most probable forms of failure, which will be electrical or mechanical failures of particular components.

Hyperscale cloud providers are full of moving parts — lots of components, assembled together into something that looks and feels like a cohesive whole. Each of those components has its own form of resilience, and some of those components are more fragile than others. Some of those components are typically operating well within engineered tolerances. Some of those components might be operating at the edge of those tolerances in certain circumstances — likely due to unexpected pressures from scale — and might be extra-scary if the provider isn’t aware that they’re operating at that edge. In addition to fault-tolerance within each component, there are many mechanisms for fault-tolerance built into the interaction between those components. 

Every provider also has its own equivalent of “maintenance” (returning to the plane analogy). The quality of the “mechanics” and the operations will also impact how well the system as a whole operates.  (See my previous blog post, “The multi-headed hydra of cloud resilience” for the factors that go into provider resilience.)

It’s not impossible for a provider to have a worldwide outage that effectively impacts all services (rather than just a single service).  Such outages are all typically rooted in something that prevents components from communicating with each other, or customers from connecting to the services — global network issues, DNS, security certificates, or identity. The first major incident of this type was the 2012 Azure leap year outage. The 2019 Google “Chubby” outage had global network impact, including on GCP. There have been multiple Azure AD outages with broad impact across Microsoft’s cloud portfolio, most recently the 2021 Azure Active Directory outage. (But there are certainly other possibilities. As recently as yesterday, there was a global Azure Windows VM outage that impacted all Windows VM-dependent services.)

Provider architectural and operational differences do clearly make a difference. AWS, notably, has never had a full regional failure or a global outage. The unique nature of GCP’s global network has both benefits and drawbacks. Azure has been improving steadily in reliably over the years as Microsoft addresses both service architecture and deployment (and other operations) processes.

Note that while these outages can be multi-hour, they have generally been short enough that — given typical enterprise recovery-time objectives for disaster recovery, which are often lengthy — customers typically don’t activate a traditional DR plan. (Customers may take other mitigation actions, i.e. failover to another region, failover to an alternative application for a business process, and so forth.)

Multicloud failover requires that you maintain full portability between two providers, which is a massive burden on your application developers. The basic compute runtime (whether VMs or containers) is not the problem, so OpenShift, Anthos, or other “I can move my containers” solutions won’t really help you. The problem is all the differentiators — the different network architectures and features, the different storage capabilities, the proprietary PaaS capabilities, the wildly different security capabilities, etc. Sure, you can run all open source in VMs, but at that point, why are you bothering with the cloud at all? Plus, even in a DR situation, you need some operational capabilities on the other cloud (monitoring, logging, etc.), even if not your full toolset.

Moreover, the huge cost and complexity of a multicloud implementation is effectively a negative distraction from what you should actually be doing that would improve your uptime and reduce your risks, which is making your applications resilient to the types of failure that are actually probable. More on that in a future blog post.

Comments are closed