Randy Bias has blogged about Amazon mandating instance reboots for hundreds, perhaps thousands, of instances (Amazon’s term for VMs). Affected instances seem to be scheduled for reboots over the next couple of weeks. Speculation is that the reboots are to patch a recently-reported vulnerability in the Xen hypervisor, which is the virtualization technology that underlies Amazon’s EC2. The GigaOm story gives some links, and the CRN story discusses customer pain.
Maintenance reboots are not new on Amazon, and are detailed on Amazon’s documentation about scheduled maintenance. The required reboots this time are instance reboots, which are easily accomplished — just point-and-click to reboot on your own schedule rather than Amazon’s (although you cannot delay past the scheduled reboot). Importantly, instance reboots do not result in a change of IP address nor do they erase the data in instance storage (which is normally non-persistent).
For some customers, of course, a reboot represents a headache, and it results in several minutes of downtime for that instance. Also, since this is peak retail season, it is already a sensitive, heavy-traffic time for many businesses, so the timing of this widespread maintenance is problematic for many customers.
However, cloud IaaS isn’t magical. If these customers were using dedicated hosting, they would still be subject to mandated reboots for security patches — hosting providers generally offer some flexibility on scheduling such reboots, but not aa lot (and sometimes none at all if there’s an exploit in the wild). If these customers were using a provider that uses live migration technology (like VMotion on a VMware-virtualized cloud), they might be spared reboots for system reasons, but they might still be subject to reboots for mandated operating system patches.
Given that what’s underlying EC2 are ordinary physical servers running virtualization without a live migration technology in use, customers should reasonably expect that they will be subject to reboots — server-level (what Amazon calls a system reboot), as well as instance-level — and also anticipate that they may sometimes need to reboot for their own guest OS patches and the like (assuming that they don’t simply patch their AMIs and re-launch their instances, arguably a more “cloudy” way to approach this problem).
What makes this rolling scheduled maintenance remarkable is its sheer scale. Hosting providers typically have a few hundred customers and a few thousand servers. Mass-market VPS hosters have lots of VPS containers, but there’s a roughly 1:1 VPS:customer ratio and a small-business-centricity that doesn’t lead to this kind of hullabaloo. Amazon’s largest competitor is estimated to be around the 100,000 VM mark. Only the largest cloud IaaS providers have more than 2,000 VMs. Consequently, this involves a virtually unprecedented number of customers and mission-critical systems.
Amazon has actually been very good about not taking down its cloud customers for extended maintenance windows. (I can think of one major Amazon competitor that took down one whole data center for an eight-hour maintenence evidently involving a total outage this past weekend, and which regularly has long-downtime maintenance windows in general.) A reboot is an inconvenience, but if you are running production infrastructure, you should darn well think about how to handle the occasional reboot, including reboots that affect a significant percentage of your infrastructure, because reboots are not likely to go away in IaaS anytime soon.
To hammer on the point again: Cloud IaaS is not magical. It still requires management, and it still has some of the foibles of both physical servers and non-cloud virtualization. Being able to push a button and get infrastructure is nice, but the responsibility to manage that infrastructure doesn’t go away — it’s just that many cloud customers manage to delay the day of reckoning when the attention they haven’t paid to management comes back to bite them.
If you run infrastructure, regardless of whether it’s in your own data center, in hosting, or in cloud IaaS, you should have a plan for “what happens if I need to mass-reboot my servers?” because it is something that will happen. And add “what if I have to do that immediately?” to the list, because that is also something that will happen, because mass exploits and worms certainly have not gone away.
(Gartner clients only: Check out a note by my security colleagues, “Address Concentration Risk in Public Cloud Deployments and Shared-Service Delivery Models to Avoid Unacceptable Losses“.)