Blog post

How long does it take to reboot a cloud?

By Jay Heiser | May 10, 2011 | 0 Comments

securityrisk managementCloud

Commercial cloud  computing raises two significant disaster recovery issues:

  • What is the cloud provider’s ability to recover their own services?
  • What is the enterprise’s ability to obtain an alternative to a vendor that can’t recover themselves?

To the extent that cloud computing actually exists, and actually is a new model, we have to consider that traditional forms of BCP/DR, and the validation of the existence and efficacy of such, may either need to be reinterpreted, or new practices and even vocabularies may be necessary. As a parallel, I think its fair to say that traditional IT security concepts and the related business requirements are still perfectly valid, but the relative degree to which traditional forms of risk assessment and testing can be applied has been reduced by this new model.  The security domain doesn’t need to reinvent the wheel to respond to cloud computing, but there is every reason to think that current wheel designs (and tires, for that matter) are less than non-optimal for this new task.  It is almost certainly the case that other IT risk domains also need to consciously consider how to apply their old concepts to this new computing style.

The practical implications of cloud ambiguity is a wilful lack of attention to architectural and build issues, with relatively greater levels of attention to operational processes as a sub-conscious form of compensation.  The light shines strongly on operations, so that’s where everyone looks. The problem is that our understanding of what constitutes an appropriate set of processes is based upon the requirements of a single host using a familiar operating environment.  I can make a Unix or Windows box as secure as you want, and I can back it up out the whing whang and have a high degree of confidence that come what may, I can restore service within an expected time frame.

In contrast, I have no basis for determining the propensity to fail, in either a confidentiality or data availability sense, of a proprietary environment based on hundreds of thousands of servers in 3 dozen data centers, tenanted by millions of users of hundreds of applications and services. 

There are clear fault tolerant advantages to most commercial cloud services.  A small to medium business, or a small business unit in any size business, can easily obtain a highly reliable level of service at a relatively low cost, and it can be done quickly and conveniently. 

What is not the least bit clear is the relative ability of any Cloud Service Provider to restore your data into their services in cases in which their high availability, fault tolerance mechanisms do not protect your data. Indeed, in certain instances, the fault tolerant mechanisms can cause an auto-immune failure that virtually ensures that every live copy of your data will be impacted.

The fact that it took Google 4 days to restore .02% of the users of a single service is a sobering one to me.  Likewise, Amazon required 4 days to recover from a limited outage, and they were never able to get all the data back. Should the buyers of such a service expect a linear relationship between the restoration time and the amount of data lost?  Would it truly take 200 days to restore the data for just 1% of gmail users? Forgive me if I’m committing BCP/DR sacrilege, but I fail to see the utility of the letters RTO to this situation.

I choose to continue to believe that the acronym RTO is a way to express a business requirement, and potential vendors shouldn’t be telling potential buyers what their business requirements are.  Its even worse when cloud service providers are suggesting that their levels of fault tolerance are so good, that traditional forms of ‘recovery’ are no longer relevant.  One BCP/DR concept that should never be lost in the cloud is the need for contingency planning.

Comments are closed