I managed to miss the excitement of yet another weather-related disaster by being in Japan during the the derecho incident that knocked out power to approximately 3 million customers, including my parents in Ohio and our house in Virginia, 400 miles away.
Another disaster that I just missed took place here in Japan, a major failure at a large and prominent cloud service provider, FirstServer. While this was extremely inconvenient for the approximately 5000 customers (1/10 of the total) who experienced unrecoverable data loss, it provided a great example to cite in my Disaster in the Cloud presentation, which I had been scheduled to present multiple times. Off the incident radar outside of the Japanese-speaking world, Gartner’s Tokyo-based security and consulting staff explained that this June 20th failure was a high-profile data loss event at a prominent national provider.
Like the gmail outage early last year, which required 4 days to recover service for what Google described as constituting only .02% of their user base, and an AWS incident about the same time that resulted in some permanent loss of data, the FirstServer incident was the result of a software upgrade.
It is hardly surprising that live upgrades of clouds sometimes result in failures. Replacing a code module within a running service is the equivalent of transplanting an organ without any anesthetic. Not only does the patient not have any anesthetic, the patient isn’t even lying down. The operation takes place while the patient is hard at work, performing heavy lifting on behalf of thousands of tenants simultaneously.
The economics of cloud computing drive service providers to make frequent software upgrades, and to do it without any downtime. Success in a highly-leveraged market is dependent upon providing as many forms of service to as many tenants as possible, virtually ensuring that most commercial clouds will be highly dynamic collections of interdependent software modules. The large number of customers makes it difficult or impossible to schedule downtime.
The good news is that a huge number of live upgrades take place without any negative impact. Failures have been relatively uncommon, and most have not yet resulted in unrecoverable data loss.
Cloud service buyers should ask current and potential providers about their software upgrade practices, but should not expect any agreement on the best practices for evaluating any particular provider’s practices. One thing I am confident about, however useful an ISO 27000 certification or a SOC2 or SOC3 assessment may be, it is unlikely to go into much depth on this increasingly critical topic.