I worked as a consultant in my university computer lab. We used UCSD Pascal, which had a very simple file system that relied upon logically contiguous blocks. Clumsy 101 students were capable of accidentally ‘deleting’ their data, and I figured out a way to restore it. It was dirt simple—you just created a new file of a sufficient length, and when you opened it up, it contained all the missing bits. This was an effective level of restoration, because the ASCII format of the data (Runoff formatted at most) was highly resilient. As long as as new data wasn’t stored on top of the old, recovery was almost always possible.
There are lots of downsides to a flat filesystem, which is why they are pretty rare these days. Unfortunately, the more complex the data file, the more complex the recovery. Windows and Unix had moved a long way from simple storage systems like UCSD. Files were virtualized, with a table pointing to all the blocks, It takes a lot more effort, but eventually, recovery of deleted or corrupted data is often possible, although less reliably than on UCSD Pascal. So what are the dynamics of a large distributed cloud storage system? It almost certainly has to be the case that restorability becomes more complex and less possible as a function of the size and complexity of the storage system. As it turns out, Amazon has again provided us with a recent test case in Ireland.
The failure reads like For Want of A Nail:
- Lightning strikes
- Power from grid is lost (is it not possible to have redundant connections to the power grid?)
- Generators don’t function (see lightning above)
- Service stops (is it not possible to perform an orderly shutdown before the batteries are exhausted?)
- Amazon reports that a software flaw has meant that some of the snapshots were inappropriately deleted (it remains uncertain how much data was lost and what degree of recovery will be possible)
Amazon’s status page is currently providing some news on the state of recovery that I think should be of interest for the users of cloud services “This process requires us to move and process large amounts of data, which is why it is taking a long time to complete, particularly for some of the larger volumes.” On August 7, the status site had previously explained “Due to the scale of the power disruption, a large number of EBS servers lost power and require manual operations before volumes can be restored. Restoring these volumes requires that we make an extra copy of all data, which has consumed most spare capacity and slowed our recovery process.”
A common natural disaster strikes, the high availability mechanisms don’t work, a recovery mechanism turns out to be broken, and fixing it takes a long time….because it is a cloud.
Its impossible to fully predict failure modes. The more complex a system, the more obscure the weaknesses, and the greater the potential for emergent negative behaviors. What can and should be predicted is that failures of all size will happen. Current and potential customers need to ask their provider just how long it would take to copy their data back onto live servers, and how long it would take to reconstruct and relink it such that it be returned to the desired recovery point.
What can reliably be predicted is that the bigger something is, the harder the fallout. At some point, economies of scale no longer compensate for the concentration of risk. I don’t have any idea what that point is, but it should be a matter of burning importance to the sellers and buyers of services based around highly complex and scaled systems. Bigger may be better, but infinitely large can never be infinitely good.