Over the last few days two major events, one of international relevance and one limited to my own country (Italy), have shown that the web is fallible and those who claim that the cloud will give us commodity, always-available, scalable and cheap IT resources will have to reassess their beliefs.
The first incident, which is well described in a post by my colleague Lydia Leong, affected Amazon cloud services and was originated by a network problem, which in turn caused the re-mirroring a large number of volumes, hence impacting one of the availability zones (AZ) into which Amazon segments its regions.
The second incident, which happened today, affected Aruba, a web hosting company headquartered in central Italy and also providing some government-sanctioned services (such as certified email). There was a fire in the room where the UPS were located and that apparently affected the whole server room.
In both cases, clients who relied only on one Amazon AZ or only on Aruba for web hosting were affected. Those who designed their applications embedding resiliency concerns, e.g. by using multiple data centers, AZs or external service providers, emerged from these incidents unscathed.
While these incidents will probably raise a wave of concern about cloud computing and provide additional ammunitions to those who are resisting this trend on the basis of security and reliability issues, they are very welcome. In fact, they show that the basic principles that drove application design for availability and reliability in the past do not change because of the cloud. As one would not run an application on a single server, without any backup or recovery mechanism, so he or she has to apply the same concepts when using cloud services. Sure, cloud services are likely to fail less than one’s own data center, because that’s a global class infrastructure, but can still fail and – as shown by the recent events – it will.
While it is important to maintain pressure on service providers to improve their reliability footprint, the onus of developing or contracting reliable system stays with their clients, and there won’t be any miraculous cloud that provides 100% uptime or that does not risk to fail meeting its own SLAs.
Is this any different than before? Is the cloud good enough a reason to put all one’s eggs in one basket? I do not think so, and the more we look at the past and how we have coped with similar issues on our own data centers or with outsourcers, the better off we will be in using cloud services.