Blog post

Designing to fail

By Lydia Leong | December 03, 2010 | 2 Comments


Cloud-savvy application architects don’t do things the same way that they’re done in the traditional enterprise.

Cloud applications assume failure. That is, well-architected cloud applications assume that just about anything can fail. Servers fail. Storage fails. Networks fail. Other application components fail. Cloud applications are designed to be resilient to failure, and they are designed to be robust at the application level rather than at the infrastructure level.

Enterprises, for the most part, design for infrastructure robustness. They build expensive data centers with redundant components. They buy expensive servers with dual-everything in case a component fails. They buy expensive storage and mirror their disks. And then whatever hardware they buy, they need two of. All so the application never has to deal with the failure of the underlying infrastructure.

The cloud philosophy is generally that you buy dirt-cheap things and expect they’ll fail. Since you’re scaling out anyway, you expect to have a bunch of boxes, so that any box failing is not an issue. You protect against data center failure by being in multiple data centers.

Cloud applications assume variable performance. Well-architected cloud applications don’t assume that anything is going to complete in a certain amount of time. The application has to deal with network latencies that might be random, storage latencies that might be random, and compute latencies that might be random. The principle of the distributed application of this sort is that just about anything that you’re talking to can mysteriously drop off the face of the Earth at any point in time, or at least not get back to you for a whlie.

Here’s where it gets funkier. Even most cloud-savvy architects don’t build applications this way today. This is why people howl about Amazon’s storage back-end for EBS, for instance — they’re used to consistent and reliable storage performance, and EBS isn’t built that way, and most applications are built with the assumption that seemingly local standard I/O is functionally local and therefore is totally reliable and high-performance. This is why people twitch about VM-to-VM latencies, although at least here there’s usually some application robustness (since people are more likely to architect with network issues in mind). This is the kind of problem things like Node.js were created to solve (don’t block on anything, and assume anything can fail), but it’s also a type of thinking that’s brand-new to most application architects.

Performance is actually where the real problems occur when moving applications to the cloud. Most businesses who are moving existing apps can deal with the infrastructure issues — and indeed, many cloud providers (generally the VMware-based ones) use clustering and live migration and so forth to present users with a reliable infrastructure layer. But most existing traditional enterprise apps don’t deal well with variable performance, and that’s a problem that will be much trickier to solve.

The Gartner Blog Network provides an opportunity for Gartner analysts to test ideas and move research forward. Because the content posted by Gartner analysts on this site does not undergo our standard editorial review, all comments or opinions expressed hereunder are those of the individual contributors and do not represent the views of Gartner, Inc. or its management.

Comments are closed


  • Kurt Cagle says:


    A very perceptive article, and one that, having designed for both cloud applications and enterprise ones, I’d definitely concur with. I would argue that this particular phenomenon is one of the reasons that there has been a quiet but increasingly noticeable migration away from large scale SOA based transactional systems (which are notorious for issues such as bottlenecking, cumulative latency and mid-chain “breaks” that can be almost impossible to debug properly) and towards more resource-centric RESTful services and syndicated architectures with the cloud.

    Most such RESTful services are built along the idea that it is better to publish notifications that resource collections have changed to participants so that they can assemble (and process) the relevant resource locally to perform their own processing, and to then publish indications that their own resources have changed than it is to push such content from processing service to processing service. The architecture can make the code processing a little more complex for each participant, but it also scales quite naturally as more participants enter the network.

    The central difference between the two approaches is that the first (traditional enterprise SOA services) is ultimately a linear pipeline process where risk increases proportionately to the longest potential chain in the distributed pipeline, while in the latter case (RESTful services), the risk of failure is actually mitigated as the number of potential redundant routes for data caching (which is a function of the interconnectedness of the network and the number of participants) increases.

  • Hi Lydia,

    How exactly does node.js solve the problem other than breaking up particular sync request/job into a series of async tasks when in fact for enterprise applications its the completion of the user task that dominates the perception of performance. There still needs to be some terminal join point in the operations at least from the end user perspective. I am also curious how applicable such a solution is when we consider that eventually the cloud will built to fully support service supply chains (exchanges, brokerages,…)? Sorry I don’t see this be anyway near a silver bullet not in the enterprise space but maybe it will eventually be a viable alternative for those that would typically build some web front end with Rails.

    I agree wholeheartedly performance monitoring & management is of great importance when moving to the cloud. In fact I am betting my company on it but I think we first need to focus more attention on standardization of metering information up and down the stack and across service interaction points & cloud metering/access points.

    We also need to make runtimes & services cost aware (cost=latency, financial liability, capacity leasing,…). This is CARS.

    We also need a truly scalable approach to this problem and not just at runtime but in terms of the many perspectives of difference parties involved in the service/app performance delivery.

    We also need to look to active take control