Blog post

How Systems Complexity Reduces Uptime

By Ephraim Baron | May 05, 2021 | 2 Comments

Product DevelopmentProduct Planning

The pace of change in IT is unlike any other discipline. Even dog years are long by comparison. It’s an environment that makes us so primed for what comes next that we often don’t pause to ask why. This is very much the case with cloud computing, where cloud is the sine qua non of any CIO’s strategy and leads to vague approaches such as ‘Cloud First’.

This phenomenon also applies to application architecture, where microservices and composability are all the rage. The underlying rationale makes sense: break complex applications into function-specific components and assemble the pieces you need. After all, this is what software engineers do; they create and implement libraries of functionality that can be assembled in almost limitless ways. The result is an integrated collection of components are more elegant than the monolithic applications they replace.

Technical elegance, though, isn’t always better. To illustrate why, we need to turn to probability – not the complex stuff like Bayesian decision theory or even Poisson distributions. I’m talking about availability, which in IT means the percentage of time a system is able to perform its function. A system that is 99.9% (‘three-nines’) available can perform its function all but 0.1% of the time. Pretty good, right? Of course the answer to that question is, it depends. Some applications are fine with three-nines. For others, four-nines, five-nines, and even higher are more appropriate. How you would feel, for instance, about getting on an airplane if there were a 1-in-1000 chance of “downtime”?

From Component to Systems Availability

The aggregate availability of a system is the product of the availabilities of each component. For example, the availability of a system with 3 interdependent components, each having 99.9% availability, is 99.9% x 99.9% x 99.9% = 99.7%. The following figure illustrates the availability of a system with multiple components, each with identical availability. The number of system components is show on the x-axis, and the aggregate system availability is shown on the y-axis (plotted on a logarithmic scale). The maximum potential monthly downtime for a given availability level is included on the second y-axis.

Chart of system availability as a function of number of components
For component availabilities greater than ~99%, the following approximation is even simpler and goes as follows:
Ac = component availability
Uc = component unavailability
Us = system unavailability
Nc = number of components
Uc = 1 – Ac           [example: Uc = 1 – 99.9% = 0.1%]
Us = 1 – nc · Uc     [example: 1 – 10 · 0.1% = 99.0%]

Implications for product managers, solutions architects, and CTOs

The point of this article is not to condemn composable applications; it’s to encourage thoughtful, intentional design. Just because everyone seems to be doing something doesn’t mean you must. In the case of application and service architecture, this means viewing the system holistically. In support, I offer the following:

Occam’s razor: entities should not be multiplied unnecessarily

Albert Einstein: Everything should be made as simple as possible, but no simpler

KISS principle: Keep It Simple …

Key Takeaways

  1. Design for component failure and to minimize its impact.
  2. Make components decoupled and asynchronous whenever possible so that loss of one component does not cascade to others.
  3. Make critical components redundant and automatically scalable.
  4. Avoid stateful components whenever possible.
  5. Understand service interdependencies and failure modes.
  6. Eliminate unnecessary complexity (KISS).

The Gartner Blog Network provides an opportunity for Gartner analysts to test ideas and move research forward. Because the content posted by Gartner analysts on this site does not undergo our standard editorial review, all comments or opinions expressed hereunder are those of the individual contributors and do not represent the views of Gartner, Inc. or its management.

Leave a Comment

2 Comments

  • Steve Murphy says:

    Ephraim,

    This was a great article. Do you have any specifics on the Key Takeaways items?

    Thanks!

    • Ephraim Baron says:

      The main message of this article is to balance modularity with operability. I recently wrote a research note I called Planet of the Ops that focuses on systems engineering principles and all the non-functional requirements, or “ilities”, that are necessary for production-readiness. Designing for operability – across all functional areas – makes a system / product / service appealing to a wider audience.
      Happy to discuss further.
      Ephraim