Most enterprises strive for long periods of high network availability. On the surface, this makes sense because outages result in lost revenue (even bankruptcy), damaged brand reputation, and (sometimes) people get fired.
However, the desire for long periods of high availability is fundamentally flawed. This is because in a complex distributed system (i.e., enterprise networks) long periods of high availability mask underlying and inherent instability. Thus, the desire for long periods of uptime actually increases the likelihood of long-duration and high-impact outages. This happens because cascading failures are much more likely, and we’re relatively unpracticed to deal with failures that do inevitably occur.
So what should we do? We have to fundamentally rethink the way we approach network downtime from the ground up (and top/down). We need to apply antifragile practices to networking. This includes antifragile network designs and operational practices. For example, rather than buy 2 big chassis with 240 ports each, buy a bunch of fixed-form factor switches. The “surface area” or “blast radius” of most issues/outages related to that environment will consequently be smaller (think backplane issue, water leak, power failure, etc). Similarly, we need to test much more. Almost all enterprise networks have HA firewalls, HA Load-balancers, multiple links internally, and redundant interfaces to external providers. But, how often do we test them failing over..(and back)? Never, Quarterly? Yearly? In related news, when they do fail, how often does failover go exactly as planned…. 70% of the time, based on my experience. We should be testing daily/weekly, not quarterly/yearly. Sure, you can’t just do all that testing at noon on a Tuesday, as we have to coordinate with the business. But it is worth it, because the more we test, the more likely we will be to find that 30%, which reduces the likelihood of a cascading failure, and makes us better practiced in troubleshooting. It’s all about planning for failure, failing small, and failing fast versus striving to never fail. Outages will occur, they are inevitable, and when they do, we need Blameless postmortems (not scapegoats). This is just a small piece of how we need to rethink dealing with network downtime. This is a passionate topic for me (too many middle-of-the-night calls) and here are some additional thoughts on network downtime including: SDN doesn’t cure world hunger, Outages keep people up at night and If you love 80s music (who doesn’t).
We cover this more in-depth here: How to Reduce Network Downtime in the Era of Digital Business
Comments or opinions expressed on this blog are those of the individual contributors only, and do not necessarily represent the views of Gartner, Inc. or its management. Readers may copy and redistribute blog postings on other blogs, or otherwise for private, non-commercial or journalistic purposes, with attribution to Gartner. This content may not be used for any other purposes in any other formats or media. The content on this blog is provided on an "as-is" basis. Gartner shall not be liable for any damages whatsoever arising out of the content or use of this blog.