Blog post

Network Downtime

By Andrew Lerner | July 11, 2014 | 4 Comments

WANNetworkingDDI

The most important aspect of nearly every network is availability. Performance, scalability, management, agility, etc. all require the network to actually be online.  In conversations with Gartner clients, availability often comes up which is echoed in surveys we’ve done:

drivers3

I would argue that availability is higher than 20%, but clients don’t score it that way because a) it is assumed as foundational to all vendors and hence is not perceived as a major differentiator and/or b) all the hype around SDN has people focused on agility and orchestration. Availability is relatively boring compared to other “cool” stuff like SDN and disaggregation, unless you’re talking about the Netflix chaos monkey (more on that below).

When talking to clients, availability often comes up after an outage. In some cases, network outages drive significant investment from IT, particularly in the DDI market (“..we could never cost justify DDI until we fat-fingered our public website A-record…”). The undisputed #1 cause of network outages is human error, with estimates as high as 32% according to Dimension Data’s 2014 Network Barometer report, not to mention a study from Avaya indicating 82% of folks experienced network downtime due to human error. In my 16+ years running large corporate networks, there was no feeling worse than the post-mortem meeting after a big outage. I’ll never forget one particular meeting in which my CIO said “…well that was just plain stupid…”. Fortunately, this only happened to me a very small number of times. And, we have research that can help you avoid networking outages, including:

Take A Four-Step NCCM Approach to Stem Disasters (Vivek Bhalla)

http://www.gartner.com/document/2251916

Summary: While businesses invest in their networks to gain a competitive edge, they often fail to ensure adequate steps are taken to reduce outages. Gartner’s four-step NCCM approach enables network staff to minimize infrastructure failure.

And, for those interested in designing WANs for availability:

Bandwidth Doesn’t Matter; Availability Drives Enterprise Network Costs (Neil Rickard Danellie Young)

http://www.gartner.com/document/2549215

Summary: In the developed world, the marginal cost of bandwidth is so low that rightsizing capacity has little impact on WAN cost. However, the cost of improving availability remains high and downtime is less acceptable, making rightsizing network availability the key goal for enterprise network designers.

And a little more about antifragile (and the Netflix Chaos Monkey)

How Antifragile Practices Can Make Your I&O Stronger (Ian Head)

http://www.gartner.com/document/2690718

Summary: Antifragile systems turn stress and adversity into advantage. Certain practices of Web-scale IT enterprises may be emulated by other IT organizations to enhance their antifragility, especially as part of their continual improvement, DevOps and digital business initiatives.

Regards, Andrew

PS – If you have a really good and/or funny outage story, feel free to include it the comments, a prize will be sent to the best one…

Leave a Comment

4 Comments

  • Andrew Lerner says:

    I will kick things off with outage stories here…My story begins as a wet-behind-the-ears 19 yr old intern in the mid 1990s at the headquarters (IT department) of a grocery chain.

    We ran Windows3.1 off the network (yes netboot), and users’ swap/temp files we’re getting really big and eating up network file space.

    One of the IT admins was working on a batch file that upon login deleted these files. She decided to have me (the Intern) test it. So I walked over to the area where the semi-trucks check out (fleet maintenance) and logged in.

    Within 30 seconds all the computers were down. The outage persisted for a few hours. Lots of unhappy truckers and IT admins. Backups from tape were involved. Not fun.

    Their senior engineer finally figured it out. The script had been written to delete files from the f:/temp drive where tmp files were stored. However, the script had a) a directory switch, b) i had admin rights but no temp drive on that particular server.

    Thus, the the following occurred

    (default netboot drive is f:/)
    cd temp (which failed b/c I had no temp file on that server)
    del *.* (which didn’t fail b/c I had admin rights)

    Since I did not have a temp folder on the server in the fleet maintenance department (and had admin rights on the root of f:/), all files on the server began deleting…

  • I will say that management agility (ease of configuration), capacity, and performance all do have ties to availability. The number one cause of most issues, including network issues, is probably human error. If you decrease the complexity then you will have fewer human error problems, thus higher availability. In the same way, as DDoS attacks become more common, having more performance and capacity allows you to have better resiliency against such attacks, thus again, increasing availability.

    I guess what I’m saying is that availability can be a concern as a part of those other areas. If you want better agility so you can reduce human error, or increased performance/capacity so you can handle larger attacks, then you’re really doing it for availability, even if that isn’t what you’re saying.

    • Andrew Lerner says:

      Karl, thanks. Great comment and I agree with you on both accounts. Simplifying the network reduces human error as well as increasing the capacity/performance (surface area). Matter of fact, I’ve written about both of these specifically…

      1. How increasing Inet capacity can help to alleviate volumetric DDOS attacks:
      Leverage Your Network Design to Mitigate DDoS Attacks
      http://www.gartner.com/document/2551517

      2. How automated provisioning can improve agility and availability
      Four Key Questions to Ask Your Data Center Networking Vendor
      http://www.gartner.com/document/2661318

  • Update: My colleagues Donna Scott and Dennis Smith just published additional research regarding downtime, including that 80% is caused by people and process…

    http://my.gartner.com/portal/server.pt?gr=dd&ref=g_portal_rss&resId=2856317