Gartner Blog Network


Network Downtime

by Andrew Lerner  |  July 11, 2014  |  12 Comments

The most important aspect of nearly every network is availability. Performance, scalability, management, agility, etc. all require the network to actually be online.  In conversations with Gartner clients, availability often comes up which is echoed in surveys we’ve done:

drivers3

I would argue that availability is higher than 20%, but clients don’t score it that way because a) it is assumed as foundational to all vendors and hence is not perceived as a major differentiator and/or b) all the hype around SDN has people focused on agility and orchestration. Availability is relatively boring compared to other “cool” stuff like SDN and disaggregation, unless you’re talking about the Netflix chaos monkey (more on that below).

When talking to clients, availability often comes up after an outage. In some cases, network outages drive significant investment from IT, particularly in the DDI market (“..we could never cost justify DDI until we fat-fingered our public website A-record…”). The undisputed #1 cause of network outages is human error, with estimates as high as 32% according to Dimension Data’s 2014 Network Barometer report, not to mention a study from Avaya indicating 82% of folks experienced network downtime due to human error. In my 16+ years running large corporate networks, there was no feeling worse than the post-mortem meeting after a big outage. I’ll never forget one particular meeting in which my CIO said “…well that was just plain stupid…”. Fortunately, this only happened to me a very small number of times. And, we have research that can help you avoid networking outages, including:

Take A Four-Step NCCM Approach to Stem Disasters (Vivek Bhalla)

http://www.gartner.com/document/2251916

Summary: While businesses invest in their networks to gain a competitive edge, they often fail to ensure adequate steps are taken to reduce outages. Gartner’s four-step NCCM approach enables network staff to minimize infrastructure failure.

And, for those interested in designing WANs for availability:

Bandwidth Doesn’t Matter; Availability Drives Enterprise Network Costs (Neil Rickard Danellie Young)

http://www.gartner.com/document/2549215

Summary: In the developed world, the marginal cost of bandwidth is so low that rightsizing capacity has little impact on WAN cost. However, the cost of improving availability remains high and downtime is less acceptable, making rightsizing network availability the key goal for enterprise network designers.

And a little more about antifragile (and the Netflix Chaos Monkey)

How Antifragile Practices Can Make Your I&O Stronger (Ian Head)

http://www.gartner.com/document/2690718

Summary: Antifragile systems turn stress and adversity into advantage. Certain practices of Web-scale IT enterprises may be emulated by other IT organizations to enhance their antifragility, especially as part of their continual improvement, DevOps and digital business initiatives.

Regards, Andrew

PS – If you have a really good and/or funny outage story, feel free to include it the comments, a prize will be sent to the best one…

Category: ddi  networking  wan  

Tags: antifragile  availability  avaya  dimension-data  downtime  legacyrunstheword  nccm  netflix  outage  

Andrew Lerner
Research Vice President
4 years at Gartner
19 years IT Industry

Andrew Lerner is a Vice President in Gartner Research. He covers enterprise networking, including data center, campus and WAN with a focus on emerging technologies (SDN, SD-WAN, and Intent-based networking). Read Full Bio


Thoughts on Network Downtime


  1. Andrew Lerner says:

    I will kick things off with outage stories here…My story begins as a wet-behind-the-ears 19 yr old intern in the mid 1990s at the headquarters (IT department) of a grocery chain.

    We ran Windows3.1 off the network (yes netboot), and users’ swap/temp files we’re getting really big and eating up network file space.

    One of the IT admins was working on a batch file that upon login deleted these files. She decided to have me (the Intern) test it. So I walked over to the area where the semi-trucks check out (fleet maintenance) and logged in.

    Within 30 seconds all the computers were down. The outage persisted for a few hours. Lots of unhappy truckers and IT admins. Backups from tape were involved. Not fun.

    Their senior engineer finally figured it out. The script had been written to delete files from the f:/temp drive where tmp files were stored. However, the script had a) a directory switch, b) i had admin rights but no temp drive on that particular server.

    Thus, the the following occurred

    (default netboot drive is f:/)
    cd temp (which failed b/c I had no temp file on that server)
    del *.* (which didn’t fail b/c I had admin rights)

    Since I did not have a temp folder on the server in the fleet maintenance department (and had admin rights on the root of f:/), all files on the server began deleting…

  2. I will say that management agility (ease of configuration), capacity, and performance all do have ties to availability. The number one cause of most issues, including network issues, is probably human error. If you decrease the complexity then you will have fewer human error problems, thus higher availability. In the same way, as DDoS attacks become more common, having more performance and capacity allows you to have better resiliency against such attacks, thus again, increasing availability.

    I guess what I’m saying is that availability can be a concern as a part of those other areas. If you want better agility so you can reduce human error, or increased performance/capacity so you can handle larger attacks, then you’re really doing it for availability, even if that isn’t what you’re saying.

    • Andrew Lerner says:

      Karl, thanks. Great comment and I agree with you on both accounts. Simplifying the network reduces human error as well as increasing the capacity/performance (surface area). Matter of fact, I’ve written about both of these specifically…

      1. How increasing Inet capacity can help to alleviate volumetric DDOS attacks:
      Leverage Your Network Design to Mitigate DDoS Attacks
      http://www.gartner.com/document/2551517

      2. How automated provisioning can improve agility and availability
      Four Key Questions to Ask Your Data Center Networking Vendor
      http://www.gartner.com/document/2661318

  3. […] ← Network Downtime […]

  4. Update: My colleagues Donna Scott and Dennis Smith just published additional research regarding downtime, including that 80% is caused by people and process…

    http://my.gartner.com/portal/server.pt?gr=dd&ref=g_portal_rss&resId=2856317

  5. […] Could’ve Caused The Downtime It isn’t really a great idea to pontificate here. United says it was a router problem. That […]

  6. […] most important aspect of nearly every network is availability,” Lerner wrote. “Performance, scalability, […]

  7. […] Network Outages: Sure we have beaches, retirees, and lots of vacationers, but we still take our network outages very seriously (side note; wasn’t my son, I swear). […]

  8. […] Network Downtime – Andrew Lerner – The most important aspect of nearly every network is availability. Performance, scalability, management, agility, etc. all require the network to actually be online. […]

  9. […] assume that most outages are out of their control, or malicious in nature, according to an article written by Gartner, “the undisputed #1 cause of network outages is human error, with estimates as high as 32% […]

  10. […] Kudos for getting beyond just “uptime” and focusing on one of the key root causes of network downtime, which is manual configuration error (often lead by undertrained staff). Also, I agree that […]

  11. […] In fact, studies show that majority of errors in a datacenter are human errors (see for example Network Downtime by Andrew Learner from Gartner). So do we really need professional services tweaking our […]



Leave a Reply

Your email address will not be published. Required fields are marked *

Comments or opinions expressed on this blog are those of the individual contributors only, and do not necessarily represent the views of Gartner, Inc. or its management. Readers may copy and redistribute blog postings on other blogs, or otherwise for private, non-commercial or journalistic purposes, with attribution to Gartner. This content may not be used for any other purposes in any other formats or media. The content on this blog is provided on an "as-is" basis. Gartner shall not be liable for any damages whatsoever arising out of the content or use of this blog.