Network outages have been in the news quite a bit lately, including Delta and Southwest (to the tune of $54 million) and last year it was NYSE, United and the WSJ. We have research in the 2017 pipeline on avoiding network outages, but in the interim, here is a lighter take on network outages described via 1980s songs…
- Most of the really bad network outages are the result of several cascading failures, because so often “One thing leads to a another“.
- Way too often, outages occur after the weekend maintenance and associated litany of “changes” (ok ’71 but it fits), which always makes for a very “manic Monday“, and makes everyone wish that you could just “turn back the clock“
- Troubleshooting during the outage can be challenging as people are running around like “It’s the End of the World as We Know It“.
- Sometimes, the problem is related to a single app, but getting developers to describe how their app is actually supposed to behave on the network (TCP Ports, anyone?) often feels like you’re running “Against the Wind “, although there are lots of “promises promises” that it is all written down somewhere.
- That said, network teams are generally equally poor at having updated network documentation, although we have “sweet dreams” of updating it.
- Many outages are due to power and weather, as fallen trees equate to broken WAN circuits, so you can often “blame it on the rain”. Thus, network folks do not generally “love a rainy night“. Side note: If the branch was really mission critical, the business should’ve paid for diverse WAN connections, but that does take “a whole lotta spending money” and when it comes to networking “there’s no money falling from the sky“.
- Sometimes outages are the result of rushed changes, like when the business came and “dropped the bomb on me” with 2 days’ notice before GO-Live of a new customer-facing app, which seems to happen “time after time“.
- Sadly, many outages are still the result of manual configuration, and while we recognize the need for increased automation, the CLI is a “hard habit to break”.
- The network engineer’s best friend during outages is often a network packet broker that can pull sniffer traces, although it requires lots of “patience” because going thru them is tedious and often requires telling managers that “I Still Haven’t Found What I’m Looking For “.
- Sometimes we just need simply to “Get Physical” and reboot the box to solve the problem. We’ve all had that time when magically, out of the blue, “abracadabra” the problem just went away on its own (or, as my colleague Todd Ferguson used to say, “Step 2 a miracle occurred”).
- But then when you do finally get the issue resolved, it is cause for a “celebration” as you’re “Back in the High Life Again” and you can “party like its 1999” (side note: unless you had to work because of y2k…).
- Unless of course, along the way, you accidentally shut down the mainframe (or DNS server), in which case you should be asking yourself “Should I stay or should I go” (courtesy Hank Barnes). Side note: You should definitely “beat it“, because at that point, “its much too late for goodbye“.
- And if you’ve never experienced an outage, then you’ve never run an operational network at scale, you’re a very “smooth (network) operator”, or it’s just that “some guys have all the luck“