It’s time, the networking industry needs a Chaos Monkey. If you’re not familiar with the concept, Netflix created the Chaos Monkey to intentionally introduce failures, such as disabling virtual machines (VMs) in their infrastructure, in order to proactively identify issues.
We’ve previously written about this a few times including here, however to date there is no readily available chaos monkey for networking (at least that I’m aware of – comment if you know of a good one). I’ve been looking, asking, googling around for it for a few years, and there has been some interest including via Plexxi way back in 2013, and some academic research. It should be relatively straightforward to create (i.e., code/script). Further, after writing it, open sourcing would lead to a substantial uptick in awareness and availability and potentially even commercial support (looking at you Red Hat, Forward, Veriflow, Apstra, Cisco, etc).
So why would an enterprise want to actually run it? For starters, it can help to proactively identify issues in a controlled manner. There is so much resiliency/HA deployed in enterprise networks, but we rarely test it. Further, networks rarely fail 100% according to plan. So, this would add a ton of value to enterprise network teams. This clearly aligns with reducing network downtime, and embracing the next-gen NetOps paradigm. There are several use-cases, including to proactively identify issues in a planned maintenance window, or in a staging/lab environment that mimics portions of production, or to test new network rollouts. You could also use it determine the value of emerging network assurance tools including Veriflow, Forward and Cisco’s Network Assurance Engine.
The tool would need to do the following (high-level requirements), for a specific list of devices
- The ability to randomly shut interfaces
- The ability to randomly remove lines of configuration
- The ability to randomly reboot a device (yes, scary)
- All of the above must be done with control parameters (devices in-scope, out-of-scope, time of day) to help control and mitigate the blast radius.
- There must be a specific log of what was done, and in what order.
- There must be a specific rollback mechanism.
This isn’t just valuable for users. Vendors could use this to show the value of management and assurance tools. This is an opportunity to deliver netops innovation to the market – which has been dramatically lacking. So let’s call it the Chaos Honey Badger and go…