I latched onto an idea early in my time as an automation architect. The idea was realized when we turned on the first automated incident response automation workflow. It wasn’t a very sophisticated workflow. It ran some triage on a filesystem filling up and enriched the ticket before dispatching it for action. What was unique is that it was triggered by the incident being created in our ITSM platform. The real value is that is demonstrated a general framework that allowed the organization to see what was possible – events driving automation.
We quickly enhanced it with additional automation to actually REMEDIATE the problem, and suddenly we were taking care of hundreds of tickets, with only a trickle making it to humans for action. There was a lot of work from some really talented people to make that happen, but it worked.
What value did it deliver? It meant that incidents weren’t waiting on humans, they were happening at the speed of automation.
Working at the speed of automation > waiting for humans to look at tickets
What does this mean for automated incident response? Building the capability is the easy part. The two topics that are more thorny to deal with are what I want to pull out of this topic.
First: incident response takes human trust, so pay attention to that.
Second: an incident is an IT concept, but it applies in many different (business) contexts.
Incident response takes human trust, pay attention to that.
I’ve been in the IT biz long enough to realize that it’s about patterns, not the individual things.
Incident response is a well understood space:
- monitoring (sometimes human) detects an anomaly or error,
- an incident is created,
- the incident is dispatched to a fix agent,
- things happen,
- incident is resolved and closed
- life moves on.
That’s the pattern.
What’s missing from this picture? Learning.
How do we learn from these incidents? What was effective at resolving the incident? How often have we seen this incident before? Are there signals that are emitted before the incident becomes an interruption? How disruptive is this incident?
All of these are missing from the action of incident response, and I’m firmly of the belief that it is because incident response remain rooted in humans taking actions that we miss the bigger picture. In order to remove the humans from the loop, the automation itself has to be trusted by those humans. You know, the ones that are held accountable for the reliability of a system.
That’s where trust surfaces as a vital ingredient. By ensuring that the humans trust the actions that are being taken by being open, transparent, and collaborative about what actually gets executed, trust is built. On the flip side, where something goes wrong those same humans need to be open, transparent, accountable and responsive to not only owning the problem, but fixing it.
An incident is an IT concept, but it applies in differently named (business) contexts.
An incident is ultimately an event with rules applied to it. The rules that get applied are different based on where the event is sourced from, but the pattern is the same. So why do I care about this? Because I sense opportunity for the Infrastructure and Operations team here! What opportunity? To take what you know and do really well with IT related incidents, and extend it to business events.
Think about all the administrative work that we do as humans, and also think about the number of times you’ve heard “oh, it was buried in my email”. That’s an event that doesn’t have anything other than a dispatch rule applied to it. It gets sent to the right human, and then the human executes a workflow that’s in their heads to process it.
Consider this: what if you could apply rules to those events that help push them to the user for action? I have a document that needs to be routed for a set of signatures in a specific order (think approval chain). I can either send individual emails to each of these people and be blocked waiting for response. OR. What if I could define the approval chain in an automated workflow, and then not only PUSH to the first user in the chain for their action, but once that signature is obtained, immediately PUSH to the next person in line? Sure, the process remains wedded to the humans, but now we not only have insight into where the bottlenecks are (<cough>humans<cough>), but we now have trend data.
This isn’t far-fetched: our business lives are full of repeated activities that are readily automatable, and give back precious time to the humans to work on the things we need them to work on. Not the things that consume their time.
This is an opportunity that I want teams to take. Being good at incident response is readily translatable to event response. Being good at event response and applying it broadly makes you even more valuable to your consumers. That means visibility, that means improvement, and it means trust. What it leads to is bringing solutions to the problems that your consumers have, and that means Value.
The Gartner Blog Network provides an opportunity for Gartner analysts to test ideas and move research forward. Because the content posted by Gartner analysts on this site does not undergo our standard editorial review, all comments or opinions expressed hereunder are those of the individual contributors and do not represent the views of Gartner, Inc. or its management.
Comments are closed
Loved this take, Chris – “What’s missing from this picture? Learning.”
Great post Chris. We all need to rethink what autonomous IT really means, what it can do for us, and what those downstream impacts really are. Nicely done.
“Incident response takes human trust”, great point Chris!
Trust is the essence and indeed, if the aim is to remove the humans from the loop it does make absolute sense that the automation itself has to be trusted by those very humans!