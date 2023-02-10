I like a good acronym. I wanted to share an acronym that serves me well when I talk about automation roadmaps and journeys.

So what does it mean?

DIAL: Data – Insight – Action – Learning.

These are the four dimensions that I use to help identify where an activity is on a spectrum. It works from the idea that in order to make a decision, there must be data. If you can aggregate and synthesize data, you can generate Insights. Insights teach you something in the data that you didn’t know before,. More importantly, it gives you a way to prioritize progression towards the next step: Action. Action is what you do with the insight – do I take action, do I monitor, do I accelerate activity? Last but not least: learning. Am I able to incorporate the Insights or the results of the Action into the data that I am using to DIAL in what I do next?

Can you give a concrete example of this cycle?

Sure! Let’s say you’re analyzing ITSM incident or request data. The data by itself lends itself to a simple histogram of frequency – say the number of times an incident occurs – and this is useful as far as it goes. You get a gross measure of what is happening in your world.

If you integrate a second source of data – business impact of an incident occurring – then you are not just changing the richness of the incident view you have. You’re generating an insight. While there may be a #1 incident by volume, the #4 incident by volume actually has the most end-user or business impact. That’s an insight that’s worth taking action on!

We then take some kind of action – resolving the root cause, monitoring the occurrence, building some automation to address the incident – whatever is appropriate. Based on the action taken, evaluate the outcomes. That gets us to the learning dimension. The following are examples of what learning looks like:

Did root cause get determined? If not, what data did we need to improve our analysis?

Did we resolve the incident? If so, what were the signals that were present before the incident became service impacting, and can we use them to improve our response? If we didn’t resolve it, did we at least restore service? If so, what telemetry are we using to ensure that we don’t suffer the same impact? Can we improve the monitoring to detect signals of impending failure earlier?

Were we able to reach the fix agents we needed? If so, were they provided with enough triage data or troubleshooting data to shorten their analysis time? How confident are they in the remediation actions they took based on the data that was available? Could the signals they use be used to trigger automated remediation.

You get the idea. The human reaction is to quickly move on from the incident to the next, but taking the time to learn from the work that is done is well worth the investment,

Final thoughts

The DIAL really does need to go to 11 when it comes to maturing the way we drive the learning from the data and effort we expend in addressing incidents. But it’s a cycle that can be applied to all the rest of the work that we do in Infrastructure & Operations.