Blog post

On SIEM Tool and Operation Metrics

By Anton Chuvakin | June 17, 2014 | 17 Comments

SIEMsecurity

While some people whine that “their SIEM deployment has failed”, how the hell do they know? I’ve met some folks who spent 8 digits (that’s EIGHT digits!) on SIEM and they are as happy as pigs in clover. They think that SIEM is the best security investment they’ve ever made, for realz.

Measuring SIEM health and operations is still an emerging art, and there is no set of accepted SIEM metrics. The core SIEM team has to define success criteria at the planning stage and periodically check for progress in regard to these criteria (source). However, collecting logged event numbers, correlated event numbers, the related rules enabled, the number of incidents handled and even the number of changes implemented as a result of SIEM monitoring has proven useful for many organizations. These do allow for basic measuring of SIEM tool and program performance. For example, if the volume of collected and correlated logs have decreased dramatically, maybe the tool usage is waning.

Measuring SIEM impact on incident recovery time (similar to the operational mean time to repair [MTTR] metric) and on incident severity also present great evidence of more strategic SIEM success. Even better, a reduced incident discovery window, if observed, can provide a great boost to an SIEM program. The number of cases open for investigation as a result of SIEM and the potential incidents resolved at early stages are useful metrics as well. The number of alerts handled for each analyst allows the organization to track team performance and not just tool performance.

Select SIEM tool metrics:

  1. Event collection rate, EPS (average, maximum – per log source, per type, etc)
  2. Event processing/analysis rate, EPS (average, maximum)
  3. Total log storage, GB (in SIEM, log management)
  4. Log source count (by type, region, log volume, etc)
  5. Alerts triggered count (per time unit, by target, by type, by rule, etc)
  6. SIEM resource usage (CPU, RAM, disk)

Select SIEM operation metrics:

  1. Alerts handled (per analyst, per rule, per target, etc)
  2. Alert response timing [such as time from triggering to review, then to first action, then to closure or escalation (by alert type, by target, by analyst, etc)] <- some call this metric “the only one that matters”
  3. Incidents opened based on SIEM alerts (by time unit, by analyst, by target, etc)

[note the word usage above, these are “select”, not “top” metrics – I feel that I don’t know enough at this stage to proclaim knowledge of the best or top metrics!]

Care to suggest more? Which ones you find the most useful?

Select recent blog posts related to SIEM:

The Gartner Blog Network provides an opportunity for Gartner analysts to test ideas and move research forward. Because the content posted by Gartner analysts on this site does not undergo our standard editorial review, all comments or opinions expressed hereunder are those of the individual contributors and do not represent the views of Gartner, Inc. or its management.

Comments are closed

17 Comments

  • Byron Anderson says:

    Awesome article, I enjoy most of the things you write. The opening paragraph was the best part!

  • Mark says:

    I think that there are quite a few more that we can add to the list. For me one of the biggest failures I see is around what I call the Golden Rules of SIEM

    1) Make sure you’re getting the logs
    2) Make sure that the logs are parsed correctly
    3) Make sure that the logs are categorized correctly
    4) Make sure time stamps are correct

    Along those lines we could measure:

    Overall collection uptime
    Percentage of events cached
    Percentage of events parsed correctly
    Percentage of events categorized correctly
    Percentage of events collected vs processed (i.e. filtering)
    Percentage of events with time issues

  • @Byron Thanks for praise 🙂

    @Mark Thanks a lot for the comments – I like the golden rules, but would consider adding “make sure that somebody or something LOOKS at logs”

    >Percentage of events categorized correctly

    This one would be pretty difficult to collect , I think. Doesn’t it require a human looking at each log message?

  • Brook Watson says:

    Great topic for conversation.

    What we have found across our customer base is that while the SIEM itself can track the operational metrics (if the customer has an idea of what they want to measure), most SIEM products don’t provide the access to the relevant data sets needed to ensure that SIEM is running properly and that there are not underlying symptoms occurring that will cause a major issues or outages down the road.

    We have taken a different approach and have implemented some NOC best practices as well as developed a suite of tools to harvest the various pieces of information from the SIEM that allows us to monitor for any situations that would cause performance or data integrity issues.

    In-line with Mark’s comments and your mention of “Tool Metrics”, we also measure and monitor the following for performance and stability issues from the SIEM platform:

    cpu, memory, IO, process counts, java heap, jmx threads, jmx actions, event broker times, EPS, table free space, incoming connector cache, load, network traffic, TCP inbound/outbound connections, disk space, disk read/write times, disk read/write operations, db read/write, db threads, db locks, db commands, db handler

    In addition to the platform metrics, we also monitor the SIEM logs and look for errors, warnings, and parsing issues. The interesting piece is that most of the information in the SIEM logs is critical to understanding the health and wellness of the SIEM, but none of the SIEM vendors actually parse their own logs or make them available in the product UI.

    I am interested in what others feel are key SIEM tool and operational metrics.

  • @Brook

    Wow, what an extensive set of tool metrics!

    Do you also do SIEM operation / SOC capability metrics?

    >none of the SIEM vendors actually parse their own logs or make
    >them available in the product UI

    Hmmm…I think you are selling the products a bit short; splunk definitely does it.

  • Brook Watson says:

    @Anton

    We have found that the operational metrics are only relevant / possible if an organization is mature enough to have standardized their operations and workflow. Unfortunately, many big budget security organizations still have no concept of workflow. Without the established workflow, there is no real way to provide tangible operational metrics.

    For the few organizations that have reached a mature capability, the metrics are typically trivial to create and track, but are not standard. Our best practice is to talk to the business owner and ask the question, “What story do you want to tell management?”. We do this because metrics can be presented in a multitude of ways and are only relevant if they are supporting business decisions.

    For Example…We need to know the proper number of Analyst headcount for budget reasons? In this scenario we will typically build metrics to understand and present:

    1) Total count of Events of Interest (EOI) per day that MUST be looked at.
    2) Average time spent on each EOI by incident classification. (Only available if structured workflow is followed).
    3) Ratio of EOI evaluated vs those not evaluated each day. (Shows Gap in Monitoring capabilities)
    4) …you get the idea…

    And because you brought it up, not because I want to talk about it :o) …

    It is our opinion that Splunk is NOT a SIEM and therefore not considered as part of my earlier comments regarding SIEM metrics. Before I get blasted for that comment, Splunk is a great log management tool and actually handles a few security use cases very well. However, Splunk lacks way too many of the core SIEM features and capabilities to be considered a true SIEM product at this point. We use and recommend Splunk at many of our customers, but ONLY for Log Management capabilities.

  • Ray says:

    Thank you and great post, Anton. Always enjoy your articles. I think you hit many critical components. I like what you have listed as they are most of the salient components, without inundating the report.

    In SIEM it seems one of the most difficult things is to narrow down all the possible metrics into a few key points that:
    1) Cover the majority of the highly diverse components
    2) Can represent an average or at least differentiate between good versus bad (or in some cases, good versus great/not bad versus horrible)
    3) Visually concise but make sense for varying levels of readers (this can be really difficult to do in product)

    It has seemed very fruitful to catch as much as possible in your trending/reporting, but to create multiple reports and limit their output depending on the audience.

    The typical “Today we collected 1million logs from 10,000 devices, creating about 10 thousand correlations, which resulted in 100 incidents of which 5 were critical” – tells a good story which seems to be understandable by many audiences. Adding time metrics (time to review, time to categorization, resolution time per incident/category) is icing on the cake, if your SOC is mature and if your case metrics system can support this in an automated fashion (a few have integration issues with SIEM).

    Admins and Leadership directly responsible/accountable for the SIEM might be interested in most of the above numbers, but also would want to know things like; insertion rates, aggregation rates, caching, parsing, event timing, retrieval rates, critical engine/component functionality (rules, dashboards, trending/reports, emails, etc), widget status(collection infrastructure working?), core system functionality (mem, cpu, database metrics, io, disk space, network stats)… Operational/Administrative reporting has always been a personal challenge, because it can become very inundating very fast for anyone not familiar with what ‘normal’ is… I have found ways to generally make this work… Mostly, but it typically involves many page reports or multiple dashboard components…

    I think you and Brook made some great points about knowing YOUR SIEM and its capabilities. Some are verbose, some are not, but understanding what you can and cannot collect natively can help substantially.

  • Peter says:

    Anton,

    First-time comment, long-time reader. As other’s have said, I really enjoy your posts on SIEM. Focused and straight-to-the-point. And so, here I go: I’m not asking for a whole new post, just your 2-cents here – Who are the current top SIEM-as-a-service providers? Top 5 or 10? I need a good starting point for my research and figured who better to point me in the right direction(s) then the the friendly neighborhood Mr. SIEM himself? Bol’shoye spasibo.

  • >Who are the current top SIEM-as-a-service providers?

    These do NOT exist at this point.

    Sorry, there are no SaaS SIEM providers on this planet.

    There are MSSPs, there is some cloud log mgt, but there is no SaaS SIEM 🙁

  • Rick says:

    We are looking at the recent offering of SUMOLOGIC’s SIEM capabilities. Off hand they appear to be more security device logs and alerts than true SIEM functions. We are looking for clarification on some log activities to see if there is room for a wedge piece while we explore other traditional SIEM offerings. Has anybody else looked at SUMOLOGIC as a SIEM?

  • Peter says:

    Hmm… Top 5-10 MSSP’s then 🙂 Thanks Anton!

  • @peter For MSSPs, see http://www.gartner.com/document/2671919

    I’d rather not comment on vendor capabilities on the blog. However, log management with some algorithms and search does not a SIEM make.

  • Peter says:

    @Anton – But of course.

    “… it’s what you do with those logs that defines you.” – Paul Bunyan

    ?curious why you didn’t link http://www.gartner.com/doc/2477018

    @Rick – Yes! On paper, Sumo definitely appears to be “legit” and as you know, actually open-challenging Splunk, one of Gartner MQ for SIEM’s “leaders”.

  • @Peter Well, the 2014 version is coming in about a week 🙂
    Also, the SIEM MQ does not list any MSSPs or cloud log management players

  • Peter says:

    @Anton – Yup, I’ll get spam for that.

    > … cloud log management players

    Ooh got a “Top 10” for that? I’m just looking for good starting pt, who are the guys we should all be looking out for…

  • @peter I suspect there are about 10 or fewer cloud log mgt players; I suspect top 10 = only 10.

    Loggly, SumoLogic, Logentries [and splunk storm] appear more often than others — but there are maybe a few more

  • Peter says:

    I suspect you suspect correct. Many thanks again sir! And I’m off…

    Please don’t stop your SIEM blog til further notice 😉