I’ve been obsessed with stored/historical data analysis inside a SIEM for a while, long before the current craze about so-called “security analytics” has been inflicted upon the community. Yes, real-time correlation of an event stream is great (and has been implemented in SIEM products since the late 1990s), but historical data can tell a story that a stream never can. Thus, we need to dig!
This post is an attempt to reflect on stored data analytics and maybe share some of the lessons and methods that have not been shared so far (in case you are curious, Chapter 11 “Log Data Mining” of our “Logging and Log Management: The Authoritative Guide to Understanding the Concepts Surrounding Logging and Log Management” book as well as these presentations reveal some of the same work)
So, roughly around 2003, my work on correlation rules and other SIEM content at a vendor that shall remain nameless led me into exploring the data stored inside our SIEM database (a traditional Oracle RDBMS at the time). Frankly, my first project was to re-express some of my correlation rules into SQL (or SQL with some nasty Perl) in order to try the same rules on stored data without re-streaming it from the database into a correlation engine (yuck!). My early analytic runs also included hunting for “unique counts” (such as count of different event types associated with each external source IP) and “NBS” (Never Before Seen event types, addresses, ports, protocols, etc or combinations thereof).
SELECT a.xxalarmid,b.description,a.devicetypeid,a.appalarmid,c.name,a.source,sum(eventcount) FROM $view a, xxalarms b,devicetypealarms c WHERE a.xxalarmid = b.xxalarmid AND a.appalarmid = c.alarmid AND a.devicetypeid = c.devicetypeid AND apptimestamp > '$dayago' AND apptimestamp < '$now' AND a.xxalarmid NOT IN ( SELECT UNIQUE(xxalarmid) FROM $view WHERE apptimestamp > '$twodayago' AND apptimestamp < '$dayago' ) GROUP BY a.xxalarmid,b.description,a.devicetypeid,a.appalarmid,c.name,a.source ORDER BY a.source } ;
(xxalarmid is a field holding normalized event types, if you are curious)
In a few months, I was staring at a gargantuan 13,400 line (!) Perl script called dm1-daily.pl (I am looking at a copy as I am typing this) that performed all sorts of stored data analytics, including, but not limited to, several methods of profiling and deviation analysis (based on one or two parameters, including derived parameters such as count of unique events and addresses), NBS analysis, event/IP frequency analysis, several methods of event clustering (and then cluster assessment and scoring by interestingness level), stored rule matching, associative rule discovery (discover that “event type X is usually followed by event Y in ZZ% of cases”), ordered and unordered event sequence discovery, etc.
The tool operated as a nightly batch run, taking 2-12 hours to complete (longer as the script grew longer and in case of some bugs). The main tool (there were a few others – to be mentioned in future posts) did not keep any state, but simply recreated it every time by querying the database (a bad design decision that made it slower but avoided the need to maintain a separate data store). It produced a mammoth HTML page (some screen shots can be seen here on slides 25, 27, 29, etc) where analysis results were organized in tables (Did you know that my favorite data visualization method is a table? Take that, Raff : – )). Given that 13K line Perl scripts are evil incarnate, the produced HTML output proudly stated “why do it right, if we can do it wrong” (after all, all this was research-grade tooling, not part of a commercial product).
(example screenshot of the output shown above)
In brief, the tool operated as follows:
- Run at 9PM every night
- Sequentially execute a few dozen analysis methods that looked at the last 24 hours of data by directly querying the SIEM database (lots of fancy SELECT lines) fed from production and honeypot systems/networks (all collection and normalization handled by a SIEM – which is great!)
- Compare the last 24 hours to the previous week of data (the week that ended when that 24 hr period began) , observe deviations, trends and other changes compared to that reference week
- Assess the same 24 period for various types of anomalies and interesting events and event combinations
- Perform a few long-term queries such as frequency of appearance (of IP, port, protocol, event type, etc) over the last 30 days
- Format the results as HTML and email the link to an analyst (i.e. myself)
- The analyst will then manual investigate the finding by using the SIEM UI (no drilldowns to more details were implemented in the HTML version)
Some of the more useful queries/methods were (some were always useful, some occasionally and some more of the “experimental” variety):
- “never seen before” event types, internal IPs, users, etc; same thing for a particular network (such as new event types seen in DMZ, etc)
- super large deviations (1000%+ usually) up or down on many metrics
- IPs with a high count of unique event types
- IP address pairs with high count of unique event types
- IPs as source with high number of unique destination accessed (aka scan or sweep)
- “2D” versions of the above such such as: “New device – event type pairs that appeared today from 08/25/2003 0:00:01 to 08/26/2003 0:00:01” or “New destination port – event type pairs that appeared today compared to last week”
So, you want to do security data analytics now in 2014?
- First, forget Hadoop for a moment
- Think analytic and data exploration mindset
- Review the tools you already have (SIEM, log management, a syslog server, etc)
- Look at available data (in your SIEM, for example)
- Start simple [NBS is very simple – yet eternally useful!]
- Explore and expand from there!
- Eh…and don’t write 13000 line Perl scripts 🙂
My long-time readers may recall my blog post called “Pathetic Analytics Epiphany!” (also look at the comments there) where I lamented that many security tools do not pay enough attention to stored data analysis.
Now the tide is changing – but why wait until you have have that “machine learning in a box”? You can do it now! In fact, you could have done it 10 years ago!! Stop reading – go do it!!! 🙂
Possibly related posts:
The Gartner Blog Network provides an opportunity for Gartner analysts to test ideas and move research forward. Because the content posted by Gartner analysts on this site does not undergo our standard editorial review, all comments or opinions expressed hereunder are those of the individual contributors and do not represent the views of Gartner, Inc. or its management.
Comments are closed
Appreciate your contribution to the SIEM World! Whether we like it or not real time log analysis/monitoring may not be sufficient to detect all the anomalies in the enterprise. But, isn’t add any additional overhead to the SIEM tools (including the major ones) when we perform these historical log data analysis day in day out as part of the regular operations? Sounds like moving the long term storage out of SIEM tools make more sense in the long run…
Thanks for the comment.
>Sounds like moving the long term storage out of SIEM tools make
>more sense in the long run…
Long term storage (say >30 days) cannot really be moved out of SIEM since … it never was IN SIEM 🙂
Over the years, I’ve seen very few orgs that had a SIEM with years of data in a database. Of course, they had archives and/or used a separate log management system to store data for longer term, but SIEM is ill-equipped to be permanent storage for data.
Great post, thanks for sharing this.
I’d like to highlight your point about having a separate meta data store. Current on-prem log solutions still don’t offer sufficient search performance to enable interactive long term analytics, especially considering the volume and variety of data sources typically collected today. Keeping the aggregates and other stats in a dedicated analytics layer is key.
Another obstacle is the assumption that the log data will be structured. Which leaves application security a big challenge. Without better, machine learning powered approaches to universal log data interpretation we’re a bit out of luck.
>Current on-prem log solutions still don’t offer sufficient search performance to enable interactive long term analytics, especially considering the volume and variety of data sources typically collected today
Well, what do you mean by “long term”? 3 months is long term for some people, while 3 years is that for others. In general, I’ll trust log mgt tool much more than SIEM for that; multi-month searches are pretty fast [as long as you have enough hardware and distribute the load well]
>Another obstacle is the assumption that the log data will be structured. Which leaves application security a big challenge.
Yes, this one is a nasty mess, for sure 🙁