Blog post

Why Your Security Data Lake Project Will FAIL!

By Anton Chuvakin | April 11, 2017 | 9 Comments


Beats me, but for some reason organizations think that they can build A SECURITY DATA LAKE and/or their own CUSTOM BIG DATA SECURITY ANALYTICS tools. Let me tell you what will happen – it will FAIL.

Cue the data swamp jokes. Mention data pond scum. Discuss pissing in the data pool. The result is the same – it likely won’t work.

OK, let me tone this down a bit – it will be successful (however this is defined) for 0.1% of those who try [the percentages are approximate and are meant to increase the dramatic impact of this post, not to share data]

Why am I so adamant about it? During our UEBA research we encountered several organizations that are migrating from DIY/custom security analytics to COTS (typically to UEBA as it has matured). What truly shocked us was that some organizations reported that they did have a custom security analytics project running for a few years – but it is now being shut down due to “huge effort – low value” combination. What was even more shocking, some of the organizations were essentially in a “Fortune 50” class, presumably those global technology elites. It didn’t even work for [some of] them…. The QotD [modified to remove any possible relation to the client] was “we wish we’d never discovered Hadoop – we wasted years of trying to make a security analytics capability out of it.”

Motivated by the use of cheap hardware, reduced data redundancy (store one copy – so wow!) and promise of advanced analytics they went for it … and mostly FAILED.

Some of the reasons for failure or relative lack of success included:

  1. Dirty data – you throw stuff in and then cannot use it; a #1 “fail-cause” (great story about it)
  2. Trouble with collecting data – SIEM vendors spent 10+ years debugging their collectors for a reason…
  3. Trouble with accessing data – data went in – plonk!- and now nobody knows how to get it out to do analysis (great story here)
  4. No value beyond collection – the data lake was created and filled with data, so it is there just in case, but any subsequent project phases stumbled
  5. No value beyond keyword search – data lake was created to enable advanced analytics, but ultimately delivered only basic keyword search of logs
  6. No threat detection value – this happened when somebody hired a big data company to build a security data lake; they build all the plumbing and said “ah, security use cases? you do it!” and left
  7. Failure to conceptualize and define the security analytics use cases – OK, we will now detect threats… OK, how? Well, nobody knows and no time to experiment. And see #1 – dirty data
  8. Security analytics use case design much harder then expected
  9. Much higher bar for analytics and big data expertise talent requirements and failure to acquire said talent.

(note that some are overlapping and/or related)

As we say here, “Given the simplicity of the technical characteristics of a data lake, it shouldn’t come as a surprise that getting value out of this concept is entirely dependent on the availability of advanced programming and analytics skills.” For security, you also need to add threat analysis skills to the mix.

In essence, the only successful project type (and this is not really security analytics, not by a long shot) was “install ELK, throw logs in, search for keywords.” This works well, but this is NOT what they aspired for – not even close. Not even in the same realm.

To conclude, successful custom big data security analytics efforts remain rare outliers, like a flying car. My 2012 post was full of hope – and sadly it didn’t work out. At this point, it is very clear to me that DIY or open source is NOT the way to go for security analytics. Sure, we will continue watching both Spot and Metron, but frankly at this point I am a skeptic.

So, short summary: open source – based log aggregation – sure, custom security analytics – only worked well for a very select few. If you still want to try, feel free to review this for some ideas (if you read it, provide feedback here!). It seems like this document will NOT be updated anytime soon…

Related blog posts on security analytics:

The Gartner Blog Network provides an opportunity for Gartner analysts to test ideas and move research forward. Because the content posted by Gartner analysts on this site does not undergo our standard editorial review, all comments or opinions expressed hereunder are those of the individual contributors and do not represent the views of Gartner, Inc. or its management.

Comments are closed


  • Andre Gironda says:

    It’s not true. Yes, there are large org, including the Fortune 50 and Fortune 500 that have whole business units dedicated to custom security analytics. I have seen many of these first-hand.

    However, you are missing the smaller picture and you are missing the bigger picture. If you look closely enough, you’ll find a sub basement under the SOC basement where the Cyber Threat staff sit. One or more of them may be a doctor, lawyer, or even former law enforcement or military. The key thing is that while they are power users of all analytical platforms that their organization controls, but they are also commanding their own tools, hidden away from prying eyes. This is only the smaller picture.

    The bigger picture is that 0.1 percent will turn into 10 percent and then 20 percent because tools such as the Threat Hunting Platform — — will continue to become available and power users all over will be running analytics and sharing their results via code, images, and stats in Jupyter and Databricks Notebooks (or whatever else comes next).

    In the cases where there was a business unit that outright replaced their custom platform with Securonix or Niara, it was probably because their old custom platform took 16 business days to generate a 400-page PDF using C++ libraries that haven’t been worked on since before SourceForge was even around, let alone GitHub. These developers would have never made the cut at these COTS UEBA shops.

    You are right; and you are wrong. From a CLevel-only view, you are absolutely right. However, if you want the ground truth from the ground up, you can find the many professionals leveraging a variety of analytical tools and techniques that meet-and exceed the capabilities of COTS UEBA even when best integrated with SIEM and Security Operations Automation platforms. You’ll find them at FS-ISAC meetings, you’ll find them at the AV vendor conferences, and you’ll find them on the SIRA, and they’ll probably find you if you are one of them. Be one of us. We may be unique, but it’s worth it to do things right.

    • Thanks a lot for your super-insightful comments. Let’s continue the discussion.

      This (“The bigger picture is that 0.1 percent will turn into 10 percent and then 20 percent”) really caught my eye – since I’ve been waiting for this since 2012 (see that outlier post). Essentially, my opinion is that it is NOT happening. Will it happen? I don’t know. May it happen? It sure may.

      Now, you are 100.0% correct in this “In the cases where there was a business unit that outright replaced their custom platform with Securonix or Niara”, their custom tool was likely not that great AND required a lot of busywork to maintain on top of it.

      Finally, have I seen examples of EPIC WIN with DIY analytics?! For sure, I have! Some of them were very, very impressive – in fact, ARE very impressive. However, my other point was that many attempts to replicate those successes (however real!) have not been..well.. successful. Many a security pro have been peeking into their shiny “security data lake” scratching their heads 🙁 and probably thinking “but it worked so well for Bank Z and Company C, why not here”…

  • Tom Clare says:

    We see a blended environment with large enterprise environments with the following at different stages.

    1. Leverage ELK in front of your SIEM for simple queries, also to reduce expenses.
    2. Keep SIEM use rational for its strengths for operations, performance, availability, and compliance with an eye on expenses, including indexing fees. SIEMs are a key data source for analytics.
    3. Store data for long term value in big data, either vendor provided or your own deployment.
    4. Expect analytics vendors to compute and store on your data lake of choice or the one you selected as your primary vendor.
    5. Avoid storing data multiple times across analytics solutions.
    6. Request custom machine learning model tools to build your own use cases, avoid the black box.
    7. Demand bidirectional APIs from vendor key data sources that drive analytics, the democracy of data with APIs is key to success. Also strive for automated risk response in use cases via API.
    8. Expect analytics features within solution silos, however understand they are restricted to data within the silo itself.
    9. An understanding of identity as a perimeter and the context of big data as horizontal planes is important for success.
    10. We all have hybrid environments with cloud adoption at various degrees, factor in cloud analytics into your plan and clean up identity before migrating.

    We do meet the basement innovators at events, see you soon at FS-ISAC and NH-ISAC in May.

  • Andre Gironda says:

    When, ultimately, we have to front vendor SIEMs with Elastic and Kibana just to run simple queries, and we have the potential to use other plain-as day cloud services (e.g., Databricks, DataStax, Rescale, Google Cloud ML, the AMIs in AWS with S3, the conda package manager from the Anaconda project at Contiuum Analytics to standardize, et al) and what amounts to trivial code samples you’ll find in hundreds of books or stackoverflow — why wouldn’t DIY be encouraged over some expensive UEBA vendor with all of the hand holding?

    Isn’t anyone who is in for the long run in infosec a bit discouraged by vendors, especially newfangled, bleeding-edge ones? What you see is a company (not an org, not an industry leader) or a series of companies converge on key phases that impress people with a lot of money (because C-Levels make a lot of money) and then get some other people with a lot of money to invest in their companies so that they can do the VC and quick sell, and by which the platform is completely useless in about 2-3 years when attackers find a way around the product. Rinse, repeat, recycle?

    Why not invest in people and their analytical skills with toolchains that grow with the size and complexity of each task or problem to solve?

    Next thing we’ll see compliance standards mandate UEBA and then we’ll be in a double pickle.

    It is nice-to hear that basement innovators are appreciated by good folks like Tom Clare. Thank you!

    • Thanks for this comment as well. This is actually even more interesting.

      Here is the q you ask “Why not invest in people and their analytical skills with toolchains that grow with the size and complexity of each task or problem to solve?”

      Admittedly, there are very good arguments (see for example for doing just that — invest in foundational tools (such as those you name) and skills.

      However, while this is definitely a good idea, for many org cultures it ended up not working. We can lament it, and we will [both of us, in fact :-)], but it is hard to recommend this approach to many.

      And yes, UEBA products also fail to deliver, once in a while (ok, more than once in while :-)), but it seems like we are seeing relatively more success there compared to DIY…

  • Avkash K says:

    What you said may be correct in current context, when the security analytics, machine learning are more buzz words , rather than successful implementations. But if I recollect the baby step days of SIEM, it was more compliance tools than anything else. It has caught the eyes only when the intelligent community started adding value to it by giving it a effective security angel. And that’s exact the case I feel with security analytics and stuff, technology has lot of potential, all we need is a perfect security community brainstorming the effectiveness and successful implementation. There are many areas in this analytics which is even untouched when we are talking failure of the concept. Personal opinion, every new technology is bound to fail if not bound with effective human intelligence, and this is the technology which will not work for all organisation in same fashion. Implementation, risk coverage and it’s goals will vary from org to org.

    • Agree with you logic, for sure. BUT! THe situation you describe re: analytics was there in 2011-2012. FIVE YEARS have passed – and we are still in the same spot? This to me sounds like something is more broken. Think of your SIEM analogy – SIM/SEM from 1999 to 2004 made a lot of steps [some backwards due to compliance, sure :-)] and adoption grew..

  • Benoit Rostagni says:

    I am a big fan of this analysis, I share it since many years.
    Data Lake does not magically solve problems.
    Data Lake are just a storage area that allow queries to be build on not normalized data.
    But to solve security problems, you need in the case of Security Data Lake, an little army of Data-scientists specialized in Security, full time, and not just during the setup as security is evolving world, with attack strategy changes every day that need to be adapt in ELK (or other) model.
    It fails because companies listen to Big Data vendors who stop their work after doing plumbing and storage area, and minimize the real problem : attack models and strategy!

    We are at war against Hackers, and having an history of all events in a data lake does not by magic find the next move of your enemy.