Blog post

Big Data for Security Realities: Case 1: Too Much Volume To Store aka “Big Data Collection”

By Anton Chuvakin | October 10, 2013 | 3 Comments

securityanalyticsData and Analytics Strategies

If you fertilize the field of big data with enough marketing bullshit, something will grow. Well, keep waiting for it 🙂 Use of “big data analytics” approaches for security seems like THE most “bullshit-rich” area of the entire infosec realm (beating such worthy contenders as APT, DLP, BYOD and, of course, “cyber”). However, there ARE definitely end-user organizations doing it for real (and not just the illustrious crew at Zions Bancorporation).

Part of my research this quarter focuses on assessing the reality of big data for security and providing practical, GTP-style recommendations for enterprises. This post is the first in my “reality files” dedicated to use of big data approaches for information security.

One case that keeps popping up my radar (that is programmed to only scan reality, not the realm of wishful thinking and obnoxious PowerPoint slides) is the case of “too much data volume to store” or “big data collection.” Specifically, this scenario often goes like this:

  1. An organization buys a SIEM for, say, $1,000,000 (admittedly, not that much, as far as large enterprise SIEM pricing is concerned…) and likes it
  2. Quickly, they realize they can only store 14-30 days of data inside the SIEM operational data store (be it an RDBMS or a columnar backend)
  3. They reach out to vendors with a log management RFP, with a requirement to store 3 years of raw data (say, with total volume in high TBs or low PBs)
  4. Soon, the get quotes back – and it is *ANOTHER* $1,000,000!
  5. At this point, the team goes “Darn! We can build it ourselves for 10% of that”
  6. And their Hadoop cluster is born…

Admittedly, they would later face challenges with streaming data from collectors to both SIEM and Hadoop (or through SIEM to Hadoop), linking the systems for seamless drilldown, doing searches (Hadoop grep anybody?), and selectively “structuring” the data from the cluster. Some of these are simply challenging while others are extreme (try picking the right data to process from a huge pile and then being sure that you picked all of what you needed). However, in several cases that I’ve seen the organizations were happy with what emerged since simply knowing that “the data is there” and “they have not paid a ton for it” was comforting for them. Note that in this case the organization uses this system for retention and occasional ad hoc queries (such as during an incident), not for any analytics (but it is usually on the roadmap for some remote future time …)

More “big data for security” reality files coming soon!

Related posts on the topic of big data for security:

The Gartner Blog Network provides an opportunity for Gartner analysts to test ideas and move research forward. Because the content posted by Gartner analysts on this site does not undergo our standard editorial review, all comments or opinions expressed hereunder are those of the individual contributors and do not represent the views of Gartner, Inc. or its management.

Comments are closed


  • Good stuff! The bullshit level is indeed high these days, but so are the log volumes 🙂 Looking forward to more in this series.

  • @jeff Indeed, both log volumes and context/supplemental data volumes are going up. So, will see how it will go

  • Carl says:

    Thank you so much, Anton Chuvakin for sharing this information about storing a data. Many software like virtual data room where we can store large volume of data.