Part of my research this quarter focuses on assessing the reality of using big data approaches for security and providing practical, GTP-style recommendations for enterprises. So, what else IS real in this segment overrun by waves of bull?
One more case that occasionally (not as often as Case 1) shows up is “massive indexed pile.”
This is not just about data collection at scale, but indexing and text analysis of unstructured text data or barely structured data. The fact is, indexed search over vast quantities of data is very different that simply having a pile of such data, even if stored in HDFS. A 1TB file stored on disk(s) is much less usable (and thus less useful) compared to the same file with a well designed indexed search logic on topic of it. The difference is not in quantity (you can merely search faster), but in quality (here you can find it within the time you have, and there you cannot).
One may think ‘I can just grep it’, but do try running grep on a 10GB file one of one days. Sloooow! Now realize that if your daily log haul is 10TBs, the situation is much more than a 1000 times worse.
Admittedly, 10TB of log-like data a day is not that common (but not that rare nowadays either, especially if all logs – not just security relevant ones – are collected), but usually when the way to handle such data appears, the volume and data usage tend to grow.
The scenario goes something like this:
- the organization has no SIEM or log management (or only basic log management)
- it wants to be able to collect, access, and ultimately see the data and there is a lot of it already piling up in various places
- sometimes multiple stake holders get in the same room and compare notes on what data they need and what level of access/analysis
- given the volume, needs and available resources, the organization goes commercial or open source
- over time, some organizations move from searching to narrowly focused analysis (to be discussed in the Case 4 post in a week or so) of search result sets, and then sometimes to real analytics.
Now, is keyword searching of a huge pile of data really analytics? No, it really is not. However, after you find the data, you can structure it selectively, apply whatever schema du our or place it in whatever structured data storage (be it RDBMS or MongoDB). Big data will not end with your indexed pile, but it can definitely start there.
Reported volumes from the setups I’ve seen so far go well into single digit petabytes of total data volume. The way it is implemented may include Hadoop with Lucene/Solr, ElasticSearch, logstash (also see this), ELSA or a commercial tool (that you all know pretty well…) – the implementation details are a separate story and go beyond the scope of this post.
By the way, the difference between this case and case 1 is in the fact that for case 1, the big data-style system is used as expanded storage for a SIEM, while in this case it is the primary system in use (no SIEM). Admittedly, a hybrid case is entirely possible and was seen a few times as well.
Related posts on the topic of big data for security:
- Big Data for Security Realities – Case 2 Variety Explosion
- Big Data for Security Realities: Case 1: Too Much Volume To Store aka “Big Data Collection”
- Big Data Analytics for Security: Having a Goal + Exploring
- More On Big Data Security Analytics Readiness
- Broadening Big Data Definition Leads to Security Idiotics!
- Next Research Project: From Big Data Analytics to … Patching
- 9 Reasons Why Building A Big Data Security Analytics Tool Is Like Building a Flying Car
- “Big Analytics” for Security: A Harbinger or An Outlier?
- All posts tagged big data
Read Complimentary Relevant Research
Implementing Customer-Centric Merchandising and Marketing in Retail Primer for 2018
Retail CIOs must position the business to leverage algorithms for unified retail commerce supported by a foundation of high-quality customer...
View Relevant Webinars
State of Data Security
Warning: Your data is not all neatly defined, structured, organized and secured in your datacenter. Determining or defining the data...
Comments or opinions expressed on this blog are those of the individual contributors only, and do not necessarily represent the views of Gartner, Inc. or its management. Readers may copy and redistribute blog postings on other blogs, or otherwise for private, non-commercial or journalistic purposes, with attribution to Gartner. This content may not be used for any other purposes in any other formats or media. The content on this blog is provided on an "as-is" basis. Gartner shall not be liable for any damages whatsoever arising out of the content or use of this blog.