There’s a quote near the top of Philipp Janert’s instant classic Data Analysis Using Open Source Tools:
[I]t seems that for many people in the tech field, ‘data’ has become nearly synonymous with ‘Big Data.’ That kind of development usually indicates a fad. The reality is that, in practice, many data sets are ‘small,’ and in particular many relevant data sets are small.
He was writing in 2010 — and, oh, what a fast five years it’s been. Janert goes on to say that classical statistics was built to perform inductive operations – start with a subset of a mess of information and draw conclusions about the mess. Big Data puts the mess in our midst, which is a mixed blessing.
As Janert says: “Big Data makes it easy to forget the basics.”
But there’s no avoiding it now: the fad is not fading. The rush of outrage greeting my recent summary post about AdWeek in NYC titled “’Big Data Is a Big Distraction’: Notes from #AdWeekXII” put me on notice. Never mind that I was quoting someone else (i.e., not myself) and was simply reporting the ad industry’s reaction against last year’s Big Data hysteria – a reaction against hype and not substance.
Let us admit to ourselves the obvious: We need to walk into the light, amigos. Big Data is a big reality.
So what do marketers need to know about it? What follows is a primer on the topic for the interested beginner. It’s based on a recent research report I published called “Understand Big Data Basics for Marketing” (Gartner subscribers enjoy here). It’s not – I mean not – for the white-coated, square-eyed crowd down there in the clean room.
So: Big Data Basics for Marketers.
What is Big Data? Let’s keep it simple. Big Data is data that is so big it won’t fit on a single machine. It has to be spread over many machines. And it can come from anywhere, so it might be in strange and exotic formats. And it’s coming fast. These ideas of size, road speed and formats are captured in the often-quoted concept of the “three V’s”: volume, velocity and variety.
Big Data Ecosystem
Big Data is not a single technology or a short list of vendors. Rather, it is a loose collection of evolving tools, techniques and talent. These include three key categories of (1) storage, (2) processing and (3) analytics. Storage aligns with the volume component of Big Data’s “three v’s,” while processing aligns with velocity, and variety spans both. Analytics refers to the methods used to gain insights from all this stored and/or processed information.
Enterprise data is traditionally stored in relational databases and managed by a database management system. Relational means that the database is structured in tables that can call other tables in a carefully organized way. A fancy term for these structures is schema. Big Data storage differs from relational databases in that it often stores data that has not been mapped to a particular schema – rather, a schema can be imposed later (this is called schema-on-read). All this looseness means data is available more rapidly for use.
So what is Hadoop? No doubt you have heard of this thing, and we’ve described it at some length in the past. In fact, like many Big Data terms, Hadoop is an umbrella: it is applied to different technologies that have three characteristics in common:
- Distributed data – data is spread over a number of different hardware locations, called nodes, increasing storage space and potentially controlling cost
- Cluster computing – processing is handled by clusters of computers whose nodes are linked together by software, so that they act like a single system
- Massively parallel processing – data is processed simultaneously within the clusters, greatly increasing speed
Hadoop shares two other characteristics with many of the energetic Big Data technologies. First, it was developed by engineers at digital media companies: in this case, Yahoo and Google (which built a critical precursor called Map/Reduce). Why? Because things like search engines and social networks have to handle more data than any companies have ever had to handle before in order to operate; and so, just to stay in business, they have had to build things that didn’t exist before.
Second, like most Big Data things, Hadoop was released into the open source world, curated by the Apache Software Foundation. Why? There are many reasons best explained by psychologists and sociologists (and lawyers), but ultimately open source technologies tend to get tested, improved, scrutinized and updated more rapidly, in more intensely practical ways, than many closed technologies. It’s hard for any but the largest companies to employ enough engineers and rigor to hone a closed piece of code; and meanwhile, there is plenty of money to be made selling products built on open source modules or wrapping the pieces together and making them look pretty.
I am willing to guess you – yes, you – would be shocked if you really understood to what extent that whizzy piece of expensive cloud software you’re using actually (deep, deep in its soul) was running on absolutely free, not-developed-here, open source technology that you – yes, you – could probably bang into something almost as useful if you only knew how to do it.
Hadoop is only part of the Big Data story. It usually exists within an ecosystem of other components that fill in its blanks and provide services we need: things like processing data on the move, speeding up the reading and writing of data, giving users ways to write queries to access the data, and so on.
Data can not be used unless it is processed, which simply means collected and moved into storage or other systems in an organized way. Because it has to be distributed across a number of different nodes and is generally not in a predefined format, Big Data requires its own approach to processing. And processing itself comes in two types:
- Batch – processing data at rest (i.e., a database)
- Streaming – ingesting and processing data in a continuous flow (e.g., from sensors or clickstreams)
Hadoop’s file system organizes the storage of data across many machines, and that thing called Map/Reduce (mentioned above) splits the processing work across the machines. Last year, the busy Apache team launched an open source project called Spark, spun out of the University of California. It is already the most common alternative to Map/Reduce and may replace it. Spark’s advantage is speed, offering very fast processing and popular APIs for machine learning and other tasks hipsters like to do. It can be 10-100 times faster than Hadoop in processing times.
Obviously, Big Data is not just about distributed storage and batch processing of data at rest; thus, we have things that do stream processing or real-time processing. Stream processing should be seen as a complement to other frameworks, such as Hadoop. It is best used for time-sensitive opportunities that don’t allow for armchair data exploration and insights. For example, taking session information and personalizing a website experience can make good use of streaming data.
Stream processing works by accessing data before it goes into a data store, preprocessing and enriching it. Intermediate results (called “rollups”) can be passed to distributed management systems such as Hadoop or Cassandra, or faster in-memory databases, for further analytics, dashboards and monitoring. There are two different types of processing methods associated with streaming:
- Complex Event Processing (CEP) – the more common method, it provides real-time analytics and monitoring for data in motion.
- Distributed Streaming Computation Platforms (DSCP) – these frameworks operate over networks of machines and execute user-defined functions on streams of data in parallel.
As I said in a previous, much (sniff) misunderstood blog post, Big Data is not intrinsically suited to analytics. All I mean by making this outrageous statement is that widely dispersed, eccentrically structured information can be a bit more difficult to poke around in than, say, an Excel file or csv-type table. That’s all I meant. Breathe, people.
Many analytics techniques can and do make use of Big Data stores, generally by transforming it into structured or semistructured formats first. These include:
- Data mining and predictive analytics
- Text and speech analytics
- Video analytics
- Social media and sentiment analysis
- Location and sensor analytics
- Machine learning
That’s enough for now. There’s more to be said, of course. I hope this was helpful. If you’re a seatholder with my team Gartner for Marketing Leaders, you can read a more detailed version of this with some tangible examples right here. (If not, you might consider contacting our friends here.)
One last thought: Advanced analytics talent is in short supply everywhere, and the “data science gap” is particularly acute in marketing. Gartner’s March 2014 survey of data-driven marketers found that 54% of organizations got big data analysts from internal development, 32% relied on consultants, and only 13% were able to bring on outside talent. Gartner clients report real problems with both training and retention.
Now, the good news is that skills are becoming more common as analysts enroll in coursework or teach themselves how to fish. Big Data infrastructure and statistical languages are not easy to master, but they are relatively easy for an experienced analyst to start using. Online learning modules and communities such as StackOverflow abound. As skills improve at the same time that traditional marketing analytics tools add more capabilities in familiar interfaces, we expect the pain to reside.
And remember, Big Data is no substitute for Big Ideas.