First, what is Hadoop?
Not so fast, amigos. You don’t just jump into something like Hadoop with half a parachute. You will need to pave a few dirt roads first on your way to the rainbow.
Let’s start with . . . well, how about this: DATA.
As you may have guessed, it’s all about data. Data is the most overused word in marketing and means both everything and nothing. What is it really? Let’s just agree that it is information that is interesting . . . and move on.
To this day, much of the data used by marketers sits in databases, and big companies invest a lot in big data warehouses. These are what we call relational databases, or RDBs, which have been around since the pre-Disco era. In fact, since exactly 1970, when they were first described in a paper by a guy at IBM. They store data in tables with a now-familiar format: each row is a record and various columns are values.
The “relational” part refers to the use of many different tables that can be related to one another, which breaks the thing down into more practical pieces. For example, one table could store a list of digital ad campaigns, days they ran and product names; another table could store the specific offers in ads for each product. If you wanted to get a list of all offers that ran on a particular day, you could write a “query” that calls on both tables. This query is likely written in a language called SQL, which simply means “structured query language.” Its mission in life is to get answers out of RDBs.
Now, there’s nothing wrong with RDBs and data warehouses, and they are not going anywhere. But for marketers, in the past decade or so, they have started needing a little help. The problem — and I’m sure you’ve seen it coming from the county line — is a little thing called: BIG DATA.
Actually, it’s not just big data. It’s the Internet, which enables massive connectivity, making a world of information freely available that wasn’t in town before. There’s also the problem of formats, with data routinely popping up in videos, images, audio, social chatter, social profiles, machine and metadata — all sorts of things that don’t fit neatly into rows and columns, tend to take up a lot more space in memory than names and numbers, and yet contain information of great interest to marketers.
(You could substitute “people” for marketers here and not lose the plot, but — as I’ve said in the past, and continue to insist, — marketers are people.)
In fact, data warehouses can and do house all formats, including cat videos. The problem is not whether they can handle Big Data: they can. The problem is that they are too good, in a sense, for the problem at hand. They are secure, reliable, safe, compliant — and expensive. To scale them up to process deep insights from all the cat videos generated by your brand’s fans would no doubt be more expensive than those videos are worth. As anyone who has used their company’s internal workflow system knows, RDBs are slow and steady. Digital marketers on the other hand favor zippy and sloppy . . . and above all cheap.
There is also the problem of structure. Rows and columns are a beauty to behold, but what are we to do with a bunch of Tweets? What about a klatsch of captions? A list of friends of friends in a social network? Forget about those cat videos. Much of the information we care about these days just isn’t formatted correctly for the RDBs and never will be.
And let’s pause to admire scale here. It’s often underestimated. Real big data is really, really big. For example, we all know what a gigabyte is. We have a few of those on our phone. Imagine a gigabyte is equivalent to the population of your podunk hometown back in rural Iowa, the one you escaped last summer for that internship on Showtime’s most excellent House of Lies. So what’s a terabyte? That is equivalent to the population of the entire Los Angeles metro area. Your typical enterprise data warehouse for a big company can be a terabyte. (Ever hear of Teradata?) When you start talking in petabytes, you’re getting into big data indeed. That’s the equivalent of the entire population of the planet Earth — if it had a lot more people.
You get the idea. It’s not uncommon to encounter SaaS platforms for marketers these days that process petabytes of data. For example, although estimates vary, Google alone churns through more than 25 petabytes per day. Facebook’s daily user logs could be ten times bigger. Obviously, even at much smaller scales, big data is a big headache for your step-by-step, pricey, high-end data warehouse.
What’s a marketer person to do?
One answer is to try something called PARALLEL PROCESSING. Like many things in data and computer science, it’s a concept that sounds more abstract that it is. It doesn’t solve the problem of structure, but it does help with that of scale. Processing a lot of data, piece by piece, takes a long time. Enterprise data warehouse jobs are still sometimes left overnight to finish up. Parallel processing as been around a long time, and the idea here is to break the problem up into smaller pieces, which can be solved simultaneously.
An analogy I owe to United Health Care’s former V.P. of Informatics, Mark Pitts, is this: If you give one person a deck of cards and ask her to find the ace of spades, she has to go through the cards until she hits it. But if you give 52 people one card each and ask the crowd for the ace of spades, the girl with the ace sees it right away. That’s parallel processing.
Okay, so now we’re at Hadoop?! Right?!! Hashtag sigh. Not quite.
Our last step before we reach the depot is MAPREDUCE. Don’t give up. This is interesting. About ten years ago, Google faced a big data problem. (Because it’s Google, it faced this before the rest of us.) As you know, they need to crawl the Internet and index links and keywords and so on, and the Internet grows exponentially. It’s not a trivial problem. They needed a way to apply parallel processing without incurring massive hardware costs — which meant figuring out some way to process huge amounts of information quickly without having to copy it, bring it into a data warehouse, analyze it, report results, and so on.
In 2004, two researchers at Google released a now-famous paper called “MapReduce: Simplified Data Processing on Large Clusters.” (It actually isn’t that hard to read.) MapReduce was designed as a conceptual framework (not a Google product), to allow people to:
- Look at unstructured data all over the Internet
- Not be afraid of massive scales in terabytes and petabytes
- Use parallel processing to speed things up
- Run on commodity machines
- Handle breakdowns (which don’t happen much in data warehouses, but happen all the time online, particularly since we’re using those “commodity machines”)
Whew. What’s interesting is that Google’s computational problems weren’t all that difficult. The questions it needed to answer were things like, “How many other web pages link back to this URL?” or “How many times does the phrase ‘cat video’ appear in this blog post?” The problem wasn’t super-fancy questions, it was scale, structure, time . . . basically, the mess we see in our digital world.
Google’s paper, by Jeffrey Dean and Sanjay Ghemawat, is rather eloquent here:
“The issues of how to parallelize the computation, distribute the data, and handle failures conspire to obscure the original simple computation with large amounts of complex code to deal with those issues.”
MapReduce was created to “deal with those issues” by formalizing and hiding all those “large amounts of complex code” so that the programmer could focus on the (often simple) question they were trying to answer. It is plumbing, behind-the-scenes, a framework rather than a language or algorithm or product or song. Rather than doing all the RDB things like copying data, structuring it, performing analysis — rather than all that, MapReduce proposed the idea of doing the analysis wherever the data is in little batches and combining the results. It skips the step where data itself is moved around and only moves around code and results.
How does it work? Well, the name is helpful. It’s divided into a “map” phase and a “reduce” phase.
- Map applies a function written by the coder to nearby data (wherever it is) and stores the output in a temporary file; this function is a kind of filter or sort and outputs an interim summary
- This summary is sent to the Reduce function in a continual loop, so it doesn’t sit in any particular database; Reduce takes the interim values and merges them into an answer
An example? Say you have War and Peace and you want a list of all the unique words in it and a count of how many times each appears. Imagine each page of the book is sitting on a different website somewhere. Your Map function will stroll out to each site simultaneously and sort the words and count them, storing this output in a temporary file. The Reduce function then takes all the temporary files and merges them into an answer.
So the value added by the MapReduce framework is:
- Ability to do the counts remotely, rather than requiring you to copy and import all the data (distribution)
- Simultaneous counts on all the pages (parallel processing)
- A lot of (invisible) error correction and checking
- Sending back the answer in summary form
As you can sense here, everything in MapReduce’s world is designed to facilitate parallel processing in messy environments where machines break down and connections are lost. In other words, it’s not baroque or luxury, but it gets the job done — pronto!
Now, whether MapReduce itself was revolutionary, or even all that original, is a topic for another day. Google didn’t invent either map or reduce, but Google’s approach certainly made its own life easier and has caught on like you wouldn’t believe.
And now, finally . . . HADOOP!
We’re here. MapReduce itself isn’t a programming language or a technology. To run it, you still need operating systems, web servers, connections, and so on. In order to implement it in the real world, there was a need for some kind of end-to-end solution, preferably open source so it could develop into a standard that could be used and improved by anyone.
Hadoop itself was developed by two engineers, one of whom joined Yahoo, and was inspired directly by Google’s paper. (Hadoop is supposedly the name of one of the engineers’ kids’ stuffed animals, an elephant.) It was adopted by Apache and is today the leading open source big data handling platform. Its development has continued in the past seven years, and we’ll cover some of these developments another time.
For now, it’s enough to say that Hadoop remains true to the MapReduce vision, containing:
- Storage for unstructured data
- Nuts and bolts around loads and tasks to manage data processing
- Distributed processing for computations
In addition to speed, it has the benefit of being a flexible platform that is relatively easy (for a programmer) to pick up. A whole barnload of products and services have arisen on the Hadoop platform, and its various limitations are being papered over.
But combined with cloud services, Hadoop has opened up a new world of rapid processing of reams of unstructured data, and that’s what you need to know, marketers. For now.
So on this Thanksgiving Day in 2014, let us give thanks for the many gifts Hadoop, and MapReduce, have given us. Happy holidays!