In February 2012, Gartner published How to Choose The Right Apache Hadoop Distribution (available to clients). At the time, the leading distributors were Cloudera, EMC (now Pivotal), Hortonworks (pre-GA), IBM, and MapR. These players all supported six Apache projects: HDFS, MapReduce, Pig, Hive, HBase, and Zookeeper. Things have changed.
[updated June 29] We included Datastax (a distributor of Apache Cassandra) then, but they did not, and still don’t, consider themselves part of the Hadoop ecosystem. And they are not alone in having a reductive view of the answer to the question What Is Hadoop? Doug Cutting, pioneer in creating it and Chief Architect at Cloudera and former president of the Apache Software Foundation, considers the Hadoop Project to be HDFS, MapReduce and some common utilities. He made that point clear during a panel of luminaries my colleague Nick Heudecker conducted recently – the video is linked to Nick’s blog here. Everything else is “related projects.” Arun Murthy of Hortonworks, who has driven the creation of YARN, prefers to say that HDFS and YARN are “kernel” now, likening the description to the way most of us think of Linux. The Apache page continues to use the older description, including HDFS, MapReduce and YARN. (June 29, 2014)
To users, and especially buyers, the definition is more expansive. Hadoop is what they use to compose a useful stack of software to execute a business process of some sort. And distributors agree: in a little over two years, the set of projects included in all commercial distributions has now reached fifteen – two and a half times as many in just over two years. The list now includes Accumulo, Avro, Cascading, Flume, Mahout, Oozie, Spark, Sqoop, and YARN.
Others are likely to join this stack long before the next two years are up: the candidates include Falcon, Knox, Giraph, Hue, Lucene, Storm, Tez, and others. Hadoop has moved from a coarse-grained blunt instrument for largely ETL-style workloads to an expanding stack for virtually any IT task big data professionals will want to undertake. What is Hadoop now? It’s a candidate to be the alternative universe for data processing, with over 20 components that span a wide array of functions. More money continues to flow into the ecosystems, more companies form, more programmers take up the challenges, and the big players are scrambling to get aboard the train.
What is Hadoop?
It’s what’s next.