Updated August 11, 2015
This perennial question resurfaced recently in a thoughtful blog post by Andreas Neumann, Chief Architect of Cask, called What is Hadoop, anyway?. Ultimately, after a careful deconstruction of the terms in the question, Andreas concludes with
“Does it really matter to agree on the answer to that question? In the end, everybody who builds an application or solution on Hadoop must pick the technologies that are right for the use case.”
We’ve agreed from the beginning – that is the only answer that really matters. Still, the question continues to come up for end users of the stack and for vendors like Cask (it helps them think about what to support in their application development offering Cask Data App Platform (CDAP).
Analysts too: I’ve discussed it several times, including a post a year ago called What Is Hadoop….Now? tracking the path from 6 commonly supported projects in 2012 to 15 in June 2014, across a set of distributors that included Cloudera, Hortonworks, MapR and IBM.
This year, the expansion process has continued – and the definition does matter. Why? Because it shows how differentiation and positioning are shaping the evolution of the commercially supported stack – which will help mainstream buyers decide what direction to take in making choices. Commercially supported open source software is now chosen for production applications. And integration, cross-porting, backporting, and supporting an ever-increasing stack of projects that they do not “own” or exclusively develop is a cost for distributors – who are not charging more as they add more projects to the stack.
The problem is, the distribution vendors generally do a terrible job of publicly documenting what they support. You will find that you have to dig to find answers to the obvious question “If I pay for a support subscription, what will be supported?” At the bottom of this post is a table of projects and the vendors who support them, based on sources that are listed below it. “Support” in this analysis means if you pay for a subscription, that explicitly includes support for the named project. A subtlety – sometimes installing a project on Amazon is a separate effort you must undertake yourself, but Amazon still supports it. Those projects are indicated in the chart with asterisks.
[added August 11] Another subtlety: some projects are subsumed inside others. Storage formats like Parquet are used by other projects – such as Hive, or Impala. Hortonworks is not listed as supporting Parquet – they tend to recommend Apache ORC (which I will add in my next revision) for columnar storage within the stack. Their answer to me was nuanced and helps explain how much goes into the word “support”: they help customers who are using it, but they
don’t actively build, distribute and support Parquet like we do other components in HDP (ex. patches, updates, maintenance releases, hot fixes, etc.)
It’s not easy being a distributor. There a lot of things for you to do with every project after it gets to you from the Apache committee. So throughout this series of posts, it’s important to understand that those opting not to support one or another projects are making choices – based on their customers’ needs and their own resources – about what they can do effectively, and what they want to recommend their customers use for best results, in general and with the component of their distribution. [end added text]
First, some disclaimers. This is still work in progress but largely complete, based on multiple vendor conversations and/or their web documentation. I’ve removed some language from earlier versions of this post that offered expectations about likely changes. Where I have the data, I’ve listed the most recently released project version supported. Supporting prior versions too is important, since many customers are running multiples in test, dev and production, but that is not detailed here.
So: what is Hadoop? We’ve used the definition “Apache Hadoop is a set of open-source software projects that provide a framework for using massive amounts of data across a distributed network.“ The Apache Hadoop web site still names only Hadoop Common, Hadoop Distributed File System (HDFS™), Hadoop YARN and Hadoop MapReduce. Nobody really pays much attention to Common, though it represents a great deal of work from some dedicated engineers – so call that 3 projects.
In the table below, you can see who supports what. Other projects supported by all the vendors include HBase, Hive, Pig, Spark, and Zookeeper – for a total of 8 projects supported by all.
(There are many pieces to Spark, including SQL, Streaming, and ML and graph libraries. Most of those are not yet supported. Ask your vendor. Also note that the Open Data Platform (ODP) adds Ambari to its “core” list, making 4 projects, but only Hortonworks and Pivotal actually list it. IBM’s published project list at http://ibm.co/1IHCNWo doesn’t name it, though they are in ODP too.)
Flume, Oozie, Parquet and Sqoop are supported by 5. That gets us to 12 projects.
Hue, Mahout and Solr have 4 supporters. And now we’re up to 15 projects.
In the “3 supporters” category there are 7 – Accumulo, Ambari, Cascading, Ganglia, Impala, Knox, and Tez – and we’re up to 22 projects. (Note: IBM has at least two sites listing projects. The newest reference I was given has updated to include Ambari.)
In “2 supporters” we find 8 – Avro, DataFu, Kafka, Nagios, Ranger, Sentry, Slider, Storm – so today, no fewer than 30 projects are supported in at least two distributions.
Another 9 are named by one distributor – Atlas, Crunch, Drill, Falcon, Kite, LLAMA, Lucene, Phoenix and Presto for a total of 39 projects.
Nine more Apache projects (some Incubating) aren’t supported by any distributors yet and are not on the table at all but are being talked about, demonstrated and otherwise evangelized: Apache Calcite, Apache Flink, Apache Geode, Apache Giraph, Apache Ignite, Apache Kylin, Apache Myriad, Apache Samza and Apache Zeppelin. This brings our list to 48 Hadoop-related projects. Is it any wonder people ask: What is Hadoop?
So, let the conversation begin. As I said, I expect to edit this post for a few days as comments come in. I’m eager to hear your thoughts.
Hadoop Project Support By Distributors
Cloudera http://bit.ly/1NtIfRe and emails
Hortonworks http://bit.ly/1NzKS3M and emails
IBM http://ibm.co/1IHCNWo and http://www-01.ibm.com/software/data/infosphere/hadoop/products.html
MapR http://bit.ly/1LT5e71 and emails