Updated August 11, 2015
This perennial question resurfaced recently in a thoughtful blog post by Andreas Neumann, Chief Architect of Cask, called What is Hadoop, anyway?. Ultimately, after a careful deconstruction of the terms in the question, Andreas concludes with
“Does it really matter to agree on the answer to that question? In the end, everybody who builds an application or solution on Hadoop must pick the technologies that are right for the use case.”
We’ve agreed from the beginning – that is the only answer that really matters. Still, the question continues to come up for end users of the stack and for vendors like Cask (it helps them think about what to support in their application development offering Cask Data App Platform (CDAP).
Analysts too: I’ve discussed it several times, including a post a year ago called What Is Hadoop….Now? tracking the path from 6 commonly supported projects in 2012 to 15 in June 2014, across a set of distributors that included Cloudera, Hortonworks, MapR and IBM.
This year, the expansion process has continued – and the definition does matter. Why? Because it shows how differentiation and positioning are shaping the evolution of the commercially supported stack – which will help mainstream buyers decide what direction to take in making choices. Commercially supported open source software is now chosen for production applications. And integration, cross-porting, backporting, and supporting an ever-increasing stack of projects that they do not “own” or exclusively develop is a cost for distributors – who are not charging more as they add more projects to the stack.
The problem is, the distribution vendors generally do a terrible job of publicly documenting what they support. You will find that you have to dig to find answers to the obvious question “If I pay for a support subscription, what will be supported?” At the bottom of this post is a table of projects and the vendors who support them, based on sources that are listed below it. “Support” in this analysis means if you pay for a subscription, that explicitly includes support for the named project. A subtlety – sometimes installing a project on Amazon is a separate effort you must undertake yourself, but Amazon still supports it. Those projects are indicated in the chart with asterisks.
[added August 11] Another subtlety: some projects are subsumed inside others. Storage formats like Parquet are used by other projects – such as Hive, or Impala. Hortonworks is not listed as supporting Parquet – they tend to recommend Apache ORC (which I will add in my next revision) for columnar storage within the stack. Their answer to me was nuanced and helps explain how much goes into the word “support”: they help customers who are using it, but they
don’t actively build, distribute and support Parquet like we do other components in HDP (ex. patches, updates, maintenance releases, hot fixes, etc.)
It’s not easy being a distributor. There a lot of things for you to do with every project after it gets to you from the Apache committee. So throughout this series of posts, it’s important to understand that those opting not to support one or another projects are making choices – based on their customers’ needs and their own resources – about what they can do effectively, and what they want to recommend their customers use for best results, in general and with the component of their distribution. [end added text]
First, some disclaimers. This is still work in progress but largely complete, based on multiple vendor conversations and/or their web documentation. I’ve removed some language from earlier versions of this post that offered expectations about likely changes. Where I have the data, I’ve listed the most recently released project version supported. Supporting prior versions too is important, since many customers are running multiples in test, dev and production, but that is not detailed here.
So: what is Hadoop? We’ve used the definition “Apache Hadoop is a set of open-source software projects that provide a framework for using massive amounts of data across a distributed network.“ The Apache Hadoop web site still names only Hadoop Common, Hadoop Distributed File System (HDFS™), Hadoop YARN and Hadoop MapReduce. Nobody really pays much attention to Common, though it represents a great deal of work from some dedicated engineers – so call that 3 projects.
In the table below, you can see who supports what. Other projects supported by all the vendors include HBase, Hive, Pig, Spark, and Zookeeper – for a total of 8 projects supported by all.
(There are many pieces to Spark, including SQL, Streaming, and ML and graph libraries. Most of those are not yet supported. Ask your vendor. Also note that the Open Data Platform (ODP) adds Ambari to its “core” list, making 4 projects, but only Hortonworks and Pivotal actually list it. IBM’s published project list at http://ibm.co/1IHCNWo doesn’t name it, though they are in ODP too.)
Flume, Oozie, Parquet and Sqoop are supported by 5. That gets us to 12 projects.
Hue, Mahout and Solr have 4 supporters. And now we’re up to 15 projects.
In the “3 supporters” category there are 7 – Accumulo, Ambari, Cascading, Ganglia, Impala, Knox, and Tez – and we’re up to 22 projects. (Note: IBM has at least two sites listing projects. The newest reference I was given has updated to include Ambari.)
In “2 supporters” we find 8 – Avro, DataFu, Kafka, Nagios, Ranger, Sentry, Slider, Storm – so today, no fewer than 30 projects are supported in at least two distributions.
Another 9 are named by one distributor – Atlas, Crunch, Drill, Falcon, Kite, LLAMA, Lucene, Phoenix and Presto for a total of 39 projects.
Nine more Apache projects (some Incubating) aren’t supported by any distributors yet and are not on the table at all but are being talked about, demonstrated and otherwise evangelized: Apache Calcite, Apache Flink, Apache Geode, Apache Giraph, Apache Ignite, Apache Kylin, Apache Myriad, Apache Samza and Apache Zeppelin. This brings our list to 48 Hadoop-related projects. Is it any wonder people ask: What is Hadoop?
So, let the conversation begin. As I said, I expect to edit this post for a few days as comments come in. I’m eager to hear your thoughts.
Hadoop Project Support By Distributors
Sources:
Amazon http://amzn.to/1U9R5In
Cloudera http://bit.ly/1NtIfRe and emails
Hortonworks http://bit.ly/1NzKS3M and emails
IBM http://ibm.co/1IHCNWo and http://www-01.ibm.com/software/data/infosphere/hadoop/products.html
MapR http://bit.ly/1LT5e71 and emails
Pivotal email
The Gartner Blog Network provides an opportunity for Gartner analysts to test ideas and move research forward. Because the content posted by Gartner analysts on this site does not undergo our standard editorial review, all comments or opinions expressed hereunder are those of the individual contributors and do not represent the views of Gartner, Inc. or its management.
Comments are closed
13 Comments
Nice piece Merv!
Very interesting. And it would also be fun to compare in a years time in a kind of pop chart. Who is new entrant who moved up anyone get dropped etc. Great stuff
Merv, Hortonwork and Amazon have ‘yes’ marked for technologies, but others have versions – any specific reason.
Nitin, I’ve since added the version numbers for Hortonworks, and hope to hear from Amazon with specific info so I can do the same for them.
Hortonworks ships Hue – do they not support it?
I believe it is Cloudera who ship Hue by default, not Hortonworks.
https://github.com/hortonworks/hue-release
The subtlety here is in the word “support,” of course. Many things are shipped – or available – but support is something I am only acknowledging if the vendor has confirmed it.
FYI Hue’s upstream website is gethue.com and code at https://github.com/cloudera/hue.
The major Hadoop distributors ship Hue 2 or 3.
Things that make you go Hmmmmm…. Horton tells their prospects they support Hue, but won’t say it to analysts, apparently.
Hortonworks ships Ganglia/Nagios via Ambari. They’re deprecated as of Ambari 2.0 in favor of Ambari Metrics (running on its own instance of HBase), but they’re still shipped and supported.
Amazon doesn’t directly support ZooKeeper as a standalone service, but it’s executed as part of HBase.
You could add the versions of the platforms if available (e.g. HDP 2.2).
FYI
Hue 2.3 is still in HDP:
http://docs.hortonworks.com/HDPDocuments/HDP2/HDP-2.1.7/bk_installing_manually_book/content/rpm-chap-hue.html