Gartner Blog Network


Now, What Is Hadoop? And What’s Supported?

by Merv Adrian  |  July 2, 2015  |  17 Comments

Updated August 11, 2015

This perennial question resurfaced recently in a thoughtful blog post by Andreas Neumann, Chief Architect of Cask, called What is Hadoop, anyway?. Ultimately, after a careful deconstruction of the terms in the question, Andreas concludes with

“Does it really matter to agree on the answer to that question? In the end, everybody who builds an application or solution on Hadoop must pick the technologies that are right for the use case.”

We’ve agreed from the beginning – that is the only answer that really matters. Still, the question continues to come up for  end users of the stack and for vendors like Cask (it helps them think about what to support in their application development offering Cask Data App Platform (CDAP).

Analysts too: I’ve discussed it several times, including a post a year ago called What Is Hadoop….Now? tracking the path from 6 commonly supported projects in 2012 to 15 in June 2014, across a set of distributors that included Cloudera, Hortonworks, MapR and IBM.

This year, the expansion process has continued – and the definition does matter. Why? Because it shows how differentiation and positioning are shaping the evolution of the commercially supported stack – which will help mainstream buyers decide what direction to take in making choices. Commercially supported open source software is now chosen for production applications. And integration, cross-porting, backporting, and supporting an ever-increasing stack of projects that they do not “own” or exclusively develop is a cost for distributors – who are not charging more as they add more projects to the stack.

The problem is, the distribution vendors generally do a terrible job of publicly documenting what they support. You will find that you have to dig to find answers to the obvious question “If I pay for a support subscription, what will be supported?” At the bottom of this post is a table of projects and the vendors who support them, based on sources that are listed below it. “Support” in this analysis means if you pay for a subscription, that explicitly includes support for the named project. A subtlety – sometimes installing a project on Amazon is a separate effort you must undertake yourself, but Amazon still supports it. Those projects are indicated in the chart with asterisks.

[added August 11] Another subtlety: some projects are subsumed inside others. Storage formats like Parquet are used by other projects – such as Hive, or Impala. Hortonworks is not listed as supporting Parquet – they tend to recommend Apache ORC (which I will add in my next revision) for columnar storage within the stack. Their answer to me was nuanced and helps explain how much goes into the word “support”: they help customers who are using it, but they

don’t actively build, distribute and support Parquet like we do other components in HDP (ex. patches, updates, maintenance releases, hot fixes, etc.)

It’s not easy being a distributor. There a lot of things for you to do with every project after it gets to you from the Apache committee. So throughout this series of posts, it’s important to understand that those opting not to support one or another projects are making choices – based on their customers’ needs and their own resources – about what they can do effectively, and what they want to recommend their customers use for best results, in general and with the component of their distribution. [end added text]

First, some disclaimers. This is still work in progress but largely complete, based on multiple vendor conversations and/or their web documentation. I’ve removed some language from earlier versions of this post that offered expectations about likely changes. Where I have the data, I’ve listed the most recently released project version supported. Supporting prior versions too is important, since many customers are running multiples in test, dev and production, but that is not detailed here.

So: what is Hadoop? We’ve used the definition “Apache Hadoop is a set of open-source software projects that provide a framework for using massive amounts of data across a distributed network. The Apache Hadoop web site still names only Hadoop Common,  Hadoop Distributed File System (HDFS™), Hadoop YARN and Hadoop MapReduce. Nobody really pays much attention to Common, though it represents a great deal of work from some dedicated engineers – so call that 3 projects.

In the table below, you can see who supports what. Other projects supported by all the vendors include HBase, Hive, Pig, Spark, and Zookeeper – for a total of 8 projects supported by all.

(There are many pieces to Spark, including SQL, Streaming, and ML and graph libraries. Most of those are not yet supported. Ask your vendor. Also note that the Open Data Platform (ODP) adds Ambari to its “core” list, making 4 projects, but only Hortonworks and Pivotal actually list it. IBM’s published project list at http://ibm.co/1IHCNWo doesn’t name it, though they are in ODP too.)

Flume, Oozie, Parquet and Sqoop are supported by 5. That gets us to 12 projects.

HueMahout and Solr have 4 supporters. And now we’re up to 15 projects.

In the “3 supporters” category there are 7 – Accumulo, Ambari, Cascading, Ganglia, Impala, Knox, and Tez – and we’re up to 22 projects. (Note: IBM has at least two sites listing projects. The newest reference I was given has updated to include Ambari.)

In “2 supporters”  we find 8 – Avro, DataFu, Kafka, Nagios, Ranger, Sentry, Slider, Storm – so today, no fewer than 30 projects are supported in at least two distributions.

Another 9 are named by one distributor – AtlasCrunch, Drill, Falcon, Kite, LLAMA, Lucene, Phoenix and Presto for a total of 39 projects. 

Nine more Apache projects (some Incubating) aren’t supported by any distributors yet and are not on the table at all but are being talked about, demonstrated and otherwise evangelized: Apache Calcite, Apache Flink, Apache Geode, Apache Giraph, Apache Ignite, Apache Kylin, Apache Myriad,  Apache Samza and Apache Zeppelin. This brings our list to 48 Hadoop-related projects. Is it any wonder people ask: What is Hadoop?

So, let the conversation begin. As I said, I expect to edit this post for a few days as comments come in. I’m eager to hear your thoughts.

Hadoop Project Support By Distributors

Screen Shot 2015-08-11 at 4.38.52 PM

Sources:

Amazon http://amzn.to/1U9R5In
Cloudera http://bit.ly/1NtIfRe and emails
Hortonworks http://bit.ly/1NzKS3M and emails
IBM http://ibm.co/1IHCNWo and http://www-01.ibm.com/software/data/infosphere/hadoop/products.html
MapR http://bit.ly/1LT5e71 and emails
Pivotal email

Category: amazon  elastic-mapreduce  amazon-web-services  apache  accumulo  ambari  avro  apache-drill  falcon  flume  hadoop  hbase  hdfs  hive  kafka  knox  lucene  mapreduce  oozie  pig  solr  spark  sqoop  tez  apache-yarn  zookeeper  big-data  cascading  cloudera  gartner  hortonworks  hue  ibm  industry-trends  mapr  oss  pivotal  

Tags: amazon  ambari  apache  accumulo  flume  hadoop  hbase  hdfs  hive  kafka  mahout  mapreduce  oozie  pig  spark  sqoop  storm  yarn  zookeeper  big-data-2  cdh  cloudera  gartner  hortonworks  ibm  oss  pivotal  

Merv Adrian
Research VP
5 years with Gartner
38 years in IT industry

Merv Adrian is an analyst following database and adjacent technologies as extreme data transforms assumptions about what to persist as well as when, where and how. He also watches the way the software/hardware boundary… Read Full Bio


Thoughts on Now, What Is Hadoop? And What’s Supported?


  1. Bruno Aziza says:

    Nice piece Merv!

  2. Dan says:

    Very interesting. And it would also be fun to compare in a years time in a kind of pop chart. Who is new entrant who moved up anyone get dropped etc. Great stuff

  3. Nitin says:

    Merv, Hortonwork and Amazon have ‘yes’ marked for technologies, but others have versions – any specific reason.

    • Merv Adrian says:

      Nitin, I’ve since added the version numbers for Hortonworks, and hope to hear from Amazon with specific info so I can do the same for them.

  4. Terry says:

    Hortonworks ships Hue – do they not support it?

  5. Chris says:

    Hortonworks ships Ganglia/Nagios via Ambari. They’re deprecated as of Ambari 2.0 in favor of Ambari Metrics (running on its own instance of HBase), but they’re still shipped and supported.

    Amazon doesn’t directly support ZooKeeper as a standalone service, but it’s executed as part of HBase.

  6. lenny says:

    You could add the versions of the platforms if available (e.g. HDP 2.2).

  7. […] 1. Now, What is Hadoop? Gartner.com- In the on-going discussion of “what is Hadoop?” Merv Adrian provides a comparison of which providers support which project in the Hadoop ecosystem. Read More […]

  8. […] to stop the Hadoop train. Despite seemingly “anemic interest,” complex setup, and a crazy quilt of different projects that ostensibly comprise the unified “thing” that is Hadoop, demand for Hadoop talent […]

  9. […] to stop the Hadoop train. Despite seemingly “anemic interest,” complex setup, and a crazy quilt of different projects that ostensibly comprise the unified “thing” that is Hadoop, demand for Hadoop talent […]



Comments are closed

Comments or opinions expressed on this blog are those of the individual contributors only, and do not necessarily represent the views of Gartner, Inc. or its management. Readers may copy and redistribute blog postings on other blogs, or otherwise for private, non-commercial or journalistic purposes, with attribution to Gartner. This content may not be used for any other purposes in any other formats or media. The content on this blog is provided on an "as-is" basis. Gartner shall not be liable for any damages whatsoever arising out of the content or use of this blog.