Gartner Blog Network


January 2020 Hadoop Apache Project Tracker

by Merv Adrian  |  January 14, 2020  |  Submit a Comment

Until 2018, I published comparisons of the supported versions of components in the multiple available Hadoop distributions. Those older posts showed the ebb and flow of new additions to the typical stack supported by “most” distributors slowed to a halt after everyone (mostly) added Kafka. The relative version currency was fairly stable, with Hortonworks typically first to market with many of the newest Apache versions and Cloudera often close behind and leading on some where they were more dominant on the project committee.

But as mergers, acquisitions, and the rise of the cloud-platform-as-Hadoop-provider dynamic played out of the past 2 years, I got away from the regular tracking cycle. This seems like a good time to revisit the question: Who supports what? since so many people seem to think “Hadoop” is HDFS and therefore “going away, because cloud object stores.” The players I include in this visit are AWS, Cloudera, Google and HPE. All of them support 9 pieces: Apache HDFS, Mapreduce, YARN, Hive, Pig, Spark, Sqoop, Tez and Zookeeper.

 

All but Google also support Apache HBase, Mahout and Oozie, as well as Hue. There is a nuance here: Google Cloud Dataproc allows you to perform initialization actions to add components (indicated on the table with “IA”) and offers scripts for dozens of installable components, but cautions that “the initialization actions provided in this repository are provided without support and you use them at your own risk.” The same applies to the various projects themselves; similar mechanisms apply to AWS, and of course Microsoft Azure, whose HDInsight begins with the Cloudera distribution, adds its own pieces and also permits you to add yours. Zeppelin is listed as “soon” for Cloudera, which would move it into the “3 supporters” category as well, but for now it appears below.

 

Apache Flume, Impala, Kafka, Phoenix, Presto, Sentry, Storm and Zeppelin are supported by two vendors apiece. Note that some of these are used infrequently and some will be deprecated soon, but continue to get support because many users have them in their stacks.

 

 

There are many additional pieces supported by only one of these vendors: Apache Accumulo, Ambari, Atlas, Avro, Crunch, Drill, Druid, Flink, Knox, Kudu, Livy, Lucene, Myriad, NiFi, Ozone, Parquet, Ranger, Solr, Tensorflow and others. And many more not directly supported by any of them. We’ll save that for the next post. And meanwhile, this is a blog post, not the result of a lengthy review process, and likely to have a few things that need updating or correcting. Please let me know what you spot.

Additional Resources

View Free, Relevant Gartner Research

Gartner's research helps you cut through the complexity and deliver the knowledge you need to make the right decisions quickly, and with confidence.

Read Free Gartner Research

Category: 

Merv Adrian
Research VP
5 years with Gartner
38 years in IT industry

Merv Adrian is an analyst following database and adjacent technologies as extreme data transforms assumptions about what to persist as well as when, where and how. He also watches the way the software/hardware boundary… Read Full Bio




Leave a Reply

Your email address will not be published. Required fields are marked *

Comments or opinions expressed on this blog are those of the individual contributors only, and do not necessarily represent the views of Gartner, Inc. or its management. Readers may copy and redistribute blog postings on other blogs, or otherwise for private, non-commercial or journalistic purposes, with attribution to Gartner. This content may not be used for any other purposes in any other formats or media. The content on this blog is provided on an "as-is" basis. Gartner shall not be liable for any damages whatsoever arising out of the content or use of this blog.