Gartner Blog Network


February 2020 Hadoop Distribution Apache Project Tracker

by Merv Adrian  |  January 14, 2020  |  Comments Off on February 2020 Hadoop Distribution Apache Project Tracker

Updated 2/15/20 – thank you to HPE for MapR details

Until 2018, I published comparisons of the supported versions of components in the multiple available Hadoop distributions. That older series of posts, beginning with the one linked here showed that the ebb and flow of new additions to the typical stack supported by “most” distributors slowed to a halt after everyone (mostly) added Kafka. Through 2018 and 2019 I no longer updated the series. The relative version currency was fairly stable, with Hortonworks typically first to market with many of the newest Apache versions and Cloudera often close behind and leading on some where they were more dominant on the project committee. HPE MapR, then and now, went its own way, with a proprietary but API-compatible file system (MapR-FS today supports HDFS 2.7.0+ APIs), HBase variant (MapR-DB today supports HBase 1.1.13 APIs) and stream processing platform (MapR-Streams today supports Kafka 1.1 APIs.)

So as mergers, acquisitions, and the rise of the cloud-platform-as-Hadoop-provider dynamic played out of the past 2 years, I got away from the regular tracking cycle. This seems like a good time to revisit the question: Who supports what? since so many people seem to think “Hadoop is HDFS and therefore going away, because cloud object stores.” The players I include in this visit are AWS, Cloudera, Google and HPE MapR. By request, I have added the current release of Cloudera HDP (the former Hortonworks distribution.) All of them support 9 pieces: Apache HDFS, Mapreduce, YARN, Hive, Pig, Spark, Sqoop, Tez and Zookeeper. MapR’s support for POSIX files as well as the HDFS APIs remains a key differentiator.

 

All but Google also support Apache HBase, Mahout and Oozie, as well as Hue. There is a nuance here: Google Cloud Dataproc allows you to perform initialization actions to add components (indicated on the table with “IA”) and offers scripts for dozens of installable components, but cautions that “the initialization actions provided in this repository are provided without support and you use them at your own risk.” The same applies to the various projects themselves; similar mechanisms apply for other not directly supported projects in AWS. Microsoft Azure, whose HDInsight begins with the Cloudera distribution, similarly adds its own pieces and also permits you to add yours. Zeppelin is listed as “soon” for Cloudera, which would move it into the “all supporters” category as well, but for now it appears below.

 

Apache Flume, Impala, Kafka, Phoenix, Presto, Sentry, and Storm are supported by two vendors apiece. Note that some of these are used infrequently and some will be deprecated soon, but continue to get support because many users have them in their stacks.

There are two distinct sets of APIs for Kafka with different version numbers. HPE MapR supports Kafka 1.1 APIs for the producer and consumer, and includes KSQL and KStreams version 4.1 for real-time analytics.

 

 

There are many additional pieces supported by only one of these vendors: Apache Accumulo, Ambari, Atlas, Avro, Crunch, Drill, Druid, Flink, Knox, Kudu, Livy, Lucene, Myriad, NiFi, Ozone, Parquet, Ranger, Solr, Tensorflow and others.  And many more not directly supported by any of them. We’ll save that for the next post, where we’ll talk about whether “the core stack” matters anymore. And I hope to talk about Kubernetes in an upcoming discussion.

And please, this is a blog post, not published research, and likely to have a few things that need updating or correcting – blogs are good for data that changes often. Please let me know what you spot.

Additional Resources

How Augmented Analytics Will Transform Your Organization

Augmented analytics uses machine learning and artificial intelligence techniques to transform how analytics content is developed, consumed and shared. Data and analytics leaders should plan to adopt augmented analytics as platform capabilities mature.

Read Free Gartner Research

Category: elastic-mapreduce  analytics-and-bi-solutions  apache  accumulo  ambari  atlas  avro  flink  flume  hadoop  hbase  hdfs  hive  impala  kafka  kudu  mapreduce  apache-nifi  oozie  phoenix  ranger  spark  sqoop  tez  apache-yarn  apache-zeppelin  zookeeper  cloudera  data-lake  gartner  google  hortonworks  industry-trends  mapr  open-source  

Tags: hadoop  

Merv Adrian
Research VP
9 years with Gartner
40 years in IT industry

Merv Adrian is an analyst following database and adjacent technologies as extreme data transforms assumptions about what to persist as well as when, where and how. He also watches the way the software/hardware boundary… Read Full Bio




Comments are closed

Comments or opinions expressed on this blog are those of the individual contributors only, and do not necessarily represent the views of Gartner, Inc. or its management. Readers may copy and redistribute blog postings on other blogs, or otherwise for private, non-commercial or journalistic purposes, with attribution to Gartner. This content may not be used for any other purposes in any other formats or media. The content on this blog is provided on an "as-is" basis. Gartner shall not be liable for any damages whatsoever arising out of the content or use of this blog.