Blog post

February 2020 Hadoop Distribution Apache Project Tracker

By Merv Adrian | January 14, 2020 | 0 Comments

open sourceMapRIndustry trendsHortonworksGoogleGartnerdata lakeClouderaApache ZookeeperApache ZeppelinApache YARNApache TezApache SqoopApache SparkApache RangerApache PhoenixApache OozieApache NiFiApache MapReduceApache KuduApache KafkaApache ImpalaApache HiveApache HDFSApache HBaseApache HadoopApache FlumeApache FlinkApache AvroApache AtlasApache AmbariApache AccumuloApacheAmazon Elastic MapReduceAnalytics and BI Solutions

Updated 2/15/20 – thank you to HPE for MapR details

Until 2018, I published comparisons of the supported versions of components in the multiple available Hadoop distributions. That older series of posts, beginning with the one linked here showed that the ebb and flow of new additions to the typical stack supported by “most” distributors slowed to a halt after everyone (mostly) added Kafka. Through 2018 and 2019 I no longer updated the series. The relative version currency was fairly stable, with Hortonworks typically first to market with many of the newest Apache versions and Cloudera often close behind and leading on some where they were more dominant on the project committee. HPE MapR, then and now, went its own way, with a proprietary but API-compatible file system (MapR-FS today supports HDFS 2.7.0+ APIs), HBase variant (MapR-DB today supports HBase 1.1.13 APIs) and stream processing platform (MapR-Streams today supports Kafka 1.1 APIs.)

So as mergers, acquisitions, and the rise of the cloud-platform-as-Hadoop-provider dynamic played out of the past 2 years, I got away from the regular tracking cycle. This seems like a good time to revisit the question: Who supports what? since so many people seem to think “Hadoop is HDFS and therefore going away, because cloud object stores.” The players I include in this visit are AWS, Cloudera, Google and HPE MapR. By request, I have added the current release of Cloudera HDP (the former Hortonworks distribution.) All of them support 9 pieces: Apache HDFS, Mapreduce, YARN, Hive, Pig, Spark, Sqoop, Tez and Zookeeper. MapR’s support for POSIX files as well as the HDFS APIs remains a key differentiator.

 

All but Google also support Apache HBase, Mahout and Oozie, as well as Hue. There is a nuance here: Google Cloud Dataproc allows you to perform initialization actions to add components (indicated on the table with “IA”) and offers scripts for dozens of installable components, but cautions that “the initialization actions provided in this repository are provided without support and you use them at your own risk.” The same applies to the various projects themselves; similar mechanisms apply for other not directly supported projects in AWS. Microsoft Azure, whose HDInsight begins with the Cloudera distribution, similarly adds its own pieces and also permits you to add yours. Zeppelin is listed as “soon” for Cloudera, which would move it into the “all supporters” category as well, but for now it appears below.

 

Apache Flume, Impala, Kafka, Phoenix, Presto, Sentry, and Storm are supported by two vendors apiece. Note that some of these are used infrequently and some will be deprecated soon, but continue to get support because many users have them in their stacks.

There are two distinct sets of APIs for Kafka with different version numbers. HPE MapR supports Kafka 1.1 APIs for the producer and consumer, and includes KSQL and KStreams version 4.1 for real-time analytics.

 

 

There are many additional pieces supported by only one of these vendors: Apache Accumulo, Ambari, Atlas, Avro, Crunch, Drill, Druid, Flink, Knox, Kudu, Livy, Lucene, Myriad, NiFi, Ozone, Parquet, Ranger, Solr, Tensorflow and others.  And many more not directly supported by any of them. We’ll save that for the next post, where we’ll talk about whether “the core stack” matters anymore. And I hope to talk about Kubernetes in an upcoming discussion.

And please, this is a blog post, not published research, and likely to have a few things that need updating or correcting – blogs are good for data that changes often. Please let me know what you spot.

Comments are closed