Blog post

August 2020 Hadoop Distribution Apache Project Tracker

By Merv Adrian | August 19, 2020 | 2 Comments

SparkPrestoopen sourceMicrosoftMapRHueHortonworksGoogledata lakeClouderaApache ZookeeperApache ZeppelinApache YARNApache TezApache SqoopApache SparkApache SolrApache RangerApache PigApache PhoenixApache ParquetApache OozieApache NiFiApache MapReduceApache MahoutApache KuduApache KnoxApache KafkaApache ImpalaApache HiveApache HDFSApache HBaseApache HadoopApache FlumeApache FlinkApache DrillApache AtlasApache AmbariApache AccumuloApacheAmazon Web ServicesAmazon Elastic MapReduceAmazonAnalytics and BI SolutionsAnalytics and BI Solutions for Technical ProfessionalsData and Analytics LeadersData and Analytics StrategiesData Management SolutionsData Management Solutions for Technical ProfessionalsTechnology and Emerging Trends

Welcome to my co-author, Gartner analyst Sanjeev Mohan

It’s been an eventful 6 months since Merv published the last of these trackers. The Hadoop ecosystem is far from dead, as many pundits predicted. Cloudera Data Platform (CDP) has begun to ship in bare metal, public cloud and private cloud versions. MapR is now HPE Ezmeral Data Fabric. Microsoft has decided to support its own Hadoop distribution in the cloud for HDInsight. As the data below shows, most of the key components are actively being updated.

It is true that one doesn’t hear about Hadoop as much, because:

  • It is no longer fashionable
  • Its core components have achieved a level of maturity and stability that doesn’t need major revamps.

However, at Gartner, we encounter many clients running workloads at extreme scale using one of the flavors of Hadoop – on premises and in the cloud. It has become the workhorse that quietly delivers in the background and doesn’t attract much attention. Increasingly, it doesn’t even use HDFS and runs directly on CSPs’ object stores – and Cloudera has just rolled out Apache Ozone, an open source object store. And there are other interesting new enhancements we are eagerly awaiting such as broader integration of Kubernetes, dueling machine learning libraries, the competition among optimization layers like Arrow and Presto, and more.

Recognizing that dynamism, we’ve included the current releases of both Cloudera legacy offerings CDH and HDP here, since they are far more widely deployed than CDP – and will be for some time, and currency and support will be top of mind issues. CDP is represented by Runtime, which is common across multiple offerings in Cloudera’s new product architecture. And Amazon’s EMR, the first commercial Hadoop offering, is represented by two releases – Amazon EMR 5.31 and 6, which will have Hadoop 3.X and Spark 3.X series components. The community continues to be composed of those who have to have the newest shiny objects and more conservative users. Amazon is  maintaining currency on both and tells us “For now, we suggest using Amazon EMR 5.3. We expect Amazon EMR 6.1 to launch in September, at which point we suggest using Amazon EMR 6.1.”

Finally, note that Cloudera’s packaging means some newer projects it supports, like Apache Flink or Apache NiFi, don’t show up here because you don’t get them in Runtime, only in specific use case offerings for operational data or data in motion. We dealt with a similar question when at first we did not find Apache Knox in Google Cloud Dataproc. It’s listed elsewhere, as a an optional component included in Component Gateway, which is avaialable at no additional charge. So recognize that those Apache projects have somewhat broader support in the market than they may appear to have here in this format. But even looking just at this set, component activity tells a story of a dynamic community that continues to birth new projects, and incorporate them even as some older ones fade. More on that in our next post.

As always, comments are invited. This is a huge, sprawling community and we can’t keep track of everything. Please send corrections, questions, experiences and opinions. Always happy to hear from you.

 

Comments are closed

2 Comments

  • Cloudera has Apache Flink 1.10 support
    https://docs.cloudera.com/csa/1.2.0/release-notes/topics/csa-what-new.html

    No Apache NiFi listed?
    Apache Airflow? Apache Arrow?
    Apache Zeppelin? Apache Hue?

  • Merv Adrian says:

    I’ll say more in a subsequent post but quickly: We used the posted lists of components from all vendors. And we cut the chart off at 2 listings. What you see is what we saw.
    Zeppelin is on there; so is Hue. The others you mentioned were not listed by 2 or more vendors.