Gartner Blog Network


August 2020 Hadoop Distribution Apache Project Tracker

by Merv Adrian  |  August 19, 2020  |  2 Comments

Welcome to my co-author, Gartner analyst Sanjeev Mohan

It’s been an eventful 6 months since Merv published the last of these trackers. The Hadoop ecosystem is far from dead, as many pundits predicted. Cloudera Data Platform (CDP) has begun to ship in bare metal, public cloud and private cloud versions. MapR is now HPE Ezmeral Data Fabric. Microsoft has decided to support its own Hadoop distribution in the cloud for HDInsight. As the data below shows, most of the key components are actively being updated.

It is true that one doesn’t hear about Hadoop as much, because:

  • It is no longer fashionable
  • Its core components have achieved a level of maturity and stability that doesn’t need major revamps.

However, at Gartner, we encounter many clients running workloads at extreme scale using one of the flavors of Hadoop – on premises and in the cloud. It has become the workhorse that quietly delivers in the background and doesn’t attract much attention. Increasingly, it doesn’t even use HDFS and runs directly on CSPs’ object stores – and Cloudera has just rolled out Apache Ozone, an open source object store. And there are other interesting new enhancements we are eagerly awaiting such as broader integration of Kubernetes, dueling machine learning libraries, the competition among optimization layers like Arrow and Presto, and more.

Recognizing that dynamism, we’ve included the current releases of both Cloudera legacy offerings CDH and HDP here, since they are far more widely deployed than CDP – and will be for some time, and currency and support will be top of mind issues. CDP is represented by Runtime, which is common across multiple offerings in Cloudera’s new product architecture. And Amazon’s EMR, the first commercial Hadoop offering, is represented by two releases – Amazon EMR 5.31 and 6, which will have Hadoop 3.X and Spark 3.X series components. The community continues to be composed of those who have to have the newest shiny objects and more conservative users. Amazon is  maintaining currency on both and tells us “For now, we suggest using Amazon EMR 5.3. We expect Amazon EMR 6.1 to launch in September, at which point we suggest using Amazon EMR 6.1.”

Finally, note that Cloudera’s packaging means some newer projects it supports, like Apache Flink or Apache NiFi, don’t show up here because you don’t get them in Runtime, only in specific use case offerings for operational data or data in motion. We dealt with a similar question when at first we did not find Apache Knox in Google Cloud Dataproc. It’s listed elsewhere, as a an optional component included in Component Gateway, which is avaialable at no additional charge. So recognize that those Apache projects have somewhat broader support in the market than they may appear to have here in this format. But even looking just at this set, component activity tells a story of a dynamic community that continues to birth new projects, and incorporate them even as some older ones fade. More on that in our next post.

As always, comments are invited. This is a huge, sprawling community and we can’t keep track of everything. Please send corrections, questions, experiences and opinions. Always happy to hear from you.

 

Additional Resources

Measure the Business Impact of Data and Analytics

How can executive leaders advise their direct reports to ensure metrics focus on Data & Analytics programs themselves? Download this guide to receive guidance to overcome challenges when trying to measure business impact.

Read Free Gartner Research

Category: amazon  elastic-mapreduce  amazon-web-services  analytics-and-bi-solutions  analytics-and-bi-solutions-for-technical-professionals  apache  accumulo  ambari  atlas  apache-drill  flink  flume  hadoop  hbase  hdfs  hive  impala  kafka  knox  kudu  mahout  mapreduce  apache-nifi  oozie  apache-parquet  phoenix  pig  ranger  solr  spark  sqoop  tez  apache-yarn  apache-zeppelin  zookeeper  cloudera  data-and-analytics-leaders  data-and-analytics-strategies  data-lake  data-management-solutions  data-management-solutions-for-technical-professionals  google  hortonworks  hue  mapr  microsoft  open-source  presto  spark-3  technology-and-emerging-trends  

Tags: amazon  apache  flink  flume  hadoop  hbase  hdfs  hive  mapreduce  apache-nifi  oozie  pig  spark  sqoop  yarn  zookeeper  big-data-2  cdh  cloudera  hortonworks  mapr  microsoft  

Merv Adrian
Research VP
9 years with Gartner
40 years in IT industry

Merv Adrian is an analyst following database and adjacent technologies as extreme data transforms assumptions about what to persist as well as when, where and how. He also watches the way the software/hardware boundary… Read Full Bio


Thoughts on August 2020 Hadoop Distribution Apache Project Tracker


  1. Cloudera has Apache Flink 1.10 support
    https://docs.cloudera.com/csa/1.2.0/release-notes/topics/csa-what-new.html

    No Apache NiFi listed?
    Apache Airflow? Apache Arrow?
    Apache Zeppelin? Apache Hue?

  2. Merv Adrian says:

    I’ll say more in a subsequent post but quickly: We used the posted lists of components from all vendors. And we cut the chart off at 2 listings. What you see is what we saw.
    Zeppelin is on there; so is Hue. The others you mentioned were not listed by 2 or more vendors.



Comments are closed

Comments or opinions expressed on this blog are those of the individual contributors only, and do not necessarily represent the views of Gartner, Inc. or its management. Readers may copy and redistribute blog postings on other blogs, or otherwise for private, non-commercial or journalistic purposes, with attribution to Gartner. This content may not be used for any other purposes in any other formats or media. The content on this blog is provided on an "as-is" basis. Gartner shall not be liable for any damages whatsoever arising out of the content or use of this blog.