Blog post

Hadoop is (still) alive and kicking

By Sanjeev Mohan | August 19, 2020 | 0 Comments

Data Management SolutionsCloud Computing for Technical ProfessionalsCloud Computing

It has been a long time since I blogged and so I joined forces with my colleague and friend Merv Adrian. Here is the first one of the series.

It’s been an eventful 6 months since Merv published the last of these trackers. The Hadoop ecosystem is far from dead, as many pundits predicted. Cloudera Data Platform (CDP) has begun to ship in bare metal, public cloud and private cloud versions. MapR is now HPE Ezmeral Data Fabric. Microsoft has decided to support its own Hadoop distribution in the cloud for HDInsight. As the data below shows, most of the key components are actively being updated.

It is true that one doesn’t hear about Hadoop as much, because:

  • It is no longer fashionable
  • Its core components have achieved a level of maturity and stability that doesn’t need major revamps.

However, at Gartner, we encounter clients running workloads at extreme scale using one of the flavors of Hadoop – on premises and in the cloud. It has become the workhorse that quietly delivers in the background and doesn’t attract much attention. Increasingly, it doesn’t even use HDFS and runs directly on CSPs’ object stores – and Cloudera has just rolled out Apache Ozone, an open source object store. And there are other interesting new enhancements we are eagerly awaiting such as broader integration of Kubernetes, dueling machine learning libraries, the competition among optimization layers like Arrow and Presto, and more.

Recognizing that dynamism, we’ve included the current releases of both Cloudera legacy offerings CDH and HDP here, since they are far more widely deployed than CDP – and will be for some time, and currency and support will be top of mind issues. CDP is represented by Runtime, which is common across multiple offerings in Cloudera’s new product architecture. And Amazon’s EMR, the first commercial Hadoop offering, is represented by two releases – Amazon EMR 5.31 and 6, which will have Hadoop 3.X and Spark 3.X series components. The community continues to be composed of those who have to have the newest shiny objects and more conservative users. Amazon is  maintaining currency on both and tells us “For now, we suggest using Amazon EMR 5.3. We expect Amazon EMR 6.1 to launch in September, at which point we suggest using Amazon EMR 6.1.”

Finally, note that Cloudera’s packaging means some newer projects it supports, like Apache Flink or Apache NiFi, don’t show up here because you don’t get them in Runtime, only in specific use case offerings for operational data or data in motion. We dealt with a similar question when at first we did not find Apache Knox in Google Cloud Dataproc. It’s listed elsewhere, as a an optional component included in Component Gateway, which is available at no additional charge. So recognize that those Apache projects have somewhat broader support in the market than they may appear to have here in this format. But even looking just at this set, component activity tells a story of a dynamic community that continues to birth new projects, and incorporate them even as some older ones fade. More on that in our next post.

As always, comments are invited. This is a huge, sprawling community and we can’t keep track of everything. Please send corrections, questions, experiences and opinions. Always happy to hear from you.

Leave a Comment