Blog post

Evolving “Hadoop” Stack Manifests a Community in Motion

By Merv Adrian | September 19, 2020 | 9 Comments

Operational DBMSopen sourceOPDBMSGartnerDBMSClouderaApache ZeppelinApache RangerApache PigApache PhoenixApache ParquetApache OozieApache NiFiApache MapReduceApache KuduApache ImpalaApache HiveApache HBaseApache HadoopApache GiraphApache FlinkApache DrillApache CalciteApache AtlasApache AmbariApache AirflowApache AccumuloApacheAnalytics and BI SolutionsData and Analytics StrategiesData Management Solutions

Apache “Hadoop” component activity tells a story of a dynamic community that continues to birth new projects, and incorporate them even as some older ones fade. Of course that extends far beyond the handful we watch from the perspective of the observers of examining the elephant, with all the challenges that entails. But as we observed in the last Tracker post, new bits are popping up all over.

The projects supported by only one of the vendors are not listed in the chart: they include some fairly well-known ones like Accumulo, Ambari, Arrow, Atlas, Beam, Drill, Mahout, Ozone, Storm, and even Parquet. There is a story for every one but I won’t tell them all here, or the hidden ones like Avro and Calcite, widely used inside multiple marketplace offerings. Parquet has wide visible adoption, of course, but that is not the same as a commitment from your vendor to support it for you. That distinction is in fact why I started these blog posts a little over 5 years ago in this post. Some other observations:

  • The rise of Apache Ranger as a key security component driven by the Cloudera-Hortonworks merger is underway, with both Google and Microsoft now supporting it directly. (Google has added several components since my last update.)
  • Apache Phoenix is garnering support as well, bidding for more operational use cases as it and Apache HBase gain features and stability. This week Cloudera again promised that its Operational Database experience will soon join the other packages that provide support for various users of CDH; Phoenix will play a key role.  HBase has had 4 releases in the past 9 months, responding to community interest and firming up what will be a significant effort to gain momentum and compete as other vendors up their commitments to nonrelational DBMSs.
  • Apache Airflow figured prominently in Cloudera’s announcements this week as well; it figures prominently in the new Data Engineering offering. Airflow has a familiar parentage in commercial innovation being open sourced; it was first built at Airbnb to manage complex workflows, like many other pieces noted here, it has found a receptive community eager to embrace it.
  • Apache Zeppelin is everywhere, except HPE, as the notebook metaphor has become widespread as a way of interacting for Hadoop players, and python’s popularity continues to grow. However, Jupyter is supported by AWS and Google, and Zeppelin seems to have lost some of its luster in conversations we’ve had lately. Cloudera’s promised Data Visualization offering will enter another crowded field.
  • Apache Livy, still incubating, has achieved broad support for REST access to Spark.
  • Apache Pig, which has not had a version update in 3 years, still appears in almost all columns of the chart – perhaps because batch MapReduce programs that work tend to be left alone. But Hue, with its SQL focus, continues to gain traction, perhaps because of its support for ODBC and JDBC and thus somewhat wider usability. (It also can edit Pig…)
  • Apache Flume is dropping out of sight with support only in HPE and in Cloudera’s legacy CDH; Hadoop users seem to have moved on from it.
  • Apache Flink is also getting substantial adoption – both AWS and Google are supporting it. Cloudera does too, but it’s not listed in the Tracker chart because it’s part of CDF, not CDP. Another example of why I have noted that it’s increasingly less meaningful to focus on “Hadoop.”
  • Apache Impala, a project Cloudera has invested steadily in, has not picked up support from other players other than the existing HPE inclusion. Nor has Apache Kudu.
  • Apache Storm seems to be losing steam, with only legacy Cloudera HDP supporting it among the group shown here.
  • Apache Mahout does not seem to have much support. Nearly a year and a half since its last release, and focusing on the Spark community instead of its original MapReduce roots, it is competing with other ways to include or connect to libraries for machine learning. Amazon lists Tensorflow but Google (which of course supports it) does not – another example of “it’s in another place” challenges in packaging.
  • Apache Giraph got a new release in June. Anybody seeing it out there?

As always, comments are invited. This is a huge, sprawling community and I can’t keep track of everything. This post is just intended to alert you to things you might find interesting. Please send corrections, questions, experiences and opinions. Always happy to hear from you; it’s how I decide where to spend my time as an observer of the market beyond the questions I receive in Gartner client inquiries. What are you seeing, or not? What’s working, or not?


Comments are closed


  • Ranger, Phoenix, Arrow, Airflow, Zeppelin,Livy, Kudu, Parquet, Avro, Calcite, Druid and Impala are important and growing. Don’t count out Accumulo. Ozone will be growing.

    Perhaps a Big Data plumbing chart with Airflow, Arrow, Livy, Parquet, Kudu, Avro, Calcite, Ranger, Atlas.

    Ambari, Pig, Flume, Storm, Giraph and Mahout should be in a legacy graph.

    Flink, NiFi, Kafka should be in a separate graph for Big Data Streaming.

  • Merv Adrian says:

    Thanks, Tim – great input. Sounds like I have to do some chart building….

  • Larry McCay says:

    Apache Knox is still supported across Cloudera offerings as well as within various cloud vendor offerings. See dataproc and AWS articles.

  • Jim Dowling says:

    You forgot Apace Hive! Still growing – position 15 on db-rankings engine ( SparkSQL at position 39, Redshift at 30, Impala at 37.
    And Presto – position 40.

    • Merv Adrian says:

      Thanks, Jim – I wasn’t trying to be exhaustive. But it’s good to have them pointed out. Hive won’t surprise anybody, and Presto is hot these days for sure. I may be talking about that in an upcoming blog post.

  • Larry McCay says:

    I think you are right about the AWS as a self install optional component.

  • Merv Adrian says:

    None of this means Knox won’t get more broad adoption, but other providers are pursuing their own strategies and some want to use their own components. Open source will be a better fit with some than with others.

  • Apache Knox is hugely important and definitely should be in a Big Data plumbing chart.