Blog post

Hadoop is in the Mind of the Beholder

By Merv Adrian | March 24, 2014 | 8 Comments

OSSopen sourceMapRIntelIBM BigInsightsIBMHortonworksGartnerClouderaApache ZookeeperApache YARNApache SqoopApache SolrApache PigApache OozieApache MapReduceApache LuceneApache HiveApache HDFSApache HBaseApache HadoopApache GiraphApache DrillApache AmbariApache AccumuloApacheAmazon Elastic MapReduceData and Analytics Strategies

This post was jointly authored by Merv Adrian (@merv) and Nick Heudecker (@nheudecker) and appears on both blogs.

In the early days of Hadoop (versions up through 1.x), the project consisted of two primary components: HDFS and MapReduce. One thing to store the data in an append-only file model, distributed across an arbitrarily large number of inexpensive nodes with disk and processing power; another to process it, in batch, with a relatively small number of available function calls. And some other stuff called Commons to handle bits of the plumbing. But early adopters demanded more functionality, so the Hadoop footprint grew. The result was an identity crisis that grows progressively more challenging for decisionmakers with almost every new announcement.

This expanding footprint included a sizable group of “related projects,” mostly under the Apache Software Foundation. When Gartner published How to Choose the Right Apache Hadoop Distribution in early February 2012 the leading vendors we surveyed (Cloudera, MapR, IBM, Hortonworks, and EMC) all included Pig, Hive, HBase, and Zookeeper. Most were willing to support Flume, Mahout, Oozie, and Sqoop. Several other projects were supported by some, but not all. If you were asked at the time, “What is Hadoop?” this set of ten projects, the commercially supported ones, would have made a good response.

In 2013, Hadoop 2.0 arrived, and with it a radical redefinition. YARN muddied the clear definition of Hadoop by introducing a way for multiple applications to use the cluster resources. You have options. Instead of just MapReduce, you can run Storm (or S4 or Spark Streaming), Giraph, or HBase, among others. The list of projects with abstract names goes on. At least fewer of them are animals now.

During the intervening time, vendors have selected different projects and versions to package and support. To a greater or lesser degree all of these vendors call their products Hadoop – some are clearly attempting to move “beyond” that message. Some vendors are trying to break free from the Hadoop baggage by introducing new, but just as awful, names. We have data lakes, hubs, and no doubt more to come.

But you get the point. The vague names indicate the vendors don’t know what to call these things either. If they don’t know what they’re selling, do you know what you’re buying? If the popular definition of Hadoop has shifted from a small conglomeration of components to a larger, increasingly vendor-specific conglomeration, does the name “Hadoop” really mean anything anymore?

Today the list of projects supported by leading vendors (now Cloudera, Hortonworks, MapR, Pivotal and IBM) numbers 13. Today it’s HDFS, YARN, MapReduce, Pig, Hive, HBase, and Zookeeper, Flume, Mahout, Oozie, Sqoop – and Cascading and HCatalog. Coming up fast are Spark, Storm, Accumulo, Sentry, Falcon, Knox, Whirr… and maybe Lucene and Solr. Numerous others are only supported by their distributor and are likely to remain so, though perhaps MapR’s support for Cloudera Impala will not be the last time we see an Apache-licensed, but not Apache project, break the pattern. All distributions have their own unique value-add. The answer to the question, “What is Hadoop?” and the choice buyers must make will not get easier in the year ahead – it will only become more difficult.

Comments are closed


  • Frank Blau says:

    Perhaps a good analogy is Xerox; What was first a corporation with an innovative product, is now a verb describing the process of duplication.

    When people ask me what Hadoop is, I start with it is first and foremost, a platform built around the idea of an intelligent file system. I think there will naturally be supporting actors in the overall screenplay of hadoop, but that fact remains a central point of reference for what their interactions will be with the leading players.

    And as an aside, what I am most looking forward to are the integration of usable IDE’s for truly integrated development. Right now (even within one vendor’s stack) there are distinctly different tools for developing and deploying “hadoop applications”, and very little to hand off and orchestrate deployment between them.


    • Merv Adrian says:

      Response from Nick Heudecker

      Excellent point about the lack of an integrated IDE for the Hadoop platform. I think we’ll need to see more uniformity between components before a truly integrated development environment emerges. Spark starts to provide some of the essential framework with its streaming and batch/iterative processing in a single API, as does Spring XD. But if it’s early for Hadoop (or whatever it’s becoming), it is even earlier for cohesive IDE experience.
      Thanks for your comments!

      and from Merv:
      there are several, as you say, and with differing coverage of the various pieces – like the distributions themselves. Yet another fragmentation of the picture. Will it drive a flight to megavendors, like IBM, Pivotal Spring or Microsoft, who already have a large part of the developer footprint? Or will specialists thrive here – and/or be acquired? An interesting landscape ahead….

  • Ofir Manor says:

    Hi Merv,
    This post has a surprising level of cynicism!
    The essence of Hadoop is a “next-gen” platform that can store and process dramatically more data, dramatically cheaper than existing enterprise platforms.
    Yes, the vendors try to come with catchy names for that, with all the hubs and lakes and oceans and what have you… Their game is obvious – they have a web-scale proven platform looking for an enterprise problem to solve… So, they work hard trying to convince enterprises that they actually got a web-scale problem 🙂
    And yes, the actual list of technologies and capabilities is growing and evolving fast… Plus, there are also dozens more from various companies as “add-ons” (like JethroData where I work). But, as technologies come and go, I believe the essence doesn’t change – do much more, with much more data, with much less money, on a (more-or-less) standard, shared platform.
    But like in other areas (public cloud, NoSQL etc), the question is can enterprise handle the transition or are they locked forever by the big existing vendors.
    I hope it makes sense,

    • Merv Adrian says:

      Thanks for commenting, and I think you misunderstand my intent. I’m not cynical about this wave of software innovation at all. It’s transformative and enormously important. My point is just that the word no longer does justice to the enormous number of available options. Clarity matters a lot, and two people who both say they are using “Hadoop” may mean entirely different things. This makes it complicated for those trying to decide what to do next. I hope our research will help people wrestling with that problem.

  • Ofir Manor says:

    Thanks for the clarification Merv, my bad – I guess I shouldn’t be reading or posting comments after midnight… I thought you were a bit cynical about some of the vendor positioning and messaging (the paragraph before the last) – not about the technology itself.
    I agree – “hadoop” has stopped describing a specific offering and now describes a full category of various offerings (almost as wide now as “Big Data”).
    For example, I think that saying that a customer using Hadoop is is a bit like saying he uses a relational database. It does give you some hints and common ground (tables, SQL, optimizer, JDBC/ODBC, likely many specific features), but doesn’t describe the technical use case (OLTP / DW) or functionality used (partitioning? indexes? MVCC?) or products used (which products, which versions) or the main challenges that were handled (concurrency, adhoc queries, extreme HA, massive batch computations etc) etc etc.

  • Merv & Nick,
    A trip down memory lane… back in the days, what did “J2EE application server” mean? A minimum set of standards, and lots of differentiating tools and add-ons. As Hadoop transitions from a “data storage + data processing framework” into a real computing platform, the same is happening here. History keeps repeating…

  • Agapito Herrera Cuiza says:

    Quiero comunicarme con ustedes tengo enteres, que nuestra institucion trabaja con las familias pobres que necesitan apoyo, que ellos tienen ideas de emprendimiento en sus actividades.

  • Dilip Rane says:


    Thanks for the post. You have rightly identified the problem many are facing today – Hadoop – What are we buying? why are we buying? What components make sense? Should I care about open source vs. proprietary additions? What do partnerships between Hadoop distributors and other database vendors mean? Is Hadoop a threat to traditional database players?
    Hopefully, through your research you can provide some answers and guidance. Looking forward to your April 24th webinar on this topic.