Merv Adrian

A member of the Gartner Blog Network

Merv Adrian
Research VP
4 years with Gartner
37 years in IT industry

Merv Adrian is an analyst following database and adjacent technologies as extreme data transforms assumptions about what to persist as well as when, where and how. He also watches the way the software/hardware boundary… Read Full Bio

Coverage Areas:

Hadoop is in the Mind of the Beholder

by Merv Adrian  |  March 24, 2014  |  11 Comments

This post was jointly authored by Merv Adrian (@merv) and Nick Heudecker (@nheudecker) and appears on both blogs.

In the early days of Hadoop (versions up through 1.x), the project consisted of two primary components: HDFS and MapReduce. One thing to store the data in an append-only file model, distributed across an arbitrarily large number of inexpensive nodes with disk and processing power; another to process it, in batch, with a relatively small number of available function calls. And some other stuff called Commons to handle bits of the plumbing. But early adopters demanded more functionality, so the Hadoop footprint grew. The result was an identity crisis that grows progressively more challenging for decisionmakers with almost every new announcement.

This expanding footprint included a sizable group of “related projects,” mostly under the Apache Software Foundation. When Gartner published How to Choose the Right Apache Hadoop Distribution in early February 2012 the leading vendors we surveyed (Cloudera, MapR, IBM, Hortonworks, and EMC) all included Pig, Hive, HBase, and Zookeeper. Most were willing to support Flume, Mahout, Oozie, and Sqoop. Several other projects were supported by some, but not all. If you were asked at the time, “What is Hadoop?” this set of ten projects, the commercially supported ones, would have made a good response.

In 2013, Hadoop 2.0 arrived, and with it a radical redefinition. YARN muddied the clear definition of Hadoop by introducing a way for multiple applications to use the cluster resources. You have options. Instead of just MapReduce, you can run Storm (or S4 or Spark Streaming), Giraph, or HBase, among others. The list of projects with abstract names goes on. At least fewer of them are animals now.

During the intervening time, vendors have selected different projects and versions to package and support. To a greater or lesser degree all of these vendors call their products Hadoop – some are clearly attempting to move “beyond” that message. Some vendors are trying to break free from the Hadoop baggage by introducing new, but just as awful, names. We have data lakes, hubs, and no doubt more to come.

But you get the point. The vague names indicate the vendors don’t know what to call these things either. If they don’t know what they’re selling, do you know what you’re buying? If the popular definition of Hadoop has shifted from a small conglomeration of components to a larger, increasingly vendor-specific conglomeration, does the name “Hadoop” really mean anything anymore?

Today the list of projects supported by leading vendors (now Cloudera, Hortonworks, MapR, Pivotal and IBM) numbers 13. Today it’s HDFS, YARN, MapReduce, Pig, Hive, HBase, and Zookeeper, Flume, Mahout, Oozie, Sqoop – and Cascading and HCatalog. Coming up fast are Spark, Storm, Accumulo, Sentry, Falcon, Knox, Whirr… and maybe Lucene and Solr. Numerous others are only supported by their distributor and are likely to remain so, though perhaps MapR’s support for Cloudera Impala will not be the last time we see an Apache-licensed, but not Apache project, break the pattern. All distributions have their own unique value-add. The answer to the question, “What is Hadoop?” and the choice buyers must make will not get easier in the year ahead – it will only become more difficult.

11 Comments »

Category: Accumulo Ambari Apache Apache Drill Apache Yarn Big Data BigInsights Cloudera Elastic MapReduce Gartner Giraph Hadoop Hbase HCatalog HDFS Hive Hortonworks IBM Intel Lucene MapR MapReduce Oozie open source OSS Pig Solr Sqoop Storm YARN Zookeeper     Tags: , , , , , , , , , , , , , , , , , , , , , , , ,

11 responses so far ↓

  • 1 Hadoop is in the Mind of the Beholder | Merv Adrian's IT Market Strategy   March 24, 2014 at 6:05 pm

    [...] –more– [...]

  • 2 Frank Blau   March 24, 2014 at 6:28 pm

    Perhaps a good analogy is Xerox; What was first a corporation with an innovative product, is now a verb describing the process of duplication.

    When people ask me what Hadoop is, I start with it is first and foremost, a platform built around the idea of an intelligent file system. I think there will naturally be supporting actors in the overall screenplay of hadoop, but that fact remains a central point of reference for what their interactions will be with the leading players.

    And as an aside, what I am most looking forward to are the integration of usable IDE’s for truly integrated development. Right now (even within one vendor’s stack) there are distinctly different tools for developing and deploying “hadoop applications”, and very little to hand off and orchestrate deployment between them.

    Frank

  • 3 Hadoop is in the Mind of the Beholder : 6config: Le blog   March 24, 2014 at 6:28 pm

    [...] By Merv Adrian [...]

  • 4 Hadoop is in the Mind of the Beholder | Euler Global Consulting   March 24, 2014 at 7:09 pm

    [...] By Merv Adrian [...]

  • 5 Merv Adrian   March 24, 2014 at 7:09 pm

    Response from Nick Heudecker

    Frank,
    Excellent point about the lack of an integrated IDE for the Hadoop platform. I think we’ll need to see more uniformity between components before a truly integrated development environment emerges. Spark starts to provide some of the essential framework with its streaming and batch/iterative processing in a single API, as does Spring XD. But if it’s early for Hadoop (or whatever it’s becoming), it is even earlier for cohesive IDE experience.
    Thanks for your comments!
    -Nick

    and from Merv:
    there are several, as you say, and with differing coverage of the various pieces – like the distributions themselves. Yet another fragmentation of the picture. Will it drive a flight to megavendors, like IBM, Pivotal Spring or Microsoft, who already have a large part of the developer footprint? Or will specialists thrive here – and/or be acquired? An interesting landscape ahead….

  • 6 Ofir Manor   March 24, 2014 at 11:37 pm

    Hi Merv,
    This post has a surprising level of cynicism!
    The essence of Hadoop is a “next-gen” platform that can store and process dramatically more data, dramatically cheaper than existing enterprise platforms.
    Yes, the vendors try to come with catchy names for that, with all the hubs and lakes and oceans and what have you… Their game is obvious – they have a web-scale proven platform looking for an enterprise problem to solve… So, they work hard trying to convince enterprises that they actually got a web-scale problem :)
    And yes, the actual list of technologies and capabilities is growing and evolving fast… Plus, there are also dozens more from various companies as “add-ons” (like JethroData where I work). But, as technologies come and go, I believe the essence doesn’t change – do much more, with much more data, with much less money, on a (more-or-less) standard, shared platform.
    But like in other areas (public cloud, NoSQL etc), the question is can enterprise handle the transition or are they locked forever by the big existing vendors.
    I hope it makes sense,
    Ofir

  • 7 Merv Adrian   March 24, 2014 at 11:41 pm

    Thanks for commenting, and I think you misunderstand my intent. I’m not cynical about this wave of software innovation at all. It’s transformative and enormously important. My point is just that the word no longer does justice to the enormous number of available options. Clarity matters a lot, and two people who both say they are using “Hadoop” may mean entirely different things. This makes it complicated for those trying to decide what to do next. I hope our research will help people wrestling with that problem.

  • 8 Ofir Manor   March 26, 2014 at 5:22 am

    Thanks for the clarification Merv, my bad – I guess I shouldn’t be reading or posting comments after midnight… I thought you were a bit cynical about some of the vendor positioning and messaging (the paragraph before the last) – not about the technology itself.
    I agree – “hadoop” has stopped describing a specific offering and now describes a full category of various offerings (almost as wide now as “Big Data”).
    For example, I think that saying that a customer using Hadoop is is a bit like saying he uses a relational database. It does give you some hints and common ground (tables, SQL, optimizer, JDBC/ODBC, likely many specific features), but doesn’t describe the technical use case (OLTP / DW) or functionality used (partitioning? indexes? MVCC?) or products used (which products, which versions) or the main challenges that were handled (concurrency, adhoc queries, extreme HA, massive batch computations etc) etc etc.

  • 9 Yves de Montcheuil   March 26, 2014 at 12:13 pm

    Merv & Nick,
    A trip down memory lane… back in the days, what did “J2EE application server” mean? A minimum set of standards, and lots of differentiating tools and add-ons. As Hadoop transitions from a “data storage + data processing framework” into a real computing platform, the same is happening here. History keeps repeating…

  • 10 Agapito Herrera Cuiza   March 27, 2014 at 7:49 pm

    Quiero comunicarme con ustedes tengo enteres, que nuestra institucion trabaja con las familias pobres que necesitan apoyo, que ellos tienen ideas de emprendimiento en sus actividades.

  • 11 Dilip Rane   April 4, 2014 at 3:04 pm

    Merv,

    Thanks for the post. You have rightly identified the problem many are facing today – Hadoop – What are we buying? why are we buying? What components make sense? Should I care about open source vs. proprietary additions? What do partnerships between Hadoop distributors and other database vendors mean? Is Hadoop a threat to traditional database players?
    Hopefully, through your research you can provide some answers and guidance. Looking forward to your April 24th webinar on this topic.