Gartner Blog Network

Stack Up Hadoop to Find Its Place in Your Architecture

by Merv Adrian  |  January 30, 2013  |  8 Comments

2013 promises to be a banner year for Apache Hadoop, platform providers, related technologies – and analysts who try to sort it out. I’ve been wrestling with ways to make sense of it for Gartner clients bewildered by a new set of choices, and for them and myself, I’ve built a stack diagram that describes possible functional layers of a Hadoop-based model.

The model is not exhaustive, and it continually evolves. In my own continuous collection and update of market, usage and technical data, it serves as a scratchpad I use – every project/product name in the layers is a link to a separate slide in a large deck I use to follow developments. As you can see below, it contains many Apache and non-Apache pieces – projects, products, vendors – open and closed source. Some are quite low level – for example Trevni can be thought of as a format used inside Avro – but I include them at least in part because I keep track of “moving parts,” and in the world of open source, that means a lot of pieces that are independent of one another.


(updated 1/30/13)

Part of the effort so far has been on relating this model to Gartner’s Information Capabilities Framework, an enormously useful view of the verbs we use to compose our semantic use cases in building business applications. My colleague Ted Friedman and I just used the two models to assess how Hadoop stacks up as a Data Integration solution. Not surprisingly, I suppose, we found it wanting.  You can see our research here if you’re a Gartner client.

I expect further refinement of this stack in the weeks ahead, and more offerings at each layer as it evolves as well. I’m trying to keep it simple – at 6 layers its already getting heavy, and I’d hate to add more. But that may be unavoidable. Your feedback here will be helpful – please offer comments if you have any! As a guide to choice, simplicity is a much-desired, but often unobtainable, objective.

Category: apache  big-data  cloudera  data-integration  hadoop  hbase  hdfs  hortonworks  mapreduce  open-source  oss  sqoop  

Tags: apache  cassandra  cloudera  datastax  flume  hadapt  hadoop  hbase  hdfs  hive  hortonworks  hstreaming  karmasphere  mapr  mapreduce  oozie  open-source  oss  pig  sqoop  zookeeper  

Merv Adrian
Research VP
4 years with Gartner
37 years in IT industry

Merv Adrian is an analyst following database and adjacent technologies as extreme data transforms assumptions about what to persist as well as when, where and how. He also watches the way the software/hardware boundary… Read Full Bio

Thoughts on Stack Up Hadoop to Find Its Place in Your Architecture

  1. […] more – 37.696935 -121.867562 Share this:DiggTwitterEmailPrintFacebookRedditStumbleUponLike […]

  2. […] on 이것이 좋아요:좋아하기Be the first to like […]

  3. Log analysis says:

    While data stacking is not to mention a very vital action to take, it can also be very painstaking. Sometimes, you’d have to put your precious data at risk for trial and error data integration attempts, which is a great reason why backing up each and every utilized data is an important precautionary measure to take. Well, with the plethora of continuously growing quantities of data in your computer, manipulating every detail can be overly hard to accomplish — perhaps an automated tool that would compliment Hadoop’s processes would be necessary. Or I would suggest that you guys do associate broader log capturing, classification and filtering capabilities to easily identify log files via their corresponding semantics — which should also be essential in user’s acuity to determine which items are rubbish and which data are worth keeping. Also, when handling a big data company, the more reason people should utilize an unconventional or advanced log analytics tool to deal with the mind boggling structural complexity of various computer systems.

  4. David Inbar says:

    Another interesting extension to this model might be to identify products which operate identically on both Hadoop and traditional server architectures – providing a counter-balance to the (expensive) trend towards data and compute islands. Hint – RushAnalytics :).

    • Merv Adrian says:

      Thanks, David. It’s certainly worth noting – and I’m not surprised you did.
      More seriously, I’m continuing to evolve the model, but I’m trying not to make it more complicated as I do so. As you can imagine, that is a challenge.

  5. John Held says:

    I had to chuckle at the title of the linked document “Hadoop Is Not a Data Integration Solution,” that’s a document title tailor-made for a concept headed right for the trough.

    I like the evolving stack, if I had any feedback it would be that layers 1, 3, 4, 5 are distinct concepts. Layers 2 & 3 were less clear. Could they all be one layer? Or three? ‘Search” is pretty distinct, but I found “implement” & “develop” less clear.

    This might be the right organization, but perhaps brief explanation on each layer might help!

    • Merv Adrian says:

      You’re absolutely right, and I appreciate the specific jab in the ribs to clean those layers up. I’ve struggled against making this a 7 or 8 layer stack, but it may not be possible to avoid it. Keep watching – I’ll likely have a new version in the next blog post. This is a place to try out ideas; that’s one of the ways it differs (one hopes) from published research.

  6. Nenshad Bardoliwalla says:

    One critical layer is missing in this stack: information management, or to use your verbs merge/cleanse/provision It sits right underneath ingest / propagate. The typical Hadoop deployment pattern is to ingest raw data from relational, device, social media stream, and other sources. It is typically very difficult to work with the data in this form. The data must be parsed, cleansed, merged with other data, and provisioned in a manner fit for purpose for the analytics layers above. There are multiple early stage companies that are pursuing building out this layer.

Comments are closed

Comments or opinions expressed on this blog are those of the individual contributors only, and do not necessarily represent the views of Gartner, Inc. or its management. Readers may copy and redistribute blog postings on other blogs, or otherwise for private, non-commercial or journalistic purposes, with attribution to Gartner. This content may not be used for any other purposes in any other formats or media. The content on this blog is provided on an "as-is" basis. Gartner shall not be liable for any damages whatsoever arising out of the content or use of this blog.