Blog post

Stack Up Hadoop to Find Its Place in Your Architecture

By Merv Adrian | January 30, 2013 | 6 Comments

OSSopen sourceHortonworksData IntegrationClouderaApache SqoopApache MapReduceApache HDFSApache HBaseApache HadoopApacheData and Analytics Strategies

2013 promises to be a banner year for Apache Hadoop, platform providers, related technologies – and analysts who try to sort it out. I’ve been wrestling with ways to make sense of it for Gartner clients bewildered by a new set of choices, and for them and myself, I’ve built a stack diagram that describes possible functional layers of a Hadoop-based model.

The model is not exhaustive, and it continually evolves. In my own continuous collection and update of market, usage and technical data, it serves as a scratchpad I use – every project/product name in the layers is a link to a separate slide in a large deck I use to follow developments. As you can see below, it contains many Apache and non-Apache pieces – projects, products, vendors – open and closed source. Some are quite low level – for example Trevni can be thought of as a format used inside Avro – but I include them at least in part because I keep track of “moving parts,” and in the world of open source, that means a lot of pieces that are independent of one another.


(updated 1/30/13)

Part of the effort so far has been on relating this model to Gartner’s Information Capabilities Framework, an enormously useful view of the verbs we use to compose our semantic use cases in building business applications. My colleague Ted Friedman and I just used the two models to assess how Hadoop stacks up as a Data Integration solution. Not surprisingly, I suppose, we found it wanting.  You can see our research here if you’re a Gartner client.

I expect further refinement of this stack in the weeks ahead, and more offerings at each layer as it evolves as well. I’m trying to keep it simple – at 6 layers its already getting heavy, and I’d hate to add more. But that may be unavoidable. Your feedback here will be helpful – please offer comments if you have any! As a guide to choice, simplicity is a much-desired, but often unobtainable, objective.

Comments are closed


  • Log analysis says:

    While data stacking is not to mention a very vital action to take, it can also be very painstaking. Sometimes, you’d have to put your precious data at risk for trial and error data integration attempts, which is a great reason why backing up each and every utilized data is an important precautionary measure to take. Well, with the plethora of continuously growing quantities of data in your computer, manipulating every detail can be overly hard to accomplish — perhaps an automated tool that would compliment Hadoop’s processes would be necessary. Or I would suggest that you guys do associate broader log capturing, classification and filtering capabilities to easily identify log files via their corresponding semantics — which should also be essential in user’s acuity to determine which items are rubbish and which data are worth keeping. Also, when handling a big data company, the more reason people should utilize an unconventional or advanced log analytics tool to deal with the mind boggling structural complexity of various computer systems.

  • David Inbar says:

    Another interesting extension to this model might be to identify products which operate identically on both Hadoop and traditional server architectures – providing a counter-balance to the (expensive) trend towards data and compute islands. Hint – RushAnalytics :).

    • Merv Adrian says:

      Thanks, David. It’s certainly worth noting – and I’m not surprised you did.
      More seriously, I’m continuing to evolve the model, but I’m trying not to make it more complicated as I do so. As you can imagine, that is a challenge.

  • John Held says:

    I had to chuckle at the title of the linked document “Hadoop Is Not a Data Integration Solution,” that’s a document title tailor-made for a concept headed right for the trough.

    I like the evolving stack, if I had any feedback it would be that layers 1, 3, 4, 5 are distinct concepts. Layers 2 & 3 were less clear. Could they all be one layer? Or three? ‘Search” is pretty distinct, but I found “implement” & “develop” less clear.

    This might be the right organization, but perhaps brief explanation on each layer might help!

    • Merv Adrian says:

      You’re absolutely right, and I appreciate the specific jab in the ribs to clean those layers up. I’ve struggled against making this a 7 or 8 layer stack, but it may not be possible to avoid it. Keep watching – I’ll likely have a new version in the next blog post. This is a place to try out ideas; that’s one of the ways it differs (one hopes) from published research.

  • Nenshad Bardoliwalla says:

    One critical layer is missing in this stack: information management, or to use your verbs merge/cleanse/provision It sits right underneath ingest / propagate. The typical Hadoop deployment pattern is to ingest raw data from relational, device, social media stream, and other sources. It is typically very difficult to work with the data in this form. The data must be parsed, cleansed, merged with other data, and provisioned in a manner fit for purpose for the analytics layers above. There are multiple early stage companies that are pursuing building out this layer.