Blog post

Hadoop and DI – A Platform Is Not A Solution

By Merv Adrian | February 10, 2013 | 12 Comments

TalendMagic QuadrantHortonworksData IntegrationApache HadoopData and Analytics Strategies
“Hadoop people” and “RDBMS people” – including some DBAs who have contacted me recently –  clearly have different ideas about what Data Integration is. And both may  differ from what Ted Friedman (twitter: @ted_friedman) and I (@merv) were talking about in our Gartner research note Hadoop Is Not a Data Integration Solution, although I think the DBAs’ concept is far closer to ours.
We went to some lengths to precisely map Gartner criteria from the Magic Quadrant for Data Integration Tools (see below) to the capabilities of what most people would consider the Hadoop stack – Apache Projects that are supported in a commercial distribution. Many of those capabilities were simply absent, with nothing currently available to perform them.


Moreover, even to the degree that some pieces/projects might meet some of the needs, there is nothing that ties them together into a “solution,” which itself was a carefully chosen word. Today, with Hadoop projects in general, we very often see bespoke, self-integrated, “build it yourself and good luck operating it” structures. By contrast, solutions, including those for data integration, provide the relevant pieces coherently in a way that ties together design, operation, optimization and governance. Leaving aside the absence of data quality tools or profiling tools of any kind in today’s supported Hadoop project stack, we don’t see that yet. And Ted and I note in our piece that Hortonworks, for example, implicitly acknowledged that by bundling Talend into its distribution. Talend itself places rather well in the Gartner Magic Quadrant for DI tools.

Hadoop is very useful for a lot of things – including analytics of some kinds, and ETL of some kinds, and for low-cost exploitation of data that is unsuitable for persisting in RDBMSs for a variety of reasons. It’s maturing, and steadily adding more capabilities, and is driving an economic refactoring of data storage and processing which will result in some (increasing amounts of) data being kept there and some (increasing amounts of) processes being performed there. In Gartner’s Logical Data Warehouse model, it occupies the spot for Distributed Process use cases. The relative size of that part of the landscape relative to repositories and to virtualization is yet to be determined. It will take some years to sort out, and it won’t stand still.

But platforms are not solutions. Hadoop can very much be a platform on which a DI solution can be built. But A solution? Not yet. For that, talk to the folks in the MQ referenced above. [added 2/13] Thanks for your comments and tweets – and keep them coming!

The Gartner Blog Network provides an opportunity for Gartner analysts to test ideas and move research forward. Because the content posted by Gartner analysts on this site does not undergo our standard editorial review, all comments or opinions expressed hereunder are those of the individual contributors and do not represent the views of Gartner, Inc. or its management.

Comments are closed


  • This is a great post! Our view is that Hadoop is a fantastic technology if it is used for the use cases that it’s good for, and within the right enterprise data architecture. Enterprise customers can be much more successful with Hadoop if they use it with complementary technologies to build an end-to-end stack that covers all their data management and analytics needs. No single technology can solve all of customers’ problems in the big data space.

    Teradata has worked with mainstream Hadoop vendors to come up with a concrete proposal of how Hadoop can be successful in the Enterprise. We call that the Unified Data Architecture:

    The Unified Data Architecture has significant similarities with Gartner’s Logical Data Warehouse concept and we have gotten very enthusiastic feedback from customers that care how to get the most business value out of the data infrastructure.

    Tasso Argyros
    co-President, Teradata Aster

  • I agree that Hadoop by itself is not a data integration platform. However, I think there is greater potential today that the industry vendors are filling the functionality gap by porting existing profiling, parsing, ETL, cleansing etc capabilities to run on a Hadoop framework directly.

    So like Tasso, I think the future is brighter than you might expect. My more complete set of thoughts that expand on what Merv wrote above and with regard to Merv’s and Ted’s research note can be found here:

    • Merv Adrian says:

      Thanks, Tasso and Todd. We’re not far apart on this at all. Neither of us is pessimistic about the possibilities for the future – we’re just trying to be realistic about the present. Much work is indeed being done; for now, anyone needing a serious solution would be well advised to acquire one that is well-integrated and fully functional. “Building your own” will be difficult, inflexible, costly to maintain, and ultimately likely to require replacement. Vendors – like the ones each of you represents – have good solutions worth assessing if selections are in your future.


    I agree, Hadoop its not a DI, but it can be made into a data integration platform.

  • Merv & Ted, thanks for the post and the mention of Talend’s inclusion in the Hortonworks Data Platform. While Talend is not a Hadoop project in itself, it it available under an Apache license and can be used deeply integrated within the Hadoop stack.

    The key to the growth and success of Hadoop is the expansion of the ecosystem with solutions that complement its “native” features – and leverage them as close as possible. The same way ELT is a great way to leverage Teradata or Oracle’s power, MapReduce code generation allows Hadoop to be used for data integration. That does not make Hadoop a DI solution anymore than Teradata is one, but that enables to leverage the platform for DI.

    Another quick point: Talend doesn’t only do Data Integration for Hadoop, it also provides Data Quality: profiling, matching, deduplication. All through native Hadoop code generation.

  • Keith Kohl says:

    Thanks for the BLOG post and the original research note; I also enjoyed reading the comments and discussing this with my fellow DI & Hadoop colleagues.

    As noted in the research note, and by others, ETL is becoming a de facto standard use case for Hadoop:

    • Hortonworks calls this “Refine”:
    • Starting on slide 20:

    Hadoop is being used as an ETL-layer to augment or replace existing ETL/ELT approaches. Why is that? Because existing ETL/ELT approaches can’t handle the data requirements anymore – the volume and scale of the data at an economical cost structure. However, Hadoop is still immature and presented significant barriers for wider adoption within the enterprise ( Key barriers include connectivity, lack of skill sets and time & cost of custom coding (as noted in the research note), and disparate/immature tools.

    In order to really perform/execute ETL on Hadoop and get the economic value from Hadoop, users need an approach to ETL that is different from what is available today (hand coding).

    There are some problems that are difficult to solve in Hadoop. Take join for example. Joining data is a very common function required whether for ETL, analytics, etc. For instance, for identifying changes in data between yesterday and today – Change Data Capture (CDC). With Hadoop, data is distributed throughout the cluster. Performing a join, especially if both sides are large, is not easy and requires a lot of custom coding.

    Many users are hand coding MapReduce in Java, Pig, HiveQL, etc. There are even products available that generate MR code (Java, HiveQL, etc.). What’s needed is a light-weight approach to running an ETL “engine” natively on each data node. The ETL engine could not only perform the transformations, like aggregations and join, but also natively connect to all sources required by the enterprise (including legacy data sources such as mainframe) and targets without the need for an external ETL server, use external data sources as lookup sources, and also load an analytical data stores outside of Hadoop using native connectors (not sqoop), like the Teradata load utilities or Vertica & Greenplum native loading capabilities, without landing the data to disk.

    • Merv Adrian says:

      Thanks for a thorough and very useful comment, Keith. Much appreciated. We’ve seen some cases where Hadoop is being used as a “pre-ETL” stage, where the output file from a MR job is then handed to an existing ETL incumbent tool. These cases seem to be about the use of skills in the business unit that has adopted Hadoop for analysis of some hitherto unused data, using skills they already have (sometimes consultants). And often these began “outside” IT and are now being brought into the fold for hardening and support.

  • @Keith: I absolutely agree with the fact that Haddop is used more and more for ETL, but I would strongly object against your statement that deploying ETL engines on each data node is the way to go.

    It may work fine on smaller clusters with only a dozen or so nodes, but when you reach production-size environments with hundreds or thousands of nodes, or when you want to elastically provision more nodes in a Cloud, this becomes a management and maintenance nightmare. And we are not even talking of the overhead required to manage and coordinate the work of these engines – you’d essentially be duplicating the work done by MapReduce, but in a much less efficient way.

    OTOH, manually coding ETL has proven its limitations in the DBMS world. But as you said, “there are even products that generate code”, and I am convinced that this is the only viable architecture for ETL on Hadoop. The key, of course, is to generate the right code, and for it to be optimized to fully leverage Hadoop’s scale out capabilities.

  • aaronw says:

    Interesting article (and interesting and frustrating MQ.) I think it is important to discuss DI and DBMS in terms of semantics.

    DI is almost always done as a conduit to RDBMS, which have fixed semantics based on normalization, though allowing flexible query and some flexible ontology. These capabilities have unique strength, and are mostly enough where data can be effectively normalized.

    Contrast RDBMS to the failures of OODBMS and network DBMS, which both have fixed semantics and ontologies (OODBMS may also encapsulate behavior, even more restrictive.) These tend to lock in to a very restrictive understanding, a rigidity that defines limited analysis target results.

    The fun in new data is often the “unstructured data”, which really means deferred semantic analysis. Hadoop is obviously is a mix of coordinated distributed data processing, but generally is a mix of uninteresting Pig/Hive style batch RDBMS semantics, and MR programs where the program may choose to impose semantics on the data.

    MR and similar are the interesting play. The deferred semantics where each program can impose a definition on the data is so burdensome that the value comes at what generally is too high a price. It is huge effort in terms of development cost and validation of results to create a new semantic definition of the results. (Consider most “unstructured/semistructured DBMS” as part of this category.)

    Since the effort is so high, most MR is one of:
    – trivial programs in terms of semantics, or data exploration
    – rich, complex, programs that impose heuristic structure on the data
    Ignoring the former, the latter is substantial effort. If the results become valuable, the programs’ interpretation of the data can become systematized and justify extraction to other systems.
    The key point here is semantics binding is the critical issue with Hadoop and DI, and late binding semantics are hard.

    The result of this mix is that Hadoop has a few DI roles:
    a. It may be a target RDBMS (e.g., Hive or HBase)
    b. It may be a DI tool (e.g., using Pig loaders or Chukwa)
    c. It may be frontline app, with downstream DI to integrate into the rest of the stack
    d. It may be a part of a heterogeneous stack
    i. as a peer (likely with DI sucking data out)
    ii. as a second class citizen (e.g., operational BI collating logs or web data for analysis – often with vaguely DI products feeding it)

    • Merv Adrian says:

      Excellent comment, and thanks, aaronw – love to have your real name, by the way. Aaron Werman?
      Your point about the “file in, file out” model’s result of creating systematized data that will be reused leads to one of the trends in 2013: the competition among candidates for the new data store role. HDFS will continue to hold detail, and infrequently used, data in its role as the economic play of choice. But much successful MapReduce and newer processing will lead to added stores for ongoing use. Leading, of course, is HBase, but there are contenders like Cassandra, Accumulo, graph plays like Neo4j, DynamoDB for Amazon users, doc stores like MongoDB and CouchDB, and RDBMSs new and old from traditional players to newcomers like Hadapt. And others….but that’s another post.

  • aaronw says:

    My take is that Cassandra/Accumulo/DynamoDB, etc. will stay in memcached world – they provide fast cache, fast keyed access, very good scaling, but not a lot else in a DBMS context.

    Neo4J and other graph DBMS seem to be falling into a category with things like Lucene – specialized indexing vs. full featured DBMS. If RDF became less… specialized and uncommon, things could change.

    Hadoop (as HDFS) and MongoDB and CouchDB (which I’ve never seen in the wild) compete for *developers* attention as a DBMS-ish store. Selection seems mostly political (e.g., already in use) vs. capability.

    MR is not a DBMS, but to follow your thinking – it can be a stored procedure scaling business logic. The problem with MR is the content of the HDFS – is it homogeneous data or more of a heterogeneous dump? Often, it is a mix that requires much of the MR logic (including hcatalog and JSON to interpret on the fly) to disambiguate.

    This means that MR leads to either structured data out (DI to RDBMS, etc.) or combines the DI with the usage, so the MR program is a on-the-fly BI tool requiring writing the logic by hand.

  • Gordon says:

    This is a great conversation. I posted a blog comment about it on our blog.