Blog post

Open Source “Purity,” Hadoop, and Market Realities

By Merv Adrian | March 09, 2013 | 13 Comments

open sourceMapREMCApache LuceneApache HadoopApache CassandraApacheData and Analytics Strategies

I don’t often do a pure opinion piece but I feel compelled to weigh in on a queston I’ve been asked several times since EMC released its Pivotal HD recently. The question is whether it is somehow inappropriate, even “evil,” for EMC to enter the market without having “enough” committers to open source Apache projects. More broadly, it’s about whether other people can use, incorporate, add to and profit from Apache Hadoop.

The fact is, there is an entire industry building products atop Apache open source code – and that is the point of having Apache license its projects and provide the other services it does for the open source community. The license permits such use, and companies using the Apache web server, Lucene and SOLR, Cassandra and CouchDB, and many others are everywhere. Others are building BI tools or DI tools that integrate with Apache Hadoop, or selling consulting to incorporate it into solutions. Again – that is the point.  Having some components of your solution stack provided by the open source community is a fact of life and a benefit for all. So are roads, but nobody accuses Fedex or your pizza delivery guy of being evil for using them without contributing some asphalt. Commercial entities (including software and IT services providers) provide needed products and services, employ people and pay taxes. We might want them to do more charitable work or make more open source contributions , and some do, but they are not morally obligated to do so. Some IT companies make huge commitments to charitable activities and some don’t – the same is true in all sectors of the economy.

I understand why open source advocates think they are defending their turf, and I know it’s a core belief that it matters how many committers you have. But I don’t believe the market will care as Hadoop moves into the mainstream. Buyers will choose the solutions that fit their needs, from suppliers who support them at a price they are comfortable with – and will do so whether the vendors have “enough” committers or not.
For clarity’s sake, this wasn’t a new market entry. EMC was already a purveyor of Hadoop-based solutions with their Greenplum HD and with a version based on MapR. That itself is a topic worth a sentence or two. EMC’s decision to offer a MapR-based distribution  early on was very much a market choice – they did it for customers who demanded those features NOW (then) and could’t get them any other way. I don’t think EMC fooled those buyers, who asked for what EMC provided. Nor do I think EMC is morally reprehensible for building their own solution by leveraging something in their product portfolio (in this case, Isilon as a potential substitute for HDFS) and thus “abandoning” those customers.
Now, if EMC  stops supporting those buyers, forces them to move to a new product to keep their support – well, then we can talk. But just to be clear, virtually every software company has an end of life policy on support for versions of its products. And again, some are more “oppressive” about it than others – and the topic is often very contentious. I get inquiries on it all the time. That topic has not even come up with EMC and MapR yet.
So a few deep breaths, please.
Dial it back.
Support open source. It’s a good thing. In fact, it’s transformative – it changes your choices, and often for the better, especially economically.
If you sell, by all means appeal to people who value purity. But let’s not try to have our cake and eat it too: if you sell a product based only on open source, or services that help people implement and profit from it, you’re part of the same economy as those  who blend it with other pieces. Let’s compete on the basis of satisfying our customers at a fair price. The rest – well, that’s marketing. And we all know how much some people like that, and how seriously they take it.

The Gartner Blog Network provides an opportunity for Gartner analysts to test ideas and move research forward. Because the content posted by Gartner analysts on this site does not undergo our standard editorial review, all comments or opinions expressed hereunder are those of the individual contributors and do not represent the views of Gartner, Inc. or its management.

Comments are closed


  • Oliver Ratzesberger says:

    I agree. Nobody complains that virtually every commercial company runs some distro of Linux and commercial software on top of that. That is the beauty of open source melting with commercial offerings – or empowering them.

    • Merv Adrian says:

      Thanks, Oliver. Appreciate your thoughts – I know you constructed some VERY large systems that combined OSS and commercial software with internal development.

  • Merv, thanks a bunch for this write-up! I by and large agree.

    One aspect I’d like to highlight is the importance of ‘standard’ interfaces, defined through community consensus, and enforced by the Apaches and the likes. With my MapR hat on: I think it makes perfect sense to offer a commercial implementation that is superior to the implementation you get ‘for free’ – as long as you’re 100% compatible with the community-defined standard.

    • Merv Adrian says:

      Thanks, Michael. I agree: interfaces matter a great deal, and make much of the enhancement and extension possible. At every layer of a Hadoop “stack” substitutions are possible and emerging for one reason or another – MapR’s is a good example at the storage layer. If you or others don’t conform to the interface standards, it will be the market that decides whether it accepts that or not. You will be less compatible with the other emerging stack components and that will mean some buyers opt to go elsewhere. The proportion of anyone’s stack that is open source will vary, but if stacks cannot be constructed, your addressable market is limited. I don’t mean to sound too doctrinaire about “letting the market rule,” but it seems to work.

  • Hi Merv,

    As you point out, the beauty of open source is that it enables others to take the code and create derivative works. IBM is an example of a company that has done this quite successfully with Apache Software Foundation projects…and in a way that benefits both their customers and the community efforts that they leverage.

    I think the issue with EMC’s Pivotal HD announcement was NOT about the product (ex. put Greenplum SQL on top of HDFS) or some notion of open source “purity”, it was in the marketing illusion Scott Yara put forth where he followed “We’re all in on Hadoop, period.” with “We literally have over 300 engineers working on our Hadoop platform”.

    Peeling back the marketing illusion of that 300 number yields almost no engineers committing code directly to open source Apache Hadoop or related projects. While blurring the lines is good marketing by EMC, it opens them up to having the bull$h1t flag thrown on them, so to speak.

    Call me old fashioned, but *truth matters*.

    With that said, I definitely agree with you that buyers will choose solutions that fit their needs. In my recent post entitled “Separating Open Source Signal from Enterprise Hadoop Noise” I make the point that enterprise needs and open source are not mutually exclusive. I use Teradata and Microsoft as examples of use-value participants who are building value on top of open source enterprise Hadoop. Microsoft, in particular, has contributed nicely to various Hadoop-related Apache projects.

    Bottom-line: I don’t see this as a debate about open source “purity”. There will be LOTS of commercial solutions built on open source enterprise Hadoop as the foundation. And that is a GREAT thing!

    P.S. Since the Hadoop trademark is owned by the Apache Software Foundation, passions tend to flare when the term gets used in questionable ways. Similar to why Gartner enforces the use of the term “Magic Quadrant” (to avoid market confusion, etc.).

    • Merv Adrian says:

      Thanks, Shaun. Good to have you commenting here and articulating your PoV.

      And mostly I agree with the sentiments, but as I hope I clearly about in the post, I think the concerns are overblown at this point. Scott Yara spoke at an industry launch event to press and analysts. He strongly affirmed EMC’s seriousness about being in what we all think of as the “Hadoop market.” “We’re all in” and “we have 300 people working on this” were expressions of that importance, as I read them at the time, and now. I don’t see any effort to subvert Apache, or steal its trademark.

      Perhaps we will see a deceptive EMC marketing campaign that pretends EMC is doing more contributions than they are. But I didn’t take what I heard that day as such a campaign. Just – as you called it – marketing – at a single event. I’m willing to reserve judgement on what happens next. It’s a good discussion, and we should all be vigilant.

  • Hi Merv,

    I appreciate your thoughts and while I agree we should all be diligent, I’m not convinced this is one-time overzealous marketing.

    In mid-2012, John Furrier interviewed Pat Gelsinger re: Hadoop (prior to Pat moving from EMC to VMware). The link below fast-forwards to 27 minute mark.

    Important snippets:
    \Hadoop isn’t a thing…it’s a bunch of stuff\.

    \There is no center of Hadoop.\

    \You’re talking about HBase, HDFS, Pig, and Hive…a bunch of components\.

    \Some [components] will just be pieces of Chorus\.

    \Other pieces are going to disappear; HDFS…you don’t want another file system, it’s a protocol; we’ve just melted it into Isilon.\

    \You won’t be able to tell where HBase stops and where Greenplum starts.\

    Net-out: EMC envisions a diaspora of Hadoop-related components that are mix-ins to EMC products and that never come together into an easy to use and consume data platform outside of EMC’s products. Fragmentation is fundamental to the strategy.

    On the other hand, I believe open source enterprise Hadoop can provide a cohesive platform that includes the platform services, data services, and operational services that address enterprise needs. Communicating a vision for where enterprise Hadoop is going and working on the open source projects as well as introducing new projects is the best way to address those enterprise requirements.

    On top of that stable foundation thousands of flowers (both open source and commercial) can bloom.

    This is why you hear folks like Hortonworks talk about the importance of open source committers. It is difficult to drive a real enterprise-focused roadmap or fix/patch major issues if you don’t have engineers working to make that happen within the community projects. And if you’re doing your work off to the side of the community, then there’s no clear path for those changes to work their way into the upstream community efforts.

    Anyhow, enough said. Thanks for the thought-provoking blog post.

    • Merv Adrian says:

      Thanks, Shaun – keep it coming. As I said, I respect the PoV here. I don’t think EMC is or will be alone in seeking to “embrace and extend.” In fact that phrase first came to prominence years ago with your partner Microsoft back in the early ODBC days. It’s a familiar motion. And for the record, I hope we see more committers from many companies. But I’m not holding my breath.

  • aaronw says:

    A critical point for a customer to consider is the containers. EMC Greenplum HAWQ is no problem for Hadoop as FOSS world or as Apache projects, but is a problem for a customer not all-in in an EMC centric world. If you plan to have Hadoop in and out of the EMC umbrella, it gets awkward.

    How do you bridge between HDFS and HAWQ? What do you do if you want a Hadoop component that EMC doesn’t support? EMC does not have a complete Hadoop ecosystem, and Isilon and HDFS do not play nice.

    It is not a Hadoop issue, it is an EMC strategy. It seems unlikely that anyone would consider HAWQ outside the Greenplum world. Net-net – consider this an add on to Greenplum that will mostly be walled off from the bulk of the customers’ Hadoop farms.

    Expect in a few years, that this stream gets smaller, not larger – becoming adaptors and glue around Greenplums, supporting MR and major subprojects. Hadoop meanwhile gets too big for EMC to manage support for and it gets back in bed with another support vendor….

  • aaronw says:

    Finishing the point – FOSS when active is not threatened by attempts to create closed source flavors (though it gets problematic if the FOSS projects stall). Hadoop is active enough not to worry about attempts to create proprietary RDBMS Hadoop admixtures, replacements for HDFS, etc.

    The worst option in this mix is to have a non-critical mass proprietary flavor, and that is where EMC currently stands. The current strategy seems simply untenable – not because it is not open sourcing the code, but because it is incomplete and requires customer lock-in to an awkward blend.

    It is a complement to FOSS that users don’t commit all code. The biggest risk of proprietary code is how awful it often is. FOSSs strong suit is how well it works through unfulfilled needs.

    We as consumers really want 100 flavors of proprietary attempts to break out of the RDBMS big data dissonance. 90 will fail out outright. Most of the others will fall into niches with maybe a single financial success because of product value – which will be replicated in short order in FOSS.

    In summary, companies with products FOSS and proprietary should create helpful products that work with Hadoop. In the rare case the capability is unique and valuable, it will likely appear in Hadoop (and it in most cases will be most helpful to the innovator to open source and continue to drive.)

    A correlary to this is that there is no great incentive to get support from a committer. The committer rarely has proprietary knowledge. You want support where you need it from vendors you trust, respect, and have confidence in.

    • Merv Adrian says:

      As always, Aaron, thanks. I think we have to assume that EMC’s strategy is hardly started yet – the organization is just forming, and these are only the earliest moves. And as Hadoop evolves, so will the strategies of vendors who distribute and support it. Witness Cloudera’s change of the free version of Manager to unlimited node support from 50, to MapR’s embrace of Dremel – excuse me, Drill, Hortonworks’ new project proposals, etc. The pace of change here is extraordinary.

  • John strout says:

    The committer issue is no longer an indicator of how good the vendor is. It is being used to raise the barriers to entry for new committers by a few vendors who are hijacking the intent of what open source was supposed to be.The process of appointing new committers is setup in a way to make it very hard for new committers to get appointed. Don’t believe me, look it up.

    Need a proof point? Just look at Horton’s nearest competitor – Cloudera. It had the commercial Hadoop distro way before Horton. It had more engineers on staff than Horton, but still has less committers than Horton works. Need another proof point – IBM, EMC, Microsoft, HP, Intel, Dell, Oracle, SAP and others combined have far higher number of engineers working on Hadoop than Horton works. But very few committers compared to Horton. This is not making much sense anymore.

    It is just a matter of time before customers see through this gimmick. I think Merv is right on this one, the number of committers will become a moot issue or a marketing gimmick at best over time.

  • Troof says:

    What makes EMCs ecosystem incomplete?