Gartner Blog Network


What, Exactly, Is “Proprietary Hadoop”? Proposed: “distribution-specific.”

by Merv Adrian  |  September 6, 2013  |  12 Comments

Many things have changed in the software industry in an era when the use of open source software has pervaded the mainstream IT shop. One of them is the significance – and descriptive adequacy – of the word “proprietary.” Merriam-Webster defines it as “something that is used, produced, or marketed under exclusive legal right of the inventor or maker.” In the Hadoop marketplace, it has come to be used – even by me, I must admit – to mean “not Apache, even though it’s open source.”

But Hadoop is not a thing,  it’s a collection of things. Apache says Hadoop is only 4 things – HDFS, MapReduce, YARN, and Common. Distributions are variable collections that may contain non-Apache things, and may be called “proprietary” even when all the things in them are open source – even marketed under Apache licenses, if they are not entirely Apache Projects.

To the most aggressive proponents of open source software, this use of “proprietary” equates to “impure,” even “evil.” And some marketing stresses the importance of “true openness” as an unmitigated good as opposed to the less-good “proprietary,” even when the described offering is in fact open source software.

Some vendors respond to market needs with distributions containing open source-licensed components that come from a community process outside the Apache model – where a committee of people from other companies with their own agendas might delay, or seek to change, the implementation or even functionality of “their” software. Many Apache projects have started in other communities, like git, and were proposed later to Apache. Some call these distributions or components “proprietary,” hence less pure, less good, less worthy of consideration.

To the consumer – the enterprise that acquires its Hadoop from a distributor that is the only provider offering that particular capability – it’s a dilemma, and a mixed bag. If they need the functionality, and no Apache project with similar functionality (if one exists) is supported by some other distributor, but the distributor offers and supports it, it seems obvious to go there. Apache, after all, does not support its projects. Vendors (distributors) do.

On the other hand, if and when Apache gets there and someone supports that Project, it will potentially be more difficult to move from one distribution to another. Organizations are increasingly wary of switching costs, even as they embrace the rapid innovation – and resultant disruption – of the open source revolution.

“Purity” is not the question; timing and availability are. Consumers need to buy what will work, and want to buy what will be supported. And the use of “proprietary” here is both inaccurate and not to the point.

So I’m proposing to use a different expression in my future discussions: “distribution-specific.” It can apply today to an Apache project if only a specific distributor includes it. And it will apply to vendor enhancements, even if API-compatible and open source, where the same is true.

I’d love to hear your thoughts – and your nominations for the list. Some are obvious: Cloudera Impala, MapR’s file system, IBM’s JAQL. Or Apache Cassandra, listed on the Apache Hadoop page linked above as a “Hadoop-related project” along with HBase, Pig and others. Only one company commercializes a software distribution with it – Datastax. And they don’t even call themselves a Hadoop distribution. What do you think of all this? Please leave your comments.

Category: apache  apache-yarn  big-data  biginsights  cassandra  cloudera  hadoop  hbase  ibm  mapr  mapreduce  open-source  oss  pig  yarn  

Tags: apache  big-data-2  biginsights  cassandra  cloudera  datastax  hadapt  hadoop  hbase  hdfs  ibm  open-source  oss  pig  yarn-2  

Merv Adrian
Research VP
4 years with Gartner
37 years in IT industry

Merv Adrian is an analyst following database and adjacent technologies as extreme data transforms assumptions about what to persist as well as when, where and how. He also watches the way the software/hardware boundary… Read Full Bio


Thoughts on What, Exactly, Is “Proprietary Hadoop”? Proposed: “distribution-specific.”


  1. Vijay says:

    I agree with you – terms matter . However – by now I have started interpreting “proprietary” along the lines of distribution specific . I am sure I am not alone . So what I wonder is the die has already been cast and whether vendors will change messaging now .

    The key issue here is how do the companies that buy Hadoop protect themselves given each vendor does its thing and this might force a port across distros for the customer . I think the answer is in “Apps” as the primary deployment model for big data , as opposed to “platforms”. Apps should ideally shield customers from underlying technology .

  2. I won’t lay odds on if the phrase “distribution specific” will stick but I think the point is valid all the same. Some customers care if a component has multi-vendor or single-vendor support and it’s fine and good to call this out.

    This should be distinct from “open” or “closed” because pretty much every leading open source software product heretofore is what you’d call “distribution specific” (e.g. MySQL, SpringSource, Talend, Pentaho to name a few). As you point out this is orthogonal to “Apache” and “non-Apache” as there are numerous Apache projects that have single vendor support or no-vendor support. And their are non-Apache OSS projects that enjoy wide redistribution.

    Vijay brings up another good point which is ultimately what component is “distribution specific” will matter less than if it has strong support from applications. For example Apache Pig is shipped by all the Hadoop distributions but has very little app / ISV support.

    • Merv Adrian says:

      Thanks for the comments. Charles, I’m not a betting man, either. I don’t expect to change the language used by the market, of course – I don’t claim that kind of clout for me or for Gartner. But in the interest of precision, I plan to avoid the use of “proprietary” where it’s not correct.

      Vijay, I think the move to apps, or solutions, is inevitable for a sizable part of the market. But the platform business continues to be a good one in the IT sector and people will continue to buy tools to build things with. Portability in an open sourced platform will definitely be an issue as the proliferation of distribution-specific components continues. Interestingly, some already have adherents in the BYOH (bring your own Hadoop) space. Just as in the DBMS space, tool vendors will support the offerings their customers demand. And they will only be able to do as many as they have tie and resources for. It’s an interesting secondary indicator of market adoption in early stage markets like this one.

  3. Tomer Shiran says:

    This is spot on. The term ‘proprietary’ represents ‘marketing FUD’ and I have seen it used primarily by vendors who have a harder time competing on parameters that matter to customers, such as product/technology advantages and support quality.

    In addition to looking at whether some innovation is “distribution specific”, it’s important to consider whether that innovation introduces vendor lock-in. To the extent possible, Hadoop distribution providers should aim to deliver unique innovation *below* common/standard APIs. When that happens, the distribution provides more (ie, unique) value to customers and at the same time does not introduce vendor lock-in.

    For example, at MapR we were able to innovate *below* the HDFS and HBase APIs, thus providing more value to customers while maintaining customers’ ability to easily migrate between MapR and other distributions (with no code changes or even recompilation). When we started the Apache Drill project to provide the next-generation SQL-in-Hadoop technology, it was clear that the only way to avoid vendor lock-in was to develop this technology as a community-driven, Apache project that could be used with any Hadoop distribution.

    • Merv Adrian says:

      Thanks, Tomer. Harkening back to the days of early SQL, Ted Codd was adamant that implementation was not the issue, as long as the standard was adhered to. We got a bit away from that over decades, but most commonly used SQL still works with most RDBMSs. However, “implementation-specific” features (in that case) have become a barrier vendors use to retain their base. Over time, I suspect the same will happen here.

  4. […] Exactly, Is “Proprietary Hadoop”? Proposed: “distribution-specific.” […]

  5. I’ve been lucky to be at JBoss, Red Hat, SpringSource, VMware, and now Hortonworks, so I can certainly appreciate the topic.

    In my mind, People, Ideas and Code are the 3 factors that make an open source project successful. Whether the project is vendor-driven or foundation-driven, those 3 factors are vital.

    With that said, there are two facets to “open source” that are also important to understand:
    1) the open source license, and
    2) the governance model (who participates, how, etc.).

    Linux, Apache, Eclipse, and OpenStack are successful foundations that have pretty clear licensing and governance models. While it’s important to note that foundations, in and of themselves, do not guarantee project success…see my 3 factors above…at Hortonworks we have a preference for foundation-driven projects (ex. we do our work in Apache and OpenStack communities). While there are certainly more rules to live by with foundation-driven versus vendor-driven projects, we feel foundations provide valuable stewardship and guide-rails for projects interested in encouraging the broadest community involvement as possible (especially across many different vendors and end users).

    Vendor-driven projects can certainly attract vibrant involvement as well, but they typically have to refine their governance models over time if they are to attract involvement of potential competitors, for example. Look at the news related to IBM embracing Pivotal’s CloudFoundry for an example. IBM needed some governance-related enhancements before embracing CloudFoundry. Why? Because IBM knows they can’t be entirely beholden to their competitor when it comes to influencing code, roadmap, etc. in support of their enterprise customers.

    So there’s generally more “lock-in” when it comes to vendor-driven projects versus foundation-driven projects, because the controlling vendor, after all, gets final say.

    One final point re: “standards”:
    Having spent a ton of time in Java EE world, I understand and appreciate “standard APIs” and multiple “implementations”.
    In the open source era, however, Code (aka the *actual* implementation) matters. Let me use an analogy to illustrate my point. Amazon Web Services is very successful and there’s a lot of discussion about competing cloud platforms supporting AWS-compatible APIs. I would like to think that nobody in their right mind would claim that their alternate implementation is “Amazon Web Services”. Implementation matters.

    • Merv Adrian says:

      Thanks, Shaun, for a typically well-reasoned and much appreciated contribution.
      I don’t see anything to disagree with in what you say here. My point was perhaps a little different, though – I was speaking specifically about the language I and others use to describe offerings when we discuss them. I seek precision about what Gartner clients tell me matters a great deal to them: whether and how they get support for what they are using.
      On that score, I think my suggestion helps clarify things.
      Some companies are willing to accept more lock-in, perceived or actual, instead of building critical systems out of parts that have no formal support associated with them. I want to describe their choices in a way that helps clarify that if possible, hence the notion “distribution-specific.” The term makes it clear there is at least one place the buyer can be sure to get support – which is better than none at all, and less useful than multiple candidates.
      So, for example, I can get my MapReduce support from several providers. I can’t get Giraph support from anybody. In between are some pieces that fall between those extremes. I hope to clarify that when I describe those pieces to clients.

  6. Hi Merv,

    The title of your article is: What, Exactly, Is “Proprietary Hadoop”? Proposed: “distribution-specific.”

    You’re blending two very different topics.

    If you are trying to clarify different types of open source projects versus traditional commercial software (i.e. tease out meaning of proprietary), then I’ve provided some food for thought.

    If you are trying to clarify what open source projects have vendor support options behind them, then that’s clearly different and I don’t see where the term “proprietary” plays a role. This topic is also likely a moving target due to the fast moving nature of open source projects and timing of when some of those projects gain enough momentum to warrant inclusion and support in broader distributions or enterprise support alternative.

  7. sudheer says:

    The Information was very much useful for Hadoop Online Training Learners Thank You for Sharing Valuable Information it is very useful for us.and we also providing Hadoop Online Training

  8. […] recently read Merv’s excellent post on propriety vs open-source Hadoop – suggesting that use the term distribution-specific is […]



Comments are closed

Comments or opinions expressed on this blog are those of the individual contributors only, and do not necessarily represent the views of Gartner, Inc. or its management. Readers may copy and redistribute blog postings on other blogs, or otherwise for private, non-commercial or journalistic purposes, with attribution to Gartner. This content may not be used for any other purposes in any other formats or media. The content on this blog is provided on an "as-is" basis. Gartner shall not be liable for any damages whatsoever arising out of the content or use of this blog.