Merv Adrian

A member of the Gartner Blog Network





Coverage Areas:

What, Exactly, Is “Proprietary Hadoop”? Proposed: “distribution-specific.”

by Merv Adrian  |  September 6, 2013  |  12 Comments

Many things have changed in the software industry in an era when the use of open source software has pervaded the mainstream IT shop. One of them is the significance – and descriptive adequacy – of the word “proprietary.” Merriam-Webster defines it as “something that is used, produced, or marketed under exclusive legal right of the inventor or maker.” In the Hadoop marketplace, it has come to be used – even by me, I must admit – to mean “not Apache, even though it’s open source.”

But Hadoop is not a thing,  it’s a collection of things. Apache says Hadoop is only 4 things – HDFS, MapReduce, YARN, and Common. Distributions are variable collections that may contain non-Apache things, and may be called “proprietary” even when all the things in them are open source – even marketed under Apache licenses, if they are not entirely Apache Projects.

To the most aggressive proponents of open source software, this use of “proprietary” equates to “impure,” even “evil.” And some marketing stresses the importance of “true openness” as an unmitigated good as opposed to the less-good “proprietary,” even when the described offering is in fact open source software.

Some vendors respond to market needs with distributions containing open source-licensed components that come from a community process outside the Apache model – where a committee of people from other companies with their own agendas might delay, or seek to change, the implementation or even functionality of “their” software. Many Apache projects have started in other communities, like git, and were proposed later to Apache. Some call these distributions or components “proprietary,” hence less pure, less good, less worthy of consideration.

To the consumer – the enterprise that acquires its Hadoop from a distributor that is the only provider offering that particular capability – it’s a dilemma, and a mixed bag. If they need the functionality, and no Apache project with similar functionality (if one exists) is supported by some other distributor, but the distributor offers and supports it, it seems obvious to go there. Apache, after all, does not support its projects. Vendors (distributors) do.

On the other hand, if and when Apache gets there and someone supports that Project, it will potentially be more difficult to move from one distribution to another. Organizations are increasingly wary of switching costs, even as they embrace the rapid innovation – and resultant disruption – of the open source revolution.

“Purity” is not the question; timing and availability are. Consumers need to buy what will work, and want to buy what will be supported. And the use of “proprietary” here is both inaccurate and not to the point.

So I’m proposing to use a different expression in my future discussions: “distribution-specific.” It can apply today to an Apache project if only a specific distributor includes it. And it will apply to vendor enhancements, even if API-compatible and open source, where the same is true.

I’d love to hear your thoughts – and your nominations for the list. Some are obvious: Cloudera Impala, MapR’s file system, IBM’s JAQL. Or Apache Cassandra, listed on the Apache Hadoop page linked above as a “Hadoop-related project” along with HBase, Pig and others. Only one company commercializes a software distribution with it – Datastax. And they don’t even call themselves a Hadoop distribution. What do you think of all this? Please leave your comments.

12 Comments »

Category: Apache Apache Yarn Big Data BigInsights Cassandra Cloudera Hadoop Hbase IBM MapR MapReduce open source OSS Pig YARN     Tags: , , , , , , , , , , , , , ,

12 responses so far ↓

  • 1 What, Exactly, Is “Proprietary Hadoop”? Proposed: “distribution-specific.” | Merv Adrian's IT Market Strategy   September 6, 2013 at 4:43 am

    [...] —more– [...]

  • 2 Vijay   September 6, 2013 at 5:00 am

    I agree with you – terms matter . However – by now I have started interpreting “proprietary” along the lines of distribution specific . I am sure I am not alone . So what I wonder is the die has already been cast and whether vendors will change messaging now .

    The key issue here is how do the companies that buy Hadoop protect themselves given each vendor does its thing and this might force a port across distros for the customer . I think the answer is in “Apps” as the primary deployment model for big data , as opposed to “platforms”. Apps should ideally shield customers from underlying technology .

  • 3 Charles Zedlewski   September 6, 2013 at 11:43 am

    I won’t lay odds on if the phrase “distribution specific” will stick but I think the point is valid all the same. Some customers care if a component has multi-vendor or single-vendor support and it’s fine and good to call this out.

    This should be distinct from “open” or “closed” because pretty much every leading open source software product heretofore is what you’d call “distribution specific” (e.g. MySQL, SpringSource, Talend, Pentaho to name a few). As you point out this is orthogonal to “Apache” and “non-Apache” as there are numerous Apache projects that have single vendor support or no-vendor support. And their are non-Apache OSS projects that enjoy wide redistribution.

    Vijay brings up another good point which is ultimately what component is “distribution specific” will matter less than if it has strong support from applications. For example Apache Pig is shipped by all the Hadoop distributions but has very little app / ISV support.

  • 4 Merv Adrian   September 6, 2013 at 4:45 pm

    Thanks for the comments. Charles, I’m not a betting man, either. I don’t expect to change the language used by the market, of course – I don’t claim that kind of clout for me or for Gartner. But in the interest of precision, I plan to avoid the use of “proprietary” where it’s not correct.

    Vijay, I think the move to apps, or solutions, is inevitable for a sizable part of the market. But the platform business continues to be a good one in the IT sector and people will continue to buy tools to build things with. Portability in an open sourced platform will definitely be an issue as the proliferation of distribution-specific components continues. Interestingly, some already have adherents in the BYOH (bring your own Hadoop) space. Just as in the DBMS space, tool vendors will support the offerings their customers demand. And they will only be able to do as many as they have tie and resources for. It’s an interesting secondary indicator of market adoption in early stage markets like this one.

  • 5 Tomer Shiran   September 7, 2013 at 11:46 pm

    This is spot on. The term ‘proprietary’ represents ‘marketing FUD’ and I have seen it used primarily by vendors who have a harder time competing on parameters that matter to customers, such as product/technology advantages and support quality.

    In addition to looking at whether some innovation is “distribution specific”, it’s important to consider whether that innovation introduces vendor lock-in. To the extent possible, Hadoop distribution providers should aim to deliver unique innovation *below* common/standard APIs. When that happens, the distribution provides more (ie, unique) value to customers and at the same time does not introduce vendor lock-in.

    For example, at MapR we were able to innovate *below* the HDFS and HBase APIs, thus providing more value to customers while maintaining customers’ ability to easily migrate between MapR and other distributions (with no code changes or even recompilation). When we started the Apache Drill project to provide the next-generation SQL-in-Hadoop technology, it was clear that the only way to avoid vendor lock-in was to develop this technology as a community-driven, Apache project that could be used with any Hadoop distribution.

  • 6 Merv Adrian   September 8, 2013 at 4:47 pm

    Thanks, Tomer. Harkening back to the days of early SQL, Ted Codd was adamant that implementation was not the issue, as long as the standard was adhered to. We got a bit away from that over decades, but most commonly used SQL still works with most RDBMSs. However, “implementation-specific” features (in that case) have become a barrier vendors use to retain their base. Over time, I suspect the same will happen here.

  • 7 Apache Flume: Distributed Log Collection for Hadoop · WWW.INFOWEBHUB.NET   September 9, 2013 at 2:14 am

    [...] Exactly, Is “Proprietary Hadoop”? Proposed: “distribution-specific.” [...]

  • 8 Shaun Connolly   September 9, 2013 at 4:32 pm

    I’ve been lucky to be at JBoss, Red Hat, SpringSource, VMware, and now Hortonworks, so I can certainly appreciate the topic.

    In my mind, People, Ideas and Code are the 3 factors that make an open source project successful. Whether the project is vendor-driven or foundation-driven, those 3 factors are vital.

    With that said, there are two facets to “open source” that are also important to understand:
    1) the open source license, and
    2) the governance model (who participates, how, etc.).

    Linux, Apache, Eclipse, and OpenStack are successful foundations that have pretty clear licensing and governance models. While it’s important to note that foundations, in and of themselves, do not guarantee project success…see my 3 factors above…at Hortonworks we have a preference for foundation-driven projects (ex. we do our work in Apache and OpenStack communities). While there are certainly more rules to live by with foundation-driven versus vendor-driven projects, we feel foundations provide valuable stewardship and guide-rails for projects interested in encouraging the broadest community involvement as possible (especially across many different vendors and end users).

    Vendor-driven projects can certainly attract vibrant involvement as well, but they typically have to refine their governance models over time if they are to attract involvement of potential competitors, for example. Look at the news related to IBM embracing Pivotal’s CloudFoundry for an example. IBM needed some governance-related enhancements before embracing CloudFoundry. Why? Because IBM knows they can’t be entirely beholden to their competitor when it comes to influencing code, roadmap, etc. in support of their enterprise customers.

    So there’s generally more “lock-in” when it comes to vendor-driven projects versus foundation-driven projects, because the controlling vendor, after all, gets final say.

    One final point re: “standards”:
    Having spent a ton of time in Java EE world, I understand and appreciate “standard APIs” and multiple “implementations”.
    In the open source era, however, Code (aka the *actual* implementation) matters. Let me use an analogy to illustrate my point. Amazon Web Services is very successful and there’s a lot of discussion about competing cloud platforms supporting AWS-compatible APIs. I would like to think that nobody in their right mind would claim that their alternate implementation is “Amazon Web Services”. Implementation matters.

  • 9 Merv Adrian   September 9, 2013 at 5:56 pm

    Thanks, Shaun, for a typically well-reasoned and much appreciated contribution.
    I don’t see anything to disagree with in what you say here. My point was perhaps a little different, though – I was speaking specifically about the language I and others use to describe offerings when we discuss them. I seek precision about what Gartner clients tell me matters a great deal to them: whether and how they get support for what they are using.
    On that score, I think my suggestion helps clarify things.
    Some companies are willing to accept more lock-in, perceived or actual, instead of building critical systems out of parts that have no formal support associated with them. I want to describe their choices in a way that helps clarify that if possible, hence the notion “distribution-specific.” The term makes it clear there is at least one place the buyer can be sure to get support – which is better than none at all, and less useful than multiple candidates.
    So, for example, I can get my MapReduce support from several providers. I can’t get Giraph support from anybody. In between are some pieces that fall between those extremes. I hope to clarify that when I describe those pieces to clients.

  • 10 Shaun Connolly   September 9, 2013 at 8:44 pm

    Hi Merv,

    The title of your article is: What, Exactly, Is “Proprietary Hadoop”? Proposed: “distribution-specific.”

    You’re blending two very different topics.

    If you are trying to clarify different types of open source projects versus traditional commercial software (i.e. tease out meaning of proprietary), then I’ve provided some food for thought.

    If you are trying to clarify what open source projects have vendor support options behind them, then that’s clearly different and I don’t see where the term “proprietary” plays a role. This topic is also likely a moving target due to the fast moving nature of open source projects and timing of when some of those projects gain enough momentum to warrant inclusion and support in broader distributions or enterprise support alternative.

  • 11 sudheer   September 14, 2013 at 10:48 am

    The Information was very much useful for Hadoop Online Training Learners Thank You for Sharing Valuable Information it is very useful for us.and we also providing Hadoop Online Training

  • 12 Big Data products/projects types – from propriety to industry standard | Big Data, Small Font   September 17, 2013 at 1:35 pm

    [...] recently read Merv’s excellent post on propriety vs open-source Hadoop – suggesting that use the term distribution-specific is [...]