Many things have changed in the software industry in an era when the use of open source software has pervaded the mainstream IT shop. One of them is the significance – and descriptive adequacy – of the word “proprietary.” Merriam-Webster defines it as “something that is used, produced, or marketed under exclusive legal right of the inventor or maker.” In the Hadoop marketplace, it has come to be used – even by me, I must admit – to mean “not Apache, even though it’s open source.”
But Hadoop is not a thing, it’s a collection of things. Apache says Hadoop is only 4 things – HDFS, MapReduce, YARN, and Common. Distributions are variable collections that may contain non-Apache things, and may be called “proprietary” even when all the things in them are open source – even marketed under Apache licenses, if they are not entirely Apache Projects.
To the most aggressive proponents of open source software, this use of “proprietary” equates to “impure,” even “evil.” And some marketing stresses the importance of “true openness” as an unmitigated good as opposed to the less-good “proprietary,” even when the described offering is in fact open source software.
Some vendors respond to market needs with distributions containing open source-licensed components that come from a community process outside the Apache model – where a committee of people from other companies with their own agendas might delay, or seek to change, the implementation or even functionality of “their” software. Many Apache projects have started in other communities, like git, and were proposed later to Apache. Some call these distributions or components “proprietary,” hence less pure, less good, less worthy of consideration.
To the consumer – the enterprise that acquires its Hadoop from a distributor that is the only provider offering that particular capability – it’s a dilemma, and a mixed bag. If they need the functionality, and no Apache project with similar functionality (if one exists) is supported by some other distributor, but the distributor offers and supports it, it seems obvious to go there. Apache, after all, does not support its projects. Vendors (distributors) do.
On the other hand, if and when Apache gets there and someone supports that Project, it will potentially be more difficult to move from one distribution to another. Organizations are increasingly wary of switching costs, even as they embrace the rapid innovation – and resultant disruption – of the open source revolution.
“Purity” is not the question; timing and availability are. Consumers need to buy what will work, and want to buy what will be supported. And the use of “proprietary” here is both inaccurate and not to the point.
So I’m proposing to use a different expression in my future discussions: “distribution-specific.” It can apply today to an Apache project if only a specific distributor includes it. And it will apply to vendor enhancements, even if API-compatible and open source, where the same is true.
I’d love to hear your thoughts – and your nominations for the list. Some are obvious: Cloudera Impala, MapR’s file system, IBM’s JAQL. Or Apache Cassandra, listed on the Apache Hadoop page linked above as a “Hadoop-related project” along with HBase, Pig and others. Only one company commercializes a software distribution with it – Datastax. And they don’t even call themselves a Hadoop distribution. What do you think of all this? Please leave your comments.
Category: Apache Apache Yarn Big Data BigInsights Cassandra Cloudera Hadoop Hbase IBM MapR MapReduce open source OSS Pig YARN Tags: Apache, big data, BigInsights, Cassandra, Cloudera, Datastax, Hadapt, Hadoop, Hbase, HDFS, IBM, open source, OSS, Pig, Yarn