In early January 2012, the world of big data was treated to an interesting series of product releases, press announcements, and blog posts about Hadoop versions. To begin with, we had the announcement of Apache version 1.0 at long last, in a press release. Although there were grumblings here and there in the twittersphere that changes to release numbers are meaningless, my discussions with Gartner’s enterprise customers indicate otherwise. Products with release numbers like 0.20.2 make the hair on Procurement’s neck stand on end, and as Hadoop begins to get mainstream attention (Gartner’s clients, see Hype Cycle for Data Management 2011), IT architects and executives find such optics quite important. Hadoop is moving beyond pioneers like Amazon, Yahoo! and LinkedIn into shops like JP Morgan Chase, and they pay attention to such things.
So what, after 6 years of steady work on earlier versions, makes this one worthy of a 1.0 designation? As Cloudera noted in a blog post, “There has been an 18 month period where there has been no one Apache release that had all the committed features of Apache Hadoop.”
That would be version 0.20.2 from 2010. In that time, every major DBMS vendor has put a strategy in place for side-by-side operation with Hadoop, sometimes direct execution and calls from “inside” (Teradata’s Aster and EMC’s Greenplum pioneering that.) So you might think “1.0” is just labeling to provide a veneer of “enterprise-ness” for customers of those products. But give the tireless gang of committers some credit; in this release they have incorporated features from a couple of branches (what’s that mean, you ask? read on) and added strong authentication (Kerberos), a REST API for HDFS, and for those who prefer the broader set of use cases enabled by HBase, several significant additions to improve working with it. But unfortunately, as useful as a 1.0 label is, it hardly eliminates the confusion, as we’ll see below.
Trunks and Branches
For those who don’t track Apache release matters, the Apache Software Foundation supports Projects, which are created by volunteers accepted into the organization on the basis of the code they write – they gain privileges as Committers who can put code into the main codeline, known as “trunk.” MapReduce is a project; so is HDFS. As new code is developed, the move from one numbered version to another is initiated by the creation of a branch, which is tested until the project members approve its readiness by vote. Branches may move “into trunk,” but other branches may also be started in the interim. And they may have different features.
Charles Zedlewski of Cloudera has described this in more detail in the excellent post linked above. In it he notes that release 0.20.2 in 2010 was the last release that had “all the usable features committed to Apache Hadoop.” Following the 0.20.2 release, work continued on that branch, resulting in 0.20.203 in 2011, which added security but not RAID support or append (which ensures that HBase, a separate project, doesn’t lose data.) It was followed by 0.20.205, which did add security although still not RAID. It is 0.20.205 that became 1.0. Seems straightforward, right?
Unfortunately, it’s not. Branches 0.21, 0.22 and 0.23 were all introduced in that 18 month time period, the latter at the end of 2011. Version 0.23, notes Cloudera, has “all of the features of any past release.” This includes a fix for the name node single point of failure issue and other HA capabilities, both of which matter a great deal to enterprise users. Good news? Well, Release 1.0, as noted, is not based on 0.23, so it does not have these.
Distributions – Solving, Or Muddying the Waters?
So, does using a distribution help sort all this out? Consider: Cloudera’s CDH3 distribution was issued before 1.0. But Cloudera distributions get updates designated with a U. Not only updates to Hadoop; remember there are other projects in there too. So CDH3U0 (yes, they use zeros) uses HBase 0.90.0 whereas CDH3U2 uses HBase 0.90.4; it also added Mahout and expanded support for Avro’s file format.
The latter discussion reinforces the important point that “Hadoop” means more than MapReduce, HDFS and job/task management to people considering it: solutions (and hence, distributions) typically involve several other projects. HBase, for example, seems to be occurring in at least half of them – a study by Dave Menninger of Ventana Research last year found 61% of respondents including it in deployments. Other parts of the stack like Pig, Hive, and Sqoop are also found in many if not most initiatives I’ve had contact with. The complexities of keeping all their versions straight as new code is contributed are a key reason to use distributions, which track and integrate a dozen or more projects.
How about Hortonworks, the other specialist with a large number of committers? They have announced a “public preview of the Hortonworks Data Platform (HDP) version 2.” That will be based on 0.23 – all those features, plus HCatalog, which Cloudera does not include in CDH (yet.) There is also a “private technology preview” of HDP version 1; “a public technology preview will be made available later this quarter.” What do these preview terms mean? Hortonworks explains:
The Technology Preview Program begins with a Limited Preview phase that enables us to engage a manageable number of representative customers, partners, and community users on focused, hands-on testing and proof of concept deployments of the Hortonworks Data Platform…Public Preview …. opens the process to anyone interested in working closely with Hortonworks…culminates in the final release of the software and General Availability.
Other distributions have their own mix of projects, reflecting their own point of view. It can be hard to find out which versions of the various projects are supported in each one. Neither Hortonworks’ nor MapR’s website, for example, shows the version numbers of included Projects, including the varied additional ones they both add, in the same way as the chart Cloudera offers does. And other distributions from IBM, Datastax, Netapp and others, are in the hunt, each with its own profile. For now, the continuing confusion and multiple conflicting “Hadoops” only serve to reinforce IT’s concerns about its readiness for a robust, governed environment – unless one distribution, from one trusted provider, is chosen.
In upcoming Gartner research I’ll talk about the distributions, how they vary, and how to track and choose among them.