Blog post

Hadoop 2013 – Part Two: Projects

By Merv Adrian | February 21, 2013 | 0 Comments

VMwareSQLStreamRainstoropen sourceInfoSphereIBM BigInsightsIBMHstreamingHortonworksHadaptgraph DBMSGartnerEMCDataguiseClouderaApache ZookeeperApache YARNApache SqoopApache SolrApache PigApache OozieApache MapReduceApache HiveApache HDFSApache HBaseApache HadoopApache GiraphApache DrillApache CassandraApache AmbariApache AccumuloApache

In Part One of this series, I pointed out that how significant attention is being lavished on performance in 2013. In this installment, the topic is projects, which are proliferating precipitously. One of my most frequent client inquiries is “which of these pieces make Hadoop?” As recently as a year ago, the question was pretty simple for most people: MapReduce, HDFS, maybe Sqoop and even Flume, Hive, Pig, HBase, Lucene/Solr, Oozie, Zookeeper. When I published the Gartner piece How to Choose the Right Apache Hadoop Distribution, that was pretty much it.

Since then, more projects have matured. More have entered incubator status. And alternatives to Apache projects have gained more traction in distributions and in customer sites whose portfolio is more expansive. I’ve talked before about my ongoing stack model that attempts to sort this out – you may have seen it in an earlier blog post. I’ve updated it a little, and in this version, you can see that the “original core” projects are bolded. A few others are too, to be discussed in my planned Hadoop Tutorial presentation at the upcoming Gartner BI Summit, March 18-20 in Grapevine, Texas, where I’ll drill into the bolded ones in more detail.

Projects (and alternatives) for the Hadoop stack


In 2013, the list of projects, alternatives, and supporting technology to watch will change as commercial distributions continue to expand what they contain and support, and as more and more use cases focus on issues like machine learning (Mahout) or text search and analytics (Lucene and Solr) and as new processing paradigms begine to compete with MapReduce under Apache 2.0. Metadata will matter, so HCatalog will turn a lot of heads. Graph processing may begin to show up if Giraph gets some traction. And there’s more:

Apache Avro – the interest in data serialization is expanding with sensor  and other machine generated data. Just ask Splunk.
Apache Accumulo – a secure datastore built by guys from the NSA, investigated by the Senate? Of course you’re interested.
Apache Ambari – covered in the last post. An open source management platform.
Apache Bigtop – packaging and testing a collectiomn of your own? This is for you.
Apache Blur (incubating) for search in cloud environments – Doug Cutting is a committer on this one.
Apache Cassandra – an alternative, distributed datastore that has won POCs against pure Hadoop in some use cases I’ve seen.
Apache Chukwa – data collection on your system, for monitoring.
Apache Crunch (incubating) – a “quicker to implement than MapReduce programming” choice, for building, testing and running pipelines.
Apache Drill (incubating) – one of several entrants in the “real-time analytics” sweepstakes – and there will be others.
Apache Giraph (incubating) for graph processing uses – one of the first examples of the changes Yarn will enable.
Apache Hama for Bulk Synchronous Parallel computing in scientific computations.
Apache Kafka – a publish and subscribe system.
Apache Mahout – already being supported by several distributions – machine learning is a key new use.
Apache Whirr – a library for running services in the cloud (including a Hadoop cluster, of course.)
Cascading – not really a project but a development platfdorm, commercialized by Concurrent.
DataFu – also not an Apache project, but a collection of Pig UDFs developed at LinkedIn.
Dataguise DG for Hadoop –  a security offering of great value in an insecure platform, which Hadoop certainly is today.
Hadapt – another “alternative datastore” contender, not open source, but offering a relational store right on your cluster.
HStreaming – along with IBM’s inclusion of InfoSphere Streams in its BigInsights distribution, Twitter’s Storm and the well established SQLstream, we’ll see more interest in realtime streaming operational processing as a counterpoint to the interest in realtime analytics that will be another key development this year.
Rainstor – again, not open source, but highly compressed Hadoop sounds pretty appealing. Check it out.
VMware Serengeti – aimed at creating virtualized, highly available, multi-tenant Hadoop. Big possibilities for this one.

I haven’t gone into the various analytics plays here. That’s a post for another time, and it’s arguably a “layer above.” (Or in the case of my diagram, below.) There’s only so much you can fit into a reasonable post and it’s time to end this one. Next time: platforms.



Comments are closed