Gartner Blog Network


Hadoop 2013 – Part Two: Projects

by Merv Adrian  |  February 21, 2013  |  1 Comment

In Part One of this series, I pointed out that how significant attention is being lavished on performance in 2013. In this installment, the topic is projects, which are proliferating precipitously. One of my most frequent client inquiries is “which of these pieces make Hadoop?” As recently as a year ago, the question was pretty simple for most people: MapReduce, HDFS, maybe Sqoop and even Flume, Hive, Pig, HBase, Lucene/Solr, Oozie, Zookeeper. When I published the Gartner piece How to Choose the Right Apache Hadoop Distribution, that was pretty much it.

Since then, more projects have matured. More have entered incubator status. And alternatives to Apache projects have gained more traction in distributions and in customer sites whose portfolio is more expansive. I’ve talked before about my ongoing stack model that attempts to sort this out – you may have seen it in an earlier blog post. I’ve updated it a little, and in this version, you can see that the “original core” projects are bolded. A few others are too, to be discussed in my planned Hadoop Tutorial presentation at the upcoming Gartner BI Summit, March 18-20 in Grapevine, Texas, where I’ll drill into the bolded ones in more detail.

Projects (and alternatives) for the Hadoop stack

 

In 2013, the list of projects, alternatives, and supporting technology to watch will change as commercial distributions continue to expand what they contain and support, and as more and more use cases focus on issues like machine learning (Mahout) or text search and analytics (Lucene and Solr) and as new processing paradigms begine to compete with MapReduce under Apache 2.0. Metadata will matter, so HCatalog will turn a lot of heads. Graph processing may begin to show up if Giraph gets some traction. And there’s more:

Apache Avro - the interest in data serialization is expanding with sensor  and other machine generated data. Just ask Splunk.
Apache Accumulo – a secure datastore built by guys from the NSA, investigated by the Senate? Of course you’re interested.
Apache Ambari – covered in the last post. An open source management platform.
Apache Bigtop – packaging and testing a collectiomn of your own? This is for you.
Apache Blur (incubating) for search in cloud environments – Doug Cutting is a committer on this one.
Apache Cassandra – an alternative, distributed datastore that has won POCs against pure Hadoop in some use cases I’ve seen.
Apache Chukwa – data collection on your system, for monitoring.
Apache Crunch (incubating) – a “quicker to implement than MapReduce programming” choice, for building, testing and running pipelines.
Apache Drill (incubating) – one of several entrants in the “real-time analytics” sweepstakes – and there will be others.
Apache Giraph (incubating) for graph processing uses – one of the first examples of the changes Yarn will enable.
Apache Hama for Bulk Synchronous Parallel computing in scientific computations.
Apache Kafka – a publish and subscribe system.
Apache Mahout - already being supported by several distributions – machine learning is a key new use.
Apache Whirr – a library for running services in the cloud (including a Hadoop cluster, of course.)
Cascading – not really a project but a development platfdorm, commercialized by Concurrent.
DataFu – also not an Apache project, but a collection of Pig UDFs developed at LinkedIn.
Dataguise DG for Hadoop –  a security offering of great value in an insecure platform, which Hadoop certainly is today.
Hadapt – another “alternative datastore” contender, not open source, but offering a relational store right on your cluster.
HStreaming – along with IBM’s inclusion of InfoSphere Streams in its BigInsights distribution, Twitter’s Storm and the well established SQLstream, we’ll see more interest in realtime streaming operational processing as a counterpoint to the interest in realtime analytics that will be another key development this year.
Rainstor – again, not open source, but highly compressed Hadoop sounds pretty appealing. Check it out.
VMware Serengeti – aimed at creating virtualized, highly available, multi-tenant Hadoop. Big possibilities for this one.

I haven’t gone into the various analytics plays here. That’s a post for another time, and it’s arguably a “layer above.” (Or in the case of my diagram, below.) There’s only so much you can fit into a reasonable post and it’s time to end this one. Next time: platforms.

 

 

Category: accumulo  ambari  apache  apache-drill  apache-yarn  biginsights  cassandra  cloudera  dataguise  emc  gartner  giraph  graph-databases  hadapt  hadoop  hbase  hcatalog  hdfs  hive  hortonworks  hstreaming  ibm  infosphere  mshout  mapreduce  oozie  open-source  pig  rainstor  serengeti  solr  sqlstream-2  sqoop  vmware  zookeeper-2  

Tags: apache  biginsights  cassandra  cloudera  flume  hadapt  hadoop  hbase  hdfs  hive  hortonworks  hstreaming  ibm  infosphere  mapr  mapreduce  oozie  pig  sqlstream  sqoop  zookeeper  

Merv Adrian
Research VP
4 years with Gartner
37 years in IT industry

Merv Adrian is an analyst following database and adjacent technologies as extreme data transforms assumptions about what to persist as well as when, where and how. He also watches the way the software/hardware boundary… Read Full Bio


Thoughts on Hadoop 2013 – Part Two: Projects


  1. […] –more– 37.696935 -121.867562 Share this:DiggTwitterEmailPrintFacebookRedditStumbleUponLike this:Like Loading… […]



Comments are closed

Comments or opinions expressed on this blog are those of the individual contributors only, and do not necessarily represent the views of Gartner, Inc. or its management. Readers may copy and redistribute blog postings on other blogs, or otherwise for private, non-commercial or journalistic purposes, with attribution to Gartner. This content may not be used for any other purposes in any other formats or media. The content on this blog is provided on an "as-is" basis. Gartner shall not be liable for any damages whatsoever arising out of the content or use of this blog.