The Apache Software Foundation has succeeded admirably in becoming a place where new software ideas are developed: today there are over 350 projects underway. The challenges for the Hadoop user are twofold: trying to decide which projects might be useful in big data-related cases, and determining which are supported by commercial distributors. In Now, What is Hadoop? And What’s Supported? I list 10 projects in the open source community (though not all Apache projects) supported by only one: Atlas, Calcite, Crunch, Drill, Falcon, Kite, LLAMA, Lucene, Phoenix and Presto. Let’s look at them a little more.
- Atlas is a governance framework project being developed by a Hortonworks-led consortium of users and vendors – Aetna, JPMC, Merck, SAS, Schlumberger, Target and others – targeting metadata (for classification), data audit, lifecycle management, search and lineage, with a security and policy engine building on Ranger (which is supported by 2 distributors: Hortonworks and Pivotal.) It’s designed to be installed and managed by Ambari (which is supported by three: the ODP members Hortonworks, IBM and Pivotal) but manual installation is also possible. Atlas is in a very early stage now, and will compete with Cloudera’s Navigator (first announced in 2013, and available as a priced option) when it begins to ship. Today only Hortonworks lists it as supported – thought there is really not much there to use yet.
- Calcite, formerly Optiq, is also driven by Hortonworks, who brought in its creator, Julian Hyde, to support his work. Calcite evaluates SQL and builds optimized, efficient plans, essentially deconstructing the traditional RDBMS model, and providing support for multiple back ends including Hive-on-Tez, Drill (both base their optimizers on it), MongoDB, Splunk, Spark, and JDBC data sources. This is a dramatic upending of current architectures, providing an implementation of relational algebra with transformation rules, a cost model, and metadata, that other projects can send work to. If it accomplishes its goals, it’s likely to be supported by several distributors.
- Crunch, supported by Cloudera, is a framework for writing, testing, and running MapReduce pipelines – including UDFs – as an alternative to Pig, and supporting Spark. It’s considered good for tasks such as joining and aggregation over data types that are “not very relational” such as HBase, time series, and serialized object formats like Avro. It has an API for Scala. I haven’t seen any evidence that it’s a significant differentiator driving distributor selection. But everyone has their own approach to this, and convergence (and adoption by others) does not appear to be on the horizon.
- Drill, based on Google’s Dremel (also a basis for Google BigQuery) is MapR’s entry in the SQL-on-Hadoop contest. It provides a SQL interface that includes interactive analysis for many data formats – Amazon S3, Azure Blob Storage, MapR-FS, NAS and local files, as well as other Hadoop and non-Hadoop formats including Parquet, AVRO, JSON, XML, HBase and MapR-DB, HDFS, MongoDB, Google Cloud Storage, and Swift. Drill uses a “shredded, in-memory, columnar data representation,” vectorization and pipelining, and its data integration capabilities differentiate it from some of its competitors. MapR is pushing it the notion of “schema on read” with Drill – some observers (this one included) prefer to think of this as “SQL on first read,” because anything useful is likely to be persisted and its schema saved. SQL is one of the major competitive battlegrounds – although MapR, for example, also supports Cloudera’s Impala, don’t expect to see any of the others support multiples anytime soon.
- Falcon, a Hortonworks-driven project which started at InMobi, will operate in concert with Atlas – the two share many committers. Falcon is described as a “feed management and data processing platform” designed for both management and governance: data lifecycle management, classification, audit, lineage, scheduling, data motion (workflow including staged replication), coordination of data pipelines, lifecycle management, data discovery, process orchestration: late data handling, retries, etc. This rather expansive set of XML-based features include multi-cluster management to support local/global aggregations and rollups. The place where Falcon leaves off and Atlas begins will clearly be of interest as the two evolve, and depending perhaps in part on who else besides Hortonworks decides to support it (at inMobi it ran over Cloudera in its early days.) Governance will be a major competitive battleground among the distributions, so expect to wait a while to see broad adoption.
- Kite (not an Apache project) is a software development kit driven by Cloudera: libraries, references, tutorials, and code samples that can be used by Pig, Hive, Impala, Spark, MapReduce, Oozie, etc . Kite APIs are positioned as “the next level up the stack,” compatible with HDFS and HBase. Cloudera tells me simplifying the experience of serializing and partitioning data in Avro and Parquet is a usage example. Again, adoption by other distributions doesn’t appear likely anytime soon.
- LLAMA, for Low Latency Application MAster, is a Cloudera project (also not an Apache project) designed originally to enable Impala to reserve, use and release YARN resource allocations without requiring Impala to use Yarn-managed container processes. No other distributor supports it.
- Lucene is a very odd member of this group. It has a long history, its own commercializing firm in LucidWorks, and seems to be involved in every distribution’s search capabilities. But ony IBM actually supports Lucene directly in their BigInsights distribution, although Cloudera, Hortonworks and MapR all support its companion indexing project SOLR. The two appear to get broad use despite the lack of commercial support – several distributors offer search add-ons, typically for added cost. Even Hortonworks charges extra for search support.
- Phoenix, supported only by Hortonworks, became a top level project in August 2014. Phoenix delivers a SQL-over-HBase layer, compiling queries into a series of HBase scans, leveraging coprocessors and custom filters. Versioned table metadata is itself stored in a HBase so queries over prior versions automatically use the correct schema. Positioned as a “data warehouse for HBase,” it’s one to watch because SQL over a DBMS has much more promise than SQL over a file system, which the other “SQL on Hadoop” interfaces tend to offer unless they have their won DBMS – which creates a non-Apache layer some users prefer to avoid.
- Presto (not an Apache project) is a distributed SQL query engine developed at Facebook for running interactive analytic queries against data sources including Hive, Cassandra, relational databases or proprietary data stores. Presto is supported only by Amazon among Hadoop distributors. It recently received a recent major commitment of resources from Teradata (which will offer its own support.) Presto is not yet an Apache project, and Teradata has made it clear that it will stay that way for awhile, so broader support is unlikely anytime soon.
These projects all have value, and promise. If you wish to use them, you’ll be depending on the kindness of strangers (apologies to Blanche Dubois) unless you pay for their distributor’s support. Caveat coder.