Gartner Blog Network

Hadoop Projects Supported By Only One Distribution

by Merv Adrian  |  August 9, 2015  |  6 Comments

The Apache Software Foundation has succeeded admirably in becoming a place where new software ideas are developed: today there are over 350 projects underway. The challenges for the Hadoop user are twofold: trying to decide which projects might be useful in big data-related cases, and determining which are supported by commercial distributors. In Now, What is Hadoop? And What’s Supported? I list 10 projects in the open source community (though not all Apache projects)  supported by only one: Atlas, Calcite, Crunch, Drill, Falcon, Kite, LLAMA, Lucene, Phoenix and Presto. Let’s look at them a little more.

  • Atlas is a governance framework project being developed by a Hortonworks-led consortium of users and vendors – Aetna, JPMC, Merck, SAS, Schlumberger, Target and others – targeting metadata (for classification), data audit, lifecycle management, search and lineage, with a security and policy engine building on Ranger (which is supported by 2 distributors: Hortonworks and Pivotal.)  It’s designed to be installed and managed by Ambari (which is supported by three: the ODP members Hortonworks, IBM and Pivotal) but manual installation is also possible. Atlas is in a very early stage now, and will compete with Cloudera’s Navigator (first announced in 2013, and available as a priced option) when it begins to ship. Today only Hortonworks lists it as supported – thought there is really not much there to use yet.
  • Calcite, formerly Optiq, is also driven by Hortonworks, who brought in its creator, Julian Hyde, to support his work. Calcite evaluates SQL and builds optimized, efficient plans, essentially deconstructing the traditional RDBMS model, and providing support for multiple back ends including Hive-on-Tez, Drill (both base their optimizers on it), MongoDB, Splunk, Spark, and JDBC data sources. This is a dramatic upending of current architectures, providing an implementation of relational algebra with transformation rules, a cost model, and metadata, that other projects can send work to. If it accomplishes its goals, it’s likely to be supported by several distributors.
  • Crunch, supported by Cloudera, is a framework for writing, testing, and running MapReduce pipelines – including UDFs –  as an alternative to Pig, and supporting Spark. It’s considered good for tasks such as joining and aggregation over data types that are “not very relational” such as HBase, time series, and serialized object formats like Avro. It has an API for Scala. I haven’t seen any evidence that it’s a significant differentiator driving distributor selection. But everyone has their own approach to this, and convergence (and adoption by others) does not appear to be on the horizon.
  • Drill, based on Google’s Dremel (also a basis for Google BigQuery) is MapR’s entry in the SQL-on-Hadoop contest. It provides a SQL interface that includes interactive analysis for many data formats –  Amazon S3, Azure Blob Storage, MapR-FS, NAS and local files, as well as other Hadoop and non-Hadoop formats including Parquet, AVRO, JSON, XML, HBase and MapR-DB, HDFS, MongoDB, Google Cloud Storage, and Swift. Drill uses a “shredded, in-memory, columnar data representation,” vectorization and pipelining, and its data integration capabilities differentiate it from some of its competitors. MapR is pushing it the notion of “schema on read” with Drill – some observers (this one included) prefer to think of this as “SQL on first read,” because anything useful is likely to be persisted and its schema saved. SQL is one of the major competitive battlegrounds – although MapR, for example, also supports Cloudera’s Impala, don’t expect to see any of the others support multiples anytime soon.
  • Falcon, a Hortonworks-driven project which started at InMobi, will operate in concert with Atlas – the two share many committers. Falcon is described as a “feed management and data processing platform” designed for both management and governance: data lifecycle management, classification, audit, lineage, scheduling, data motion (workflow including staged replication), coordination of data pipelines, lifecycle management, data discovery, process orchestration: late data handling, retries, etc. This rather expansive set of XML-based features include multi-cluster management to support local/global aggregations and rollups. The place where Falcon leaves off and Atlas begins will clearly be of interest as the two evolve, and depending perhaps in part on who else besides Hortonworks decides to support it (at inMobi it ran over Cloudera in its early days.) Governance will be a major competitive battleground among the distributions, so expect to wait a while to see broad adoption.
  • Kite (not an Apache project) is a software development kit driven by Cloudera: libraries, references, tutorials, and code samples that can be used by Pig, Hive, Impala, Spark, MapReduce, Oozie, etc .  Kite APIs are positioned as “the next level up the stack,” compatible with HDFS and HBase. Cloudera tells me simplifying the experience of serializing and partitioning data in Avro and Parquet is a usage example. Again, adoption by other distributions doesn’t appear likely anytime soon.
  • LLAMA, for Low Latency Application MAster, is a Cloudera project (also not an Apache project) designed originally to enable Impala to reserve, use and release YARN  resource allocations without requiring Impala to use Yarn-managed container processes. No other distributor supports it.
  • Lucene is a very odd member of this group. It has a long history, its own commercializing firm in LucidWorks, and seems to be involved in every distribution’s search capabilities. But ony IBM actually supports Lucene directly in their BigInsights distribution, although Cloudera, Hortonworks and MapR all support its companion indexing project SOLR. The two appear to get broad use despite the lack of commercial support – several distributors offer search add-ons, typically for added cost. Even Hortonworks charges extra for search support.
  • Phoenix, supported only by Hortonworks, became a top level project in August 2014. Phoenix  delivers a SQL-over-HBase layer, compiling queries into a series of HBase scans, leveraging coprocessors and custom filters. Versioned table metadata is itself stored in a HBase so queries over prior versions automatically use the correct schema. Positioned as a “data warehouse for HBase,” it’s one to watch because SQL over a DBMS has much more promise than SQL over a file system, which the other “SQL on Hadoop” interfaces tend to offer unless they have their won DBMS – which creates a non-Apache layer some users prefer to avoid.
  • Presto (not an Apache project) is a distributed SQL query engine developed at Facebook for running interactive analytic queries against data sources including Hive, Cassandra, relational databases or proprietary data stores. Presto is supported only by Amazon among Hadoop distributors. It recently received a recent major commitment of resources from Teradata (which will offer its own support.) Presto is not yet an Apache project, and Teradata has made it clear that it will stay that way for awhile, so broader support is unlikely anytime soon.

These projects all have value, and promise. If you wish to use them, you’ll be depending on the kindness of strangers (apologies to Blanche Dubois) unless you pay for their distributor’s support. Caveat coder.

Category: amazon  amazon-web-services  apache  ambari  atlas  avro  calcite  cassandra  crunch  apache-drill  falcon  hadoop  hbase  hdfs  hive  impala  lucene  mapreduce  oozie  apache-parquet  phoenix  pig  ranger  solr  apache-yarn  zookeeper  big-data  cloudera  data-warehouse  dbms  gartner  hortonworks  ibm  biginsights  industry-trends  kite  llama  mapr  odp  open-source  oss  pivotal  presto  rdbms  sas  sql  teradata  

Tags: amazon  apache  hadoop  hbase  hdfs  hive  mapreduce  oozie  pig  yarn  zookeeper  big-data-2  biginsights  cassandra  cloudera  data-warehouse  gartner  hortonworks  ibm  mapr  open-source  oss  sas  teradata  yahoo  

Merv Adrian
Research VP
5 years with Gartner
38 years in IT industry

Merv Adrian is an analyst following database and adjacent technologies as extreme data transforms assumptions about what to persist as well as when, where and how. He also watches the way the software/hardware boundary… Read Full Bio

Thoughts on Hadoop Projects Supported By Only One Distribution

  1. […] Merv Adrian The Apache Software Foundation has succeeded admirably in becoming a place where new software ideas […]

  2. Ethan Jewett says:

    It strikes me that Calcite and Lucene on this list may be a bit of a misleading classification for those who aren’t familiar with the positioning of these 2 libraries. I’m not an expert, but both seem like libraries that are meant to add a capability to a larger application. Lucene provides full-text search and indexing capabilities, while Calcite provides an SQL engine and optimizer on top of any application.

    Neither is something you deploy on it’s own like one would with Solr or Hive, but if you deploy Solr you are deploying Lucene, and if you deploy Hive (as of 0.14 IIRC) you are deploying Calcite, to name 2 examples. Solr and Hive both have wide support in distributions, and I assume that that support extends to issues with Lucene and Calcite as used in those applications.

    In any case, I thought it was worth noting that there are many different types of Apache projects in the Hadoop ecosystem. In the case of some types of projects it makes sense to look for explicit distribution support as indicators of maturity and reliability, but I think perhaps less so in the case of projects like Lucene and Calcite.

    • Merv Adrian says:

      Great points, and thanks for making them. One of the interesting challenges of this research has been the difficulty of getting explicit data about who supports what, and how. But this is an issue that matters greatly to mainstream users of technology. Its complexity is certainly compounded by the interconnectedness, hierarchical and otherwise, of the projects in the stack. Where one is subsumed inside the other and not used by itself, it may be fair to assume support.
      In the case of Lucene and SOLR, I’m less sure – explicit support and lack of explicit support do seem to be an issue to me there. Search is separately supported by the vendors. Even Hortonworks does not include it in the standard package.

  3. […]  the Hadoop stack. In recent posts In Now, What is Hadoop? And What’s Supported? and Hadoop Projects Supported by Only One Distribution, I mapped those supported by commercial distributors. But there’s another group – those […]

  4. […] (and without)  the Hadoop stack. In recent posts Now, What is Hadoop? And What’s Supported? and Hadoop Projects Supported by Only One Distribution, I mapped those supported by commercial distributors. But there’s another group – a dozen (so […]

Comments are closed

Comments or opinions expressed on this blog are those of the individual contributors only, and do not necessarily represent the views of Gartner, Inc. or its management. Readers may copy and redistribute blog postings on other blogs, or otherwise for private, non-commercial or journalistic purposes, with attribution to Gartner. This content may not be used for any other purposes in any other formats or media. The content on this blog is provided on an "as-is" basis. Gartner shall not be liable for any damages whatsoever arising out of the content or use of this blog.