The Apache Software Foundation has over 350 projects underway, and others are being developed in the open source community for use with (and without) the Hadoop stack. In recent posts Now, What is Hadoop? And What’s Supported? and Hadoop Projects Supported by Only One Distribution, I mapped those supported by commercial distributors. But there’s another group – eleven (so far) not supported by any of them. Some of those projects – Flink, Geode, Giraph, Ignite, Kylin, Lens, Myriad, NiFi, Samza, Twill and Zeppelin – are getting a great deal of attention.
Being on this list does not necessarily mean they are not ready, or don’t work. There are many reasons a project may not be supported yet. If you want to experiment, by all means do so. Just be aware no commercial support is provided – yet. And that will change….
- Flink is a top level Apache project; it’s hot, but currently not supported by any distributors. It offers scalable batch and stream data processing supporting Java, Scala, and Python. Flink offers a Table API with a “SQL-like” expression language (uh-oh; “-like” anything rarely ends well), and targets machine learning and graph processing (a popular target with no obvious leader yet.)
- Geode is an incubating Apache project based on Pivotal’s donated Gemfire in-memory datagrid source code. It persists to disk storage and has configurable consistency trading off between performance and ACID transactions. It fails over with automatic rebalancing and automatic self-healing. At this point, no distributor (including Pivotal) offers support for Geode as part of a Hadoop distribution.
- Giraph is an iterative graph processing system running over HDFS and Amazon EC2 built and used at Facebook and LinkedIn. The Apache community has been expecting to see it emerge for some time – it entered incubation in February 2012 and released version 1.0 in May 2013 – but no distribution is yet offering support for it three and a half years later. Broader support for graph cases seems to be found more with the NoSQL graph DBMSs so far.
- Ignite, like Geode, is a bid for the (brace yourself and get your buzzword bingo cards ready) real-time in-memory distributed event processing transactional Data Fabric space. Describing itself as a combination of a data grid, compute (“in-memory MapReduce”) service grid, streaming, file system (“in-memory HDFS” and an implementation of Spark RDDs) and several other functional components, with SQL support, it’s currently incubating at Apache. In an environment of disaggregated functions each with their own project, it seems much more expansive, like Spark, in that it offers many layers within its portfolio. It is largely driven by GridGain, who offer a commercial in-memory Hadoop accelerator. Ignite is not yet supported by any distributors.
- Kylin, an Apache Project since November 2014, is an “extreme OLAP engine for big data” being developed at eBay. It is ANSI SQL-based, and claims to scale to 10+ billion rows, with compression, incremental refresh of cubes and support for HBase Coprocessors. It offers security at the Cube or Project level via ACLs. Kylin supports Tableau; Microstrategy and Excel support is underway. No commercial distributions support it.
- Lens, incubating at Apache, provides a federated view of data across multiple tiered data stores using a single shared schema server based on the Hive Metastore. It has its own high level “SQL-like” (uh-oh again) language: OLAP Cube QL, and drivers for other systems like Hive, Redshift and other columnar data warehouses. Contributed by InMobi, the project has committers from Software AG and Hortonworks as well, but is not supported by any distributors as yet.
- Myriad is an Apache incubator project that provides a framework for dynamically scaling YARN clusters on the Apache Mesos distributed systems kernel (which I don’t discuss in this post), leveraging its API’s for resource management and scheduling across datacenters and cloud environments, connections to docker and other innovative approaches. No distribution is offering support for Myriad yet.
- NiFi provides scalable multisource directed graphs of data routing, transformation, and system mediation, not unlike enterprise system bus (ESB) offerings. It is extensible with secure, configurable delivery options, provenance tracking. NiFi originated at the National Security Agency (NSA) as Niagarafiles, was submitted to the Apache Incubator in November 2014 and became a top level project in July 2015. No distributions support it yet.
- Samza is a top level Apache project that provides fault-tolerant distributed stream processing framework using Apache Kafka (which is supported by Cloudera and Hortonworks) for messaging. It relies on YARN for resource management and security, and no, it is not supported by any distributor.
- Twill (incubating) is an higher level abstraction over YARN that uses a thread-like model to reduces the complexity of developing distributed applications. The intent is to let developers to focus more on their application logic. Apache Twill abstracts YARN’s distributed capabilities, provides logging of its multiple runnables, state recovery and elastic scaling. No distribution supports it.
- Zeppelin is a web-based notebook. It supports data ingestion and discovery for creating interactive, collaborative data analytic documents with Scala and Python (with Spark), SparkSQL, Hive, Markdown and Shell. Basic charts and forms are built in. No distributor is offering support for Zeppelin, which is an incubating Apache project.