New York’s Javits Center is a cavernous triumph of form over function. Giant empty spaces were everywhere at this year’s empty-though-sold-out Strata/Hadoop World, but the strangely-numbered, hard to find, typically inadequately-sized rooms were packed. Some redesign will be needed next year, because the event was huge in impact and demand will only grow. A few of those big tent pavilions you see at Oracle Open World or Dreamforce would drop into the giant halls without a trace – I’d expect to see some next year to make some usable space available.
So much happened, I’ll post a couple of pieces here. Last year’s news was all about promises: Hadoop 2.0 brought the promise of YARN enabling new kinds of processing, and there was promise in the multiple emerging SQL-on-HDFS plays. The Hadoop community was clearly ready to crown a new hype king for 2014.
This year, all that noise had jumped the Spark.
If you have not kept up, Apache Spark bids to
replace supplement MapReduce with a more general purpose engine, combining interactive processing and streaming along with MapReduce-like batch capabilities, leveraging YARN to enable a new, much broader set of use cases. (See Nick Heudecker’s blog for a recent assessment.) It has a commercializer in Databricks, which has shown great skill in assembling an ecosystem of support from a set of partners who are enabling it to work with multiple key Hadoop stack projects at an accelerating pace. That momentum was reflected in the rash of announcements at Hadoop World, across categories from Analytics to Wrangling (couldn’t come up with a Z.) There were more than I’ll list here – their vendors are welcome to add themselves via comments, and I’ll curate this post for a while to put them in.
Hadoop analytics pioneer Platfora announced its version 4.0 with enhanced visualizations, geo-analytics capabilities and collaboration features, and revealed it has “plans for integration” with Spark.
Tableau was a little more ready, delivering a beta version of its Spark Connector, claiming its in-memory offering delivered up to 100x the performance of Hadoop MapReduce. Tableau is also broadening its ecosystem reach, adding a beta version of its connector for Amazon EMR, and support for IBM BigSQL and MarkLogic.
Tresata extended the analytics wave to analytic applications, enhancing its customer intelligence management software for financial data by adding real-time execution of analytical processes using Spark. Tresata is an early mover, and believes one of its core advantages derives from having been architected to run entirely in Hadoop early on. It supports its own data wrangling with Automated Data Ontology Discovery and entity resolution – cleaning, de-duping, and parsing data.
(For developers, Tresata is also open sourcing Scalding-on-Spark – a library that adds support for Cascading Taps, Scalding Sources and the Scalding Fields API in Spark.)
Appliances were represented by Dell, who introduced a new In-memory box (one of many Hadoop appliances that represented another 2014 trend) that integrates Spark with Cloudera Enterprise. (Dell is all in on the new datastores – they have buit architectures with Datastax for Cassandra, and with MongoDB, as well.) And Cray, having completed its spinback of Yarc, unveiled its Urika-XA platform with Hadoop and Spark pre-installed, and leveraging its HPC expertise to exploit SSDs, parallel file systems, and high-speed interconnects for a test run to see if there is a high-end performance market yet.
Cloud was brought to the party by BlueData, packaging Spark with its EPIC™ private-cloud deployment platform. Standalone Spark clusters can run Spark-Scala, MLLib or SparkSQL jobs against data stored in HDFS, NFS and other storage. Note “standalone” – Spark can, and will, be used by shops that are not running Hadoop. Once it is actually running production jobs, that is.
Rackspace is in both games with its OnMetal – an appliance-based cloud you don’t have to own, with a high-performance design using 3.2 TB per data node. They provision the other services. Rackspace is partnering with Hortonworks to deliver HDP 2.1 or – you guessed it – Spark. This is all built on a thin virtualization layer on another emerging hot platform: Openstack.
The distributions were represented of course: Cloudera jumped in back in February accompanied by strong statements from Mike Olson that helped put it on the map. Hortonworks followed in May with a tech preview. It still is in preview – Hortonworks, for good reasons, is not quite prepared to call it production-ready yet. Pivotal support was announced in May – oddly, in the Databricks blog, reflecting its on-again, off-again marketing motions. In New York, MapR on the bandwagon since April as well, announced that Drill – itself barely out of the gate – will also run on Spark.
It was intriguing to note that many of the emerging data wrangling/munging/harmonizing/preparing/curating players started early. ClearStory CEO Sharmila Mulligan of was quick to note during her keynote appearance that her offering has been built on Spark from the outset. Paxata, another of the new players with a couple of dozen licensed customers already, has also built its in-memory, columnar, parallel enterprise platform on top of Apache Spark. It connects directly to HDFS, RDBMS, and web services like SalesForce.com and publishes to Apache Hive or Cloudera Impala. Trifacta, already onto its v2, has now officially named its language Wrangle , added native support for more complex data formats, including JSON, Avro, ORC and Parquet, and yes, is focusing on delivering scale for its data transformation through native use of both Spark and MapReduce.
Even the conference organizers got into the act. O’Reilly has made a big investment with Cloudera to make Strata a leading conference. It’s added a European conference, making Doug Cutting the new conference Chair. In New York, O’Reilly announced a partnership with Databricks for Spark developer certification, expanding the franchise before someone else jumps in.
There is far more to come from Spark – a memory-centric file system called Tachyon that will add new capabilities above today’s disk-oriented ones; the MLlib machine learning library that will leverage Spark’s superior iterative performance, GraphX for the long awaited graph performance that today is best served by commercial vendors like Teradata Aster, and of course, Spark Streaming. But much of that is simply not demonstrably production-ready just yet – much is still in beta. Or even alpha. We’ll be watching. For now, it’s the new hype king.