Blog post

Strata Spark Tsunami – Hadoop World, Part One

By Merv Adrian | October 31, 2014 | 6 Comments

Teradata AsterMicrosoftMapRIBM BigInsightsIBMHortonworksGartnerCrayClouderaCascadingApache YARNApache SparkApache MapReduceApache HiveApache HDFSApache HadoopApache CassandraApache AvroApache AccumuloApacheAmazon Elastic MapReduceAmazonData and Analytics Strategies

New York’s Javits Center is a cavernous triumph of form over function. Giant empty spaces were everywhere at this year’s empty-though-sold-out Strata/Hadoop World, but the strangely-numbered, hard to find, typically inadequately-sized rooms were packed. Some redesign will be needed next year, because the event was huge in impact and demand will only grow. A few of those big tent pavilions you see at Oracle Open World or Dreamforce would drop into the giant halls without a trace – I’d expect to see some next year to make some usable space available.

So much happened, I’ll post a couple of pieces here. Last year’s news was all about promises: Hadoop 2.0 brought the promise of YARN enabling new kinds of processing, and there was promise in the multiple emerging SQL-on-HDFS plays. The Hadoop community was clearly ready to crown a new hype king for 2014.

This year, all that noise had jumped the Spark.

If you have not kept up, Apache Spark bids to replace supplement MapReduce with a more general purpose engine, combining interactive processing and streaming along with MapReduce-like batch capabilities, leveraging YARN to enable a new, much broader set of use cases. (See Nick Heudecker’s blog for a recent assessment.) It has a commercializer in Databricks, which has shown great skill in assembling an ecosystem of support from a set of partners who are enabling it to work with multiple key Hadoop stack projects at an accelerating pace. That momentum was reflected in the rash of announcements at Hadoop World, across categories from Analytics to Wrangling (couldn’t come up with a Z.) There were more than I’ll list here – their vendors are welcome to add themselves via comments, and I’ll curate this post for a while to put them in.

Hadoop analytics pioneer Platfora announced its version 4.0 with enhanced visualizations, geo-analytics capabilities and collaboration features, and revealed it has “plans for integration” with Spark.

Tableau was a little more ready, delivering a beta version of its Spark Connector, claiming its in-memory offering delivered up to 100x the performance of Hadoop MapReduce. Tableau is also broadening its ecosystem reach, adding a beta version of its connector for Amazon EMR, and support for IBM BigSQL and MarkLogic.

Tresata extended the analytics wave to analytic applications, enhancing its customer intelligence management software for financial data by adding real-time execution of analytical processes using Spark. Tresata is an early mover, and believes one of its core advantages derives from having been architected to run entirely in Hadoop early on. It supports its own data wrangling with Automated Data Ontology Discovery and entity resolution – cleaning, de-duping, and parsing data.

(For developers, Tresata is also open sourcing Scalding-on-Spark – a library that adds support for Cascading Taps, Scalding Sources and the Scalding Fields API in Spark.)

Appliances were represented by Dell, who introduced a new In-memory box (one of many Hadoop appliances that represented another 2014 trend) that integrates Spark with Cloudera Enterprise. (Dell is all in on the new datastores – they have buit architectures with Datastax for Cassandra, and with MongoDB, as well.) And Cray, having completed its spinback of Yarc, unveiled its Urika-XA platform with Hadoop and Spark pre-installed, and leveraging its HPC expertise to exploit SSDs, parallel file systems, and high-speed interconnects for a test run to see if there is a high-end performance market yet.

Cloud was brought to the party by BlueData, packaging Spark with its EPIC™ private-cloud deployment platform. Standalone Spark clusters can run Spark-Scala, MLLib or SparkSQL jobs against data stored in HDFS, NFS and other storage. Note “standalone” – Spark can, and will, be used by shops that are not running Hadoop. Once it is actually running production jobs, that is.

Rackspace is in both games with its OnMetal – an appliance-based cloud you don’t have to own, with a high-performance design using 3.2 TB per data node. They provision the other services. Rackspace is partnering with Hortonworks to deliver HDP 2.1 or – you guessed it – Spark. This is all built on a thin virtualization layer on another emerging hot platform: Openstack.

The distributions were represented of course: Cloudera jumped in back in February accompanied by strong statements from Mike Olson that helped put it on the map. Hortonworks followed in May with a tech preview. It still is in preview – Hortonworks, for good reasons, is not quite prepared to call it production-ready yet. Pivotal support was announced in May – oddly, in the Databricks blog, reflecting its on-again, off-again marketing motions. In New York, MapR on the bandwagon since April as well, announced that Drill – itself barely out of the gate – will also run on Spark.

It was intriguing to note that many of the emerging data wrangling/munging/harmonizing/preparing/curating players started early. ClearStory CEO Sharmila Mulligan of  was quick to note during her keynote appearance that her offering has been built on Spark from the outset. Paxata, another of the new players  with a couple of dozen licensed customers already, has also built its in-memory, columnar, parallel enterprise platform on top of Apache Spark. It connects directly to HDFS, RDBMS, and web services like SalesForce.com and publishes to Apache Hive or Cloudera Impala. Trifacta, already onto its v2, has now officially named its language Wrangle , added native support for more complex data formats, including JSON, Avro, ORC and Parquet, and yes, is focusing on delivering scale for its data transformation through native use of both Spark and MapReduce.

Even the conference organizers got into the act. O’Reilly has made a big investment with Cloudera to make Strata a leading conference. It’s added a European conference, making Doug Cutting the new conference Chair. In New York, O’Reilly announced a partnership with Databricks for  Spark developer certification, expanding the franchise before someone else jumps in.

There is far more to come from Spark – a memory-centric file system called Tachyon that will add new capabilities above today’s disk-oriented ones; the MLlib machine learning library that will leverage Spark’s superior iterative performance, GraphX for the long awaited graph performance that today is best served by commercial vendors like Teradata Aster, and of course, Spark Streaming. But much of that is simply not demonstrably production-ready just yet – much is still in beta. Or even alpha. We’ll be watching. For now, it’s the new hype king.

Comments are closed

6 Comments

  • Merv,

    Great write-up, thanks a lot! There’s one sentence where I’d like to make a factual correction though. When you say ‘In New York, MapR got on the bandwagon as well ‘: in fact, we announced the Databricks partnership and that we will support and ship Spark in April 2014, before most of our esteemed competitors, details see:

    http://databricks.com/blog/2014/04/10/mapr-integrates-spark-stack.html

    Cheers,
    Michael

    • Merv Adrian says:

      Thanks, Michael. Have made the correction.You’re not the only one whose announcement was on the Databricks blog – that complicated finding it.

  • Kevin Leong says:

    Hi Merv,

    We had a chance to chat about Cray’s Urika-XA appliance last week, if you recall. We launched the Urika-XA platform at Strata, and it comes with Hadoop and Spark pre-installed. More information can be found here: http://www.cray.com/Products/BigData/Urika-XA.aspx

    Thanks,
    Kevin

    • Merv Adrian says:

      Thanks – I was working from two week-old notes. Fixed.
      Maybe this Daylight Savings Time thing will help….

  • Hey Merv,
    Great writeup and observations. Love the term “jump the Spark”. It’s amazing to see not only the growth in the Hadoop World / Strata conference over the past 5 years, but also how quickly the market is buying into the vision and demonstrating the appetite for advanced big data solutions based on emerging technologies like Spark.

    At Sqrrl, we also see the power in these higher-level platforms like Spark. We too gave a talk at Strata at how one can use the GraphX graph-oriented processing engine component of Spark to rapidly derive new information about the data you care about. Our CTO, Adam Fuchs gave a wonderful real-world example about how the Spark technology can be applied to investigating cyber crime with rapidly executing anomaly detection algorithms.

    More information can be found in our encore webinar on-demand: http://info.sqrrl.com/oct-2014-webinar

  • Merv Adrian says:

    Thanks, Joe. I did not get the chance to see your talk, but it’s good to see Sqrrl broadening its story from the early mostly Accumulo focused days.