Gartner Blog Network


Strata Spark Tsunami – Hadoop World, Part One

by Merv Adrian  |  October 31, 2014  |  10 Comments

New York’s Javits Center is a cavernous triumph of form over function. Giant empty spaces were everywhere at this year’s empty-though-sold-out Strata/Hadoop World, but the strangely-numbered, hard to find, typically inadequately-sized rooms were packed. Some redesign will be needed next year, because the event was huge in impact and demand will only grow. A few of those big tent pavilions you see at Oracle Open World or Dreamforce would drop into the giant halls without a trace – I’d expect to see some next year to make some usable space available.

So much happened, I’ll post a couple of pieces here. Last year’s news was all about promises: Hadoop 2.0 brought the promise of YARN enabling new kinds of processing, and there was promise in the multiple emerging SQL-on-HDFS plays. The Hadoop community was clearly ready to crown a new hype king for 2014.

This year, all that noise had jumped the Spark.

If you have not kept up, Apache Spark bids to replace supplement MapReduce with a more general purpose engine, combining interactive processing and streaming along with MapReduce-like batch capabilities, leveraging YARN to enable a new, much broader set of use cases. (See Nick Heudecker’s blog for a recent assessment.) It has a commercializer in Databricks, which has shown great skill in assembling an ecosystem of support from a set of partners who are enabling it to work with multiple key Hadoop stack projects at an accelerating pace. That momentum was reflected in the rash of announcements at Hadoop World, across categories from Analytics to Wrangling (couldn’t come up with a Z.) There were more than I’ll list here – their vendors are welcome to add themselves via comments, and I’ll curate this post for a while to put them in.

Hadoop analytics pioneer Platfora announced its version 4.0 with enhanced visualizations, geo-analytics capabilities and collaboration features, and revealed it has “plans for integration” with Spark.

Tableau was a little more ready, delivering a beta version of its Spark Connector, claiming its in-memory offering delivered up to 100x the performance of Hadoop MapReduce. Tableau is also broadening its ecosystem reach, adding a beta version of its connector for Amazon EMR, and support for IBM BigSQL and MarkLogic.

Tresata extended the analytics wave to analytic applications, enhancing its customer intelligence management software for financial data by adding real-time execution of analytical processes using Spark. Tresata is an early mover, and believes one of its core advantages derives from having been architected to run entirely in Hadoop early on. It supports its own data wrangling with Automated Data Ontology Discovery and entity resolution – cleaning, de-duping, and parsing data.

(For developers, Tresata is also open sourcing Scalding-on-Spark – a library that adds support for Cascading Taps, Scalding Sources and the Scalding Fields API in Spark.)

Appliances were represented by Dell, who introduced a new In-memory box (one of many Hadoop appliances that represented another 2014 trend) that integrates Spark with Cloudera Enterprise. (Dell is all in on the new datastores – they have buit architectures with Datastax for Cassandra, and with MongoDB, as well.) And Cray, having completed its spinback of Yarc, unveiled its Urika-XA platform with Hadoop and Spark pre-installed, and leveraging its HPC expertise to exploit SSDs, parallel file systems, and high-speed interconnects for a test run to see if there is a high-end performance market yet.

Cloud was brought to the party by BlueData, packaging Spark with its EPIC™ private-cloud deployment platform. Standalone Spark clusters can run Spark-Scala, MLLib or SparkSQL jobs against data stored in HDFS, NFS and other storage. Note “standalone” – Spark can, and will, be used by shops that are not running Hadoop. Once it is actually running production jobs, that is.

Rackspace is in both games with its OnMetal – an appliance-based cloud you don’t have to own, with a high-performance design using 3.2 TB per data node. They provision the other services. Rackspace is partnering with Hortonworks to deliver HDP 2.1 or – you guessed it – Spark. This is all built on a thin virtualization layer on another emerging hot platform: Openstack.

The distributions were represented of course: Cloudera jumped in back in February accompanied by strong statements from Mike Olson that helped put it on the map. Hortonworks followed in May with a tech preview. It still is in preview – Hortonworks, for good reasons, is not quite prepared to call it production-ready yet. Pivotal support was announced in May – oddly, in the Databricks blog, reflecting its on-again, off-again marketing motions. In New York, MapR on the bandwagon since April as well, announced that Drill – itself barely out of the gate – will also run on Spark.

It was intriguing to note that many of the emerging data wrangling/munging/harmonizing/preparing/curating players started early. ClearStory CEO Sharmila Mulligan of  was quick to note during her keynote appearance that her offering has been built on Spark from the outset. Paxata, another of the new players  with a couple of dozen licensed customers already, has also built its in-memory, columnar, parallel enterprise platform on top of Apache Spark. It connects directly to HDFS, RDBMS, and web services like SalesForce.com and publishes to Apache Hive or Cloudera Impala. Trifacta, already onto its v2, has now officially named its language Wrangle , added native support for more complex data formats, including JSON, Avro, ORC and Parquet, and yes, is focusing on delivering scale for its data transformation through native use of both Spark and MapReduce.

Even the conference organizers got into the act. O’Reilly has made a big investment with Cloudera to make Strata a leading conference. It’s added a European conference, making Doug Cutting the new conference Chair. In New York, O’Reilly announced a partnership with Databricks for  Spark developer certification, expanding the franchise before someone else jumps in.

There is far more to come from Spark – a memory-centric file system called Tachyon that will add new capabilities above today’s disk-oriented ones; the MLlib machine learning library that will leverage Spark’s superior iterative performance, GraphX for the long awaited graph performance that today is best served by commercial vendors like Teradata Aster, and of course, Spark Streaming. But much of that is simply not demonstrably production-ready just yet – much is still in beta. Or even alpha. We’ll be watching. For now, it’s the new hype king.

Category: amazon  apache  accumulo  avro  hadoop  hdfs  hive  mapreduce  spark  apache-yarn  aster  big-data  biginsights  cascading  cassandra  cloudera  cray  elastic-mapreduce  gartner  hortonworks  ibm  mapr  microsoft  

Tags: apache  avro  hadoop  hbase  hdfs  hive  mapreduce  orc  parquet  spark  yarn  aster  big-data-2  biginsights  bigsql  bluedata  cassandra  cdh  cloudera  databricks  datastax  emr  gartner  hortonworks  ibm  json  mapr  marklogic  microsoft  mllib  mongodb  openstack  paxata  platfora  rackspace  scalding  sql  tableau  tachyon  tresata  trifacta  

Merv Adrian
Research VP
5 years with Gartner
38 years in IT industry

Merv Adrian is an analyst following database and adjacent technologies as extreme data transforms assumptions about what to persist as well as when, where and how. He also watches the way the software/hardware boundary… Read Full Bio


Thoughts on Strata Spark Tsunami – Hadoop World, Part One


  1. Merv,

    Great write-up, thanks a lot! There’s one sentence where I’d like to make a factual correction though. When you say ‘In New York, MapR got on the bandwagon as well ‘: in fact, we announced the Databricks partnership and that we will support and ship Spark in April 2014, before most of our esteemed competitors, details see:

    http://databricks.com/blog/2014/04/10/mapr-integrates-spark-stack.html

    Cheers,
    Michael

    • Merv Adrian says:

      Thanks, Michael. Have made the correction.You’re not the only one whose announcement was on the Databricks blog – that complicated finding it.

  2. Kevin Leong says:

    Hi Merv,

    We had a chance to chat about Cray’s Urika-XA appliance last week, if you recall. We launched the Urika-XA platform at Strata, and it comes with Hadoop and Spark pre-installed. More information can be found here: http://www.cray.com/Products/BigData/Urika-XA.aspx

    Thanks,
    Kevin

    • Merv Adrian says:

      Thanks – I was working from two week-old notes. Fixed.
      Maybe this Daylight Savings Time thing will help….

  3. […] Merv Adrian Research VP 1 year with Gartner 30 years in IT industry. Merv Adrian is an analyst following database and adjacent technologies as extreme data transforms assumptions about what to persist as well as when, … Article by tsunami – Google Blog Search. Read entire story here. […]

  4. Hey Merv,
    Great writeup and observations. Love the term “jump the Spark”. It’s amazing to see not only the growth in the Hadoop World / Strata conference over the past 5 years, but also how quickly the market is buying into the vision and demonstrating the appetite for advanced big data solutions based on emerging technologies like Spark.

    At Sqrrl, we also see the power in these higher-level platforms like Spark. We too gave a talk at Strata at how one can use the GraphX graph-oriented processing engine component of Spark to rapidly derive new information about the data you care about. Our CTO, Adam Fuchs gave a wonderful real-world example about how the Spark technology can be applied to investigating cyber crime with rapidly executing anomaly detection algorithms.

    More information can be found in our encore webinar on-demand: http://info.sqrrl.com/oct-2014-webinar

  5. Merv Adrian says:

    Thanks, Joe. I did not get the chance to see your talk, but it’s good to see Sqrrl broadening its story from the early mostly Accumulo focused days.

  6. […] Strata Spark Tsunami – Hadoop World, Part One […]



Comments are closed

Comments or opinions expressed on this blog are those of the individual contributors only, and do not necessarily represent the views of Gartner, Inc. or its management. Readers may copy and redistribute blog postings on other blogs, or otherwise for private, non-commercial or journalistic purposes, with attribution to Gartner. This content may not be used for any other purposes in any other formats or media. The content on this blog is provided on an "as-is" basis. Gartner shall not be liable for any damages whatsoever arising out of the content or use of this blog.