Gartner Blog Network

Spark and Tez Highlight MapReduce Problems

by Nick Heudecker  |  February 4, 2014  |  5 Comments

On February 3rd, Cloudera announced support for Apache Spark as part of Cloudera Enterprise. I’ve blogged about Spark before so I won’t go into substantial detail here, but the short version is Spark improves upon MapReduce by removing the need to write data to disk between steps. Spark also takes advantage of in-memory processing and data sharing for further optimizations.

The other successor to MapReduce (of course there is more than one) is Apache Tez. Tez improves upon MapReduce by removing the need to write data to disk between steps (Sound familiar?). It also has in-memory capabilities similar to Spark.  Thus far Hortonworks has thrown its weight behind Tez development as part of the Stinger project.

Both Tez and Spark are described as supplementing MapReduce workloads. However, I don’t think this will be case much longer. The world has changed since Google published the original MapReduce paper in 2004. Memory prices have plummeted while data volumes and sources have increased, making legacy MapReduce less appealing.

Vendors will likely begin distancing themselves from MapReduce for more performant options once there are some high profile customer references. It remains to be seen what this means for early adopters with legacy MapReduce applications.

Thanks to Josh Wills at Cloudera for helping clarify the advantage provided by Spark & Tez.

Additional Resources

Predicts 2019: Data and Analytics Strategy

Data and analytics are the key accelerants of digitalization, transformation and “ContinuousNext” efforts. As a result, data and analytics leaders will be counted upon to affect corporate strategy and value, change management, business ethics, and execution performance.

Read Free Gartner Research

Category: data-and-analytics-strategies  

Tags: hadoop  mapreduce  spark  tez  

Nick Heudecker
Research Vice President
5 years at Gartner
19 years IT Industry

Nick Heudecker is an Analyst in Gartner's Research and Advisory Data Management group. Read Full Bio

Thoughts on Spark and Tez Highlight MapReduce Problems

  1. […] Nick Heudecker On February 3rd, Cloudera announced support for Apache Spark as part of Cloudera Enterprise. […]

  2. Robert says:

    Good article!

    Regarding the legacy support for early adopters: Tez claims to support existing Map Reduce jobs within their system.

    I want to add another system that is also tackling the problems of MapReduce with a similar approach:
    We also avoid disk accesses as much as possible. Our system does also JVM sharing for better performance. Once of the unique features of Stratosphere is its optimizer that, similar to traditional RDBMS, decides based on data properties etc. which execution strategy to choose. Have a look at it!
    Ask me if you have any questions regarding Stratosphere.

  3. Worry more about the interface, less about the storage implementation

    I would strongly recommend to focus LESS on the details of a certain open source implementation, and more on the interfaces that developers want. Don’t just hopscotch from one project to the next rewriting your code each time.

    I was happy to hear that Tez supports MapReduce, but we need a developer platform for real time analytics. One that my company, Aerospike, can start offering solutions for.

    Right now, there are too many possible interfaces for a programmer to use. Storm has a nice model, but is fairly basic and doesn’t have a reliability abstraction (trident turned out not to be the answer). Aerospike has a Storm plugin available on Github. Spark has some legs right now, and has a flexible interface for storage providers (“watch this space” for Aerospike’s interest). I haven’t looked at Tez yet.

    Guys – don’t let this market fragment, then let the storage guys work their magic, like there are innumerable HDFS / Hadoop optimizations.

    • Nick Heudecker says:

      Thanks for the insight Brian. The vendors are certainly diverging and I’m not sure that’s a bad thing from a customer perspective. How the divergence impacts the competitive landscape is another matter.

Leave a Reply

Your email address will not be published. Required fields are marked *

Comments or opinions expressed on this blog are those of the individual contributors only, and do not necessarily represent the views of Gartner, Inc. or its management. Readers may copy and redistribute blog postings on other blogs, or otherwise for private, non-commercial or journalistic purposes, with attribution to Gartner. This content may not be used for any other purposes in any other formats or media. The content on this blog is provided on an "as-is" basis. Gartner shall not be liable for any damages whatsoever arising out of the content or use of this blog.