Gartner Blog Network

Hadoop Summit Recap Part One – A Ripping YARN

by Merv Adrian  |  July 10, 2013  |  7 Comments

I had the privilege of keynoting this year’s Hadoop Summit, so I may be a bit prejudiced when I say the event confirmed my assertion that we have arrived at a turning point in Hadoop’s maturation. The large number of attendees (2500, a solid increase – and more “suits”) and sponsors (70, also a significant uptick) made it clear that growth is continuing. Gartner’s data confirms this – my own inquiry rate continues to grow, and my colleagues covering big data and Hadoop are all seeing steady growth too. But it’s not all sweetness and light. There are issues. Here we’ll look at the centerpeice of the technical messaging: YARN. Much is expected – and we seem to be doomed to wait a while longer.

Here is a great summary of YARN, also known as Hadoop 2.0, posted after the Summit:

MapReduce is great for batch processing large volumes of distributed data, but it’s less than ideal for real-time data processing, graph processing and other non-batch methods. YARN is the open source community’s effort to overcome this limitation and transform Hadoop from a One Trick Pony to a truly comprehensive Big Data management and analytics platform.

Sounds great, doesn’t it? Problem is this was posted by Jeff Kelly last August, after Hadoop Summit 2012. Now, YARN is being used – on Yahoo’s 30,000 nodes, for example – but Apache still calls it Alpha as of this writing (July 9, 2013.)  Next announcement, when it comes, will be beta. Some distributions, like Cloudera CDH 4.2,  are already supporting it anyway. Hortonworks HDP 2.0, which includes YARN, is in Community Preview (what we enterprise guys like to call beta). MapR doesn’t list it yet – the search engine on their site comes up empty if you search for it. So we aren’t quite there yet.

One other note: confusion continues – I see it in my inquiries abut “what IS Hadoop?” Two of what Apache lists as the 3 core Hadoop components will be substitutable now – you can already swap out HDFS for IBM’s GPFS, Intel’s Lustre, or MapR’s storage layer.  As YARN comes to market, other engines will be swappable for MapReduce. Graph engines and “closer to real-time” processing are next on the horizon, as Storm is getting great traction and several Summit presenters of real world case studies alluded to their use of it. Yahoo! has open sourced its Storm-YARN code, which it runs internally, so expect more productionization ahead. So the answer to “what is Hadoop, exactly?” will become even more complicated.

Will this confusion hurt the market, and slow adoption? Hard to say. The uncertain part of the market remains so; Gartner’s 2012 Research Circle found 31% of enterprises had no plans for Big Data investment. In 2013, the number was the same. YARN will broaden the set of possible use cases, and raise emany questions. Let’s hope it’s ready to start answering them soon.

Additional Resources

Predicts 2019: Data and Analytics Strategy

Data and analytics are the key accelerants of digitalization, transformation and “ContinuousNext” efforts. As a result, data and analytics leaders will be counted upon to affect corporate strategy and value, change management, business ethics, and execution performance.

Read Free Gartner Research

Category: apache  hadoop  hdfs  mapreduce  apache-yarn  cloudera  data-and-analytics-strategies  gartner  graph-databases  hortonworks  ibm  intel  mapr  yahoo  

Tags: apache  hadoop  hdfs  mapreduce  big-data-2  cloudera  hortonworks  ibm  mapr  yahoo  

Merv Adrian
Research VP
9 years with Gartner
40 years in IT industry

Merv Adrian is an analyst following database and adjacent technologies as extreme data transforms assumptions about what to persist as well as when, where and how. He also watches the way the software/hardware boundary… Read Full Bio

Thoughts on Hadoop Summit Recap Part One – A Ripping YARN

  1. Hans Willems says:

    Thanks for the blog post Merv. It was not easy to choose the database technology in 2009 for an enterprise customer engagement platform where real-time data processing was a core requirement. We went for Apache Cassandra and the platform has currently grown to 150 million individual customer records continuously updated with real-time behavioural, identifiable and internal information driven by online interactions. We are testing for 1 billion profiles where the overall execution time (listen, identify, write, retrieve, decide, deliver) should stay within 24ms. We are following the Hadoop developments but it seems for now that Cassandra wasn’t a bad choice.

    • Merv Adrian says:

      Thanks for telling this story, Hans. Are you using the open source version, or the commercially supported one? I talked to two clients in recent months who assessed Hadoop distributions and threw Datastax Enterprise in the mix almost by accident – and selected it.

  2. […] struggles to be complete and timely with its offerings. YARN’s lateness was discussed in my first post on the Summit. Tweo weeks before that, at HBaseCon I watched a speaker asking his audience to volunteer for […]

  3. Hans Willems says:

    We are using the open source version and have build our own expertise over the years.

  4. […] Hadoop Summit Recap Part One – A Ripping YARN – Gartner Blog … by Merv Adrian | July 10, 2013 | 3 Comments … Gartner’s data confirms this – my own inquiry rate continues to grow, and my colleagues covering big data and ……/hadoop-summit-recap-part-one-a-ripping… […]

Comments are closed

Comments or opinions expressed on this blog are those of the individual contributors only, and do not necessarily represent the views of Gartner, Inc. or its management. Readers may copy and redistribute blog postings on other blogs, or otherwise for private, non-commercial or journalistic purposes, with attribution to Gartner. This content may not be used for any other purposes in any other formats or media. The content on this blog is provided on an "as-is" basis. Gartner shall not be liable for any damages whatsoever arising out of the content or use of this blog.