Merv Adrian

A member of the Gartner Blog Network

Merv Adrian
Research VP
1 year with Gartner
30 years in IT industry

Merv Adrian is an analyst following database and adjacent technologies as extreme data transforms assumptions about what to persist as well as when, where and how. He also watches the way the software/hardware boundary… Read Full Bio

Coverage Areas:

Hadoop Summit Recap Part One – A Ripping YARN

by Merv Adrian  |  July 10, 2013  |  7 Comments

I had the privilege of keynoting this year’s Hadoop Summit, so I may be a bit prejudiced when I say the event confirmed my assertion that we have arrived at a turning point in Hadoop’s maturation. The large number of attendees (2500, a solid increase – and more “suits”) and sponsors (70, also a significant uptick) made it clear that growth is continuing. Gartner’s data confirms this – my own inquiry rate continues to grow, and my colleagues covering big data and Hadoop are all seeing steady growth too. But it’s not all sweetness and light. There are issues. Here we’ll look at the centerpeice of the technical messaging: YARN. Much is expected – and we seem to be doomed to wait a while longer.

Here is a great summary of YARN, also known as Hadoop 2.0, posted after the Summit:

MapReduce is great for batch processing large volumes of distributed data, but it’s less than ideal for real-time data processing, graph processing and other non-batch methods. YARN is the open source community’s effort to overcome this limitation and transform Hadoop from a One Trick Pony to a truly comprehensive Big Data management and analytics platform.

Sounds great, doesn’t it? Problem is this was posted by Jeff Kelly last August, after Hadoop Summit 2012. Now, YARN is being used – on Yahoo’s 30,000 nodes, for example – but Apache still calls it Alpha as of this writing (July 9, 2013.)  Next announcement, when it comes, will be beta. Some distributions, like Cloudera CDH 4.2,  are already supporting it anyway. Hortonworks HDP 2.0, which includes YARN, is in Community Preview (what we enterprise guys like to call beta). MapR doesn’t list it yet – the search engine on their site comes up empty if you search for it. So we aren’t quite there yet.

One other note: confusion continues – I see it in my inquiries abut “what IS Hadoop?” Two of what Apache lists as the 3 core Hadoop components will be substitutable now – you can already swap out HDFS for IBM’s GPFS, Intel’s Lustre, or MapR’s storage layer.  As YARN comes to market, other engines will be swappable for MapReduce. Graph engines and “closer to real-time” processing are next on the horizon, as Storm is getting great traction and several Summit presenters of real world case studies alluded to their use of it. Yahoo! has open sourced its Storm-YARN code, which it runs internally, so expect more productionization ahead. So the answer to “what is Hadoop, exactly?” will become even more complicated.

Will this confusion hurt the market, and slow adoption? Hard to say. The uncertain part of the market remains so; Gartner’s 2012 Research Circle found 31% of enterprises had no plans for Big Data investment. In 2013, the number was the same. YARN will broaden the set of possible use cases, and raise emany questions. Let’s hope it’s ready to start answering them soon.

7 Comments »

Category: Apache Apache Yarn Big Data Cloudera Gartner graph databases Hadoop HDFS Hortonworks IBM Intel MapR MapReduce Storm Yahoo! YARN     Tags: , , , , , , , , ,

7 responses so far ↓

  • 1 Hadoop Summit Recap Part One – A Ripping YARN | Merv Adrian's IT Market Strategy   July 10, 2013 at 3:56 am

    [...] — more —  [...]

  • 2 Hans Willems   July 12, 2013 at 10:23 am

    Thanks for the blog post Merv. It was not easy to choose the database technology in 2009 for an enterprise customer engagement platform where real-time data processing was a core requirement. We went for Apache Cassandra and the platform has currently grown to 150 million individual customer records continuously updated with real-time behavioural, identifiable and internal information driven by online interactions. We are testing for 1 billion profiles where the overall execution time (listen, identify, write, retrieve, decide, deliver) should stay within 24ms. We are following the Hadoop developments but it seems for now that Cassandra wasn’t a bad choice.

  • 3 Merv Adrian   July 13, 2013 at 1:40 am

    Thanks for telling this story, Hans. Are you using the open source version, or the commercially supported one? I talked to two clients in recent months who assessed Hadoop distributions and threw Datastax Enterprise in the mix almost by accident – and selected it.

  • 4 Hadoop Summit Recap Part Two – SELECT FROM hdfs WHERE bigdatavendor USING SQL   July 15, 2013 at 5:17 am

    [...] struggles to be complete and timely with its offerings. YARN’s lateness was discussed in my first post on the Summit. Tweo weeks before that, at HBaseCon I watched a speaker asking his audience to volunteer for [...]

  • 5 Hans Willems   July 15, 2013 at 7:58 am

    We are using the open source version and have build our own expertise over the years.

  • 6 Merv Adrian   July 15, 2013 at 9:19 pm

    Thanks for updating!

  • 7 Analysts in the Press 7/16/2013 | Thomas Ward Lynch   July 17, 2013 at 4:38 pm

    [...] Hadoop Summit Recap Part One – A Ripping YARN – Gartner Blog … by Merv Adrian | July 10, 2013 | 3 Comments … Gartner’s data confirms this – my own inquiry rate continues to grow, and my colleagues covering big data and … blogs.gartner.com/…/hadoop-summit-recap-part-one-a-ripping… [...]