I had the privilege of keynoting this year’s Hadoop Summit, so I may be a bit prejudiced when I say the event confirmed my assertion that we have arrived at a turning point in Hadoop’s maturation. The large number of attendees (2500, a solid increase – and more “suits”) and sponsors (70, also a significant uptick) made it clear that growth is continuing. Gartner’s data confirms this – my own inquiry rate continues to grow, and my colleagues covering big data and Hadoop are all seeing steady growth too. But it’s not all sweetness and light. There are issues. Here we’ll look at the centerpeice of the technical messaging: YARN. Much is expected – and we seem to be doomed to wait a while longer.
Here is a great summary of YARN, also known as Hadoop 2.0, posted after the Summit:
MapReduce is great for batch processing large volumes of distributed data, but it’s less than ideal for real-time data processing, graph processing and other non-batch methods. YARN is the open source community’s effort to overcome this limitation and transform Hadoop from a One Trick Pony to a truly comprehensive Big Data management and analytics platform.
Sounds great, doesn’t it? Problem is this was posted by Jeff Kelly last August, after Hadoop Summit 2012. Now, YARN is being used – on Yahoo’s 30,000 nodes, for example – but Apache still calls it Alpha as of this writing (July 9, 2013.) Next announcement, when it comes, will be beta. Some distributions, like Cloudera CDH 4.2, are already supporting it anyway. Hortonworks HDP 2.0, which includes YARN, is in Community Preview (what we enterprise guys like to call beta). MapR doesn’t list it yet – the search engine on their site comes up empty if you search for it. So we aren’t quite there yet.
One other note: confusion continues – I see it in my inquiries abut “what IS Hadoop?” Two of what Apache lists as the 3 core Hadoop components will be substitutable now – you can already swap out HDFS for IBM’s GPFS, Intel’s Lustre, or MapR’s storage layer. As YARN comes to market, other engines will be swappable for MapReduce. Graph engines and “closer to real-time” processing are next on the horizon, as Storm is getting great traction and several Summit presenters of real world case studies alluded to their use of it. Yahoo! has open sourced its Storm-YARN code, which it runs internally, so expect more productionization ahead. So the answer to “what is Hadoop, exactly?” will become even more complicated.
Will this confusion hurt the market, and slow adoption? Hard to say. The uncertain part of the market remains so; Gartner’s 2012 Research Circle found 31% of enterprises had no plans for Big Data investment. In 2013, the number was the same. YARN will broaden the set of possible use cases, and raise emany questions. Let’s hope it’s ready to start answering them soon.
The Gartner Blog Network provides an opportunity for Gartner analysts to test ideas and move research forward. Because the content posted by Gartner analysts on this site does not undergo our standard editorial review, all comments or opinions expressed hereunder are those of the individual contributors and do not represent the views of Gartner, Inc. or its management.
Comments are closed
4 Comments
Thanks for the blog post Merv. It was not easy to choose the database technology in 2009 for an enterprise customer engagement platform where real-time data processing was a core requirement. We went for Apache Cassandra and the platform has currently grown to 150 million individual customer records continuously updated with real-time behavioural, identifiable and internal information driven by online interactions. We are testing for 1 billion profiles where the overall execution time (listen, identify, write, retrieve, decide, deliver) should stay within 24ms. We are following the Hadoop developments but it seems for now that Cassandra wasn’t a bad choice.
Thanks for telling this story, Hans. Are you using the open source version, or the commercially supported one? I talked to two clients in recent months who assessed Hadoop distributions and threw Datastax Enterprise in the mix almost by accident – and selected it.
We are using the open source version and have build our own expertise over the years.
Thanks for updating!