It’s no surprise that we’ve been treated to many year-end lists and predictions for Hadoop (and everything else IT) in 2013. I’ve never been that much of a fan of those exercises, but I’ve been asked so much lately that I’ve succumbed. Herewith, the first of a series of posts on what I see as the 4 Ps of Hsdoop in the year ahead: performance, projects, platforms and players.
Performance concerns are inevitable as technologies move from early adopters, who are already tweaking everything they build as a matter of course, to mainstream firms, where the value of the investment is always expected to be validated in part by measuring and demonstrating performance superiority. It also becomes an issue when the 3rd or 4th project comes along with a workload profile different from those that came before – and it doesn’t perform as well as those heady first experiments. Getting it right with Hadoop is as much art as science today – the tools are primitive or nonexistent, the skills are more scarce than the tools, and experience – and therefore comparative measurement – is hard to come by.
What’s coming: newly buffed up versions of key management tools. It’s one method of differentiating distributions in a largely common set of software – Hortonworks doubling down on open source Apache Ambari, Cloudera enhancing Cloudera Manager, MapR’s updated Control System (as well as their continued touting of DIY favorites Nagios and Ganglia.) EMC, HP, IBM and other megavendors are continuing to instrument their existing, and familiar, enterprise tools to reach this exploding market. It will be a busy bazaar.
Resources are proliferating to help: published work like Eric Sammer’s Hadoop Operations (somewhat Cloudera-centric but very well organized and useful). A plethora of Slideshare presentations designed to help navigate the arcana of cluster optimization, workload management, configuration optimization, are appearing.
Performance has figured in a number of proof of concept (POC) tests pitting distributions against one another that I’ve heard about from Gartner clients. Some have been inconclusive; some have had clear winners. As we’ve seen in DBMS POCs over the years, your data and your workloads matter, and your results may differ from others’. I’ve seen replacements of “first distributions” by another, as performance or differing functionality comes to the fore. I’ve even seen a case where a Cassandra-based alternative won out over the Hadoop distributions.
Next time: projects proliferate.
The Gartner Blog Network provides an opportunity for Gartner analysts to test ideas and move research forward. Because the content posted by Gartner analysts on this site does not undergo our standard editorial review, all comments or opinions expressed hereunder are those of the individual contributors and do not represent the views of Gartner, Inc. or its management.
Comments are closed
7 Comments
Any feedback on the another P… Price? We all know that open source is not free… but it may be price/performance that drives Hadoop adoption when the price for other options are in the tens of thousands of dollars per terabytes range… and there are lots of terabytes in big (I will not say the last word).
Rob
Thanks for the question, Rob. In general, I don’t blog about pricing. Gartner has an IT Asset Management team that helps our clients with those issues, and I provide the team with some insights on DBMS when I have them. In this space (Hadoop) we are planing a more substantial effort in 2013 as more questions are coming in.
Merv, the basic pitch for Hadoop includes performance but the discussion includes 2 aspects:
(1) first is massive parallel processing framework that Hadoop offers and what performance you could achieve with it vis-a-vis what you cannot achieve with DBMS and traditional batch/analytic tools;
(2) second is how an IT vendor’s Hadoop offering compares vis-a-vis others and this is where, as you point out, the ‘bazaar’ offerings roll up to create a heady solution cocktail.
For both these points, it has to be advised to customers to be aware of:
(1) like any technology, each organization may have to map out their requirements and patiently consolidate their Hadoop stack with tools, frameworks, accelerators, dev and ops required to stabilize the applications.
(2) it is not as much as Cloudera v/s Hortonworks v/s MapR etc. as much as it is Hadoop v/s traditional approach – so if you want billion data records processed in a day versus a month, you may want to go with Hadoop…
Thanks, Sechin. Maturing the stack will take a while, as it always has with newly introduced data management offerings – including new RDBMSs. So building a full solution, as I noted in my earlier post on data integration, is no simple matter. It should always be requirements first, technology second. But I disagree with your second conclusion. I’m seeing differences in the providers already, and so are Gartner clients who have tried them – often in POCs where they have competed.
Sachin – taking your point further, there are two reasons you use Hadoopy things: specific applications for massive scale, and future goals. Companies tend to implement, generally now for the app, but it’s moving to “for the infrastructure” (or should that be headlines).
Most initial use has specific goals, and specific components needed (and most use the basics only initially). Once you have the data and the cluster, it becomes tempting to experiment with additional capabilities for additional needs. So – either the initial project pushes you a certain direction (e.g., “I need to do scoring”) or the addins become addins.
The vendors only become interesting if you need their capabilies, such as support, ops, NFS. In most cases I’ve seen, they don’t make much difference except to non-tech management.
And – forget about the *billion records*. You can store and process a billion records on a SSD e.g., http://ark.intel.com/products/67009/Intel-SSD-910-Series-800GB-12-Height-PCIe-2_0-25nm-MLC) in 1.5 hours on a $5K desktop. A billion records is a boring RDBMS issue, and is really a desktop issue. Hadoop becomes interesting when you have petabytes and trillions of records and/or massive compute.
Aaron, I disagree with you on whether having a vendor matters – you seem to be saying they don’t. Perhaps that seem true in shops that have large, expert IT organizations, but the training, feature set integration and most especially the support matter a lot for those who do not. And when it comes to Hadoop, that’s the large majority. In general Gartner always recommends using vendor-supported open source software if you plan to go into production.
Hello Merv,
Thanks for this great post and drawing attention to the performance issues with Hadoop. We have been working on this problem for the past three years in the Starfish project at Duke University: http://www.cs.duke.edu/starfish
Converting Hadoop performance tuning from an art to a science, and developing automated tools for the same, has been a fun intellectual exercise. Would appreciate your thoughts and feedback. We have been expanding the ecosystem that we support: from Hive, Pig, and Cascading and now looking at Oozie, Storm, Shark/Spark, and YARN.
best,
Shivnath Babu