It’s no surprise that we’ve been treated to many year-end lists and predictions for Hadoop (and everything else IT) in 2013. I’ve never been that much of a fan of those exercises, but I’ve been asked so much lately that I’ve succumbed. Herewith, the first of a series of posts on what I see as the 4 Ps of Hsdoop in the year ahead: performance, projects, platforms and players.
Performance concerns are inevitable as technologies move from early adopters, who are already tweaking everything they build as a matter of course, to mainstream firms, where the value of the investment is always expected to be validated in part by measuring and demonstrating performance superiority. It also becomes an issue when the 3rd or 4th project comes along with a workload profile different from those that came before – and it doesn’t perform as well as those heady first experiments. Getting it right with Hadoop is as much art as science today – the tools are primitive or nonexistent, the skills are more scarce than the tools, and experience – and therefore comparative measurement – is hard to come by.
What’s coming: newly buffed up versions of key management tools. It’s one method of differentiating distributions in a largely common set of software – Hortonworks doubling down on open source Apache Ambari, Cloudera enhancing Cloudera Manager, MapR’s updated Control System (as well as their continued touting of DIY favorites Nagios and Ganglia.) EMC, HP, IBM and other megavendors are continuing to instrument their existing, and familiar, enterprise tools to reach this exploding market. It will be a busy bazaar.
Resources are proliferating to help: published work like Eric Sammer’s Hadoop Operations (somewhat Cloudera-centric but very well organized and useful). A plethora of Slideshare presentations designed to help navigate the arcana of cluster optimization, workload management, configuration optimization, are appearing.
Performance has figured in a number of proof of concept (POC) tests pitting distributions against one another that I’ve heard about from Gartner clients. Some have been inconclusive; some have had clear winners. As we’ve seen in DBMS POCs over the years, your data and your workloads matter, and your results may differ from others’. I’ve seen replacements of “first distributions” by another, as performance or differing functionality comes to the fore. I’ve even seen a case where a Cassandra-based alternative won out over the Hadoop distributions.
Next time: projects proliferate.