In the first two posts in this series, I talked about performance and projects as key themes in Hadoop’s watershed year. As it moves squarely into the mainstream, organizations making their first move to experiment will have to make a choice of platform. And – arguably for the first time in the early mainstreaming of an information technology wave – that choice is about more than who made the box where the software will run, and the spinning metal platters the bits will be stored on.There are three options, and choosing among them will have dramatically different implications on the budget, on the available capabilities, and on the fortunes of some vendors seeking to carve out a place in the IT landscape with their offerings.
First up is the cloud. It’s extraordinarily attractive to first timers, because there is no capital expenditure (read: no procurement process, no IT Standards Committees, minimal budget impact, etc.) It’s easy. Maybe too easy; using it outside IT can undermine years of careful work on governance and dramatically increase risk. But that’s another post; for more detail, clients can consult the Gartner report ‘Big Data’ Is Only the Beginning of Extreme Information Management. My point here is that it’s no accident that Amazon has reported that it started 2 million Elastic MapReduce (EMR) clusters – in a single year. And if that many are already on Amazon, think of the other platforms – bet there are more than a few there? You’d be right, but I won’t belabor that here – they aren’t hard to find. The growth of cloud platforms for big data is not likely to slow, and it will remain a great choice for early uses, speculative projects that may need to spin down as quickly as they spun up if they don’t prove to be useful, and ones whose economics just plain work. And for many projects, the cloud will remain the most reasonable economic choice. That’s why some of this year’s key announcements will focus there – stay tuned.
Second is our default choice so far: buy some nodes. The buy some more. Rinse and repeat. Early adopters, even the mammoth new web-based firms that got all this started, did this, and still do. Some sites literally have people who spend the day going up and down rows of racks pulling failed inexpensive disks and replacing them. It’s the source of the HDFS “3 copies, one on a separate rack” default, and it works. You can buy a quad-core Supermicro data node (or several) with 12 500 GB hard disks in it – 6 TB – for $7K or so, and a name node for $4K with more memory and less disk (why? that’s another post too) and you’ll be working with a Cloudera-certified platform.
You can spend less – and you can spend more – but the numbers are still as compelling as ever. Buy some racks, fill ‘em up with a few dozen nodes and you’re into a couple of hundred terabytes for a couple of hundred thousand (insert currency of your choice here). More expensive than the cloud, but nothing like the big server/storage combination bucks your brethren in the data center are spending for those big RDBMS platforms. Be warned: if you don’t know how to deploy, operate and optimize a cluster, you have a lot to learn. And there is a good chance your data center folks, if you have some, will need new skills even if they are already good at operating what is in there today.
Finally, and of most financial, vendor leadership and internal standards import, is the newest choice: appliances. At least 6 plays are making their way to market for Hadoop users: EMC Greenplum’s Data Computing Appliance, the Cisco Platform for NetApp Open Solution for Hadoop, HP’s AppSystem for Apache Hadoop, IBM’s PureData System for Analytics, Oracle’s Big Data Appliance, and Teradata’s Aster Analytic Appliance. (I’m sure there are others I’ve left out here, and there are data warehouse appliances and specialty plays like Yarc’s Urika for graph data applications [not Hadoop], but this is a good start.)
The big questions to ask here are: whose software are you running, what else do you get in the package beside metal and a Hadoop distribution, how much easier will it be to operate than buying your own nodes, what support will you get – and the big one: how much does it cost? In 2013, the market will begin to decide if the value proposition of appliances will play here – is the premium (and make no mistake, there is one) you pay worth the quicker time to deployment, operational and management help, and agility you get? That discussion is deep and detailed, and beyond our scope here, but I’m looking forward to continued conversations with Gartner clients who are making these choices as the market develops.
Next time: players. And there are some new ones. Don’t miss it.
Category: Amazon Apache Aster Big Data BigInsights Cisco Cloudera data warehouse appliance Elastic MapReduce EMC Gartner graph databases Hadoop HP IBM MapReduce NetApp Oracle Teradata Yarc Tags: Amazon, Apache, Aster, big data, Cisco, Cloudera, Elastic MapReduce, EMC, EMR, graph database, Hadoop, HP, IBM, MapReduce, NetApp, Oracle, Teradata, Yarc