Merv Adrian

A member of the Gartner Blog Network

Merv Adrian
Research VP
4 years with Gartner
37 years in IT industry

Merv Adrian is an analyst following database and adjacent technologies as extreme data transforms assumptions about what to persist as well as when, where and how. He also watches the way the software/hardware boundary… Read Full Bio

Coverage Areas:

Hadoop 2013 – Part Three: Platforms

by Merv Adrian  |  February 23, 2013  |  4 Comments

In the first two posts in this series, I talked about performance and projects as key themes in Hadoop’s watershed year. As it moves squarely into the mainstream, organizations making their first move to experiment will have to make a choice of platform. And – arguably for the first time in the early mainstreaming of an information technology wave – that choice is about more than who made the box where the software will run, and the spinning metal platters the bits will be stored on.There are three options, and choosing among them will have dramatically different implications on the budget, on the available capabilities, and on the fortunes of some vendors seeking to carve out a place in the IT landscape with their offerings.

First up is the cloud. It’s extraordinarily attractive to first timers, because there is no capital expenditure (read: no procurement process, no IT Standards Committees, minimal budget impact, etc.) It’s easy. Maybe too easy; using it outside IT can undermine years of careful work on governance and dramatically increase risk. But that’s another post; for more detail, clients can consult the Gartner report  ‘Big Data’ Is Only the Beginning of Extreme Information Management. My point here is that it’s no accident that Amazon has reported that it started 2 million Elastic MapReduce  (EMR) clusters – in a single year. And if that many are already on Amazon, think of the other platforms – bet there are more than a few there? You’d be right, but I won’t belabor that here – they aren’t hard to find. The growth of cloud platforms for big data is not likely to slow, and it will remain a great choice for early uses, speculative projects that may need to spin down as quickly as they spun up if they don’t prove to be useful, and ones whose economics just plain work. And for many projects, the cloud will remain the most reasonable economic choice. That’s why some of this year’s key announcements will focus there – stay tuned.

Second is our default choice so far: buy some nodes. The buy some more. Rinse and repeat. Early adopters, even the mammoth new web-based firms that got all this started, did this, and still do. Some sites literally have people who spend the day going up and down rows of racks pulling failed inexpensive disks and replacing them. It’s the source of the HDFS “3 copies, one on a separate rack” default, and it works. You can buy a quad-core Supermicro data node (or several) with 12 500 GB hard disks in it – 6 TB – for $7K or so, and a name node for $4K with more memory and less disk (why? that’s another post too) and you’ll be working with a Cloudera-certified platform.

You can spend less – and you can spend more – but the numbers are still as compelling as ever. Buy some racks, fill ‘em up with a few dozen nodes and you’re into a couple of hundred terabytes for a couple of hundred thousand (insert currency of your choice here). More expensive than the cloud, but nothing like the big server/storage combination bucks your brethren in the data center are spending for those big RDBMS platforms. Be warned: if you don’t know how to deploy, operate and optimize a cluster, you have a lot to learn. And there is a good chance your data center folks, if you have some, will need new skills even if they are already good at operating what is in there today.

Finally, and of most financial, vendor leadership and internal standards import, is the newest choice: appliances. At least 6 plays are making their way to market for Hadoop users: EMC Greenplum’s Data Computing Appliance, the Cisco Platform for NetApp Open Solution for Hadoop, HP’s AppSystem for Apache Hadoop, IBM’s PureData System for Analytics, Oracle’s Big Data Appliance, and Teradata’s Aster Analytic Appliance. (I’m sure there are others I’ve left out here, and there are data warehouse appliances and specialty plays like Yarc’s Urika for graph data applications [not Hadoop], but this is a good start.)

The big questions to ask here are: whose software are you running, what else do you get in the package beside metal and a Hadoop distribution, how much easier will it be to operate than buying your own nodes, what support will you get – and the big one: how much does it cost? In 2013, the market will begin to decide if the value proposition of appliances will play here – is the premium (and make no mistake, there is one) you pay worth the quicker time to deployment, operational and management help, and agility you get? That discussion is deep and detailed, and beyond our scope here, but I’m looking forward to continued conversations with Gartner clients who are making these choices as the market develops.

Next time: players. And there are some new ones.  Don’t miss it.

4 Comments »

Category: Amazon Apache Aster Big Data BigInsights Cisco Cloudera data warehouse appliance Elastic MapReduce EMC Gartner graph databases Hadoop HP IBM MapReduce NetApp Oracle Teradata Yarc     Tags: , , , , , , , , , , , , , , , , ,

4 responses so far ↓

  • 1 aaronw   February 25, 2013 at 10:03 pm

    I think it’s worth looking at Hadoop:
    – distros
    – support
    – appliances
    all as spectrums. Distros and support differ in features, versions, add-ins, and integration. Hadoop appliances and platforms are likewise a range from scavenge through roll your own through the vendor preconfigured servers to things sold as appliances.

    This becomes interesting because of which consumers they service. Hadoop is both easy to configure and pretty flexible if you have soft requirements.

    Most users testing the waters have soft requirements, and the choice of these things is mostly either a non-technical decision or positioning for future state. All options work well, and the features needed are basic ones, the differentiation in distro is generally not relevant for early use, and the value added features in appliances not helpful is use, support, or performance.

    Appliances probably make life harder of new or small users. Appliances are an odd trade-off, though, for more sophisticated users. On the one hand, they are actually hard to adopt – they conflict with large IT standards in most cases, requiring more engineering to implement than white-box servers. On the other hand , they could help specialized problems (map/map/reduce with giant shuffles for example, may benefit with additional network bandwidth.)

    So – why appliances?
    – for vendors, they are really more competing for mindshare rather than selling products. (Large vendors are also attempting to brand commodities.) Any real sales are really gravy.
    – for consumers, they are likely one of:
    — future focus managers looking for an endgame
    — departmental runs around central IT (I’ve only seen a few of these – Hadoop seems to still be geeky)
    — checkmark items
    — (in the case of IBM) some Chinese menu bucket-of-goods sales

  • 2 Dan Graham   February 26, 2013 at 12:24 am

    Merv,
    Good advice on all three blog posts. Yes, clusters are hard, very hard to do — unless you don’t care about consistent performance, up time, or recovery. Then they get fairly easy. I like your balanced approach: Yep, Hadoop can be cheap and the hardware can be cheap, and there is a place for that. Enterprise class applications supporting dozens or hundreds of users is not a safe place for Hadoop (ie your cloud discussion).

    This is why vendors build appliances: to make it easier to reach upwards towards enterprise class. If customers weren’t buying appliances, the vendors would not sell them. Duh! I know many customers who cannot bring any old cheap node into the data center and rack it up — the operations manager doesn’t want the headache of reliability and support. Which is why many Hadoop clusters start life outside the data center.

    I grow weary of Hadoopies claiming they discovered parallelism and that it destroys RDBMS’s with its performance. Maybe it destroys MySQL. But Teradata, DB2, Greenplum, and others have had two or three kinds of parallelism for many years, refining and perfecting it. Hadoop has only one coarse grained parallelism with years of refinements to come. It should be appreciated for what it can do, not compared to RDBMSs that are 10-20 years ahead.

    These are good blogs. I’m always refreshed when I read your work.
    Cheers

  • 3 Merv Adrian   February 26, 2013 at 12:43 am

    Thanks, Dan – very much appreciate the kind words. It’s interesting to see how management as a value-add (per the first of the 3 posts) is such an obvious play – and why I believe it will be more important in 2013 as a topic. I don’t need to tell you as a vendor that the things you charge people for are priced instead of free because we know they are worth paying for. And the more mainstream Hadoop gets, the more these issues will be part of the conversation.

  • 4 Thomas Ward Lynch   February 28, 2013 at 8:29 pm

    [...] Hadoop 2013 – Part Three: Platforms – Gartner Blog Network by Merv Adrian | February 23, 2013 | Submit a Comment. In the first two posts in this series, I talked about performance and projects as key themes in Hadoop’s … blogs.gartner.com/merv…/hadoop-2013-part-three-platforms/ [...]