Blog post

Hadoop Distributions And Kids’ Soccer

By Merv Adrian | July 19, 2011 | 2 Comments

Yahoo!OSSMicrosoftIBMApache MapReduceApache HadoopData and Analytics Strategies

The big players are moving in for a piece of the Big Data action. IBM, EMC, and NetApp have stepped up their messaging, in part to prevent startup upstarts like Cloudera from cornering the Apache Hadoop distribution market. They are all elbowing one another to get closest to “pure Apache” while still “adding value.” Numerous other startups have emerged, with greater or lesser reliance on, and extensions or substitutions for, the core Apache distribution. Yahoo! has found a funding partner and spun its team out, forming a new firm called Hortonworks, whose claim to fame begins with an impressive roster responsible for much of the code in the core Hadoop projects. Think of the Doctor Seuss children’s book featuring that famous elephant, and you’ll understand the name.

While we’re talking about kids – ever watch young kids play soccer? Everyone surrounds the ball. It takes years to learn their position on the field and play accordingly. There are emerging alphas, a few stragglers on the sidelines hoping for a chance to play, community participants – and a clear need for governance. Tech markets can be like that, and with 1600 attendees packing late June’s Hadoop Summit event, all of those scenarios were playing out. Leaders, new entrants, and the big silents, like the absent Oracle and Microsoft.

The ball is indeed in play; the open source Apache Hadoop stack today boasts “customers” among numerous Fortune 500 companies, running critical business workloads on Hadoop clusters constructed for data scientists and business sponsors – and very often with little or no participation by IT and the corporate data governance and enterprise architecture teams. Thousands of servers, multiple petabytes of data, and growing numbers of users are increasingly to be seen.

Community Growth – Chasing the Ball

At Hadoop Summit 2011, the community itself was in many ways the star – big, diverse, attentive, engaged, tweeting up a storm. Sessions were packed; content, dense and technical, was well received and understood; deep inquiry was the order of the day. Sessions on Youtube are here.

Sponsorship was substantial – 26 sponsoring companies did continuous, deeply engaged business throughout. Several told me when I asked “Yes, our inquiries here are from people with money to spend.” Many were hiring, and inboxes (physical ones, on their display tables – how quaint!) were visibly bulging. Eyes were on the ball, and players were swarming.

Before the conference began, Cloudera, founded in 2009 by its own team of early Hadoop contributors including early author Doug Cutting and Amr Awadallah, a former VP of engineering at Yahoo!, announced a new release, version 3.5, of its distribution. This is not to be confused with Cloudera Enterprise, a subscription service that adds a management suite and production support. Cloudera, which can take a great deal of pride in its evangelism, training and nurturing of the community to date, has maintained its OSS street cred, contributing significantly to Apache Hadoop, and has garnered more than 100 paid clients to add to all the free download users.

Cloudera’s focus with this release is on more than just the ball (the core distribution components and other surrounding “projects” – more on that below); it adds some important new capabilities that reflect continuing maturation of its game plan:

  • Service and Configuration Manager (SCM) Express: a free, GUI-based offering for creating a cluster. It permists you to configure and start HDFS, MapReduce, HBase – without logging into each server individually.
  • Cloudera Management Suite: a collection of software tools including a full-featured version of SCM, a resource manager and an activity monitor that allows operators to watch jobs, compare performance with prior runs, and identify bottlenecks.
  • An Authorization Manager featuring “One-Click Security” – a capability that unifies what are sometimes differing security models in different components, including management of rights for users and groups.

Cloudera reaffirmed that its distribution remains 100% Apache software-licensed. In this, it joins IBM, whose endorsement of Apache was a key element of its discussion several weeks ago at a launch event for its own InfoSphere BigInsights.

Whose Team Is This, Anyway?

The Hadoop Summit event itself started with a bang. Jay Rossiter, SVP of the Cloud Platform Group at Yahoo!, detailed company-wide uses such as fraud and spam detection, ad targeting, geotagging and local indexing, search assist, aggregation and categorization of news stories, and predictive analytics. He introduced Eric Baldeschwieler, formerly VP of software engineering for the Hadoop team, who talked about where Hadoop came from and where it’s going. Then he took his free kick: he has been named CEO of Hortonworks, funded with investments from Yahoo! and Benchmark Capital. Hortonworks is described as “an independent company consisting of key architects and core contributors;” Baldeschwieler reminded the crowd that Yahoo! is the primary contributor to, and one of the leading users of Apache Hadoop.

Hortonworks’ strong relationship with one of the largest working – thriving – installations gives it a great opportunity to test its continuing contributions. This is a critical advantage. As an offering specifically pointed at large-scale deployments, any distribution of Hadoop, with whatever API-compatible but not yet committed additions, changes and associated projects such as HBase, Hive, Zookeeper, Pig, Flume, Sqoop, Oozie and others, will require substantial integration testing. Baldeschwieler noted in conversation with me later that there are more than 1,000 active, sophisticated Hadoop users of Apache on the Yahoo! grid; testing large workloads will be much easily accomplished.

In general, Hortonworks scored a PR coup here – despite the “we’re all happy members of the Apache Hadoop community” messaging, the floor was theirs until the track sessions began, and they were well-represented there too. Nobody loses at kids’ soccer either – but somebody wins. The most prominent messaging to a dedicated community of 1600 advocates for the entire morning came from the new team on the block.

That’s not to say other voices were not heard. IBM’s Anant Jhingram entertained the crowd with some “Watson on Jeopardy” clips (Hadoop was a sizable component of Watson’s software stack), then waxed philosophical about how the community would develop as the “birthers” (present at the creation) began to increasingly interact with the “adopters” – those who come along later. His blog gives a quick summary. Key point: IBM’s ringing pitch for having as few distributions as possible, recommending that everyone focus on supporting and contributing to Apache’s. I had heard this from them a few weeks earlier at their analyst event, and in response to a question from the floor about IBM’s distribution, he could not have been more clear reiterating it: “We’re not in the distribution business.”

NetApp Wants to Play Too, And Others Tried Out

NetApp, not to be outdone by EMC, who announced their Hadoop offerings in May at EMC World, took its own run at the ball. NetApp has been a strong supporter of open source, they wanted everyone to know, involved in free BSD, Linux, and NFS – and say they want to play in the Apache Hadoop community too. Fair enough; they’re not alone, and they have a shiny new partnership with Hortonworks to show off.

Does NetApp have a play? Of course. Every soccer team needs a goalie; every Hadoop project needs storage – lots of storage. But goalies don’t score, so they don’t win games. They are necessary, but not sufficient. NetApp hopes to prove they can qualify for more roles, but they will need more offerings, and they will have competitors, new and old. Credit to them for trying out for the team. Other hopefuls:

  • HStreaming – Announced the launch of “the most scalable real-time data processing platform powered by Apache Hadoop.”
  • Karmasphere – Announced an “all-in-one virtual appliance for building Hadoop applications.” Also, a partnership with Think Big Analytics, a consultancy that shows the health of the emerging ecosystem.
  • MapR Technologies – EMC’s partner unveiled two software editions: MapR M3 (free for an unlimited number of nodes) and MapR M5. MapR adds direct NFS connection and some lofty performance goals. While the Hadoop File System (HDFS) can handle 10-50PB, MapR can handle 1010 exabytes. While HDFS can handle some 4000 nodes [edited 7/19 from 2000] in a cluster, MapR can handle 10,000 or more. It also stresses its elimination of the namenode single point of failure problem, a real advantage that is likely to relatively short-lived.

Much of the action is around swapping in pieces to supplement stack components from the Apache Hadoop core. Most typically, we see replacements or supplements for HDFS. DataStax’ Brisk distribution brings Apache Cassandra into the mix; MapR takes its own approach, and others including HBase and Hadapt are in the mix as well. But another player or two are tackling things at the upper end of the stack:

  • Platform Computing – a mature, well-established firm in the financial markets, where risk applications require leading-edge speed and reliability, announced Platform MapReduce, the “first enterprise-class distributed runtime engine for MapReduce applications.” Targeting the high end of the stack – the Hadoop job and task scheduler itself – Platform is hoping to parlay its $70M presence into helping its clients as they examine the new game in town.
  • Pervasive DataRush for Hadoop seeks to extend Pervasive’s successful franchise in high-performance parallel computing into the Hadoop ecosystem. Also in the game for a long time, Pervasive, a $50M firm, recently received patents it filed for over 6 years ago for its dataflow architecture that exploits hardware parallelism with relatively low memory usage.

Who’s On The Team?

Overall, “who’s on the team” seemed to be a key issue for the presenters, who targeted what IBM’s Dhingram called the “birthers” by stressing their pedigree. MapR has a significant relationship with EMC, but the latter lacks the street cred of Cloudera. EMC’s contributions to the Apache Hadoop projects are negligible, and it talks of using Facebook’s version; it has few if any contributing engineers. EMC’s large sales and marketing presence and its Isilon storage line, purpose-built built for large, unstructured storage, could trump that.

Hortonworks clearly wins the battle of engineering contributors, and has the added advantage of a testbed environment in its part owner Yahoo!, but it’s new, and has little differentiating product – yet. In my meeting with the executive team, Baldeschwieler was clear about the early days. The first offerings will be training and support. As their roadmap plays out, they have a lot of funded talent to work on advancing Apache Hadoop, and many of the committers on the projects that compose the core stack. Certainly the value of having a commercial business affords a focus for buyer demands. Still, I believe the challenge implicit in having several competing vendors in the space may make resolving competing priorities a challenge moving forward.

I had the opportunity to exchange messages with Hadapt’s Chief Scientist Daniel Abadi during the event. He was unable to attend, being occupied with bringing Hadapt’s own more database-like relational engine and query optimization to market. In twitter exchanges during the conference, we had both pointed out that the kids’ soccer-like “everybody here is a winner” message was disingenuous, and that of course these players are all competitors. Abadi noted that “vendors who incorporate Hadoop into their solution stack (such as Hadapt, Datameer, or Karmasphere) will breathe easier because Hortonworks is going to make the Apache distribution of Hadoop much better.” And Julian Hyde, of Mondrian and SQLStream fame, who has had much developer community dynamics experience with Eigenbase, was very supportive of the Apache governance model in a conversation with me later.

That governance will be critical for the future. Other Apache and non-Apache projects, like HBase, Hive, Zookeeper, Pig, Flume, Sqoop, Oozie, et al all have their own agendas. In Apache locution, each has its own “committers” – owners of the code lines – and the task of integrating disparate pieces – each on its own time line – will fall to somebody. Will your distribution owner test the combination of the particular ones you’re using? If not, that will be up to you. One of the biggest barriers to open source adoption so far has been precisely that degree of required self-integration. Gartner’s second half 2010 open source survey showed that more than half of the 547 surveyed organizations have adopted OSS solutions as part of their IT strategy. Data management and integration is the top initiative they name; 46% of surveyed companies named it. This is where the game is.

Comments are closed


  • Jeff Kibler says:

    Merv –

    You correctly use the analogy of a child’s game of soccer. There are a few key elements that you omitted, and I’m curious on your take.

    1) Who’s kidding themselves by being in this game? There’s really no reason for some being even on the field.

    2) What other games are being played? Hadoop is probably the most well-known and accepted Open-Source offering for very large data storage and retrieval. If all players aren’t careful, I speculate that other games (even from different sports) will most likely emerge to rival Hadoop.

    I look forward to your thoughts on these two questions.

    • Merv Adrian says:

      Good questions, Jeff. Plenty of vendors will get a slice of the pie (sorry to mix metaphors). At the event there were software vendors, hardware vendors, services players – all of whom can generate some revenue helping people who want to work with Big Data. I’d imagine Infobright will find that its customers want to know how you play here too – especially with data they want to leave in file systems and combine with data you’re managing for them.

      What alternatives are out there? That’s a broader topic – other very nascent approaches are already on the horizon, but that was not the topic here. We’re watching, and as they bubble up enough, we’ll have things to say about them. The point of this post was that the Hadoop phenomenon continues apace; you’ll see a little more on that in the upcoming Hype Cycle for Enterprise Information Management.