Merv Adrian

A member of the Gartner Blog Network

Merv Adrian
Research VP
1 year with Gartner
30 years in IT industry

Merv Adrian is an analyst following database and adjacent technologies as extreme data transforms assumptions about what to persist as well as when, where and how. He also watches the way the software/hardware boundary… Read Full Bio

Strata Spark Tsunami – Hadoop World, Part One

by Merv Adrian  |  October 31, 2014  |  5 Comments

New York’s Javits Center is a cavernous triumph of form over function. Giant empty spaces were everywhere at this year’s empty-though-sold-out Strata/Hadoop World, but the strangely-numbered, hard to find, typically inadequately-sized rooms were packed. Some redesign will be needed next year, because the event was huge in impact and demand will only grow. A few of those big tent pavilions you see at Oracle Open World or Dreamforce would drop into the giant halls without a trace – I’d expect to see some next year to make some usable space available.

So much happened, I’ll post a couple of pieces here. Last year’s news was all about promises: Hadoop 2.0 brought the promise of YARN enabling new kinds of processing, and there was promise in the multiple emerging SQL-on-HDFS plays. The Hadoop community was clearly ready to crown a new hype king for 2014.

This year, all that noise had jumped the Spark.

If you have not kept up, Apache Spark bids to replace supplement MapReduce with a more general purpose engine, combining interactive processing and streaming along with MapReduce-like batch capabilities, leveraging YARN to enable a new, much broader set of use cases. (See Nick Heudecker’s blog for a recent assessment.) It has a commercializer in Databricks, which has shown great skill in assembling an ecosystem of support from a set of partners who are enabling it to work with multiple key Hadoop stack projects at an accelerating pace. That momentum was reflected in the rash of announcements at Hadoop World, across categories from Analytics to Wrangling (couldn’t come up with a Z.) There were more than I’ll list here – their vendors are welcome to add themselves via comments, and I’ll curate this post for a while to put them in.

Hadoop analytics pioneer Platfora announced its version 4.0 with enhanced visualizations, geo-analytics capabilities and collaboration features, and revealed it has “plans for integration” with Spark.

Tableau was a little more ready, delivering a beta version of its Spark Connector, claiming its in-memory offering delivered up to 100x the performance of Hadoop MapReduce. Tableau is also broadening its ecosystem reach, adding a beta version of its connector for Amazon EMR, and support for IBM BigSQL and MarkLogic.

Tresata extended the analytics wave to analytic applications, enhancing its customer intelligence management software for financial data by adding real-time execution of analytical processes using Spark. Tresata is an early mover, and believes one of its core advantages derives from having been architected to run entirely in Hadoop early on. It supports its own data wrangling with Automated Data Ontology Discovery and entity resolution – cleaning, de-duping, and parsing data.

(For developers, Tresata is also open sourcing Scalding-on-Spark – a library that adds support for Cascading Taps, Scalding Sources and the Scalding Fields API in Spark.)

Appliances were represented by Dell, who introduced a new In-memory box (one of many Hadoop appliances that represented another 2014 trend) that integrates Spark with Cloudera Enterprise. (Dell is all in on the new datastores – they have buit architectures with Datastax for Cassandra, and with MongoDB, as well.)

Cloud was brought to the party by BlueData, packaging Spark with its EPIC™ private-cloud deployment platform. Standalone Spark clusters can run Spark-Scala, MLLib or SparkSQL jobs against data stored in HDFS, NFS and other storage. Note “standalone” – Spark can, and will, be used by shops that are not running Hadoop. Once it is actually running production jobs, that is.

Rackspace is in both games with its OnMetal – an appliance-based cloud you don’t have to own, with a high-performance design using 3.2 TB per data node. They provision the other services. Rackspace is partnering with Hortonworks to deliver HDP 2.1 or – you guessed it – Spark. This is all built on a thin virtualization layer on another emerging hot platform: Openstack.

The distributions were represented of course: Cloudera jumped in back in February accompanied by strong statements from Mike Olson that helped put it on the map. Hortonworks followed in May with a tech preview. It still is in preview – Hortonworks, for good reasons, is not quite prepared to call it production-ready yet. Pivotal support was announced in May – oddly, in the Databricks blog, reflecting its on-again, off-again marketing motions. In New York, MapR on the bandwagon since April as well, announced that Drill – itself barely out of the gate – will also run on Spark.

It was intriguing to note that many of the emerging data wrangling/munging/harmonizing/preparing/curating players started early. ClearStory CEO Sharmila Mulligan of  was quick to note during her keynote appearance that her offering has been built on Spark from the outset. Paxata, another of the new players  with a couple of dozen licensed customers already, has also built its in-memory, columnar, parallel enterprise platform on top of Apache Spark. It connects directly to HDFS, RDBMS, and web services like SalesForce.com and publishes to Apache Hive or Cloudera Impala. Trifacta, already onto its v2, has now officially named its language Wrangle , added native support for more complex data formats, including JSON, Avro, ORC and Parquet, and yes, is focusing on delivering scale for its data transformation through native use of both Spark and MapReduce.

Even the conference organizers got into the act. O’Reilly has made a big investment with Cloudera to make Strata a leading conference. It’s added a European conference, making Doug Cutting the new conference Chair. In New York, O’Reilly announced a partnership with Databricks for  Spark developer certification, expanding the franchise before someone else jumps in.

There is far more to come from Spark – a memory-centric file system called Tachyon that will add new capabilities above today’s disk-oriented ones; the MLlib machine learning library that will leverage Spark’s superior iterative performance, GraphX for the long awaited graph performance that today is best served by commercial vendors like Teradata Aster, and of course, Spark Streaming. But much of that is simply not demonstrably production-ready just yet – much is still in beta. Or even alpha. We’ll be watching. For now, it’s the new hype king.

5 Comments »

Category: Accumulo Amazon Apache Apache Yarn Aster Avro Big Data BigInsights Cascading Cassandra Cloudera Elastic MapReduce Gartner Hadoop HDFS Hive Hortonworks IBM MapR MapReduce Microsoft Spark Uncategorized YARN     Tags: , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , ,

Hadoop Is A Recursive Acronym

by Merv Adrian  |  October 13, 2014  |  3 Comments

Hopefully, that title got your attention. A recursive acronym – the term first appeared in the book Gödel, Escher, Bach: An Eternal Golden Braid and is likely more familiar to tech folks who know Gnu – is self-referential (as in “Gnu’s not Unix.”) So how did I conclude Hadoop, whose name origin we know, fits the definition? Easy – like everyone else, I’m redefining Hadoop to suit my own purposes. 

Let’s start with the obvious one. Of course, Doug Cutting named Hadoop after his child’s toy elephant, seen here.

Photo: Merv Adrian

Photo: Merv Adrian

 

And in its early days, as I discussed in my post about the changing composition of distributions a few months back, the story was simpler. Hadoop was HDFS, MapReduce and some utilities. As those utilities got formalized and became projects themselves and were supported by commercial distributors, the list grew: Pig, Hive, HBase, and Zookeeper were Hadoop too. And a few months ago, as I noticed, Accumulo, Avro, Cascading, Flume, Mahout, Oozie, Spark, Sqoop,  and YARN had joined the list.

YARN is the one that really matters here because it doesn’t just mean the list of components will change, but because in its wake the list of components will change Hadoop’s meaning. YARN enables Hadoop to be more than a brute force, batch blunt instrument for analytics and ETL jobs. It can be an interactive analytic tool, an event processor, a transactional system, a governed, secure system for complex, mixed workloads. At Strata this week, we’ll talk about its integration with Red Hat’s middleware, its cautious alliance with Spark for MapReduce replacement, its alliance with data wrangling tools from startups and Teradata, its connection, via Sentry, to security stacks… and more.

So yes, many of us are redefining Hadoop as we add new pieces – new use cases, new projects that change its very nature. My answer to “What is Hadoop”?

Hadoop
And
Diverse
Other
Operating
Platforms

OK – it’s a bit cute. But hopefully, it got your attention. Hadoop’s journey is just beginning, and there is much more change ahead.

3 Comments »

Category: Accumulo Apache Apache Yarn Big Data Cascading Flume Gartner Hadoop Hbase HDFS Hive Mahout MapReduce Oozie Pig Spark Sqoop Teradata YARN Zookeeper     Tags: , , , , , , , , , , , ,

Guest Post: Are You Pouring Pollution into the Data Lake?

by Merv Adrian  |  October 10, 2014  |  Comments Off

From my esteemed colleague Mark Beyer

“Unstructured data” is a misnomer—everyone finally agrees on that much, at least. It is a term that is often applied to information assets that are not relational. Sometimes it is applied to machine data generated by operational technologies. Sometimes it is applied to content, like documents, or even less specific text such as email or twitter feeds. The term unstructured creates fear, loathing and desire across the IT landscape. If it generates meaningful discussions about innovative processing or challenges the re-use of existing infrastructure without simply defaulting to a mindset of replacing components—then that is a good discussion. But that is about all the term is good for.

But I’ve been growing ever more fond of Semi-structured, but not for the reason that many who are enamored of the term might assume. Semi actually  means half, but that doesn’t mean it is halfway between structured and unstructured. Because that would be halfway between myth and reality, which is a “place” that doesn’t actually exist. Consider the vernacular American use of the term “semi-truck”, that would mean “half truck”. But, of course, that isn’t what semi-truck means. A truck vehicle—referred to properly as the “tractor” — is “half” of the truck and the trailer is the other half of the truck. Separately, neither can move cargo. When you put them together, you have two halves that make a whole and creates a complete, useful delivery vehicle. So, a semi-truck is actually made up of a semi-tractor and a semi-trailer which makes, simply, a whole truck.

Now consider semi-structured data. What it actually means is that “half” of the governance and schema instructions (defined as physical and logical definition plus applied governance) are in the data and the other half are in the using application. Semi-structured data means that you MUST write an application to complete the schema. That also means that with each different application, you can impose different governance and schema instructions (also the processing rules), theoretically making the data more flexible.

However, with the introduction of this capability comes the danger of having disparate governance rules regarding the same data that may be so significant, it becomes different data. As data moves further away from containing its own schema, then flexibility increases and more and more of the entire schema must be imposed by the using application—and ever greater diversity is introduced. Of course application developers think this is a grand idea, but users of that data in a second environment are not equally fond of it.

In a data lake, we drop all of these various degrees of structure into a common pool or body. Each asset is deposited with a different level of dilution regarding the schema instructions, thus requiring a oceanographer’s level of expertise to determine when the data in a data lake has “high alkalinity” or “high acidity” or too much “copper” and so on. A data scientist who understands the “alchemy” that exists in the fetid waters of a polluted lake will have no trouble cleaning the data up and discerning pollutants from di-hydrogen oxide (that’s water). But a novice or dilettante may find that they are drinking polluted data water from the lake long after they are infected with the equivalent of data E.coli and are making continuous visits to the data latrine.

To be completely fair, the analogy of the waters of a lake when compared to the Data Lake, needs some departure from the metaphor, but there are also further extensions of the metaphor that also work (other than those obnoxious points I raise above). For example, a lake that is crystal clear somewhere in the mountains of upstate New York would be a dead lake. There would be no fish, no microscopic organisms, no snakes, no mosquitos flying over its surface. It would be crystal clear and completely dead.  So, if you desire a completely clean lake, you will find a sanitized data environment that defeats the entire purpose of storing information in something near its native form. You want a vibrant, living data lake. But you must keep the trucking metaphor in mind and remember that unstructured data does not exist. Instead, semi-structured data which intentionally only has SOME of the schema embedded in the information asset, requires a knowledge of how to swim. And boats (tools for exploring the data lake) don’t really work either—anyone who cannot swim must be very careful to NOT fall out of the boat. But anyone who can swim, can safely use a boat and be confident that if they fall over the side they can make it back to shore.

Data lakes are for data scientists to conduct science, they are not for casual analytics users or even advanced business analysts who generally scuff their code-writing and script-writing toes on ignorance. But those data miners that understand business process and systems analysis, or those data scientists who understand model theory and statistical primitives? Well, they can swim all day. So, buy a boat (a tool), explore away, and be comfortable in the knowledge that some of the structure is waiting for you in the lake, but not all of it—your job is to figure out the rest including sometimes reaching over the side of the boat and getting a little wet because you never know when you will fall in.

Merv’s comment: this has been a topic of some discussion on our team. My own view is very close to Mark’s and is expressed in my closing slide from the Hadoop Summit 

Comments Off

Category: Big Data data lake data warehouse Gartner Hadoop metadata     Tags: , , ,

Satya Nadella on Mobility: It’s Personal

by Merv Adrian  |  October 9, 2014  |  1 Comment

At Garner Symposium, Drue Reeves and I had the opportunity to interview Microsoft CEO Satya Nadella. Here’s a brief clip from the closing. I’m summarizing and Satya, passionate as he was throughout the conversation, lays out his vision about mobility that crosses the personal and professional: mobility of the individual and the app experiences. “Have my work and life wherever – that’s the true form of mobility.”

1 Comment »

Category: Uncategorized     Tags:

Satya Nadella at Symposium – What to Look For

by Merv Adrian  |  October 2, 2014  |  Comments Off

It’s rare that one gets the chance to talk to a new megavendor CEO in his first year on the job – especially in front of 10,000 senior IT professionals. But that is the opportunity Drue Reeves and I have on Tuesday, October 7 in Gartner’s Mastermind interview.

What have we got in mind? Enterprise IT questions. We won’t talk much about Xbox or Bing. But we do plan to ask Satya:

  • How he is driving a culture of innovation, and in what direction
  • What Windows 10 means to IT, and what they should do about it – and when
  • How mobility is changing end user experiences – and what Microsoft is doing to get us there
  • How the cloud impacts usage, data centers, and architects’ budgets
  • What impact will the Internet of Things have on Microsoft – and us?

We’ll have a lightning round featuring questions from Twitter to liven things up – though I expect we won’t struggle with liveliness.  If you want to suggest questions, send them to us with the twitter hashtag #GartnerSymAskSatya or email us at Gartner, or by comment here. We’ll do our best to keep up with them all….

Hope to see you there.

Comments Off

Category: Gartner Microsoft     Tags: ,

Microsoft’s Portfolio – A Formidable Mix

by Merv Adrian  |  September 20, 2014  |  2 Comments

For the past few months, I’ve been Gartner’s Vendor Lead for Microsoft. For some 30 vendors, we assign a single analyst to act as a focal point for coordinating across the 1000 analysts we have when research covers that vendor.

In Microsoft’s case, that has proven to be fascinating – we have some 3 dozen Magic Quadrants alone that have been published about their offerings in the last 15 months or so. As Vendor Lead, I’m a mandatory peer reviewer for those and other documents. For my own edification, I decide to map the Magic Quadrants that feature Microsoft onto a quadrant that shows where Microsoft appears in that piece of research. The results are intriguing.

Microsoft has a sizable number of Leader offerings, but many in the Challenger quadrant, and a few that appear in the Niche Player quadrant as well. It’s a bit rarer for them to appear as Visionaries – if they have it figured out, their ability to execute tends to drive them up into the Leader space fairly quickly.

The chart shows places where Microsoft clearly needs to focus, and makes it clear that they play in numerous markets of interest to Gartner’s enterprise IT-focused audience. Many categories do not appear, and a half dozen MQs are currently in process. I’ll keep this up to date for myself, and occasionally will share it here.

Slide1

2 Comments »

Category: Gartner Microsoft     Tags: ,

Hadoop Investments Continue: Teradata, HP Jockey For Position

by Merv Adrian  |  July 24, 2014  |  5 Comments

Interest from the leading players continues to drive investment in the Hadoop marketplace. This week Teradata made two acquisitions – Revelytix and Hadapt – that enrich its already sophisticated big data portfolio, while HP made a $50M investment in, and joined the board of, Hortonworks. These moves continue the ongoing effort by leading players. 4 of the top 5 DBMS players (Oracle, Microsoft, IBM, SAP and Teradata) and 3 of the top 7 IT companies (Samsung, Apple, Foxconn, HP, IBM, Hitachi, Microsoft) have now made direct moves into the Hadoop space. Oracle’s recent Big Data Appliance and Big Data SQL, and Microsoft’s HDInsight represent substantial moves to target Hadoop opportunities, and these Teradata and HP moves mean they don’t want to be left behind.

Teradata begins its moves with Revelytix. Andrew White noted in Gartner’s 2013 Cool Vendors in Information Infrastructure and Big Data that Revelytix’ “Loom, which runs in Hadoop, classifies objects in the Hadoop Distributed File System and applies a predefined transformation so that objects become structured and more usable for data scientists.” In our discussions of the Logical Data Warehouse, Gartner has targeted the capabilities Revelytix was designed to provide as being on the critical path to creating a coherent, optimized metadata architecture that will incorporate both traditional Enterprse DWs and Hadoop – a direction or research shows the advanced users are heading in.

In the 2012 edition of Cool Vendors, I described Teradata’s other acquisition, Hadapt, defining its vision as a Postgres-based “RDBMS instance on every node in the cluster in order to improve performance of queries over the structured part of the data, and … data partitioning techniques to eliminate unnecessary data movement.” Admirable as it was, this vision had not generated much business, and the window for additional SQL-on-Hadoop offerings may be closing – but Teradata has acquired  technology and engineering talent that it will put to use supplementing its continuing optimization of Teradata SQL and SQL-H across complex logical data fabrics. The Hadapt team joins Teradata, though the brand will disappear.

HP chose to make a direct investment in Hortonworks, which extended its last funding round, closed months ago, to accept an additional $50M. The oddity of these mechanics aside, HP gets significant impact for its money: Martin Fink, its CTO, joins the board. HP will integrate the Hortonworks Data Platform (HDP) into its HAVEn offering, invest resources to certify its Vertica column-store analytic DBMS with HDP, and provide 1st line support. Hortonworks gets access to the global HP channel which could provide a major boost to its sales capabilities. HP was already a reseller, but, HP has been partnering with MapR as well for some time, and this relationship does not end that one. HP gets access to a leader in the continuing development of Apache Hadoop, and it’s likely that the relationship will expand as the two decide what their roadmap will be.

Increasingly, the players are marshaling their forces for global competition, global sales and support, and increased integration with enterprise-class architectures. These moves will hardly close this round of the maneuvering – it will be interesting to see what comes next.

5 Comments »

Category: Apache Big Data data warehouse DBMS Gartner Hadapt Hadoop Hortonworks HP IBM MapR Microsoft Oracle RDBMS Revelytix Teradata Uncategorized     Tags: , , , , , , , , , , , , , ,

Microsoft Turns the ‘Scope on SaaS

by Merv Adrian  |  July 16, 2014  |  4 Comments

One of the more interesting conversations I had at the Microsoft Worldwide Partners Conference this week concerned an initiative they have launched to help IT understand – and get under control – proliferating ungoverned SaaS applications. Brad Anderson, Corporate VP for Cloud and Mobility, told the 16,000 attendees that enterprises need help.  “We ask them how many SaaS apps they have in their environment and they usually tell us 30-40. We audit with the Cloud App Discovery tool and find , on average, over 300.” And are these managed? One can only imagine…

The tool is in preview now, and a link to try it out for free is provided in Microsoft’s blog, It offers more than discovery – it will permit managers to monitor usage, identify users, integrate apps into Azure Active Directory, and more.

This is part of a larger story about governance and optimization in a hybrid cloud- and on-premises world that enterprises will live in for this decade and the next. Anderson also pointed out that 3.1M smartphones were stolen and another 1.4M lost. How many of these had corporate data on them. Would you know if it happened to one of your users? Can you govern access to corporate data in the apps there, and prevent it from being pasted into emails by someone who gets that phone and uses the saved logins to get at it? Some of these challenges can be handled by policy-based tools.

Getting the apps your users want into Azure, managing them there, and linking the on-premises Active Directory used by the overwhelming majority of enterprises to Azure Active Directory offers the possibility of getting corporate data security under better control before you find out how you look in orange. One of my favorite scenarios Microsoft showed its Enterprise Mobility Suite detecting is “impossible logins” – an hour ago you logged in from Australia and now you’re apparently in Chicago. Software can stop that? Yes.

The context here was Microsoft telling its partners about the opportunities for them to sell these capabilities to customers – and it’s hard to imagine them not wanting to, especially with the incentives, certification, training and co-marketing efforts Microsoft is launching. Expect this to be a major theme, leveraging the power of the crown jewel that Active Directory is in the portfolio in many additional ways to come.

4 Comments »

Category: Active Directory Industry trends Microsoft mobility SaaS Security     Tags: , , ,

What Is Hadoop….Now?

by Merv Adrian  |  June 28, 2014  |  4 Comments

In February 2012, Gartner published How to Choose The Right Apache Hadoop Distribution (available to clients). At the time, the leading distributors were Cloudera, EMC (now Pivotal), Hortonworks (pre-GA), IBM, and MapR. These players all supported six Apache projects: HDFS, MapReduce, Pig, Hive, HBase, and Zookeeper. Things have changed.

[updated June 29] We included Datastax (a distributor  of Apache Cassandra) then, but they did not, and still don’t, consider themselves part of the Hadoop ecosystem. And they are not alone in having a reductive view of the answer to the question What Is Hadoop? Doug Cutting, pioneer in creating it and Chief Architect at Cloudera and former president of the Apache Software Foundation, considers the Hadoop Project to be HDFS, MapReduce and some common utilities. He made that point clear during a panel of luminaries my colleague Nick Heudecker conducted recently – the video is linked to  Nick’s blog here. Everything else is “related projects.” Arun Murthy of Hortonworks, who has driven the creation of YARN, prefers to say that HDFS and YARN are “kernel” now, likening the description to the way most of us think of Linux. The Apache page continues to use the older description, including HDFS, MapReduce and YARN. (June 29, 2014)

To users, and especially buyers, the definition is more expansive. Hadoop is what they use to compose a useful stack of software to execute a business process of some sort. And distributors agree: in a little over two years, the set of projects included in all commercial distributions has now reached fifteen – two and a half times as many in just over two years. The list now includes Accumulo, Avro, Cascading, Flume, Mahout, Oozie, Spark, Sqoop,  and YARN.

Others are likely to join this stack long before the next two years are up: the candidates include Falcon, Knox, Giraph, Hue, Lucene, Storm, Tez, and others. Hadoop has moved from a coarse-grained blunt instrument for largely ETL-style workloads to an expanding stack for virtually any IT task big data professionals will want to undertake. What is Hadoop now? It’s a candidate to be the alternative universe for data processing, with over 20 components that span a wide array of functions. More money continues to flow into the ecosystems, more companies form, more programmers take up the challenges, and the big players are scrambling to get aboard the train.

What is Hadoop?

It’s what’s next.

4 Comments »

Category: Accumulo Apache Apache Yarn Avro Cascading Cloudera Falcon Flume Gartner Giraph Hadoop Hbase HDFS Hive Hortonworks Hue IBM Knox Lucene Mahout MapR MapReduce Oozie Pig Pivotal Spark Sqoop Storm Tez YARN Zookeeper     Tags:

Hadoop is in the Mind of the Beholder

by Merv Adrian  |  March 24, 2014  |  11 Comments

This post was jointly authored by Merv Adrian (@merv) and Nick Heudecker (@nheudecker) and appears on both blogs.

In the early days of Hadoop (versions up through 1.x), the project consisted of two primary components: HDFS and MapReduce. One thing to store the data in an append-only file model, distributed across an arbitrarily large number of inexpensive nodes with disk and processing power; another to process it, in batch, with a relatively small number of available function calls. And some other stuff called Commons to handle bits of the plumbing. But early adopters demanded more functionality, so the Hadoop footprint grew. The result was an identity crisis that grows progressively more challenging for decisionmakers with almost every new announcement.

This expanding footprint included a sizable group of “related projects,” mostly under the Apache Software Foundation. When Gartner published How to Choose the Right Apache Hadoop Distribution in early February 2012 the leading vendors we surveyed (Cloudera, MapR, IBM, Hortonworks, and EMC) all included Pig, Hive, HBase, and Zookeeper. Most were willing to support Flume, Mahout, Oozie, and Sqoop. Several other projects were supported by some, but not all. If you were asked at the time, “What is Hadoop?” this set of ten projects, the commercially supported ones, would have made a good response.

In 2013, Hadoop 2.0 arrived, and with it a radical redefinition. YARN muddied the clear definition of Hadoop by introducing a way for multiple applications to use the cluster resources. You have options. Instead of just MapReduce, you can run Storm (or S4 or Spark Streaming), Giraph, or HBase, among others. The list of projects with abstract names goes on. At least fewer of them are animals now.

During the intervening time, vendors have selected different projects and versions to package and support. To a greater or lesser degree all of these vendors call their products Hadoop – some are clearly attempting to move “beyond” that message. Some vendors are trying to break free from the Hadoop baggage by introducing new, but just as awful, names. We have data lakes, hubs, and no doubt more to come.

But you get the point. The vague names indicate the vendors don’t know what to call these things either. If they don’t know what they’re selling, do you know what you’re buying? If the popular definition of Hadoop has shifted from a small conglomeration of components to a larger, increasingly vendor-specific conglomeration, does the name “Hadoop” really mean anything anymore?

Today the list of projects supported by leading vendors (now Cloudera, Hortonworks, MapR, Pivotal and IBM) numbers 13. Today it’s HDFS, YARN, MapReduce, Pig, Hive, HBase, and Zookeeper, Flume, Mahout, Oozie, Sqoop – and Cascading and HCatalog. Coming up fast are Spark, Storm, Accumulo, Sentry, Falcon, Knox, Whirr… and maybe Lucene and Solr. Numerous others are only supported by their distributor and are likely to remain so, though perhaps MapR’s support for Cloudera Impala will not be the last time we see an Apache-licensed, but not Apache project, break the pattern. All distributions have their own unique value-add. The answer to the question, “What is Hadoop?” and the choice buyers must make will not get easier in the year ahead – it will only become more difficult.

11 Comments »

Category: Accumulo Ambari Apache Apache Drill Apache Yarn Big Data BigInsights Cloudera Elastic MapReduce Gartner Giraph Hadoop Hbase HCatalog HDFS Hive Hortonworks IBM Intel Lucene MapR MapReduce Oozie open source OSS Pig Solr Sqoop Storm YARN Zookeeper     Tags: , , , , , , , , , , , , , , , , , , , , , , , ,