by Merv Adrian | October 13, 2014 | 2 Comments
Hopefully, that title got your attention. A recursive acronym – the term first appeared in the book Gödel, Escher, Bach: An Eternal Golden Braid and is likely more familiar to tech folks who know Gnu – is self-referential (as in “Gnu’s not Unix.”) So how did I conclude Hadoop, whose name origin we know, fits the definition? Easy – like everyone else, I’m redefining Hadoop to suit my own purposes.
Let’s start with the obvious one. Of course, Doug Cutting named Hadoop after his child’s toy elephant, seen here.
Photo: Merv Adrian
And in its early days, as I discussed in my post about the changing composition of distributions a few months back, the story was simpler. Hadoop was HDFS, MapReduce and some utilities. As those utilities got formalized and became projects themselves and were supported by commercial distributors, the list grew: Pig, Hive, HBase, and Zookeeper were Hadoop too. And a few months ago, as I noticed, Accumulo, Avro, Cascading, Flume, Mahout, Oozie, Spark, Sqoop, and YARN had joined the list.
YARN is the one that really matters here because it doesn’t just mean the list of components will change, but because in its wake the list of components will change Hadoop’s meaning. YARN enables Hadoop to be more than a brute force, batch blunt instrument for analytics and ETL jobs. It can be an interactive analytic tool, an event processor, a transactional system, a governed, secure system for complex, mixed workloads. At Strata this week, we’ll talk about its integration with Red Hat’s middleware, its cautious alliance with Spark for MapReduce replacement, its alliance with data wrangling tools from startups and Teradata, its connection, via Sentry, to security stacks… and more.
So yes, many of us are redefining Hadoop as we add new pieces – new use cases, new projects that change its very nature. My answer to “What is Hadoop”?
OK – it’s a bit cute. But hopefully, it got your attention. Hadoop’s journey is just beginning, and there is much more change ahead.
Category: Accumulo Apache Apache Yarn Big Data Cascading Flume Gartner Hadoop Hbase HDFS Hive Mahout MapReduce Oozie Pig Spark Sqoop Teradata YARN Zookeeper Tags: Apache, Flume, Gartner, Hadoop, Hbase, HDFS, Hive, MapReduce, Oozie, Pig, Sqoop, Teradata, zookeeper
by Merv Adrian | October 10, 2014 | Submit a Comment
From my esteemed colleague Mark Beyer
“Unstructured data” is a misnomer—everyone finally agrees on that much, at least. It is a term that is often applied to information assets that are not relational. Sometimes it is applied to machine data generated by operational technologies. Sometimes it is applied to content, like documents, or even less specific text such as email or twitter feeds. The term unstructured creates fear, loathing and desire across the IT landscape. If it generates meaningful discussions about innovative processing or challenges the re-use of existing infrastructure without simply defaulting to a mindset of replacing components—then that is a good discussion. But that is about all the term is good for.
But I’ve been growing ever more fond of Semi-structured, but not for the reason that many who are enamored of the term might assume. Semi actually means half, but that doesn’t mean it is halfway between structured and unstructured. Because that would be halfway between myth and reality, which is a “place” that doesn’t actually exist. Consider the vernacular American use of the term “semi-truck”, that would mean “half truck”. But, of course, that isn’t what semi-truck means. A truck vehicle—referred to properly as the “tractor” — is “half” of the truck and the trailer is the other half of the truck. Separately, neither can move cargo. When you put them together, you have two halves that make a whole and creates a complete, useful delivery vehicle. So, a semi-truck is actually made up of a semi-tractor and a semi-trailer which makes, simply, a whole truck.
Now consider semi-structured data. What it actually means is that “half” of the governance and schema instructions (defined as physical and logical definition plus applied governance) are in the data and the other half are in the using application. Semi-structured data means that you MUST write an application to complete the schema. That also means that with each different application, you can impose different governance and schema instructions (also the processing rules), theoretically making the data more flexible.
However, with the introduction of this capability comes the danger of having disparate governance rules regarding the same data that may be so significant, it becomes different data. As data moves further away from containing its own schema, then flexibility increases and more and more of the entire schema must be imposed by the using application—and ever greater diversity is introduced. Of course application developers think this is a grand idea, but users of that data in a second environment are not equally fond of it.
In a data lake, we drop all of these various degrees of structure into a common pool or body. Each asset is deposited with a different level of dilution regarding the schema instructions, thus requiring a oceanographer’s level of expertise to determine when the data in a data lake has “high alkalinity” or “high acidity” or too much “copper” and so on. A data scientist who understands the “alchemy” that exists in the fetid waters of a polluted lake will have no trouble cleaning the data up and discerning pollutants from di-hydrogen oxide (that’s water). But a novice or dilettante may find that they are drinking polluted data water from the lake long after they are infected with the equivalent of data E.coli and are making continuous visits to the data latrine.
To be completely fair, the analogy of the waters of a lake when compared to the Data Lake, needs some departure from the metaphor, but there are also further extensions of the metaphor that also work (other than those obnoxious points I raise above). For example, a lake that is crystal clear somewhere in the mountains of upstate New York would be a dead lake. There would be no fish, no microscopic organisms, no snakes, no mosquitos flying over its surface. It would be crystal clear and completely dead. So, if you desire a completely clean lake, you will find a sanitized data environment that defeats the entire purpose of storing information in something near its native form. You want a vibrant, living data lake. But you must keep the trucking metaphor in mind and remember that unstructured data does not exist. Instead, semi-structured data which intentionally only has SOME of the schema embedded in the information asset, requires a knowledge of how to swim. And boats (tools for exploring the data lake) don’t really work either—anyone who cannot swim must be very careful to NOT fall out of the boat. But anyone who can swim, can safely use a boat and be confident that if they fall over the side they can make it back to shore.
Data lakes are for data scientists to conduct science, they are not for casual analytics users or even advanced business analysts who generally scuff their code-writing and script-writing toes on ignorance. But those data miners that understand business process and systems analysis, or those data scientists who understand model theory and statistical primitives? Well, they can swim all day. So, buy a boat (a tool), explore away, and be comfortable in the knowledge that some of the structure is waiting for you in the lake, but not all of it—your job is to figure out the rest including sometimes reaching over the side of the boat and getting a little wet because you never know when you will fall in.
Merv’s comment: this has been a topic of some discussion on our team. My own view is very close to Mark’s and is expressed in my closing slide from the Hadoop Summit
Category: Big Data data lake data warehouse Gartner Hadoop metadata Tags: data lake, data warehouse, Hadoop, metadata
by Merv Adrian | October 9, 2014 | 1 Comment
At Garner Symposium, Drue Reeves and I had the opportunity to interview Microsoft CEO Satya Nadella. Here’s a brief clip from the closing. I’m summarizing and Satya, passionate as he was throughout the conversation, lays out his vision about mobility that crosses the personal and professional: mobility of the individual and the app experiences. “Have my work and life wherever – that’s the true form of mobility.”
Category: Uncategorized Tags:
by Merv Adrian | October 2, 2014 | Comments Off
It’s rare that one gets the chance to talk to a new megavendor CEO in his first year on the job – especially in front of 10,000 senior IT professionals. But that is the opportunity Drue Reeves and I have on Tuesday, October 7 in Gartner’s Mastermind interview.
What have we got in mind? Enterprise IT questions. We won’t talk much about Xbox or Bing. But we do plan to ask Satya:
- How he is driving a culture of innovation, and in what direction
- What Windows 10 means to IT, and what they should do about it – and when
- How mobility is changing end user experiences – and what Microsoft is doing to get us there
- How the cloud impacts usage, data centers, and architects’ budgets
- What impact will the Internet of Things have on Microsoft – and us?
We’ll have a lightning round featuring questions from Twitter to liven things up – though I expect we won’t struggle with liveliness. If you want to suggest questions, send them to us with the twitter hashtag #GartnerSymAskSatya or email us at Gartner, or by comment here. We’ll do our best to keep up with them all….
Hope to see you there.
Category: Gartner Microsoft Tags: Gartner, Microsoft
by Merv Adrian | September 20, 2014 | 2 Comments
For the past few months, I’ve been Gartner’s Vendor Lead for Microsoft. For some 30 vendors, we assign a single analyst to act as a focal point for coordinating across the 1000 analysts we have when research covers that vendor.
In Microsoft’s case, that has proven to be fascinating – we have some 3 dozen Magic Quadrants alone that have been published about their offerings in the last 15 months or so. As Vendor Lead, I’m a mandatory peer reviewer for those and other documents. For my own edification, I decide to map the Magic Quadrants that feature Microsoft onto a quadrant that shows where Microsoft appears in that piece of research. The results are intriguing.
Microsoft has a sizable number of Leader offerings, but many in the Challenger quadrant, and a few that appear in the Niche Player quadrant as well. It’s a bit rarer for them to appear as Visionaries – if they have it figured out, their ability to execute tends to drive them up into the Leader space fairly quickly.
The chart shows places where Microsoft clearly needs to focus, and makes it clear that they play in numerous markets of interest to Gartner’s enterprise IT-focused audience. Many categories do not appear, and a half dozen MQs are currently in process. I’ll keep this up to date for myself, and occasionally will share it here.
Category: Gartner Microsoft Tags: Gartner, Microsoft
by Merv Adrian | July 24, 2014 | 5 Comments
Interest from the leading players continues to drive investment in the Hadoop marketplace. This week Teradata made two acquisitions – Revelytix and Hadapt – that enrich its already sophisticated big data portfolio, while HP made a $50M investment in, and joined the board of, Hortonworks. These moves continue the ongoing effort by leading players. 4 of the top 5 DBMS players (Oracle, Microsoft, IBM, SAP and Teradata) and 3 of the top 7 IT companies (Samsung, Apple, Foxconn, HP, IBM, Hitachi, Microsoft) have now made direct moves into the Hadoop space. Oracle’s recent Big Data Appliance and Big Data SQL, and Microsoft’s HDInsight represent substantial moves to target Hadoop opportunities, and these Teradata and HP moves mean they don’t want to be left behind.
Teradata begins its moves with Revelytix. Andrew White noted in Gartner’s 2013 Cool Vendors in Information Infrastructure and Big Data that Revelytix’ “Loom, which runs in Hadoop, classifies objects in the Hadoop Distributed File System and applies a predefined transformation so that objects become structured and more usable for data scientists.” In our discussions of the Logical Data Warehouse, Gartner has targeted the capabilities Revelytix was designed to provide as being on the critical path to creating a coherent, optimized metadata architecture that will incorporate both traditional Enterprse DWs and Hadoop – a direction or research shows the advanced users are heading in.
In the 2012 edition of Cool Vendors, I described Teradata’s other acquisition, Hadapt, defining its vision as a Postgres-based “RDBMS instance on every node in the cluster in order to improve performance of queries over the structured part of the data, and … data partitioning techniques to eliminate unnecessary data movement.” Admirable as it was, this vision had not generated much business, and the window for additional SQL-on-Hadoop offerings may be closing – but Teradata has acquired technology and engineering talent that it will put to use supplementing its continuing optimization of Teradata SQL and SQL-H across complex logical data fabrics. The Hadapt team joins Teradata, though the brand will disappear.
HP chose to make a direct investment in Hortonworks, which extended its last funding round, closed months ago, to accept an additional $50M. The oddity of these mechanics aside, HP gets significant impact for its money: Martin Fink, its CTO, joins the board. HP will integrate the Hortonworks Data Platform (HDP) into its HAVEn offering, invest resources to certify its Vertica column-store analytic DBMS with HDP, and provide 1st line support. Hortonworks gets access to the global HP channel which could provide a major boost to its sales capabilities. HP was already a reseller, but, HP has been partnering with MapR as well for some time, and this relationship does not end that one. HP gets access to a leader in the continuing development of Apache Hadoop, and it’s likely that the relationship will expand as the two decide what their roadmap will be.
Increasingly, the players are marshaling their forces for global competition, global sales and support, and increased integration with enterprise-class architectures. These moves will hardly close this round of the maneuvering – it will be interesting to see what comes next.
Category: Apache Big Data data warehouse DBMS Gartner Hadapt Hadoop Hortonworks HP IBM MapR Microsoft Oracle RDBMS Revelytix Teradata Uncategorized Tags: Apache, big data, CDH, Cloudera, data warehouse, Hadapt, Hadoop, Hortonworks, HP, IBM, MapR, Microsoft, Oracle, Revelytix, Teradata
by Merv Adrian | July 16, 2014 | 4 Comments
One of the more interesting conversations I had at the Microsoft Worldwide Partners Conference this week concerned an initiative they have launched to help IT understand – and get under control – proliferating ungoverned SaaS applications. Brad Anderson, Corporate VP for Cloud and Mobility, told the 16,000 attendees that enterprises need help. “We ask them how many SaaS apps they have in their environment and they usually tell us 30-40. We audit with the Cloud App Discovery tool and find , on average, over 300.” And are these managed? One can only imagine…
The tool is in preview now, and a link to try it out for free is provided in Microsoft’s blog, It offers more than discovery – it will permit managers to monitor usage, identify users, integrate apps into Azure Active Directory, and more.
This is part of a larger story about governance and optimization in a hybrid cloud- and on-premises world that enterprises will live in for this decade and the next. Anderson also pointed out that 3.1M smartphones were stolen and another 1.4M lost. How many of these had corporate data on them. Would you know if it happened to one of your users? Can you govern access to corporate data in the apps there, and prevent it from being pasted into emails by someone who gets that phone and uses the saved logins to get at it? Some of these challenges can be handled by policy-based tools.
Getting the apps your users want into Azure, managing them there, and linking the on-premises Active Directory used by the overwhelming majority of enterprises to Azure Active Directory offers the possibility of getting corporate data security under better control before you find out how you look in orange. One of my favorite scenarios Microsoft showed its Enterprise Mobility Suite detecting is “impossible logins” – an hour ago you logged in from Australia and now you’re apparently in Chicago. Software can stop that? Yes.
The context here was Microsoft telling its partners about the opportunities for them to sell these capabilities to customers – and it’s hard to imagine them not wanting to, especially with the incentives, certification, training and co-marketing efforts Microsoft is launching. Expect this to be a major theme, leveraging the power of the crown jewel that Active Directory is in the portfolio in many additional ways to come.
Category: Active Directory Industry trends Microsoft mobility SaaS Security Tags: Microsoft, mobility, SaaS, Security
by Merv Adrian | June 28, 2014 | 4 Comments
In February 2012, Gartner published How to Choose The Right Apache Hadoop Distribution (available to clients). At the time, the leading distributors were Cloudera, EMC (now Pivotal), Hortonworks (pre-GA), IBM, and MapR. These players all supported six Apache projects: HDFS, MapReduce, Pig, Hive, HBase, and Zookeeper. Things have changed.
[updated June 29] We included Datastax (a distributor of Apache Cassandra) then, but they did not, and still don’t, consider themselves part of the Hadoop ecosystem. And they are not alone in having a reductive view of the answer to the question What Is Hadoop? Doug Cutting, pioneer in creating it and Chief Architect at Cloudera and former president of the Apache Software Foundation, considers the Hadoop Project to be HDFS, MapReduce and some common utilities. He made that point clear during a panel of luminaries my colleague Nick Heudecker conducted recently – the video is linked to Nick’s blog here. Everything else is “related projects.” Arun Murthy of Hortonworks, who has driven the creation of YARN, prefers to say that HDFS and YARN are “kernel” now, likening the description to the way most of us think of Linux. The Apache page continues to use the older description, including HDFS, MapReduce and YARN. (June 29, 2014)
To users, and especially buyers, the definition is more expansive. Hadoop is what they use to compose a useful stack of software to execute a business process of some sort. And distributors agree: in a little over two years, the set of projects included in all commercial distributions has now reached fifteen – two and a half times as many in just over two years. The list now includes Accumulo, Avro, Cascading, Flume, Mahout, Oozie, Spark, Sqoop, and YARN.
Others are likely to join this stack long before the next two years are up: the candidates include Falcon, Knox, Giraph, Hue, Lucene, Storm, Tez, and others. Hadoop has moved from a coarse-grained blunt instrument for largely ETL-style workloads to an expanding stack for virtually any IT task big data professionals will want to undertake. What is Hadoop now? It’s a candidate to be the alternative universe for data processing, with over 20 components that span a wide array of functions. More money continues to flow into the ecosystems, more companies form, more programmers take up the challenges, and the big players are scrambling to get aboard the train.
What is Hadoop?
It’s what’s next.
Category: Accumulo Apache Apache Yarn Avro Cascading Cloudera Falcon Flume Gartner Giraph Hadoop Hbase HDFS Hive Hortonworks Hue IBM Knox Lucene Mahout MapR MapReduce Oozie Pig Pivotal Spark Sqoop Storm Tez YARN Zookeeper Tags:
by Merv Adrian | March 24, 2014 | 11 Comments
This post was jointly authored by Merv Adrian (@merv) and Nick Heudecker (@nheudecker) and appears on both blogs.
In the early days of Hadoop (versions up through 1.x), the project consisted of two primary components: HDFS and MapReduce. One thing to store the data in an append-only file model, distributed across an arbitrarily large number of inexpensive nodes with disk and processing power; another to process it, in batch, with a relatively small number of available function calls. And some other stuff called Commons to handle bits of the plumbing. But early adopters demanded more functionality, so the Hadoop footprint grew. The result was an identity crisis that grows progressively more challenging for decisionmakers with almost every new announcement.
This expanding footprint included a sizable group of “related projects,” mostly under the Apache Software Foundation. When Gartner published How to Choose the Right Apache Hadoop Distribution in early February 2012 the leading vendors we surveyed (Cloudera, MapR, IBM, Hortonworks, and EMC) all included Pig, Hive, HBase, and Zookeeper. Most were willing to support Flume, Mahout, Oozie, and Sqoop. Several other projects were supported by some, but not all. If you were asked at the time, “What is Hadoop?” this set of ten projects, the commercially supported ones, would have made a good response.
In 2013, Hadoop 2.0 arrived, and with it a radical redefinition. YARN muddied the clear definition of Hadoop by introducing a way for multiple applications to use the cluster resources. You have options. Instead of just MapReduce, you can run Storm (or S4 or Spark Streaming), Giraph, or HBase, among others. The list of projects with abstract names goes on. At least fewer of them are animals now.
During the intervening time, vendors have selected different projects and versions to package and support. To a greater or lesser degree all of these vendors call their products Hadoop – some are clearly attempting to move “beyond” that message. Some vendors are trying to break free from the Hadoop baggage by introducing new, but just as awful, names. We have data lakes, hubs, and no doubt more to come.
But you get the point. The vague names indicate the vendors don’t know what to call these things either. If they don’t know what they’re selling, do you know what you’re buying? If the popular definition of Hadoop has shifted from a small conglomeration of components to a larger, increasingly vendor-specific conglomeration, does the name “Hadoop” really mean anything anymore?
Today the list of projects supported by leading vendors (now Cloudera, Hortonworks, MapR, Pivotal and IBM) numbers 13. Today it’s HDFS, YARN, MapReduce, Pig, Hive, HBase, and Zookeeper, Flume, Mahout, Oozie, Sqoop – and Cascading and HCatalog. Coming up fast are Spark, Storm, Accumulo, Sentry, Falcon, Knox, Whirr… and maybe Lucene and Solr. Numerous others are only supported by their distributor and are likely to remain so, though perhaps MapR’s support for Cloudera Impala will not be the last time we see an Apache-licensed, but not Apache project, break the pattern. All distributions have their own unique value-add. The answer to the question, “What is Hadoop?” and the choice buyers must make will not get easier in the year ahead – it will only become more difficult.
Category: Accumulo Ambari Apache Apache Drill Apache Yarn Big Data BigInsights Cloudera Elastic MapReduce Gartner Giraph Hadoop Hbase HCatalog HDFS Hive Hortonworks IBM Intel Lucene MapR MapReduce Oozie open source OSS Pig Solr Sqoop Storm YARN Zookeeper Tags: Apache, big data, BigInsights, CDH, Cloudera, Datastax, EMC, Flume, Hadoop, Hbase, HDFS, Hive, Hortonworks, IBM, InfoSphere, Isilon, MapR, MapReduce, Oozie, open source, OSS, Pig, Pivotal, Sqoop, zookeeper
by Merv Adrian | February 23, 2014 | 5 Comments
In my post about the BYOH market last October, I noted that increasing numbers of existing players are connecting their offerings to Apache Hadoop, even as upstarts enter their markets with a singular focus. And last month, I pointed out that Nick Heudecker and I detected a surprising lack of concern about security in a recent Hadoop webinar. Clearly, these two topics have an important intersection – both Hadoop specialists (including distribution vendors) and existing security vendors will need to expand their efforts to drive awareness if they are to capture an opportunity that is clearly going begging today. Security for big data will be a key issue in 2014 and beyond.
Other analysts at Gartner have tracked many of these products, and in my own followup I’ve been catching up on the work of Joseph Feiman and Brian Lowans, among others. Their Magic Quadrant for Data Masking, published in December, offers useful discussion of that capability (both static and dynamic) and which existing players have already added Hadoop support. Axis Technology’s DMSuite, Dataguise (who partners with Compuware), IBM InfoSphere Optim Data Privacy and InfoSphere Guardium Data Activity Monitor, Informatica Dynamic Data Masking and Persistent Data Masking, and Voltage SecureData Enterprise are all mentioned in the MQ.
There are other offerings, of course – for example Feiman and Lowans note that masking of big data is available for the Oracle Big Data Appliance with its installed Cloudera distribution, but added that it requires the use of Oracle consulting services, or the services of Oracle’s numerous service partners. Similarly, there are several emerging Hadoop focused firms I’ve mentioned elsewhere and will cover in an upcoming piece of Gartner research I’m doing with Neil MacDonald. With RSA coming up this week (unfortunately, I can’t attend), I expect to see more heat – and perhaps light as well – on the issue ahead.
Category: Apache Big Data Cloudera Dataguise Gartner Hadoop IBM Magic Quadrant Oracle Security Tags: Apache, Axis Technology, big data, Cloudera, Compuware, Dataguise, Hadoop, IBM, Informatica, open source, Oracle, OSS, Security, Voltage