by Merv Adrian | November 17, 2014 | 9 Comments
Last week, many observers were surprised when Hortonworks’ S-1 for an initial public offering (IPO) was filed. And there are good reasons to be surprised. Why now? CEO Rob Bearden told VentureWire not long ago that he expected to exit 2014 “at a strong $100 million run rate” in preparation for a 2015 IPO. What changed? Perhaps one answer to that question might be answered by asking another question: for whom?
Is the filing is for Hortonworks to help with cash? That is not obvious. The filing is listed as being for an offering of $100M. In the context of other fundraising activities by Hadoop vendors – with Cloudera’s $900M or so of a few months ago at the top of the list – it will hardly create a war chest suitable for out-expanding its competitors aggressively. And there are cheaper and easier ways to raise $100M in Silicon Valley than an IPO.
In fact, a look at the numbers – now public for the first time because of the filing – makes it all the more puzzling. Hortonworks’ $33.4 million in revenue for the nine months ending Sept. 30 was up sharply from last year, only its second full year since HDP went GA in June 2012. Revenue for the last quarter was $12M. It was barely up over the prior quarter (also $12M), so things are actually a bit flat. But expenses are several times that – $29M and $41M respectively, so the gap is widening. Put another way, losses are growing faster than revenues are, at an accelerating rate. That $100M, plus the $111M the company has in cash now, gives it a year or two’s worth of runway to improve matters. Presumably, that’s the bet. But why only $100M if it seems possible that more could be available?
Is it for Hortonworks’ investors? Let’s see who they are. Here’s a table of their stakes (a table stakes stakes table, if you will):
Benchmark and Index are successful funds, and it’s unlikely they are in a hurry to cash in on their investment. Yahoo might care, but urgency seems unlikely, particularly after the Alibaba windfall. There is little reason to think HP is driving this. Teradata? OK, if they were betting on Hortonworks as the key element in their big data strategy, maybe they have decided to hedge, but it’s hard to imagine they really feel the need to worry about this – and they’ve already hedged by announcing a new partnership with Cloudera. They have a fair number of joint customers with MapR as well, so one can’t rule then out as a future partner too. Teradata’s role here is not likely to be the motivating factor.
Personal gain? Doubtful. The stakes owned by Hortonworks’ CEO and President are nice and will certainly help them – but there is no obvious reason for them to have accelerated this for their own gain, even if they could do so.
Is it to help build the market for Hadoop? This seems to have been the party line on general motivation till now. But they are one vendor among several, some truly megavendors and some similar in size – and evidently in prospects – for the near term. They are major contributors to the open source code in the Apache stack and driving substantial innovation. Being able to keep paying engineers (R&D is 28% of expenses, and has doubled over the past year) is a good use of funds – and $100M will fund a couple of years at the current run rate, which one might expect to level off a bit. But it won’t be the only use of funds: sales and marketing is 48%, and more is better. Still, let’s face it, because of Hortonworks’ business model, everything they build is Apache open source code. Their R&D spend enables their competitors too. It won’t separate them quickly and dramatically from the pack any better to have much more spending on either or both.
It’s been 10 years since the first Google paper on MapReduce. Hortonworks will be the first new public company descending from that and they want HDP as symbol. They were formed 3 years after Cloudera, so they can at least grab the Hadoop label for themselves. But with an open source stack, value is likely to be determined by how well the company is seen to run, how many customers it has, how likely the revenue of the company is to track growth in Hadoop usage at those and at new customer sites, etc. Hortonworks’ business is services and support. Nether is particularly high margin. Nor is it clear how customer spending on either or both will scale with their Hadoop usage.
Hortonworks’ 3 largest customers (Yahoo, Teradata, and Microsoft) account for 37.4% of its revenue – and two are investors. The biggest is Microsoft, at 22.4% now – it was 55.3% for the year ended April 30, 2013. That sort of concentration never makes investors too happy, and though it is declining it’s still sizable. The Microsoft deal, like all others, is renewable – it expires in July 2015. And like Teradata, Microsoft has added other partnerships to what was an exclusive with Hortonworks till recently. Is the possible “window” closing a reason to accelerate the IPO? According to Fortune magazine, to actually list in the 2014 calendar year, this was basically the last week for Hortonworks to make the S-1 public (due to a combination of holidays and regulatory waiting periods).
Ultimately, it’s unlikely that Hortonworks will be alone as a public company for long. MapR told the Wall Street Journal they want to IPO next year, and they claim to have more customers, high margins and “efficient cash management.” Cloudera says they “are not ready yet” though they have lower rate of losses, and also claim more customers. At the end of the day, the answer may be rather simple. And again, answering a question with a question: if not now, when? There may not be a better time.
Category: Apache Big Data Cloudera Gartner Hadoop Hortonworks HP Industry trends IPO MapR Microsoft Teradata Yahoo! Tags: Apache, big data, Cloudera, Gartner, Hadoop, Hortonworks, initial public offering, IPO, MapR, Microsoft, Teradata, Yahoo!
by Merv Adrian | October 31, 2014 | 10 Comments
New York’s Javits Center is a cavernous triumph of form over function. Giant empty spaces were everywhere at this year’s empty-though-sold-out Strata/Hadoop World, but the strangely-numbered, hard to find, typically inadequately-sized rooms were packed. Some redesign will be needed next year, because the event was huge in impact and demand will only grow. A few of those big tent pavilions you see at Oracle Open World or Dreamforce would drop into the giant halls without a trace – I’d expect to see some next year to make some usable space available.
So much happened, I’ll post a couple of pieces here. Last year’s news was all about promises: Hadoop 2.0 brought the promise of YARN enabling new kinds of processing, and there was promise in the multiple emerging SQL-on-HDFS plays. The Hadoop community was clearly ready to crown a new hype king for 2014.
This year, all that noise had jumped the Spark.
If you have not kept up, Apache Spark bids to
replace supplement MapReduce with a more general purpose engine, combining interactive processing and streaming along with MapReduce-like batch capabilities, leveraging YARN to enable a new, much broader set of use cases. (See Nick Heudecker’s blog for a recent assessment.) It has a commercializer in Databricks, which has shown great skill in assembling an ecosystem of support from a set of partners who are enabling it to work with multiple key Hadoop stack projects at an accelerating pace. That momentum was reflected in the rash of announcements at Hadoop World, across categories from Analytics to Wrangling (couldn’t come up with a Z.) There were more than I’ll list here – their vendors are welcome to add themselves via comments, and I’ll curate this post for a while to put them in.
Hadoop analytics pioneer Platfora announced its version 4.0 with enhanced visualizations, geo-analytics capabilities and collaboration features, and revealed it has “plans for integration” with Spark.
Tableau was a little more ready, delivering a beta version of its Spark Connector, claiming its in-memory offering delivered up to 100x the performance of Hadoop MapReduce. Tableau is also broadening its ecosystem reach, adding a beta version of its connector for Amazon EMR, and support for IBM BigSQL and MarkLogic.
Tresata extended the analytics wave to analytic applications, enhancing its customer intelligence management software for financial data by adding real-time execution of analytical processes using Spark. Tresata is an early mover, and believes one of its core advantages derives from having been architected to run entirely in Hadoop early on. It supports its own data wrangling with Automated Data Ontology Discovery and entity resolution – cleaning, de-duping, and parsing data.
(For developers, Tresata is also open sourcing Scalding-on-Spark – a library that adds support for Cascading Taps, Scalding Sources and the Scalding Fields API in Spark.)
Appliances were represented by Dell, who introduced a new In-memory box (one of many Hadoop appliances that represented another 2014 trend) that integrates Spark with Cloudera Enterprise. (Dell is all in on the new datastores – they have buit architectures with Datastax for Cassandra, and with MongoDB, as well.) And Cray, having completed its spinback of Yarc, unveiled its Urika-XA platform with Hadoop and Spark pre-installed, and leveraging its HPC expertise to exploit SSDs, parallel file systems, and high-speed interconnects for a test run to see if there is a high-end performance market yet.
Cloud was brought to the party by BlueData, packaging Spark with its EPIC™ private-cloud deployment platform. Standalone Spark clusters can run Spark-Scala, MLLib or SparkSQL jobs against data stored in HDFS, NFS and other storage. Note “standalone” – Spark can, and will, be used by shops that are not running Hadoop. Once it is actually running production jobs, that is.
Rackspace is in both games with its OnMetal – an appliance-based cloud you don’t have to own, with a high-performance design using 3.2 TB per data node. They provision the other services. Rackspace is partnering with Hortonworks to deliver HDP 2.1 or – you guessed it – Spark. This is all built on a thin virtualization layer on another emerging hot platform: Openstack.
The distributions were represented of course: Cloudera jumped in back in February accompanied by strong statements from Mike Olson that helped put it on the map. Hortonworks followed in May with a tech preview. It still is in preview – Hortonworks, for good reasons, is not quite prepared to call it production-ready yet. Pivotal support was announced in May – oddly, in the Databricks blog, reflecting its on-again, off-again marketing motions. In New York, MapR on the bandwagon since April as well, announced that Drill – itself barely out of the gate – will also run on Spark.
It was intriguing to note that many of the emerging data wrangling/munging/harmonizing/preparing/curating players started early. ClearStory CEO Sharmila Mulligan of was quick to note during her keynote appearance that her offering has been built on Spark from the outset. Paxata, another of the new players with a couple of dozen licensed customers already, has also built its in-memory, columnar, parallel enterprise platform on top of Apache Spark. It connects directly to HDFS, RDBMS, and web services like SalesForce.com and publishes to Apache Hive or Cloudera Impala. Trifacta, already onto its v2, has now officially named its language Wrangle , added native support for more complex data formats, including JSON, Avro, ORC and Parquet, and yes, is focusing on delivering scale for its data transformation through native use of both Spark and MapReduce.
Even the conference organizers got into the act. O’Reilly has made a big investment with Cloudera to make Strata a leading conference. It’s added a European conference, making Doug Cutting the new conference Chair. In New York, O’Reilly announced a partnership with Databricks for Spark developer certification, expanding the franchise before someone else jumps in.
There is far more to come from Spark – a memory-centric file system called Tachyon that will add new capabilities above today’s disk-oriented ones; the MLlib machine learning library that will leverage Spark’s superior iterative performance, GraphX for the long awaited graph performance that today is best served by commercial vendors like Teradata Aster, and of course, Spark Streaming. But much of that is simply not demonstrably production-ready just yet – much is still in beta. Or even alpha. We’ll be watching. For now, it’s the new hype king.
Category: Accumulo Amazon Apache Apache Yarn Aster Avro Big Data BigInsights Cascading Cassandra Cloudera Cray Elastic MapReduce Gartner Hadoop HDFS Hive Hortonworks IBM MapR MapReduce Microsoft Spark Uncategorized YARN Tags: Apache, Aster, Avro, big data, BigInsights, BigSQL, BlueData, Cassandra, CDH, Cloudera, Databricks, Datastax, EMR, Gartner, Hadoop, Hbase, HDFS, Hive, Hortonworks, IBM, Impala, JSON, MapR, MapReduce, MarkLogic, Microsoft, MLlib, MongoDB, Openstack, ORC, Parquet, Paxata, Platfora, Rackspace, Scalding, Spark, SQL, Tableau, Tachyon, Tresata, Trifacta, Yarn
by Merv Adrian | October 13, 2014 | 3 Comments
Hopefully, that title got your attention. A recursive acronym – the term first appeared in the book Gödel, Escher, Bach: An Eternal Golden Braid and is likely more familiar to tech folks who know Gnu – is self-referential (as in “Gnu’s not Unix.”) So how did I conclude Hadoop, whose name origin we know, fits the definition? Easy – like everyone else, I’m redefining Hadoop to suit my own purposes.
Let’s start with the obvious one. Of course, Doug Cutting named Hadoop after his child’s toy elephant, seen here.
Photo: Merv Adrian
And in its early days, as I discussed in my post about the changing composition of distributions a few months back, the story was simpler. Hadoop was HDFS, MapReduce and some utilities. As those utilities got formalized and became projects themselves and were supported by commercial distributors, the list grew: Pig, Hive, HBase, and Zookeeper were Hadoop too. And a few months ago, as I noticed, Accumulo, Avro, Cascading, Flume, Mahout, Oozie, Spark, Sqoop, and YARN had joined the list.
YARN is the one that really matters here because it doesn’t just mean the list of components will change, but because in its wake the list of components will change Hadoop’s meaning. YARN enables Hadoop to be more than a brute force, batch blunt instrument for analytics and ETL jobs. It can be an interactive analytic tool, an event processor, a transactional system, a governed, secure system for complex, mixed workloads. At Strata this week, we’ll talk about its integration with Red Hat’s middleware, its cautious alliance with Spark for MapReduce replacement, its alliance with data wrangling tools from startups and Teradata, its connection, via Sentry, to security stacks… and more.
So yes, many of us are redefining Hadoop as we add new pieces – new use cases, new projects that change its very nature. My answer to “What is Hadoop”?
OK – it’s a bit cute. But hopefully, it got your attention. Hadoop’s journey is just beginning, and there is much more change ahead.
Category: Accumulo Apache Apache Yarn Big Data Cascading Flume Gartner Hadoop Hbase HDFS Hive Mahout MapReduce Oozie Pig Spark Sqoop Teradata YARN Zookeeper Tags: Apache, Flume, Gartner, Hadoop, Hbase, HDFS, Hive, MapReduce, Oozie, Pig, Sqoop, Teradata, zookeeper
by Merv Adrian | October 10, 2014 | Comments Off
From my esteemed colleague Mark Beyer
“Unstructured data” is a misnomer—everyone finally agrees on that much, at least. It is a term that is often applied to information assets that are not relational. Sometimes it is applied to machine data generated by operational technologies. Sometimes it is applied to content, like documents, or even less specific text such as email or twitter feeds. The term unstructured creates fear, loathing and desire across the IT landscape. If it generates meaningful discussions about innovative processing or challenges the re-use of existing infrastructure without simply defaulting to a mindset of replacing components—then that is a good discussion. But that is about all the term is good for.
But I’ve been growing ever more fond of Semi-structured, but not for the reason that many who are enamored of the term might assume. Semi actually means half, but that doesn’t mean it is halfway between structured and unstructured. Because that would be halfway between myth and reality, which is a “place” that doesn’t actually exist. Consider the vernacular American use of the term “semi-truck”, that would mean “half truck”. But, of course, that isn’t what semi-truck means. A truck vehicle—referred to properly as the “tractor” — is “half” of the truck and the trailer is the other half of the truck. Separately, neither can move cargo. When you put them together, you have two halves that make a whole and creates a complete, useful delivery vehicle. So, a semi-truck is actually made up of a semi-tractor and a semi-trailer which makes, simply, a whole truck.
Now consider semi-structured data. What it actually means is that “half” of the governance and schema instructions (defined as physical and logical definition plus applied governance) are in the data and the other half are in the using application. Semi-structured data means that you MUST write an application to complete the schema. That also means that with each different application, you can impose different governance and schema instructions (also the processing rules), theoretically making the data more flexible.
However, with the introduction of this capability comes the danger of having disparate governance rules regarding the same data that may be so significant, it becomes different data. As data moves further away from containing its own schema, then flexibility increases and more and more of the entire schema must be imposed by the using application—and ever greater diversity is introduced. Of course application developers think this is a grand idea, but users of that data in a second environment are not equally fond of it.
In a data lake, we drop all of these various degrees of structure into a common pool or body. Each asset is deposited with a different level of dilution regarding the schema instructions, thus requiring a oceanographer’s level of expertise to determine when the data in a data lake has “high alkalinity” or “high acidity” or too much “copper” and so on. A data scientist who understands the “alchemy” that exists in the fetid waters of a polluted lake will have no trouble cleaning the data up and discerning pollutants from di-hydrogen oxide (that’s water). But a novice or dilettante may find that they are drinking polluted data water from the lake long after they are infected with the equivalent of data E.coli and are making continuous visits to the data latrine.
To be completely fair, the analogy of the waters of a lake when compared to the Data Lake, needs some departure from the metaphor, but there are also further extensions of the metaphor that also work (other than those obnoxious points I raise above). For example, a lake that is crystal clear somewhere in the mountains of upstate New York would be a dead lake. There would be no fish, no microscopic organisms, no snakes, no mosquitos flying over its surface. It would be crystal clear and completely dead. So, if you desire a completely clean lake, you will find a sanitized data environment that defeats the entire purpose of storing information in something near its native form. You want a vibrant, living data lake. But you must keep the trucking metaphor in mind and remember that unstructured data does not exist. Instead, semi-structured data which intentionally only has SOME of the schema embedded in the information asset, requires a knowledge of how to swim. And boats (tools for exploring the data lake) don’t really work either—anyone who cannot swim must be very careful to NOT fall out of the boat. But anyone who can swim, can safely use a boat and be confident that if they fall over the side they can make it back to shore.
Data lakes are for data scientists to conduct science, they are not for casual analytics users or even advanced business analysts who generally scuff their code-writing and script-writing toes on ignorance. But those data miners that understand business process and systems analysis, or those data scientists who understand model theory and statistical primitives? Well, they can swim all day. So, buy a boat (a tool), explore away, and be comfortable in the knowledge that some of the structure is waiting for you in the lake, but not all of it—your job is to figure out the rest including sometimes reaching over the side of the boat and getting a little wet because you never know when you will fall in.
Merv’s comment: this has been a topic of some discussion on our team. My own view is very close to Mark’s and is expressed in my closing slide from the Hadoop Summit
Category: Big Data data lake data warehouse Gartner Hadoop metadata Tags: data lake, data warehouse, Hadoop, metadata
by Merv Adrian | October 9, 2014 | 1 Comment
At Garner Symposium, Drue Reeves and I had the opportunity to interview Microsoft CEO Satya Nadella. Here’s a brief clip from the closing. I’m summarizing and Satya, passionate as he was throughout the conversation, lays out his vision about mobility that crosses the personal and professional: mobility of the individual and the app experiences. “Have my work and life wherever – that’s the true form of mobility.”
Category: Uncategorized Tags:
by Merv Adrian | October 2, 2014 | Comments Off
It’s rare that one gets the chance to talk to a new megavendor CEO in his first year on the job – especially in front of 10,000 senior IT professionals. But that is the opportunity Drue Reeves and I have on Tuesday, October 7 in Gartner’s Mastermind interview.
What have we got in mind? Enterprise IT questions. We won’t talk much about Xbox or Bing. But we do plan to ask Satya:
- How he is driving a culture of innovation, and in what direction
- What Windows 10 means to IT, and what they should do about it – and when
- How mobility is changing end user experiences – and what Microsoft is doing to get us there
- How the cloud impacts usage, data centers, and architects’ budgets
- What impact will the Internet of Things have on Microsoft – and us?
We’ll have a lightning round featuring questions from Twitter to liven things up – though I expect we won’t struggle with liveliness. If you want to suggest questions, send them to us with the twitter hashtag #GartnerSymAskSatya or email us at Gartner, or by comment here. We’ll do our best to keep up with them all….
Hope to see you there.
Category: Gartner Microsoft Tags: Gartner, Microsoft
by Merv Adrian | September 20, 2014 | 2 Comments
For the past few months, I’ve been Gartner’s Vendor Lead for Microsoft. For some 30 vendors, we assign a single analyst to act as a focal point for coordinating across the 1000 analysts we have when research covers that vendor.
In Microsoft’s case, that has proven to be fascinating – we have some 3 dozen Magic Quadrants alone that have been published about their offerings in the last 15 months or so. As Vendor Lead, I’m a mandatory peer reviewer for those and other documents. For my own edification, I decide to map the Magic Quadrants that feature Microsoft onto a quadrant that shows where Microsoft appears in that piece of research. The results are intriguing.
Microsoft has a sizable number of Leader offerings, but many in the Challenger quadrant, and a few that appear in the Niche Player quadrant as well. It’s a bit rarer for them to appear as Visionaries – if they have it figured out, their ability to execute tends to drive them up into the Leader space fairly quickly.
The chart shows places where Microsoft clearly needs to focus, and makes it clear that they play in numerous markets of interest to Gartner’s enterprise IT-focused audience. Many categories do not appear, and a half dozen MQs are currently in process. I’ll keep this up to date for myself, and occasionally will share it here.
Category: Gartner Microsoft Tags: Gartner, Microsoft
by Merv Adrian | July 24, 2014 | 5 Comments
Interest from the leading players continues to drive investment in the Hadoop marketplace. This week Teradata made two acquisitions – Revelytix and Hadapt – that enrich its already sophisticated big data portfolio, while HP made a $50M investment in, and joined the board of, Hortonworks. These moves continue the ongoing effort by leading players. 4 of the top 5 DBMS players (Oracle, Microsoft, IBM, SAP and Teradata) and 3 of the top 7 IT companies (Samsung, Apple, Foxconn, HP, IBM, Hitachi, Microsoft) have now made direct moves into the Hadoop space. Oracle’s recent Big Data Appliance and Big Data SQL, and Microsoft’s HDInsight represent substantial moves to target Hadoop opportunities, and these Teradata and HP moves mean they don’t want to be left behind.
Teradata begins its moves with Revelytix. Andrew White noted in Gartner’s 2013 Cool Vendors in Information Infrastructure and Big Data that Revelytix’ “Loom, which runs in Hadoop, classifies objects in the Hadoop Distributed File System and applies a predefined transformation so that objects become structured and more usable for data scientists.” In our discussions of the Logical Data Warehouse, Gartner has targeted the capabilities Revelytix was designed to provide as being on the critical path to creating a coherent, optimized metadata architecture that will incorporate both traditional Enterprse DWs and Hadoop – a direction or research shows the advanced users are heading in.
In the 2012 edition of Cool Vendors, I described Teradata’s other acquisition, Hadapt, defining its vision as a Postgres-based “RDBMS instance on every node in the cluster in order to improve performance of queries over the structured part of the data, and … data partitioning techniques to eliminate unnecessary data movement.” Admirable as it was, this vision had not generated much business, and the window for additional SQL-on-Hadoop offerings may be closing – but Teradata has acquired technology and engineering talent that it will put to use supplementing its continuing optimization of Teradata SQL and SQL-H across complex logical data fabrics. The Hadapt team joins Teradata, though the brand will disappear.
HP chose to make a direct investment in Hortonworks, which extended its last funding round, closed months ago, to accept an additional $50M. The oddity of these mechanics aside, HP gets significant impact for its money: Martin Fink, its CTO, joins the board. HP will integrate the Hortonworks Data Platform (HDP) into its HAVEn offering, invest resources to certify its Vertica column-store analytic DBMS with HDP, and provide 1st line support. Hortonworks gets access to the global HP channel which could provide a major boost to its sales capabilities. HP was already a reseller, but, HP has been partnering with MapR as well for some time, and this relationship does not end that one. HP gets access to a leader in the continuing development of Apache Hadoop, and it’s likely that the relationship will expand as the two decide what their roadmap will be.
Increasingly, the players are marshaling their forces for global competition, global sales and support, and increased integration with enterprise-class architectures. These moves will hardly close this round of the maneuvering – it will be interesting to see what comes next.
Category: Apache Big Data data warehouse DBMS Gartner Hadapt Hadoop Hortonworks HP IBM MapR Microsoft Oracle RDBMS Revelytix Teradata Uncategorized Tags: Apache, big data, CDH, Cloudera, data warehouse, Hadapt, Hadoop, Hortonworks, HP, IBM, MapR, Microsoft, Oracle, Revelytix, Teradata
by Merv Adrian | July 16, 2014 | 4 Comments
One of the more interesting conversations I had at the Microsoft Worldwide Partners Conference this week concerned an initiative they have launched to help IT understand – and get under control – proliferating ungoverned SaaS applications. Brad Anderson, Corporate VP for Cloud and Mobility, told the 16,000 attendees that enterprises need help. “We ask them how many SaaS apps they have in their environment and they usually tell us 30-40. We audit with the Cloud App Discovery tool and find , on average, over 300.” And are these managed? One can only imagine…
The tool is in preview now, and a link to try it out for free is provided in Microsoft’s blog, It offers more than discovery – it will permit managers to monitor usage, identify users, integrate apps into Azure Active Directory, and more.
This is part of a larger story about governance and optimization in a hybrid cloud- and on-premises world that enterprises will live in for this decade and the next. Anderson also pointed out that 3.1M smartphones were stolen and another 1.4M lost. How many of these had corporate data on them. Would you know if it happened to one of your users? Can you govern access to corporate data in the apps there, and prevent it from being pasted into emails by someone who gets that phone and uses the saved logins to get at it? Some of these challenges can be handled by policy-based tools.
Getting the apps your users want into Azure, managing them there, and linking the on-premises Active Directory used by the overwhelming majority of enterprises to Azure Active Directory offers the possibility of getting corporate data security under better control before you find out how you look in orange. One of my favorite scenarios Microsoft showed its Enterprise Mobility Suite detecting is “impossible logins” – an hour ago you logged in from Australia and now you’re apparently in Chicago. Software can stop that? Yes.
The context here was Microsoft telling its partners about the opportunities for them to sell these capabilities to customers – and it’s hard to imagine them not wanting to, especially with the incentives, certification, training and co-marketing efforts Microsoft is launching. Expect this to be a major theme, leveraging the power of the crown jewel that Active Directory is in the portfolio in many additional ways to come.
Category: Active Directory Industry trends Microsoft mobility SaaS Security Tags: Microsoft, mobility, SaaS, Security
by Merv Adrian | June 28, 2014 | 4 Comments
In February 2012, Gartner published How to Choose The Right Apache Hadoop Distribution (available to clients). At the time, the leading distributors were Cloudera, EMC (now Pivotal), Hortonworks (pre-GA), IBM, and MapR. These players all supported six Apache projects: HDFS, MapReduce, Pig, Hive, HBase, and Zookeeper. Things have changed.
[updated June 29] We included Datastax (a distributor of Apache Cassandra) then, but they did not, and still don’t, consider themselves part of the Hadoop ecosystem. And they are not alone in having a reductive view of the answer to the question What Is Hadoop? Doug Cutting, pioneer in creating it and Chief Architect at Cloudera and former president of the Apache Software Foundation, considers the Hadoop Project to be HDFS, MapReduce and some common utilities. He made that point clear during a panel of luminaries my colleague Nick Heudecker conducted recently – the video is linked to Nick’s blog here. Everything else is “related projects.” Arun Murthy of Hortonworks, who has driven the creation of YARN, prefers to say that HDFS and YARN are “kernel” now, likening the description to the way most of us think of Linux. The Apache page continues to use the older description, including HDFS, MapReduce and YARN. (June 29, 2014)
To users, and especially buyers, the definition is more expansive. Hadoop is what they use to compose a useful stack of software to execute a business process of some sort. And distributors agree: in a little over two years, the set of projects included in all commercial distributions has now reached fifteen – two and a half times as many in just over two years. The list now includes Accumulo, Avro, Cascading, Flume, Mahout, Oozie, Spark, Sqoop, and YARN.
Others are likely to join this stack long before the next two years are up: the candidates include Falcon, Knox, Giraph, Hue, Lucene, Storm, Tez, and others. Hadoop has moved from a coarse-grained blunt instrument for largely ETL-style workloads to an expanding stack for virtually any IT task big data professionals will want to undertake. What is Hadoop now? It’s a candidate to be the alternative universe for data processing, with over 20 components that span a wide array of functions. More money continues to flow into the ecosystems, more companies form, more programmers take up the challenges, and the big players are scrambling to get aboard the train.
What is Hadoop?
It’s what’s next.
Category: Accumulo Apache Apache Yarn Avro Cascading Cloudera Falcon Flume Gartner Giraph Hadoop Hbase HDFS Hive Hortonworks Hue IBM Knox Lucene Mahout MapR MapReduce Oozie Pig Pivotal Spark Sqoop Storm Tez YARN Zookeeper Tags: