by Merv Adrian | February 19, 2014 | 3 Comments
In the most profound change of leadership in Microsoft’s history, Satya Nadella, who was head of the Cloud and Enterprise division, has taken the helm, succeeding Steve Ballmer. Nadella’s “insider” understanding of Microsoft’s culture and his effectiveness in cross-team communication and collaboration could help him reshape Microsoft for the digital era — which will be key for the company to attain the visionary technical leadership to which it aspires.
Nadella’s main challenge consists of evolving Microsoft’s existing businesses (including its enterprise offerings, which represent half of its current revenue) while reinventing Microsoft to make it relevant in mobile and cloud-centric markets.
Unlike his sales-focused predecessor, Steve Ballmer, Nadella has an engineering background; thus, his selection has reinstated the model of a technically minded CEO driving the company’s technical vision. But Nadella also must overcome several challenges.
He lacks direct experience in the mobile market. His insider status raises the risk of his being overly respectful of existing businesses, and hanging back from tough decisions that potentially threaten them but are critical to generating innovation. He will also need to shake up what is widely viewed as a culturally dysfunctional management structure.
Nadella must quickly demonstrate that he is not backing a “business as usual” strategy, and that he recognizes that design is front and center in client computing for both consumers and enterprise users and that a mobilized environment has replaced the desktop. The next six months will show how well Nadella and Gates collaborate to determine Microsoft’s technical direction.
We expect Microsoft’s trajectory will be clear by year-end 2014. Do not expect radical changes in the company’s overall “devices and services” strategy; instead watch for organizational shifts, product design changes and updated product road maps to address a mobile- and cloud-dominant world.
- Establish a vision of itself as an innovative, disruptive force in IT. Concentrating on mobile technology and leveraging lessons learned from gaming can help Microsoft appeal to the next generation.
- Emphasize design to enhance ease of use for consumers, and apply these lessons to its considerable assets in IT infrastructure to change its image of a legacy enterprise vendor competing in a consumerized market.
- Enable entrepreneurs and developers to develop new business value atop a common Windows client environment with unified, cross-platform services. Microsoft must enable a complete, compelling set of apps that attracts developers and can compete with and within iOS and Android environments.
- Acknowledge its customers’ heterogeneity by supporting Google and Apple client environments, the Linux/Java environment on servers, and cloud-based services in general.
- Deliver compelling experiences and solutions to both IT and to non-IT buyers.
For clients, my research note “Nadella Must Disrupt Microsoft Models to Establish Market Leadership,” co-authored with David Cearley, can be found on the Gartner website at http://www.gartner.com/doc/2664432.
Category: Gartner Microsoft Tags: Microsoft
by Merv Adrian | January 21, 2014 | 19 Comments
“Not looking” at security and privacy seems to be the posture of people implementing Hadoop, based on recent data Gartner has collected. This is troubling, and paradoxical. In an era when the privacy of data, from government surveillance to medical record-keeping to “creepy” marketing initiatives and password breaches, has been in the news regularly, it is hard to understand why professionals implementing Hadoop are not paying attention.
The data here comes from a recent webinar I conducted with my colleague Nick Heudecker. We had over 600 attendees, and during the discussion we offered several polling questions. One had to do with barriers to Hadoop adoption. We had 213 responses to that question.
You can see the results below and two things leap out: only 2% of the respondents see lack of robust security as a barrier, and half of the respondents feel that they do not have a sufficiently defined value proposition. More on the latter in another post.
For me, the nearly non-existent response to the security issue is shocking. Can it be that people believe Hadoop is secure? Because it certainly is not. At every layer of the stack, vulnerabilities exist, and at the level of the data itself there numerous concerns. These include the use of external unveiled data and of data in file systems that lack any protection, and the separation of Hadoop initiatives in most organizations from IT governance. Add to that the kinds of use cases Hadoop is being pointed at: sensitive health care information personal data in retail systems; telephone usage; social media connection and sentiment analytics – all of them give us pause.
I’ve pointed to security as a key issue facing the Hadoop community in 2014 for some time now. The fact that awareness of the problem is not getting attention only reinforces my belief that we will see major problems as Hadoop goes mainstream.
Category: Big Data Gartner Hadoop Security Tags: big data, data security, Hadoop, Security
by Merv Adrian | January 17, 2014 | 3 Comments
In the Hadoop community there is a great deal of talk of late about its positioning as an Enterprise Data Hub. My description of this is “aspirational marketing;” it addresses the ambition its advocates have for how Hadoop will be used, when it realizes the vision of capabilities currently in early development. There’s nothing wrong with this, but it does need to be kept in perspective. It’s a long way off.
Start here: A Gartner Research Circle survey revealed that in 2013, big data projects went into production in less than 8% of enterprises – and Hadoop was by no means the only technology included in their count. In 2014, many more enterprises will go into production with their first Hadoop project or two, perhaps even pushing deployed Hadoop shops into low double digit percentages.
In those same shops, there are thousands of significant database instances, and tens of thousands of applications – and those are conservative numbers. So the first few Hadoop applications will represent a toehold in their information infrastructure. It will be a significant beachhead, and it will grow as long as the community of vendors and open source committers deliver on the exciting promise of added functionality we see described in the budding Hadoop 2.0 era, adding to its early successes in some analytics and data integration workloads.
So “Enterprise Data Hub?” Not yet. At best in 2014, Hadoop will begin to build a role as part of an Enterprise Data Spoke in some shops. Aspirations are good, but perspective helps. Don’t confuse vision with strategy. The enterprise data warehouse has a long life ahead of it; the synergistic addition of Hadoop to its evolution into the logical data warehouse is just beginning, and Hadoop’s role in operational workloads and event processing has yet to launch.
Category: Apache Big Data data warehouse DBMS Gartner Hadoop RDBMS Tags: Apache, big data, data warehouse, Hadoop
by Merv Adrian | January 13, 2014 | 11 Comments
Talk to security folks, especially network ones, and AAA will likely come up. It stands for authentication, authorization and accounting (sometimes audit). There are even protocols such as Radius (Remote Authentication Dial In User Service, much evolved from its first uses) and Diameter, its significantly expanded (and punnily named) newer cousin, implemented in commercial and open source versions, included in hardware for networks and storage. AAA is and will remain a key foundation of security in the big data era, but as a longtime information management person, I believe it’s time to acknowledge that it’s not enough, and we need a new A – anonymization.
I realize I’m speaking out of turn here. I’m not a security guy myself, and I don’t pretend to be deep in the disciplines that decide whether you are who claim to be (authentication) and govern whether you can get to the network. Nor do I know the detailed nuances, spread across many different resources, that grant me permission to do what I will be allowed to do with those resources when I get there (authorization.) I don’t understand the various protections that assure breaches do not/have not occurred, which depend on the audit capabilities (the latter, as accounting, also provides the mechanism to report on all of the above.
What I do spend some time on is what happens within the resource that holds the data, when an authenticated, authorized person who is appropriately audited gets to it. For example, we need to distinguish what DBAs can see from what an analyst can – financial types call that “separation of concerns,” and it’s typically managed by a DBMS, which has mechanisms to interact with authorization capabilities to implement policy. It can be coarse- or quite fine-grained, and it’s one of the reasons we analysts always like to remind people that we talk about database management systems, not just databases.
But here’s the problem: in the big data era, much of the data we work with is not in DBMSs – and more and more of it will not be, as file-based systems like Hadoop gain broader and broader use. File systems don’t provide that granular control, so intervening layers will be required. They too can be coarse – we can encrypt/decrypt everything, for example. Or they too can be fine-grained, offering selective, policy-based decryption – in memory, after the bits come off the disk, before handing to the requester.
Personally, I hope people who model disease vectors, or even purchase behaviors, can build effective predictive models that describe what happens to people with certain characteristics. I just don’t want that process to result in my name being “on their list.” If they can intuit and classify what I am by my behavior and assign me to a category in some separate process, that is a different issue.
One approach that matters a great deal is obfuscation, which replaces a field like name or SSN with valid characters, but not the original data. Its value is that if properly implemented, it maintains mathematical cohesion and permits statistical analysis, aggregation, model building, etc to proceed without individuation of the records they are performed over. This is a privacy concern. Redaction – the familiar “blacking out” of content, is also used – but in some policy scenarios, being able to peer into the redacted data might subsequently be of value, and redaction typically doesn’t permit this.
Both approaches, however, can be classified as anonymization (or data deidentification, but I prefer to add another A to AAA for consistency!), and in an era where big data will increasingly be used to track human behaviors for medical, commercial and security reasons, I believe it’s time for anonymization to join the other 3 As. Perhaps it’s time to talk of authentication, authorization, anonymization and audit as the true foundation for data security.
Thanks to my esteemed Gartner colleague Neil MacDonald for commenting on and improving this post.
Category: Big Data data integration data warehouse DBMS Gartner Hadoop HDFS Industry trends Security Tags: big data, data integration, data security, data warehouse, Hadoop, HDFS, Security
by Merv Adrian | October 9, 2013 | 9 Comments
When is a technology offering a platform? Arguably, when people build products assuming it will be there. Or extend their existing products to support it, or add versions designed to run on it. Hadoop is there. The age of Bring Your Own Hadoop (BYOH) is clearly upon us. Specific support for components such as Pig and Hive vary, as do capabilities and levels of partnership in development, integration and co-marketing. Some vendors are in many categories – for example, Pentaho and IBM at opposite ends of the size spectrum interact with Hadoop in development tools, data integration, BI, and other ways. A few category examples, by no means exhaustive:
Analytic Platforms: Kognitio – an analytics-focused in-memory database with SQL 2011 support – offers significant Hadoop support. Jethrodata adds indexes for SQL and stores them in HDFS to accelerate BI tools, while Splunk’s offering called – gotta love it – Hunk has a similar approach. SAS recently added a partnership wkith Hortonworks, extending existing capabilities. For integrated marketing, RedPoint Global offers a true YARN-enabled engine with a rich array of capabilities that many tool vendors would envy.
Application Performance Management – longtime stalwart Compuware is bringing its portfolio to both Hadoop and leading NoSQL offerings.
BI Tools: specialists like Alpine Data Labs, Datameer, Karmasphere and Platfora position themselves as targeted for Hadoop environments. Traditional players like SAP Business Objects may represent themselves as connecting via Hive, and increasingly we will see some, like Alteryx, Qlikview and Tableau, are partnering with emerging distribution-specific stack components like Cloudera’s Impala.
Database: some vendors, like IBM and Teradata offer their own distributions, and even appliances. Others like Actian, Calpont, Oracle and Microsoft partner with pureplay vendors. All provide connectors, management interfaces to their own management tools, etc. MarkLogic adds an “enterprise NoSQL” flavor; Rainstor adds an archiving solution for a highly compressed Hadoop environment.
Data Integration: Informatica and Talend both support HDFS and even have specific offerings for ETL, data quality, etc. Revelytix Loom offers data prep and metadata creation capabilities to shorten time to use cycles.
Development platforms: Continuuity is out to an early lead here, but is hardly alone and won’t be the last. For example, SQL Server development tool player Red Gate will enter the market soon.
Hadoop as a Service: Altiscale, Amazon, Qubole, Rackspace, Savvis, and Xplenty (who mask Hadoop development complexity) offer varying degrees of control and surrounding capabilities – and marketing, as the links demonstrate.
In-memory data grid (IMDG) engine: Gridgain offers GGFS, one of several HDFS substitutes, and like ScaleOut hServer offers an in-memory grid for execution of MapReduce code. Longtime IMDG player Terracotta has added a Hadoop connector.
Lifecycle Management: WANdisco is offering ALM and support for highly available distributed network deployments, and has recently partnered with Hortonworks.
Platform Performance Management: Appfluent, already providing visibility and performance optimization for Oracle, Teradata and IBM DB2 and PureData for Analytics (aka Netezza) platforms, has now added a Hadoop offering as well.
Search: numerous approaches and players here. One of the more interesting is LucidWorks, leveraging Lucene and Solr for search based use cases on a Hadoop infrastructure.
Security: Dataguise, Gazzang, Protegrity, and Zettaset offer varying components of a full-stack security hardening for a Hadoop deployment.
Stream Processing: DataTorrent, Tibco Streambase, Vitria and Zoomdata are among the early players here.
Workload Automation: BMC’s Control-M is already in place for a need that will become more significant as adoption rises and efficiency becomes more of an issue.
This is just a bare smattering of the evolving ecosystem, and I’d be delighted to have your additions, recommendations and comments. I will update this post to include them. Please jump in.
Category: Big Data BigInsights Cloudera DBMS Hadoop HDFS Hive Hortonworks IBM Lucene MapR MapReduce Microsoft Oracle Pig Rainstor RDBMS Security Solr SQL Server SQLstream Talend Teradata YARN Tags: Apache, big data, BigInsights, CDH, Cloudera, Compuware, Hadoop, Hbase, HDFS, Hive, Hortonworks, IBM, Jethrodata, Karmasphere, Kognitio, metadata, Oracle, Pentaho, Pig, Qubole, Revelytix, SAS, Splunk, Teradata
by Merv Adrian | October 3, 2013 | 1 Comment
My Gartner APAC colleague Darryl Carlton and I recently discussed the obscene ratio between CEO pay and average worker pay in the US. And this IS about the US – we are supporting an astonishing gap compared to the rest of the world, and high tech vendors like Oracle are not the only ones at the top of the list – Larry Ellison comes in only number 4 on this Bloomberg list, pulling down 1,287 times what an average Oracle worker (not impoverished at nearly $75K per year) collects.
Personally, I was surprised that the banking community was not represented “better” here – but perhaps that’s because there seem to be dozens of executives, not just the CEOs, in every firm currently hauling cash home in wheelbarrows as mortgages remain underwater and pension funds and healthcare plans get de-funded. At least they share with each other.
In my conversation with Darryl, we discussed the link to company performance. Pick your measure – stock price, revenue improvement, etc. – in general those gorging at the trough are not dramatically outperforming their peers – and they certainly are not doing 10 times, or a hundred times, better than their counterparts in other countries. But look at how the US plutocrats compare to their brethren in other countries – Japan at 11:1, Britain at 22:1 -
and the US at 475:1.
It’s all in the table linked below.
CEO Pay Ratio
Darryl notes that
Drucker argues that CEO pay which exceeds 20-25 times the average paid in that company is bad for business. The new law in Australia is that if 25% of the stockholders vote down executive pay twice (two strike rule) at successive AGM’s then the Board can be replaced. The CEO of Lenovo in China has once again distributed his bonus to the lowest paid factory workers.
Not here, I’m afraid. And I want to say so much more, but only one word is needed, as far as I’m concerned:
Category: Industry trends Tags: Industry trends
by Merv Adrian | September 6, 2013 | 12 Comments
Many things have changed in the software industry in an era when the use of open source software has pervaded the mainstream IT shop. One of them is the significance – and descriptive adequacy – of the word “proprietary.” Merriam-Webster defines it as “something that is used, produced, or marketed under exclusive legal right of the inventor or maker.” In the Hadoop marketplace, it has come to be used – even by me, I must admit – to mean “not Apache, even though it’s open source.”
But Hadoop is not a thing, it’s a collection of things. Apache says Hadoop is only 4 things – HDFS, MapReduce, YARN, and Common. Distributions are variable collections that may contain non-Apache things, and may be called “proprietary” even when all the things in them are open source – even marketed under Apache licenses, if they are not entirely Apache Projects.
To the most aggressive proponents of open source software, this use of “proprietary” equates to “impure,” even “evil.” And some marketing stresses the importance of “true openness” as an unmitigated good as opposed to the less-good “proprietary,” even when the described offering is in fact open source software.
Some vendors respond to market needs with distributions containing open source-licensed components that come from a community process outside the Apache model – where a committee of people from other companies with their own agendas might delay, or seek to change, the implementation or even functionality of “their” software. Many Apache projects have started in other communities, like git, and were proposed later to Apache. Some call these distributions or components “proprietary,” hence less pure, less good, less worthy of consideration.
To the consumer – the enterprise that acquires its Hadoop from a distributor that is the only provider offering that particular capability – it’s a dilemma, and a mixed bag. If they need the functionality, and no Apache project with similar functionality (if one exists) is supported by some other distributor, but the distributor offers and supports it, it seems obvious to go there. Apache, after all, does not support its projects. Vendors (distributors) do.
On the other hand, if and when Apache gets there and someone supports that Project, it will potentially be more difficult to move from one distribution to another. Organizations are increasingly wary of switching costs, even as they embrace the rapid innovation – and resultant disruption – of the open source revolution.
“Purity” is not the question; timing and availability are. Consumers need to buy what will work, and want to buy what will be supported. And the use of “proprietary” here is both inaccurate and not to the point.
So I’m proposing to use a different expression in my future discussions: “distribution-specific.” It can apply today to an Apache project if only a specific distributor includes it. And it will apply to vendor enhancements, even if API-compatible and open source, where the same is true.
I’d love to hear your thoughts – and your nominations for the list. Some are obvious: Cloudera Impala, MapR’s file system, IBM’s JAQL. Or Apache Cassandra, listed on the Apache Hadoop page linked above as a “Hadoop-related project” along with HBase, Pig and others. Only one company commercializes a software distribution with it – Datastax. And they don’t even call themselves a Hadoop distribution. What do you think of all this? Please leave your comments.
Category: Apache Apache Yarn Big Data BigInsights Cassandra Cloudera Hadoop Hbase IBM MapR MapReduce open source OSS Pig YARN Tags: Apache, big data, BigInsights, Cassandra, Cloudera, Datastax, Hadapt, Hadoop, Hbase, HDFS, IBM, open source, OSS, Pig, Yarn
by Merv Adrian | July 15, 2013 | 10 Comments
Probably the most widespread, and commercially imminent, theme at the Summit was “SQL on Hadoop.” Since last year, many offerings have been touted, debated, and some have even shipped. In this post, I offer a brief look at where things stood at the Summit and how we got there. To net it out: offerings today range from the not-even-submitted to GA – if you’re interested, a bit of familiarity will help. Even more useful: patience.
The earliest uses of Hadoop were – and most still are – ETL-style batch workloads of java MapReduce code for extracting information from hitherto unused data sets, sometimes new, and sometimes simply unutilized – what Gartner has called “dark data.” But Hadapt had already begun talking in 2011 about putting SQL tables inside HDFS files – what my friend Tony Baer has called “shoehorn[ing] tables into a file system with arbitrary data block sizes.” We highlighted them in early 2012. But this thread, though it continues, was about pre-creating relational tables – suitable for the known problems, like the existing data warehouses and data marts, and thus not complete over the new, exploratory ad unpredictable jobs Hadoop advocates envisioned.
Pre-defined or not, SQL access would need metadata, and the indispensable DDL it would provide at runtime if not in advance. The HCatalog project was incubating – first outside, then inside SQL interface project Apache Hive (itself first developed at Facebook in 2009), getting significant support from Teradata and Microsoft partnering with Hortonworks. RDBMS vendors were doing what they always do with new wrinkles at the innovative market edges – co-opting them with “embrace and extend” strategies. Thus, HP Vertica, IBM DB2 and Netezza, Kognitio, EMC Greenplum, Microsoft SQL Server, Oracle, Paraccel, Rainstor and Teradata all offered SQL-oriented ways to call out to the Hadoop stack, using external table functions, import pipes, etc. But actual SQL, for use directly against Hive (or HCatalog) metadata from inside their engine was so far a “not exactly, but…” kind of proposition.
At Strata/Hadoop World in October 2012, Cloudera announced its Impala, an already visible SQL engine bypassing MapReduce and executing directly against HDFS and HBase; it is now available in the extra cost add-on RTQ (real time query) option to CDH Enterprise. Although Apache Drill (based on Google’s Dremel) had been announced a few months before, it was and is still listed as Incubating by Apache. That is a pre-Project, arguably pre-alpha state, though MapR’s plan to support it commercially is imminent. Microsoft publicly described its plans for Polybase, which is included in SQL Server 2102 Parallel Data Warehouse V2. Platfora, a BYOH (bring your own Hadoop) BI player, announced its interactive tool with a memory-cached embedded store, opening another front in the battles for customers to muddle over – a database or a tool? Which is a better fit for me?
In February, Hortonworks had thrown the “open, Apache, Hadoop” hat into the ring by announcing the Stinger initiative, promising a 100x speedup – and a few months later the labors of 55 developers from several companies progressed Hive to release – wait for it – 0.11. This is a platform for further development; in its upcoming stages, Stinger will make use of yet another Incubator project, in this case Tez.
Shortly after that, EMC Greenplum (by then using Pivotal as a product, not corporate, name) announced its HAWQ (“Hadoop With Query”). It added some wrinkles by grafting the commercially developed and market-hardened Greenplum MPP Postgres-based DBMS directly to HDFS (or EMC’s Isilon OneFS) – and promising in interviews to be half the price of leading alternatives.
In April, IBM announced its BigSQL, and Teradata unveiled SQL-H, both making real “SQL against Hadoop” part of their portfolios (Teradata already had Aster’s SQL-MR, first introduced in 2008.) Oracle announced an updated Big Data Appliance and some software component upgrades. Oracle again left its MySQL unit out of the conversation, relegated to making its own announcement of a MySQL Applier for Hadoop, which replicates events to HDFS via Hive, extending its existing Sqoop capability. MySQL continues to be remarkably invisible in Oracle big data messaging, despite the widespread presence of the LAMP stack in big data circles.
There you have it – the state of SQL on Hadoop at the Summit. This discussion doesn’t even include things that have not even been entered into the races yet, such as Facebook’s yet-to-be-submitted-to-Apache Presto interface. A closing thought or two:
- None of this is real-time, no matter what product branding is applied. On the continuum between batch and true real-time, these offerings fall in between – they are interactive. And the future of Hadoop in the enterprise includes both interactive and real-time. As Hadoop will be increasingly used for operational purposes, critical real-time applications will require continuous availability for successful deployment in large global organizations. There are stirrings on the horizon, but nonstop operation is still aspirational, not easily available.
- There continues to be much hype about the advantages open source community innovation. In this case, it’s often innovating what has been in SQL since standards like SQL92 were agreed upon and deployed by those slowpoke commercial vendors.
- The reality is, the open source community struggles to be complete and timely with its offerings. YARN’s lateness was discussed in my first post on the Summit. Two weeks before that, at HBaseCon I watched a speaker asking his audience to volunteer for features identified as needed for the roadmap. Sometimes, a commercial product manager is a blessing indeed.
P.S. Yes, I know the syntax of the title is not correct. Call it literary license.
Category: Apache Apache Drill Apache Yarn Aster Big Data Cloudera data warehouse DBMS Gartner Hadapt Hadoop HCatalog HDFS Hive Hortonworks IBM MapR MapReduce Microsoft Netezza Oozie Oracle Rainstor RDBMS Real-time SQL Server Sqoop Teradata YARN Tags: Apache, Aster, big data, BigSQL, CDH, Cloudera, data warehouse, DB2, Drill, EMC, ETL, Greenplum, Hadapt, Hadoop, HAWQ, Hbase, HCatalog, HDFS, Hive, Hortonworks, HP, Impala, Isilon, Kognitio, MapR, MapReduce, MPP, MySQL, OneFS, Oracle, Paraccel, Platfora, Polybase, Postgres, Rainstor, SQL, Sqoop, Stinger, Teradata, Tez, Vertica
by Merv Adrian | July 13, 2013 | 3 Comments
A brief rant here: I am asked with great frequency how this RDBMS will hold off that big data play, how data warehouses will survive in a world where Hadoop exists, or whether Apple is done now that Android is doing well. There is a fundamental fallacy implicit in these questions.
Comparing what someone new and shiny may be claiming they will do a year from now with what someone established is already doing today is foolish. The established vendor being compared is not likely to stand still. In fact, it may well have got where it is precisely because it has learned to sustain innovation. In the big data world, to acknowledge that, say, the uniqueness of MapR’s current storage solution compared to HDFS will likely erode over time is accurate. But to assume MapR will stand still while that happens is not; they are several releases, and several different innovations, in. They still may fall behind – but not because they stood still.
How do I handle these questions as an analyst? By sticking with what is shipping, in production, with referenceable customers. To advise someone who has a need for technology that they should wait until some uncertain point in time when an open source provider may have some technology ready that will compete with today’s enterprise-ready, supported product strikes me as very poor advice. If they don’t need it now, they should wait anyway, and evaluate the options when they do.
This ties closely to my often-offered comment that is it is the Silly Con Valley (thanks to Paul Kent at SAS for that one) disease to believe that once we write it on the whiteboard it’s ready. It’s bad enough to compare to what we know will go GA at a relatively predictable time (like a SQL Server release) but to compare to something whose feature list is on a request for volunteers at an open source meetup is entirely different.
Category: Big Data data warehouse Gartner Hadoop HDFS MapR RDBMS SQL Server Tags: data warehouse, Hadoop, HDFS, MapR, R, RDBMS, SAS, SQL Server
by Merv Adrian | July 10, 2013 | 7 Comments
I had the privilege of keynoting this year’s Hadoop Summit, so I may be a bit prejudiced when I say the event confirmed my assertion that we have arrived at a turning point in Hadoop’s maturation. The large number of attendees (2500, a solid increase – and more “suits”) and sponsors (70, also a significant uptick) made it clear that growth is continuing. Gartner’s data confirms this – my own inquiry rate continues to grow, and my colleagues covering big data and Hadoop are all seeing steady growth too. But it’s not all sweetness and light. There are issues. Here we’ll look at the centerpeice of the technical messaging: YARN. Much is expected – and we seem to be doomed to wait a while longer.
Here is a great summary of YARN, also known as Hadoop 2.0, posted after the Summit:
MapReduce is great for batch processing large volumes of distributed data, but it’s less than ideal for real-time data processing, graph processing and other non-batch methods. YARN is the open source community’s effort to overcome this limitation and transform Hadoop from a One Trick Pony to a truly comprehensive Big Data management and analytics platform.
Sounds great, doesn’t it? Problem is this was posted by Jeff Kelly last August, after Hadoop Summit 2012. Now, YARN is being used – on Yahoo’s 30,000 nodes, for example – but Apache still calls it Alpha as of this writing (July 9, 2013.) Next announcement, when it comes, will be beta. Some distributions, like Cloudera CDH 4.2, are already supporting it anyway. Hortonworks HDP 2.0, which includes YARN, is in Community Preview (what we enterprise guys like to call beta). MapR doesn’t list it yet – the search engine on their site comes up empty if you search for it. So we aren’t quite there yet.
One other note: confusion continues – I see it in my inquiries abut “what IS Hadoop?” Two of what Apache lists as the 3 core Hadoop components will be substitutable now – you can already swap out HDFS for IBM’s GPFS, Intel’s Lustre, or MapR’s storage layer. As YARN comes to market, other engines will be swappable for MapReduce. Graph engines and “closer to real-time” processing are next on the horizon, as Storm is getting great traction and several Summit presenters of real world case studies alluded to their use of it. Yahoo! has open sourced its Storm-YARN code, which it runs internally, so expect more productionization ahead. So the answer to “what is Hadoop, exactly?” will become even more complicated.
Will this confusion hurt the market, and slow adoption? Hard to say. The uncertain part of the market remains so; Gartner’s 2012 Research Circle found 31% of enterprises had no plans for Big Data investment. In 2013, the number was the same. YARN will broaden the set of possible use cases, and raise emany questions. Let’s hope it’s ready to start answering them soon.
Category: Apache Apache Yarn Big Data Cloudera Gartner graph databases Hadoop HDFS Hortonworks IBM Intel MapR MapReduce Storm Yahoo! YARN Tags: Apache, big data, Cloudera, Hadoop, HDFS, Hortonworks, IBM, MapR, MapReduce, Yahoo!