by Merv Adrian | February 16, 2013 | 11 Comments
It’s no surprise that we’ve been treated to many year-end lists and predictions for Hadoop (and everything else IT) in 2013. I’ve never been that much of a fan of those exercises, but I’ve been asked so much lately that I’ve succumbed. Herewith, the first of a series of posts on what I see as the 4 Ps of Hsdoop in the year ahead: performance, projects, platforms and players.
Performance concerns are inevitable as technologies move from early adopters, who are already tweaking everything they build as a matter of course, to mainstream firms, where the value of the investment is always expected to be validated in part by measuring and demonstrating performance superiority. It also becomes an issue when the 3rd or 4th project comes along with a workload profile different from those that came before – and it doesn’t perform as well as those heady first experiments. Getting it right with Hadoop is as much art as science today – the tools are primitive or nonexistent, the skills are more scarce than the tools, and experience – and therefore comparative measurement – is hard to come by.
What’s coming: newly buffed up versions of key management tools. It’s one method of differentiating distributions in a largely common set of software – Hortonworks doubling down on open source Apache Ambari, Cloudera enhancing Cloudera Manager, MapR’s updated Control System (as well as their continued touting of DIY favorites Nagios and Ganglia.) EMC, HP, IBM and other megavendors are continuing to instrument their existing, and familiar, enterprise tools to reach this exploding market. It will be a busy bazaar.
Resources are proliferating to help: published work like Eric Sammer’s Hadoop Operations (somewhat Cloudera-centric but very well organized and useful). A plethora of Slideshare presentations designed to help navigate the arcana of cluster optimization, workload management, configuration optimization, are appearing.
Performance has figured in a number of proof of concept (POC) tests pitting distributions against one another that I’ve heard about from Gartner clients. Some have been inconclusive; some have had clear winners. As we’ve seen in DBMS POCs over the years, your data and your workloads matter, and your results may differ from others’. I’ve seen replacements of “first distributions” by another, as performance or differing functionality comes to the fore. I’ve even seen a case where a Cassandra-based alternative won out over the Hadoop distributions.
Next time: projects proliferate.
Category: Big Data BigInsights Cloudera EMC Hadoop Hbase HDFS Hortonworks IBM MapReduce Sqoop Tags: Apache, BigInsights, Cloudera, EMC, Flume, Hadoop, Hbase, HDFS, Hive, Hortonworks, IBM, MapR, MapReduce, Pig, Sqoop, zookeeper
by Merv Adrian | February 10, 2013 | 15 Comments
“Hadoop people” and “RDBMS people” – including some DBAs who have contacted me recently – clearly have different ideas about what Data Integration is. And both may differ from what Ted Friedman (twitter: @ted_friedman) and I (@merv) were talking about in our Gartner research note Hadoop Is Not a Data Integration Solution
, although I think the DBAs’ concept is far closer to ours.
We went to some lengths to precisely map Gartner criteria from the Magic Quadrant for Data Integration Tools
(see below) to the capabilities of what most people would consider the Hadoop stack – Apache Projects that are supported in a commercial distribution. Many of those capabilities were simply absent, with nothing currently available to perform them.
Moreover, even to the degree that some pieces/projects might meet some of the needs, there is nothing that ties them together into a “solution,” which itself was a carefully chosen word. Today, with Hadoop projects in general, we very often see bespoke, self-integrated, “build it yourself and good luck operating it” structures. By contrast, solutions, including those for data integration, provide the relevant pieces coherently in a way that ties together design, operation, optimization and governance. Leaving aside the absence of data quality tools or profiling tools of any kind in today’s supported Hadoop project stack, we don’t see that yet. And Ted and I note in our piece that Hortonworks, for example, implicitly acknowledged that by bundling Talend into its distribution. Talend itself places rather well in the Gartner Magic Quadrant for DI tools.
Hadoop is very useful for a lot of things – including analytics of some kinds, and ETL of some kinds, and for low-cost exploitation of data that is unsuitable for persisting in RDBMSs for a variety of reasons. It’s maturing, and steadily adding more capabilities, and is driving an economic refactoring of data storage and processing which will result in some (increasing amounts of) data being kept there and some (increasing amounts of) processes being performed there. In Gartner’s Logical Data Warehouse model, it occupies the spot for Distributed Process use cases. The relative size of that part of the landscape relative to repositories and to virtualization is yet to be determined. It will take some years to sort out, and it won’t stand still.
But platforms are not solutions. Hadoop can very much be a platform on which a DI solution can be built. But A solution? Not yet. For that, talk to the folks in the MQ referenced above. [added 2/13] Thanks for your comments and tweets – and keep them coming!
Category: Big Data data integration Hadoop Hortonworks Magic Quadrant Talend Uncategorized Tags: Apache, data integration, Hadoop, Hortonworks, Magic Quadrant
by Merv Adrian | January 30, 2013 | 8 Comments
2013 promises to be a banner year for Apache Hadoop, platform providers, related technologies – and analysts who try to sort it out. I’ve been wrestling with ways to make sense of it for Gartner clients bewildered by a new set of choices, and for them and myself, I’ve built a stack diagram that describes possible functional layers of a Hadoop-based model.
The model is not exhaustive, and it continually evolves. In my own continuous collection and update of market, usage and technical data, it serves as a scratchpad I use – every project/product name in the layers is a link to a separate slide in a large deck I use to follow developments. As you can see below, it contains many Apache and non-Apache pieces – projects, products, vendors – open and closed source. Some are quite low level – for example Trevni can be thought of as a format used inside Avro – but I include them at least in part because I keep track of “moving parts,” and in the world of open source, that means a lot of pieces that are independent of one another.
Part of the effort so far has been on relating this model to Gartner’s Information Capabilities Framework, an enormously useful view of the verbs we use to compose our semantic use cases in building business applications. My colleague Ted Friedman and I just used the two models to assess how Hadoop stacks up as a Data Integration solution. Not surprisingly, I suppose, we found it wanting. You can see our research here if you’re a Gartner client.
I expect further refinement of this stack in the weeks ahead, and more offerings at each layer as it evolves as well. I’m trying to keep it simple – at 6 layers its already getting heavy, and I’d hate to add more. But that may be unavoidable. Your feedback here will be helpful – please offer comments if you have any! As a guide to choice, simplicity is a much-desired, but often unobtainable, objective.
Category: Apache Big Data Cloudera data integration Hadoop Hbase HDFS Hortonworks MapReduce open source OSS Sqoop Tags: Apache, Cassandra, Cloudera, Datastax, Flume, Hadapt, Hadoop, Hbase, HDFS, Hive, Hortonworks, Hstreaming, Karmasphere, MapR, MapReduce, Oozie, open source, OSS, Pig, Sqoop, zookeeper
by Merv Adrian | December 27, 2012 | 6 Comments
I had an inquiry today from a client using packaged software for a business system that is built on a proprietary, non-relational datastore (in this case an object-oriented DBMS.) They have an older version of the product – having “failed” with a recent upgrade attempt.
The client contacted me to ask about ways to integrate this OODBMS-based system with others in their environment. They said the vendor-provided utilities were not very good and hard to use, and the vendor has not given them any confidence it will improve. The few staff programmers who have learned enough internals have already built a number of one-off connections using multiple methods, and were looking for a more generalizable way to create a layer for other systems to use when they need data from the underlying database. They expect more such requests, and foresee chaos, challenges hiring and retaining people with the right skills, and cycles of increasing cost and operational complexity.
My reply: “you’re absolutely right.”
Their only recourse, absent utilities that can extract data when needed (via federation, virtualization or distributed processing models as identified in Gartner’s Information Capabilities Framework), is to build and materialize an intermediate datastore – ODS- or DW-style. That means an exhaustive effort to anticipate all likely future requests for data. It’s unlikely they will be able to do so, and any change to that system will be as difficult as the one-offs they build now. They could literally recreate the entire database in an alternative architecture that their other systems can access via conventional methods, but the cost of building and maintaining such a redundant “layer” might rival replacement cost.
Thus, my conclusion: their vendor has demonstrated itself to be an architectural cul-de-sac they might want to consider exiting. If you have any similar systems in your portfolio, here’s a New Years’ resolution for you: back out of that dead end. As soon as you can.
Category: data integration data warehouse DBMS Tags: data integration, data warehouse, federation, virtualization
by Merv Adrian | December 8, 2012 | 7 Comments
At its first re:Invent conference in Late November, Amazon announced Redshift, a new managed service for data warehousing. Amazon also offered details and customer examples that made AWS’ steady inroads toward enterprise, mainstream application acceptance very visible.
Redshift is made available via MPP nodes of 2TB (XL) or 16TB (8XL), running the Paraccel PADB high-performance columnar, compressed DBMS, scaling to 100 8XL nodes, or 1.6PB of compressed data. XL nodes have 2 virtual cores, with 15GB of memory, while 8XL nodes have 16 virtual cores and 120 GB of memory and operate on 10Gigabit Ethernet.
Reserved pricing (the more likely scenario, involving a commitment of 1 year or 3 years) is set at “under $1000 per TB per year” for a 3 year commitment, combining upfront and hourly charges. Continuous, automated backup for up to 100% of the provisioned storage is free. Amazon does not charge for data transfer into or out of the data clusters. Network connections, of course, are not free - see Doug Henschen’s Information Week story for details.
This is a dramatic thrust in pricing, but it does not come without giving up some things. For example, Amazon has not licensed Paraccel’s high-speed data import utilities; it is far more focused at this point at enabling movement between its own Elastic MapReduce, DynamoDB and S3 storage and Redshift. Thus the early focus, and likely early adoption, is Amazon’s customers’ data already in the cloud. Movement from existing data warehouses will come later. Today, that would require exporting data into S3 and then moving it into a (designed) Redshift data warehouse using Amazon’s data movement utilities, which were not shown in detail. Design doesn’t disappear, and it’s not free. As my colleague Mark Beyer said in an email discussion:
Data warehouse and analytics expertise is harder to come by than many believe. With Amazon Redshift providing services to initiate and operate the data warehouse in lieu of Paraccel’s management interface and tools, it is left up to the Redshift implementer to “provide the data warehouse chops.” While I’m sure that any good Cloud application jockey knows their stuff, any data warehouse veteran on the planet knows that letting the apps guys write analytics is like asking your doctor to be the striker on your football team (what we call Soccer here). It is entirely likely that an entire cottage industry of “expert implementers for analytics in the Cloud” will appear on the near horizon.
It’s also not clear how much database (not deployment and operating) control will be made available. Paraccel offers plenty of knobs and buttons. Tweaking performance by configuring memory, pinning tables there, looking at how data is packed inside the “slices” – it does not appear any of that will be exposed in the Redshift version. Nor is it obvious how to build ongoing update for a Redshift data warehouse yet.
Another missing “feature” is the support model one gets from a software firm like Paraccel – the level and nature of support in an Amazon environment today is quite different. Still, this is a work in progress. It was evident at re:Invent that Amazon is building up and enhancing its enterprise-facing team, and I had an interesting conversation with them about how the engagement model for an enterprise that has had several individuals “unofficially” contracting for projects on their own transitions to a corporate model. They have seen this play out a number of times now, and it’s becoming a better understood play for them.
One final comment about the vendors’ relationship: it is not as close as I suspect Paraccel would have hoped. After a million dollar multi- investment and over a year of joint work, it was surprising not to see Paraccel’s CEO on stage for the announcement, or even a synchronized press release. This reflects the relative arm’s-length nature of this arrangement. In my subsequent conversations with them, it became clear that Amazon expects their offering to diverge from Paraccel’s over time as they add their own pieces around the part they have licensed for use in Redshift. And there was no publicized joint marketing or sales initiative.
It remains to be seen if the whole elasticity value proposition (scale up, scale down) proves as relevant to data marts and data warehouses as it does to the apps that Amazon is more accustomed to hosting, or how quickly enterprises will move their data to a public cloud. Warehouses don’t scale down. But analytic platforms iused for experimenting will, and this may create a great opportunity for Amazon. Gartner clients can see our position on other dimensions of this announcement in a First Take Mark Beyer and I just published.
Category: Amazon Big Data data warehouse data warehouse appliance DBMS Hadoop MapReduce Tags: Amazon, big data, data warehouse, DynamoDB, MapReduce, Paraccel, Redshift
by Merv Adrian | February 24, 2012 | 1 Comment
From my colleague Mark Beyer, who speculates about how leadership in moving toward the logical data warehouse (LDW) will be received:
The logical data warehouse is already creating a stir in the traditional data warehouse market space. Less than 5% of clients with implemented warehouses that we speak with are pursuing three or more of the six aspects of a logical warehouse:
- data virtualization
- distributed processes
- active auditing and optimization
- service level negotiation
- ontological and taxonomic metadata
That means we are in a very early stage regarding the adoption trend, and vendors who are aggressively moving toward it are ahead of their customers. We have spoken with at least one end-user organization that has implemented the entire solution from scratch in their own best of breed approach. Life teaches us that one aspect of humility is to recognize that if a good idea is really good, more than one person has thought of it. With a far greater number of organizations pursuing less than a complete LDW, but multiple parts, it’s only a matter of time. You can count on a trend for adoption of the LDW coming to fruition; timing is the only thing in question at this point.
[ Added 3/22/12 - MA
For those not familiar with the LDW model, here is a working definition taken from
Does the 21st-Century "Big Data" Warehouse Mean the End of the Enterprise Data Warehouse? (G00213081)
This new type of warehouse — the LDW — is an information management and access engine that takes an architectural approach which de-emphasizes repositories in favor of new guidelines:
- The LDW follows a semantic directive to orchestrate the consolidation and sharing of information assets, as opposed to one that focuses exclusively on storing integrated datasets.
- The semantics are described by governance rules from data creation and use case business processes in a data management layer, instead of via a negotiated, static transformation process located within individual tools or platforms.
- Integration leverages both steady-state data assets in repositories and services in a flexible, audited model via the best available optimization and comprehension solution available.
Hope this helps clarify for those not familiar with the research note above.]
Gartner clients with currently successful warehouses already have plans to retrofit and adapt the infrastructure as demand arises. Some of these are a bit long in the tooth with seven years or more of success, while others are as little as eighteen months old. A major portion of the data warehouse population will resist shifting to the LDW simply because it contradicts the traditional notion of IT success. IT projects are supposed to have beginnings and endings, but data warehouses have never had endings—in other words, the data warehouse is supposed to constantly evolve and resistance to the LDW is almost guaranteed retirement of your existing warehouse.
So, when Gartner comments that pursuing the LDW is ahead of the rest of the market, that’s a caution that it’s important to create ample justification, education and training to present the case in a comprehensible and compelling fashion and not a caution against the strategy. But that is not a negative comment – quite the contrary. It’s a recognition of a tough but worthwhile struggle ahead—so it’s truly a caution to justify your pursuit of something difficult. For vendors and user clients alike, it’s a signal that leadership does not come without its challenges.
Think of it this way. When the party is already started and everyone else is invited but you, you cannot just show up late and uninvited to declare the party is beginning because “now” you have arrived. If you decided to wait for the party to start then come late, you sure know that you better bring some snacks. But you better hope some of the guests are still hungry. Leading a market is a dangerous and risky business—and it attracts customers just like you.
Gartner clients: for more detail on the LDW, see Does the 21st-Century “Big Data” Warehouse Mean the End of the Enterprise Data Warehouse?
Category: data warehouse Tags: data warehouse
by Merv Adrian | January 23, 2012 | Comments Off
In early January 2012, the world of big data was treated to an interesting series of product releases, press announcements, and blog posts about Hadoop versions. To begin with, we had the announcement of Apache version 1.0 at long last, in a press release. Although there were grumblings here and there in the twittersphere that changes to release numbers are meaningless, my discussions with Gartner’s enterprise customers indicate otherwise. Products with release numbers like 0.20.2 make the hair on Procurement’s neck stand on end, and as Hadoop begins to get mainstream attention (Gartner’s clients, see Hype Cycle for Data Management 2011), IT architects and executives find such optics quite important. Hadoop is moving beyond pioneers like Amazon, Yahoo! and LinkedIn into shops like JP Morgan Chase, and they pay attention to such things.
So what, after 6 years of steady work on earlier versions, makes this one worthy of a 1.0 designation? As Cloudera noted in a blog post, ”There has been an 18 month period where there has been no one Apache release that had all the committed features of Apache Hadoop.”
That would be version 0.20.2 from 2010. In that time, every major DBMS vendor has put a strategy in place for side-by-side operation with Hadoop, sometimes direct execution and calls from “inside” (Teradata’s Aster and EMC’s Greenplum pioneering that.) So you might think “1.0″ is just labeling to provide a veneer of “enterprise-ness” for customers of those products. But give the tireless gang of committers some credit; in this release they have incorporated features from a couple of branches (what’s that mean, you ask? read on) and added strong authentication (Kerberos), a REST API for HDFS, and for those who prefer the broader set of use cases enabled by HBase, several significant additions to improve working with it. But unfortunately, as useful as a 1.0 label is, it hardly eliminates the confusion, as we’ll see below.
Trunks and Branches
For those who don’t track Apache release matters, the Apache Software Foundation supports Projects, which are created by volunteers accepted into the organization on the basis of the code they write – they gain privileges as Committers who can put code into the main codeline, known as “trunk.” MapReduce is a project; so is HDFS. As new code is developed, the move from one numbered version to another is initiated by the creation of a branch, which is tested until the project members approve its readiness by vote. Branches may move ”into trunk,” but other branches may also be started in the interim. And they may have different features.
Charles Zedlewski of Cloudera has described this in more detail in the excellent post linked above. In it he notes that release 0.20.2 in 2010 was the last release that had “all the usable features committed to Apache Hadoop.” Following the 0.20.2 release, work continued on that branch, resulting in 0.20.203 in 2011, which added security but not RAID support or append (which ensures that HBase, a separate project, doesn’t lose data.) It was followed by 0.20.205, which did add security although still not RAID. It is 0.20.205 that became 1.0. Seems straightforward, right?
Unfortunately, it’s not. Branches 0.21, 0.22 and 0.23 were all introduced in that 18 month time period, the latter at the end of 2011. Version 0.23, notes Cloudera, has “all of the features of any past release.” This includes a fix for the name node single point of failure issue and other HA capabilities, both of which matter a great deal to enterprise users. Good news? Well, Release 1.0, as noted, is not based on 0.23, so it does not have these.
Distributions – Solving, Or Muddying the Waters?
So, does using a distribution help sort all this out? Consider: Cloudera’s CDH3 distribution was issued before 1.0. But Cloudera distributions get updates designated with a U. Not only updates to Hadoop; remember there are other projects in there too. So CDH3U0 (yes, they use zeros) uses HBase 0.90.0 whereas CDH3U2 uses HBase 0.90.4; it also added Mahout and expanded support for Avro’s file format.
The latter discussion reinforces the important point that “Hadoop” means more than MapReduce, HDFS and job/task management to people considering it: solutions (and hence, distributions) typically involve several other projects. HBase, for example, seems to be occurring in at least half of them – a study by Dave Menninger of Ventana Research last year found 61% of respondents including it in deployments. Other parts of the stack like Pig, Hive, and Sqoop are also found in many if not most initiatives I’ve had contact with. The complexities of keeping all their versions straight as new code is contributed are a key reason to use distributions, which track and integrate a dozen or more projects.
How about Hortonworks, the other specialist with a large number of committers? They have announced a ”public preview of the Hortonworks Data Platform (HDP) version 2.” That will be based on 0.23 – all those features, plus HCatalog, which Cloudera does not include in CDH (yet.) There is also a “private technology preview” of HDP version 1; “a public technology preview will be made available later this quarter.” What do these preview terms mean? Hortonworks explains:
The Technology Preview Program begins with a Limited Preview phase that enables us to engage a manageable number of representative customers, partners, and community users on focused, hands-on testing and proof of concept deployments of the Hortonworks Data Platform…Public Preview …. opens the process to anyone interested in working closely with Hortonworks…culminates in the final release of the software and General Availability.
Other distributions have their own mix of projects, reflecting their own point of view. It can be hard to find out which versions of the various projects are supported in each one. Neither Hortonworks’ nor MapR’s website, for example, shows the version numbers of included Projects, including the varied additional ones they both add, in the same way as the chart Cloudera offers does. And other distributions from IBM, Datastax, Netapp and others, are in the hunt, each with its own profile. For now, the continuing confusion and multiple conflicting “Hadoops” only serve to reinforce IT’s concerns about its readiness for a robust, governed environment – unless one distribution, from one trusted provider, is chosen.
In upcoming Gartner research I’ll talk about the distributions, how they vary, and how to track and choose among them.
Category: Apache Big Data Cloudera Hadoop Hbase HDFS Hortonworks IBM MapReduce NetApp open source Sqoop Tags: Apache Software Foundation, ASF, Aster, Avro, CDH, Cloudera, Datastax, EMC, Greenplum, Hadoop, Hbase, Hive, Hortonworks, IBM, Mahout, MapReduce, NetApp, open source, Pig, Sqoop, Teradata
by Merv Adrian | November 3, 2011 | 6 Comments
Another guest post, this time from my colleague and friend Mark Beyer.
My name is Mark Beyer, and I am the “father of the logical data warehouse”. So, what does that mean? First, if like any father, you are not willing to address your ancestry with full candor you will lose your place in the universe and wither away without making a meaningful contribution. As an implementer in the field, I was a student and practitioner of both Inmon and Kimball. I learned as much or more from my clients and my colleagues during multiple implementations as I did from studying any methodology. My Gartner colleagues challenged my concepts and helped hammer them into a comprehensive and complete concept. Simply put, I was willing to consider DNA contributions from anyone and anywhere, but through a form of unnatural selection, persisted in choosing to include the good genes and actively removing the undesirable elements.
I first used the term “logical data warehouse” while delivering a market analysis and guidance session for one of our Gartner clients at Sybase headquarters two and one-half years ago. I was outlining that Search added to the data warehouse was not “finished” and was an incomplete evolution. We were discussing technical architectures of the future. I kept pushing because the warehouse was always “supposed to” include all information in the enterprise, but it did not. As a field practitioner, it always seemed there was much more to be done and left even those projects perceived as highly successful as “not being done here”. We relayed how leading Gartner clients were making attempts to put every sort of content into their warehouse beyond structured data and outlined the missing components of the architecture. As we completed the discussion, I was asked “So, what do you call this trend?” At which point I was temporarily at a loss. I did not want to call it a federated warehouse. I did not want to call it a virtual warehouse. Both of those terms fell short and both had negative baggage in their history. I did not adhere to Warehouse 2.0—because it focused too heavily on adding Search (that was not the intent, but that was what the market had decided). So, in a flash of brilliance inspired by frustration, I muttered “the best name is probably a Logical Data Warehouse,…because, it focuses on the logic of information and not the mechanics that are used.” At this time, my Gartner colleague, Donald Feinberg responded “That’s a stupid name. I’m not going to use it.” Three months later, Donald was using it and like a good uncle in your own family tree, he was touting the accomplishments of the family golden child. Now, in fairness to Donald, he expressed his misgivings to me about the term as well and there was some “convincing” in between.
After on-going vetting of both the architectural approach and the methodology with clients and vendors, and studiously NOT using the term in public forums, we had completed our research and validated its conclusions. The logical data warehouse was part of Gartner’s Big Data, Extreme Information and Information Capabilities Framework research in 2011. It was published in the first week of August, and at the October Symposium in Orlando, Peter Sondergaard mentioned it as evidence of how the use and access to information is becoming a market changing force. The logical warehouse leveraged all of Gartner’s information management themes and brought them together in, well, a “logical information delivery platform”—hence, the logical data warehouse. The logical data warehouse inherits the genetics of best practices for 30+ years as well as the formative concepts of old warehouses and diverse data marts. As leading organizations begin to deploy this new generation warehouse, it will demonstrate its ability to meet the long-desired mission of the enterprise data warehouse of giving integrated access to all forms of information assets. The logical data warehouse will finally provide the information services platform for the applications of the highly competitive companies and organizations in the early 21st Century.
We are already seeing early inputs to the reference architecture and it is becoming clear that the “title fight” Gartner predicted in the 2009 Magic Quadrant for Data Warehouse Database Management Systems is underway. The DBMS vendors, the data integration and service bus vendors and even some of the application vendors have laced up their gloves and entered the ring with their mouth guards in place and their fists up. The logical data warehouse is the next significant evolution of information integration because it includes ALL of its progenitors and demands that each piece of previously proven engineering in the architecture should be used in its best and most appropriate place. This architecture will include and even expand the enterprise data warehouse, but will add semantic data abstraction and distributed processing. It will be fed by content and data mining to document data assets in metadata. And, it will monitor its own performance and provide that information first to manual administration, but then grow toward dynamic provisioning and evaluation of performance against service level expectations. This is important. This is big. This is NOT buzz. This is real.
Call me. I have a picture and different forms of the reference architecture are already filling up.
Category: Big Data data warehouse DBMS Tags: data warehouse
by Merv Adrian | September 26, 2011 | 5 Comments
Having just seen the movie myself, I was delighted to receive this guest post from my colleague Rita Sallam, a Research Director here who focuses on Analytics, BI, and Performance Management. It’s a good read.
As demonstrated in movie “Moneyball”, starring Brad Pitt and opening in theaters today (http://www.moneyball-movie.com/), professional sports teams are increasingly using data mining and statistical analysis to find the players that best correlate to success.
This approach has resulted in the displacement of many long-held, but less relevant, performance statistics and “gut feel” recruiting approaches. Many successful teams are building on – and supplementing – this fact-based approach to winning by using collaborative decision making (CDM) platforms that enable key team decision makers to assess, weight and optimize a combination of quantitative and qualitative measures used to select the best players at any one time to meet their specific team needs.
CDM platforms combine business intelligence (BI) and other sources of information used for decision making, with social networking and collaboration capabilities, decision support tools, and analytic processes such as data mining and statistics, methodologies and models — to improve and capture the decision process.
Of course, professional sports teams are not new to BI; they have one of the longest histories of any industry for using player and game statistics to report on, assess and value players and team performance.
Advanced analytical and statistical approaches, used by teams such as the Oakland Athletics, Green Bay Packers, and the Calgary Flames shift the odds of winning and have forced coaches and management to rethink the statistics and player attributes that matter most in player selection and to remix the formula for winning. These new analytical techniques are resulting in new correlations between previously ignored and undervalued statistics and player performance and winning.
This approach doesn’t just stop with movies and sports teams. Statistical analysis, combined with CDM, can help any organization, to rethink and then optimize the business processes that drive competing and succeeding. CDM can also highlight and resolve differences of opinion in the decision-making process in any organization relating to judgment and weighting of decision drivers.
Taking the view that player selection is similar to the key decision processes – such as vendor selection and portfolio optimization – which most companies must leverage; lessons can be learned from early adopters of CDM in the professional sports world to find opportunities to improve the quality of decision making in any organization.
Like professional sports teams, traditional companies must rethink how new measures that drive outcomes can also drive changes in business processes and look for ways to modify and optimize processes based on new CDM insights. Given that the potential competitive advantage from finding new statistical correlations for success can be short lived once other teams learn of the advantage, what is needed is a new way to assess performance and identify the ‘best fit’ players: one that could combine a range of quantitative and subjective measures, and that is customized to a particular team and their specific needs at any time, given the full picture of team dynamics, strengths and weaknesses. CDM can play this role by directly linking analytics to the decision-making process, and literally puts all decision makers on the same page.
If you’re interested in additional information, I have published a report titled “Beyond Moneyball: How Professional Sports Teams Are Using Collaborative Decision Making to Win” at http://www.gartner.com/resId=1800819 (a client subscription is required).
Category: Uncategorized Tags:
by Merv Adrian | July 19, 2011 | 4 Comments
The big players are moving in for a piece of the Big Data action. IBM, EMC, and NetApp have stepped up their messaging, in part to prevent startup upstarts like Cloudera from cornering the Apache Hadoop distribution market. They are all elbowing one another to get closest to “pure Apache” while still “adding value.” Numerous other startups have emerged, with greater or lesser reliance on, and extensions or substitutions for, the core Apache distribution. Yahoo! has found a funding partner and spun its team out, forming a new firm called Hortonworks, whose claim to fame begins with an impressive roster responsible for much of the code in the core Hadoop projects. Think of the Doctor Seuss children’s book featuring that famous elephant, and you’ll understand the name.
While we’re talking about kids – ever watch young kids play soccer? Everyone surrounds the ball. It takes years to learn their position on the field and play accordingly. There are emerging alphas, a few stragglers on the sidelines hoping for a chance to play, community participants – and a clear need for governance. Tech markets can be like that, and with 1600 attendees packing late June’s Hadoop Summit event, all of those scenarios were playing out. Leaders, new entrants, and the big silents, like the absent Oracle and Microsoft.
The ball is indeed in play; the open source Apache Hadoop stack today boasts “customers” among numerous Fortune 500 companies, running critical business workloads on Hadoop clusters constructed for data scientists and business sponsors – and very often with little or no participation by IT and the corporate data governance and enterprise architecture teams. Thousands of servers, multiple petabytes of data, and growing numbers of users are increasingly to be seen.
Community Growth – Chasing the Ball
At Hadoop Summit 2011, the community itself was in many ways the star – big, diverse, attentive, engaged, tweeting up a storm. Sessions were packed; content, dense and technical, was well received and understood; deep inquiry was the order of the day. Sessions on Youtube are here.
Sponsorship was substantial – 26 sponsoring companies did continuous, deeply engaged business throughout. Several told me when I asked “Yes, our inquiries here are from people with money to spend.” Many were hiring, and inboxes (physical ones, on their display tables – how quaint!) were visibly bulging. Eyes were on the ball, and players were swarming.
Before the conference began, Cloudera, founded in 2009 by its own team of early Hadoop contributors including early author Doug Cutting and Amr Awadallah, a former VP of engineering at Yahoo!, announced a new release, version 3.5, of its distribution. This is not to be confused with Cloudera Enterprise, a subscription service that adds a management suite and production support. Cloudera, which can take a great deal of pride in its evangelism, training and nurturing of the community to date, has maintained its OSS street cred, contributing significantly to Apache Hadoop, and has garnered more than 100 paid clients to add to all the free download users.
Cloudera’s focus with this release is on more than just the ball (the core distribution components and other surrounding “projects” – more on that below); it adds some important new capabilities that reflect continuing maturation of its game plan:
- Service and Configuration Manager (SCM) Express: a free, GUI-based offering for creating a cluster. It permists you to configure and start HDFS, MapReduce, HBase – without logging into each server individually.
- Cloudera Management Suite: a collection of software tools including a full-featured version of SCM, a resource manager and an activity monitor that allows operators to watch jobs, compare performance with prior runs, and identify bottlenecks.
- An Authorization Manager featuring “One-Click Security” – a capability that unifies what are sometimes differing security models in different components, including management of rights for users and groups.
Cloudera reaffirmed that its distribution remains 100% Apache software-licensed. In this, it joins IBM, whose endorsement of Apache was a key element of its discussion several weeks ago at a launch event for its own InfoSphere BigInsights.
Whose Team Is This, Anyway?
The Hadoop Summit event itself started with a bang. Jay Rossiter, SVP of the Cloud Platform Group at Yahoo!, detailed company-wide uses such as fraud and spam detection, ad targeting, geotagging and local indexing, search assist, aggregation and categorization of news stories, and predictive analytics. He introduced Eric Baldeschwieler, formerly VP of software engineering for the Hadoop team, who talked about where Hadoop came from and where it’s going. Then he took his free kick: he has been named CEO of Hortonworks, funded with investments from Yahoo! and Benchmark Capital. Hortonworks is described as “an independent company consisting of key architects and core contributors;” Baldeschwieler reminded the crowd that Yahoo! is the primary contributor to, and one of the leading users of Apache Hadoop.
Hortonworks’ strong relationship with one of the largest working – thriving – installations gives it a great opportunity to test its continuing contributions. This is a critical advantage. As an offering specifically pointed at large-scale deployments, any distribution of Hadoop, with whatever API-compatible but not yet committed additions, changes and associated projects such as HBase, Hive, Zookeeper, Pig, Flume, Sqoop, Oozie and others, will require substantial integration testing. Baldeschwieler noted in conversation with me later that there are more than 1,000 active, sophisticated Hadoop users of Apache on the Yahoo! grid; testing large workloads will be much easily accomplished.
In general, Hortonworks scored a PR coup here – despite the “we’re all happy members of the Apache Hadoop community” messaging, the floor was theirs until the track sessions began, and they were well-represented there too. Nobody loses at kids’ soccer either – but somebody wins. The most prominent messaging to a dedicated community of 1600 advocates for the entire morning came from the new team on the block.
That’s not to say other voices were not heard. IBM’s Anant Jhingram entertained the crowd with some “Watson on Jeopardy” clips (Hadoop was a sizable component of Watson’s software stack), then waxed philosophical about how the community would develop as the “birthers” (present at the creation) began to increasingly interact with the “adopters” – those who come along later. His blog gives a quick summary. Key point: IBM’s ringing pitch for having as few distributions as possible, recommending that everyone focus on supporting and contributing to Apache’s. I had heard this from them a few weeks earlier at their analyst event, and in response to a question from the floor about IBM’s distribution, he could not have been more clear reiterating it: “We’re not in the distribution business.”
NetApp Wants to Play Too, And Others Tried Out
NetApp, not to be outdone by EMC, who announced their Hadoop offerings in May at EMC World, took its own run at the ball. NetApp has been a strong supporter of open source, they wanted everyone to know, involved in free BSD, Linux, and NFS – and say they want to play in the Apache Hadoop community too. Fair enough; they’re not alone, and they have a shiny new partnership with Hortonworks to show off.
Does NetApp have a play? Of course. Every soccer team needs a goalie; every Hadoop project needs storage – lots of storage. But goalies don’t score, so they don’t win games. They are necessary, but not sufficient. NetApp hopes to prove they can qualify for more roles, but they will need more offerings, and they will have competitors, new and old. Credit to them for trying out for the team. Other hopefuls:
- HStreaming – Announced the launch of “the most scalable real-time data processing platform powered by Apache Hadoop.”
- Karmasphere – Announced an “all-in-one virtual appliance for building Hadoop applications.” Also, a partnership with Think Big Analytics, a consultancy that shows the health of the emerging ecosystem.
- MapR Technologies – EMC’s partner unveiled two software editions: MapR M3 (free for an unlimited number of nodes) and MapR M5. MapR adds direct NFS connection and some lofty performance goals. While the Hadoop File System (HDFS) can handle 10-50PB, MapR can handle 1010 exabytes. While HDFS can handle some 4000 nodes [edited 7/19 from 2000] in a cluster, MapR can handle 10,000 or more. It also stresses its elimination of the namenode single point of failure problem, a real advantage that is likely to relatively short-lived.
Much of the action is around swapping in pieces to supplement stack components from the Apache Hadoop core. Most typically, we see replacements or supplements for HDFS. DataStax’ Brisk distribution brings Apache Cassandra into the mix; MapR takes its own approach, and others including HBase and Hadapt are in the mix as well. But another player or two are tackling things at the upper end of the stack:
- Platform Computing – a mature, well-established firm in the financial markets, where risk applications require leading-edge speed and reliability, announced Platform MapReduce, the “first enterprise-class distributed runtime engine for MapReduce applications.” Targeting the high end of the stack – the Hadoop job and task scheduler itself – Platform is hoping to parlay its $70M presence into helping its clients as they examine the new game in town.
- Pervasive DataRush for Hadoop seeks to extend Pervasive’s successful franchise in high-performance parallel computing into the Hadoop ecosystem. Also in the game for a long time, Pervasive, a $50M firm, recently received patents it filed for over 6 years ago for its dataflow architecture that exploits hardware parallelism with relatively low memory usage.
Who’s On The Team?
Overall, “who’s on the team” seemed to be a key issue for the presenters, who targeted what IBM’s Dhingram called the “birthers” by stressing their pedigree. MapR has a significant relationship with EMC, but the latter lacks the street cred of Cloudera. EMC’s contributions to the Apache Hadoop projects are negligible, and it talks of using Facebook’s version; it has few if any contributing engineers. EMC’s large sales and marketing presence and its Isilon storage line, purpose-built built for large, unstructured storage, could trump that.
Hortonworks clearly wins the battle of engineering contributors, and has the added advantage of a testbed environment in its part owner Yahoo!, but it’s new, and has little differentiating product – yet. In my meeting with the executive team, Baldeschwieler was clear about the early days. The first offerings will be training and support. As their roadmap plays out, they have a lot of funded talent to work on advancing Apache Hadoop, and many of the committers on the projects that compose the core stack. Certainly the value of having a commercial business affords a focus for buyer demands. Still, I believe the challenge implicit in having several competing vendors in the space may make resolving competing priorities a challenge moving forward.
I had the opportunity to exchange messages with Hadapt’s Chief Scientist Daniel Abadi during the event. He was unable to attend, being occupied with bringing Hadapt’s own more database-like relational engine and query optimization to market. In twitter exchanges during the conference, we had both pointed out that the kids’ soccer-like “everybody here is a winner” message was disingenuous, and that of course these players are all competitors. Abadi noted that “vendors who incorporate Hadoop into their solution stack (such as Hadapt, Datameer, or Karmasphere) will breathe easier because Hortonworks is going to make the Apache distribution of Hadoop much better.” And Julian Hyde, of Mondrian and SQLStream fame, who has had much developer community dynamics experience with Eigenbase, was very supportive of the Apache governance model in a conversation with me later.
That governance will be critical for the future. Other Apache and non-Apache projects, like HBase, Hive, Zookeeper, Pig, Flume, Sqoop, Oozie, et al all have their own agendas. In Apache locution, each has its own “committers” – owners of the code lines – and the task of integrating disparate pieces – each on its own time line – will fall to somebody. Will your distribution owner test the combination of the particular ones you’re using? If not, that will be up to you. One of the biggest barriers to open source adoption so far has been precisely that degree of required self-integration. Gartner’s second half 2010 open source survey showed that more than half of the 547 surveyed organizations have adopted OSS solutions as part of their IT strategy. Data management and integration is the top initiative they name; 46% of surveyed companies named it. This is where the game is.
Category: Big Data Hadoop IBM MapReduce Microsoft OSS Yahoo! Tags: Apache, BigInsights, Brisk, Cassandra, Cloudera, Datarush, Datastax, Eigenbase, EMC, Facebook, Flume, Hadapt, Hadoop, Hbase, HDFS, Hive, Hortonworks, Hstreaming, IBM, InfoSphere, Isilon, Karmasphere, Linux, MapR, MapReduce, Microsoft, Mondrian, NetApp, NFS, Oozie, open source, Oracle, OSS, Pervasive, Pig, Platform Computing, SQLStream, Sqoop, Watson, Yahoo!, zookeeper