by Merv Adrian | January 23, 2012 | Comments Off
In early January 2012, the world of big data was treated to an interesting series of product releases, press announcements, and blog posts about Hadoop versions. To begin with, we had the announcement of Apache version 1.0 at long last, in a press release. Although there were grumblings here and there in the twittersphere that changes to release numbers are meaningless, my discussions with Gartner’s enterprise customers indicate otherwise. Products with release numbers like 0.20.2 make the hair on Procurement’s neck stand on end, and as Hadoop begins to get mainstream attention (Gartner’s clients, see Hype Cycle for Data Management 2011), IT architects and executives find such optics quite important. Hadoop is moving beyond pioneers like Amazon, Yahoo! and LinkedIn into shops like JP Morgan Chase, and they pay attention to such things.
So what, after 6 years of steady work on earlier versions, makes this one worthy of a 1.0 designation? As Cloudera noted in a blog post, ”There has been an 18 month period where there has been no one Apache release that had all the committed features of Apache Hadoop.”
That would be version 0.20.2 from 2010. In that time, every major DBMS vendor has put a strategy in place for side-by-side operation with Hadoop, sometimes direct execution and calls from “inside” (Teradata’s Aster and EMC’s Greenplum pioneering that.) So you might think “1.0″ is just labeling to provide a veneer of “enterprise-ness” for customers of those products. But give the tireless gang of committers some credit; in this release they have incorporated features from a couple of branches (what’s that mean, you ask? read on) and added strong authentication (Kerberos), a REST API for HDFS, and for those who prefer the broader set of use cases enabled by HBase, several significant additions to improve working with it. But unfortunately, as useful as a 1.0 label is, it hardly eliminates the confusion, as we’ll see below.
Trunks and Branches
For those who don’t track Apache release matters, the Apache Software Foundation supports Projects, which are created by volunteers accepted into the organization on the basis of the code they write – they gain privileges as Committers who can put code into the main codeline, known as “trunk.” MapReduce is a project; so is HDFS. As new code is developed, the move from one numbered version to another is initiated by the creation of a branch, which is tested until the project members approve its readiness by vote. Branches may move ”into trunk,” but other branches may also be started in the interim. And they may have different features.
Charles Zedlewski of Cloudera has described this in more detail in the excellent post linked above. In it he notes that release 0.20.2 in 2010 was the last release that had “all the usable features committed to Apache Hadoop.” Following the 0.20.2 release, work continued on that branch, resulting in 0.20.203 in 2011, which added security but not RAID support or append (which ensures that HBase, a separate project, doesn’t lose data.) It was followed by 0.20.205, which did add security although still not RAID. It is 0.20.205 that became 1.0. Seems straightforward, right?
Unfortunately, it’s not. Branches 0.21, 0.22 and 0.23 were all introduced in that 18 month time period, the latter at the end of 2011. Version 0.23, notes Cloudera, has “all of the features of any past release.” This includes a fix for the name node single point of failure issue and other HA capabilities, both of which matter a great deal to enterprise users. Good news? Well, Release 1.0, as noted, is not based on 0.23, so it does not have these.
Distributions – Solving, Or Muddying the Waters?
So, does using a distribution help sort all this out? Consider: Cloudera’s CDH3 distribution was issued before 1.0. But Cloudera distributions get updates designated with a U. Not only updates to Hadoop; remember there are other projects in there too. So CDH3U0 (yes, they use zeros) uses HBase 0.90.0 whereas CDH3U2 uses HBase 0.90.4; it also added Mahout and expanded support for Avro’s file format.
The latter discussion reinforces the important point that “Hadoop” means more than MapReduce, HDFS and job/task management to people considering it: solutions (and hence, distributions) typically involve several other projects. HBase, for example, seems to be occurring in at least half of them – a study by Dave Menninger of Ventana Research last year found 61% of respondents including it in deployments. Other parts of the stack like Pig, Hive, and Sqoop are also found in many if not most initiatives I’ve had contact with. The complexities of keeping all their versions straight as new code is contributed are a key reason to use distributions, which track and integrate a dozen or more projects.
How about Hortonworks, the other specialist with a large number of committers? They have announced a ”public preview of the Hortonworks Data Platform (HDP) version 2.” That will be based on 0.23 – all those features, plus HCatalog, which Cloudera does not include in CDH (yet.) There is also a “private technology preview” of HDP version 1; “a public technology preview will be made available later this quarter.” What do these preview terms mean? Hortonworks explains:
The Technology Preview Program begins with a Limited Preview phase that enables us to engage a manageable number of representative customers, partners, and community users on focused, hands-on testing and proof of concept deployments of the Hortonworks Data Platform…Public Preview …. opens the process to anyone interested in working closely with Hortonworks…culminates in the final release of the software and General Availability.
Other distributions have their own mix of projects, reflecting their own point of view. It can be hard to find out which versions of the various projects are supported in each one. Neither Hortonworks’ nor MapR’s website, for example, shows the version numbers of included Projects, including the varied additional ones they both add, in the same way as the chart Cloudera offers does. And other distributions from IBM, Datastax, Netapp and others, are in the hunt, each with its own profile. For now, the continuing confusion and multiple conflicting “Hadoops” only serve to reinforce IT’s concerns about its readiness for a robust, governed environment – unless one distribution, from one trusted provider, is chosen.
In upcoming Gartner research I’ll talk about the distributions, how they vary, and how to track and choose among them.
Category: Apache Big Data Cloudera Hadoop Hbase HDFS Hortonworks IBM MapReduce NetApp open source Sqoop Tags: Apache Software Foundation, ASF, Aster, Avro, CDH, Cloudera, Datastax, EMC, Greenplum, Hadoop, Hbase, Hive, Hortonworks, IBM, Mahout, MapReduce, NetApp, open source, Pig, Sqoop, Teradata
by Merv Adrian | November 3, 2011 | 6 Comments
Another guest post, this time from my colleague and friend Mark Beyer.
My name is Mark Beyer, and I am the “father of the logical data warehouse”. So, what does that mean? First, if like any father, you are not willing to address your ancestry with full candor you will lose your place in the universe and wither away without making a meaningful contribution. As an implementer in the field, I was a student and practitioner of both Inmon and Kimball. I learned as much or more from my clients and my colleagues during multiple implementations as I did from studying any methodology. My Gartner colleagues challenged my concepts and helped hammer them into a comprehensive and complete concept. Simply put, I was willing to consider DNA contributions from anyone and anywhere, but through a form of unnatural selection, persisted in choosing to include the good genes and actively removing the undesirable elements.
I first used the term “logical data warehouse” while delivering a market analysis and guidance session for one of our Gartner clients at Sybase headquarters two and one-half years ago. I was outlining that Search added to the data warehouse was not “finished” and was an incomplete evolution. We were discussing technical architectures of the future. I kept pushing because the warehouse was always “supposed to” include all information in the enterprise, but it did not. As a field practitioner, it always seemed there was much more to be done and left even those projects perceived as highly successful as “not being done here”. We relayed how leading Gartner clients were making attempts to put every sort of content into their warehouse beyond structured data and outlined the missing components of the architecture. As we completed the discussion, I was asked “So, what do you call this trend?” At which point I was temporarily at a loss. I did not want to call it a federated warehouse. I did not want to call it a virtual warehouse. Both of those terms fell short and both had negative baggage in their history. I did not adhere to Warehouse 2.0—because it focused too heavily on adding Search (that was not the intent, but that was what the market had decided). So, in a flash of brilliance inspired by frustration, I muttered “the best name is probably a Logical Data Warehouse,…because, it focuses on the logic of information and not the mechanics that are used.” At this time, my Gartner colleague, Donald Feinberg responded “That’s a stupid name. I’m not going to use it.” Three months later, Donald was using it and like a good uncle in your own family tree, he was touting the accomplishments of the family golden child. Now, in fairness to Donald, he expressed his misgivings to me about the term as well and there was some “convincing” in between.
After on-going vetting of both the architectural approach and the methodology with clients and vendors, and studiously NOT using the term in public forums, we had completed our research and validated its conclusions. The logical data warehouse was part of Gartner’s Big Data, Extreme Information and Information Capabilities Framework research in 2011. It was published in the first week of August, and at the October Symposium in Orlando, Peter Sondergaard mentioned it as evidence of how the use and access to information is becoming a market changing force. The logical warehouse leveraged all of Gartner’s information management themes and brought them together in, well, a “logical information delivery platform”—hence, the logical data warehouse. The logical data warehouse inherits the genetics of best practices for 30+ years as well as the formative concepts of old warehouses and diverse data marts. As leading organizations begin to deploy this new generation warehouse, it will demonstrate its ability to meet the long-desired mission of the enterprise data warehouse of giving integrated access to all forms of information assets. The logical data warehouse will finally provide the information services platform for the applications of the highly competitive companies and organizations in the early 21st Century.
We are already seeing early inputs to the reference architecture and it is becoming clear that the “title fight” Gartner predicted in the 2009 Magic Quadrant for Data Warehouse Database Management Systems is underway. The DBMS vendors, the data integration and service bus vendors and even some of the application vendors have laced up their gloves and entered the ring with their mouth guards in place and their fists up. The logical data warehouse is the next significant evolution of information integration because it includes ALL of its progenitors and demands that each piece of previously proven engineering in the architecture should be used in its best and most appropriate place. This architecture will include and even expand the enterprise data warehouse, but will add semantic data abstraction and distributed processing. It will be fed by content and data mining to document data assets in metadata. And, it will monitor its own performance and provide that information first to manual administration, but then grow toward dynamic provisioning and evaluation of performance against service level expectations. This is important. This is big. This is NOT buzz. This is real.
Call me. I have a picture and different forms of the reference architecture are already filling up.
Category: Big Data data warehouse DBMS Tags: data warehouse
by Merv Adrian | September 26, 2011 | 5 Comments
Having just seen the movie myself, I was delighted to receive this guest post from my colleague Rita Sallam, a Research Director here who focuses on Analytics, BI, and Performance Management. It’s a good read.
As demonstrated in movie “Moneyball”, starring Brad Pitt and opening in theaters today (http://www.moneyball-movie.com/), professional sports teams are increasingly using data mining and statistical analysis to find the players that best correlate to success.
This approach has resulted in the displacement of many long-held, but less relevant, performance statistics and “gut feel” recruiting approaches. Many successful teams are building on – and supplementing – this fact-based approach to winning by using collaborative decision making (CDM) platforms that enable key team decision makers to assess, weight and optimize a combination of quantitative and qualitative measures used to select the best players at any one time to meet their specific team needs.
CDM platforms combine business intelligence (BI) and other sources of information used for decision making, with social networking and collaboration capabilities, decision support tools, and analytic processes such as data mining and statistics, methodologies and models — to improve and capture the decision process.
Of course, professional sports teams are not new to BI; they have one of the longest histories of any industry for using player and game statistics to report on, assess and value players and team performance.
Advanced analytical and statistical approaches, used by teams such as the Oakland Athletics, Green Bay Packers, and the Calgary Flames shift the odds of winning and have forced coaches and management to rethink the statistics and player attributes that matter most in player selection and to remix the formula for winning. These new analytical techniques are resulting in new correlations between previously ignored and undervalued statistics and player performance and winning.
This approach doesn’t just stop with movies and sports teams. Statistical analysis, combined with CDM, can help any organization, to rethink and then optimize the business processes that drive competing and succeeding. CDM can also highlight and resolve differences of opinion in the decision-making process in any organization relating to judgment and weighting of decision drivers.
Taking the view that player selection is similar to the key decision processes – such as vendor selection and portfolio optimization – which most companies must leverage; lessons can be learned from early adopters of CDM in the professional sports world to find opportunities to improve the quality of decision making in any organization.
Like professional sports teams, traditional companies must rethink how new measures that drive outcomes can also drive changes in business processes and look for ways to modify and optimize processes based on new CDM insights. Given that the potential competitive advantage from finding new statistical correlations for success can be short lived once other teams learn of the advantage, what is needed is a new way to assess performance and identify the ‘best fit’ players: one that could combine a range of quantitative and subjective measures, and that is customized to a particular team and their specific needs at any time, given the full picture of team dynamics, strengths and weaknesses. CDM can play this role by directly linking analytics to the decision-making process, and literally puts all decision makers on the same page.
If you’re interested in additional information, I have published a report titled “Beyond Moneyball: How Professional Sports Teams Are Using Collaborative Decision Making to Win” at http://www.gartner.com/resId=1800819 (a client subscription is required).
Category: Uncategorized Tags:
by Merv Adrian | July 19, 2011 | 4 Comments
The big players are moving in for a piece of the Big Data action. IBM, EMC, and NetApp have stepped up their messaging, in part to prevent startup upstarts like Cloudera from cornering the Apache Hadoop distribution market. They are all elbowing one another to get closest to “pure Apache” while still “adding value.” Numerous other startups have emerged, with greater or lesser reliance on, and extensions or substitutions for, the core Apache distribution. Yahoo! has found a funding partner and spun its team out, forming a new firm called Hortonworks, whose claim to fame begins with an impressive roster responsible for much of the code in the core Hadoop projects. Think of the Doctor Seuss children’s book featuring that famous elephant, and you’ll understand the name.
While we’re talking about kids – ever watch young kids play soccer? Everyone surrounds the ball. It takes years to learn their position on the field and play accordingly. There are emerging alphas, a few stragglers on the sidelines hoping for a chance to play, community participants – and a clear need for governance. Tech markets can be like that, and with 1600 attendees packing late June’s Hadoop Summit event, all of those scenarios were playing out. Leaders, new entrants, and the big silents, like the absent Oracle and Microsoft.
The ball is indeed in play; the open source Apache Hadoop stack today boasts “customers” among numerous Fortune 500 companies, running critical business workloads on Hadoop clusters constructed for data scientists and business sponsors – and very often with little or no participation by IT and the corporate data governance and enterprise architecture teams. Thousands of servers, multiple petabytes of data, and growing numbers of users are increasingly to be seen.
Community Growth – Chasing the Ball
At Hadoop Summit 2011, the community itself was in many ways the star – big, diverse, attentive, engaged, tweeting up a storm. Sessions were packed; content, dense and technical, was well received and understood; deep inquiry was the order of the day. Sessions on Youtube are here.
Sponsorship was substantial – 26 sponsoring companies did continuous, deeply engaged business throughout. Several told me when I asked “Yes, our inquiries here are from people with money to spend.” Many were hiring, and inboxes (physical ones, on their display tables – how quaint!) were visibly bulging. Eyes were on the ball, and players were swarming.
Before the conference began, Cloudera, founded in 2009 by its own team of early Hadoop contributors including early author Doug Cutting and Amr Awadallah, a former VP of engineering at Yahoo!, announced a new release, version 3.5, of its distribution. This is not to be confused with Cloudera Enterprise, a subscription service that adds a management suite and production support. Cloudera, which can take a great deal of pride in its evangelism, training and nurturing of the community to date, has maintained its OSS street cred, contributing significantly to Apache Hadoop, and has garnered more than 100 paid clients to add to all the free download users.
Cloudera’s focus with this release is on more than just the ball (the core distribution components and other surrounding “projects” – more on that below); it adds some important new capabilities that reflect continuing maturation of its game plan:
- Service and Configuration Manager (SCM) Express: a free, GUI-based offering for creating a cluster. It permists you to configure and start HDFS, MapReduce, HBase – without logging into each server individually.
- Cloudera Management Suite: a collection of software tools including a full-featured version of SCM, a resource manager and an activity monitor that allows operators to watch jobs, compare performance with prior runs, and identify bottlenecks.
- An Authorization Manager featuring “One-Click Security” – a capability that unifies what are sometimes differing security models in different components, including management of rights for users and groups.
Cloudera reaffirmed that its distribution remains 100% Apache software-licensed. In this, it joins IBM, whose endorsement of Apache was a key element of its discussion several weeks ago at a launch event for its own InfoSphere BigInsights.
Whose Team Is This, Anyway?
The Hadoop Summit event itself started with a bang. Jay Rossiter, SVP of the Cloud Platform Group at Yahoo!, detailed company-wide uses such as fraud and spam detection, ad targeting, geotagging and local indexing, search assist, aggregation and categorization of news stories, and predictive analytics. He introduced Eric Baldeschwieler, formerly VP of software engineering for the Hadoop team, who talked about where Hadoop came from and where it’s going. Then he took his free kick: he has been named CEO of Hortonworks, funded with investments from Yahoo! and Benchmark Capital. Hortonworks is described as “an independent company consisting of key architects and core contributors;” Baldeschwieler reminded the crowd that Yahoo! is the primary contributor to, and one of the leading users of Apache Hadoop.
Hortonworks’ strong relationship with one of the largest working – thriving – installations gives it a great opportunity to test its continuing contributions. This is a critical advantage. As an offering specifically pointed at large-scale deployments, any distribution of Hadoop, with whatever API-compatible but not yet committed additions, changes and associated projects such as HBase, Hive, Zookeeper, Pig, Flume, Sqoop, Oozie and others, will require substantial integration testing. Baldeschwieler noted in conversation with me later that there are more than 1,000 active, sophisticated Hadoop users of Apache on the Yahoo! grid; testing large workloads will be much easily accomplished.
In general, Hortonworks scored a PR coup here – despite the “we’re all happy members of the Apache Hadoop community” messaging, the floor was theirs until the track sessions began, and they were well-represented there too. Nobody loses at kids’ soccer either – but somebody wins. The most prominent messaging to a dedicated community of 1600 advocates for the entire morning came from the new team on the block.
That’s not to say other voices were not heard. IBM’s Anant Jhingram entertained the crowd with some “Watson on Jeopardy” clips (Hadoop was a sizable component of Watson’s software stack), then waxed philosophical about how the community would develop as the “birthers” (present at the creation) began to increasingly interact with the “adopters” – those who come along later. His blog gives a quick summary. Key point: IBM’s ringing pitch for having as few distributions as possible, recommending that everyone focus on supporting and contributing to Apache’s. I had heard this from them a few weeks earlier at their analyst event, and in response to a question from the floor about IBM’s distribution, he could not have been more clear reiterating it: “We’re not in the distribution business.”
NetApp Wants to Play Too, And Others Tried Out
NetApp, not to be outdone by EMC, who announced their Hadoop offerings in May at EMC World, took its own run at the ball. NetApp has been a strong supporter of open source, they wanted everyone to know, involved in free BSD, Linux, and NFS – and say they want to play in the Apache Hadoop community too. Fair enough; they’re not alone, and they have a shiny new partnership with Hortonworks to show off.
Does NetApp have a play? Of course. Every soccer team needs a goalie; every Hadoop project needs storage – lots of storage. But goalies don’t score, so they don’t win games. They are necessary, but not sufficient. NetApp hopes to prove they can qualify for more roles, but they will need more offerings, and they will have competitors, new and old. Credit to them for trying out for the team. Other hopefuls:
- HStreaming – Announced the launch of “the most scalable real-time data processing platform powered by Apache Hadoop.”
- Karmasphere – Announced an “all-in-one virtual appliance for building Hadoop applications.” Also, a partnership with Think Big Analytics, a consultancy that shows the health of the emerging ecosystem.
- MapR Technologies – EMC’s partner unveiled two software editions: MapR M3 (free for an unlimited number of nodes) and MapR M5. MapR adds direct NFS connection and some lofty performance goals. While the Hadoop File System (HDFS) can handle 10-50PB, MapR can handle 1010 exabytes. While HDFS can handle some 4000 nodes [edited 7/19 from 2000] in a cluster, MapR can handle 10,000 or more. It also stresses its elimination of the namenode single point of failure problem, a real advantage that is likely to relatively short-lived.
Much of the action is around swapping in pieces to supplement stack components from the Apache Hadoop core. Most typically, we see replacements or supplements for HDFS. DataStax’ Brisk distribution brings Apache Cassandra into the mix; MapR takes its own approach, and others including HBase and Hadapt are in the mix as well. But another player or two are tackling things at the upper end of the stack:
- Platform Computing – a mature, well-established firm in the financial markets, where risk applications require leading-edge speed and reliability, announced Platform MapReduce, the “first enterprise-class distributed runtime engine for MapReduce applications.” Targeting the high end of the stack – the Hadoop job and task scheduler itself – Platform is hoping to parlay its $70M presence into helping its clients as they examine the new game in town.
- Pervasive DataRush for Hadoop seeks to extend Pervasive’s successful franchise in high-performance parallel computing into the Hadoop ecosystem. Also in the game for a long time, Pervasive, a $50M firm, recently received patents it filed for over 6 years ago for its dataflow architecture that exploits hardware parallelism with relatively low memory usage.
Who’s On The Team?
Overall, “who’s on the team” seemed to be a key issue for the presenters, who targeted what IBM’s Dhingram called the “birthers” by stressing their pedigree. MapR has a significant relationship with EMC, but the latter lacks the street cred of Cloudera. EMC’s contributions to the Apache Hadoop projects are negligible, and it talks of using Facebook’s version; it has few if any contributing engineers. EMC’s large sales and marketing presence and its Isilon storage line, purpose-built built for large, unstructured storage, could trump that.
Hortonworks clearly wins the battle of engineering contributors, and has the added advantage of a testbed environment in its part owner Yahoo!, but it’s new, and has little differentiating product – yet. In my meeting with the executive team, Baldeschwieler was clear about the early days. The first offerings will be training and support. As their roadmap plays out, they have a lot of funded talent to work on advancing Apache Hadoop, and many of the committers on the projects that compose the core stack. Certainly the value of having a commercial business affords a focus for buyer demands. Still, I believe the challenge implicit in having several competing vendors in the space may make resolving competing priorities a challenge moving forward.
I had the opportunity to exchange messages with Hadapt’s Chief Scientist Daniel Abadi during the event. He was unable to attend, being occupied with bringing Hadapt’s own more database-like relational engine and query optimization to market. In twitter exchanges during the conference, we had both pointed out that the kids’ soccer-like “everybody here is a winner” message was disingenuous, and that of course these players are all competitors. Abadi noted that “vendors who incorporate Hadoop into their solution stack (such as Hadapt, Datameer, or Karmasphere) will breathe easier because Hortonworks is going to make the Apache distribution of Hadoop much better.” And Julian Hyde, of Mondrian and SQLStream fame, who has had much developer community dynamics experience with Eigenbase, was very supportive of the Apache governance model in a conversation with me later.
That governance will be critical for the future. Other Apache and non-Apache projects, like HBase, Hive, Zookeeper, Pig, Flume, Sqoop, Oozie, et al all have their own agendas. In Apache locution, each has its own “committers” – owners of the code lines – and the task of integrating disparate pieces – each on its own time line – will fall to somebody. Will your distribution owner test the combination of the particular ones you’re using? If not, that will be up to you. One of the biggest barriers to open source adoption so far has been precisely that degree of required self-integration. Gartner’s second half 2010 open source survey showed that more than half of the 547 surveyed organizations have adopted OSS solutions as part of their IT strategy. Data management and integration is the top initiative they name; 46% of surveyed companies named it. This is where the game is.
Category: Big Data Hadoop IBM MapReduce Microsoft OSS Yahoo! Tags: Apache, BigInsights, Brisk, Cassandra, Cloudera, Datarush, Datastax, Eigenbase, EMC, Facebook, Flume, Hadapt, Hadoop, Hbase, HDFS, Hive, Hortonworks, Hstreaming, IBM, InfoSphere, Isilon, Karmasphere, Linux, MapR, MapReduce, Microsoft, Mondrian, NetApp, NFS, Oozie, open source, Oracle, OSS, Pervasive, Pig, Platform Computing, SQLStream, Sqoop, Watson, Yahoo!, zookeeper
by Merv Adrian | June 23, 2011 | 2 Comments
In the months since IBM closed its Netezza acquisition, the data warehouse appliance pioneer has been busy, if the announcements at this week’s Enzee are any indication. An enthusiastic crowd – 1000 strong – heard CEO Jim Baum deliver the news: new hardware, software and partnerships.The biggest news was The Appliance Formerly Known As Cruiser, now known as the Netezza High Capacity Appliance (HCA). A wag made up some t-shirts bearing the acronym TAFKAC and did quite well. IBM is aiming to push the size perception for Netezza higher. How high? Half a PB in a rack. You can scale it to 10PB.
Baum, keynoting to kick things off, was clearly jazzed to be bringing a new product to market so soon after the acquisition, with Arvind Krishna, GM of Information Management for IBM, there to provide his blessing – not without a sly question about why it wasn’t blue. Baum returned the jibe, noting that in the acquisition, Netezza had grown to over 400,000 employees.
Humor aside, IBM support was everywhere to be seen – I saw IBMers I knew everywhere, and on Day 2 Steve Mills, Senior Vice President and Group Executive – Software & Systems, kicked the day off – and took a humorous shot of his own at the swirly color panel on the front of the Netezza boxes. His style and attitude manifested the cultural embrace Mills described the next day in a Q&A at the end of his session.
Netezza employees, customers and prospects have reason to be sanguine about the possibilities – suddenly Netezza is selling in dozens of countries where it had no presence before, with a value proposition far simpler and quicker than many other IBM offerings and at a price point that ought to land it on many a short list, where it wins its share. The HCA claims 5.5 TB/hour load rates and Razi Raziuddin, Netezza Senior Director of Product Marketing, told attendees of his session that the standby node in every rack will be pressed into service later this year to substantially boost that – doubling is not likely, but they may get close. Time will tell. Replication is on the roadmap too – and the impression that IBM resources are helping get more things done was reinforced by the announcement that SPSS now runs models natively inside Netezza. In another imaginative bit of nomenclature, Netezza iClass analytics, the sizable portfolio of native “inside” executables, have been renamed IBM Netezza Analytics.
A two-way Hadoop connector, developed with Cloudera, was also announced by Krishnan Parasuraman, CTO/Chief Architect, Digital Media for Netezza. Audience awareness of Hadoop in the Netezza-Hadoop session, which was mostly just an intro to Hadoop itself, was mixed – some were quite knowledgeable, but the first question asked at the end of the talk was “What is MapReduce?” This reaffirms Gartner’s positioning of several of these technologies on the Hype Cycle – and a belief of mine that neither vendors nor big research firms like ours are always talking to the people doing these newer things in many of our interactions. The early adopters are often drawn from other parts of the organization – that don’t talk to research firms, or their own procurement teams. Sometimes, they are bypassing their internal IT governance processes to such a degree that their activities are completely under the radar. As the use of cloud-based resources like Amazon Elastic MapReduce proliferate, this issue is exacerbated – like the spreadsheets that go home on thumb drives, but even more worrisome.
I was happy to be able to attend and speak at this event, which may be the last one of its kind as Netezza is absorbed into other IBM events going forward. In a private meeting with Steve Mills, I was struck by his enthusiasm for what he called Netezza’s “premier example of the use of FPGAs.” He says he’s used it as a catalyst to challenge other IBM teams, and his vision of workload optimized systems as the head of both software and hardware gives him extraordinary scope to drive that vision – one Ambuj Goyal, General Manager, Systems and Technology Group’s Development and Manufacturing, told me recently encompasses a flexible, multi-hardware fabric under a single management and governance framework. Mills also reiterated his increasingly sharp critique of Oracle, especially its stewardship since acquiring Sun.
IBM’s other data warehouse platforms, the Smart Analytics Systems (ISAS) were, unsurprisingly, not much in evidence. The positioning seems to be emerging – Netezza as a more pure appliance story, with simplicity and time to value as key themes, ISAS as the more configurable, general purpose offering for mixed workloads. Time will tell how well this story will do – Netezza’s messaging about numbers has been clearer pre-acquisition than IBM’s about ISAS, so it will take a while for us to assess progress in the market.
Category: data warehouse data warehouse appliance DBMS FPGA Hadoop IBM ISAS MapReduce Netezza Tags:
by Merv Adrian | March 23, 2011 | 6 Comments
This is my first post on the Gartner blog network. It feels a little odd to be posting after nearly 3 months away – my old blog at www.itmarketstrategy.com was a frequently updated place. I hope to make this one the same. Why the lag?
It turns out that there is a great deal of prior art to deal with when you’re a Gartner analyst. This is no small matter – with over 700 other analysts publishing regularly, I want to be sure that I’m aware of “Gartner positions” on key issues. Not because I won’t be willing to take a contrary view – but if so, I need to be respectful and explain why I differ. And as a courtesy, it’s appropriate to have a dialogue with the author of that position. Not to do so would not only be disrespectful, but it would confuse our readers, Gartner clients or not.
Another issue to consider is priorities. Managing my time begins with client inquiry response. I’ve been stunned by the volume of inquiry I’ve had since I “went live” on internal systems here. It helps to be in such a hot space – data, big, small, batch or interactive, structured or not, appliance or cloud….there are so many questions our clients have, and I’m routinely handling several inquiries every day that I’m available. It’s a fantastic source of input for me to understand what people are concerned about, investing in, or deploying. My perspective is shifting steadily based on that input – the last few years were very vendor-centric for me, and that skews your point of view. Processing those changes, and doing the grunt work of digging in to answer questions about issues I haven’t been close to for a while, has taken some time.
Finally, there is another question of priorities: publishing Gartner research is at the top of the list. As I’ve gotten ramped up, trained, handled inquiries, and kept my usual ear to the ground via events and briefings, I’ve been slow to start the writing process. All of the above things not only felt prerequisite, they competed for my time. But now, I feel up to speed, and I’ve implemented a scheduling model that will let me carve out the time I need to do the writing well.
For those wondering where I’ve been, I’m back. I’ve been learning the ropes. But I’m not tied up anymore. Watch this space!
Category: Uncategorized Tags: