by Merv Adrian | March 24, 2014 | 11 Comments
This post was jointly authored by Merv Adrian (@merv) and Nick Heudecker (@nheudecker) and appears on both blogs.
In the early days of Hadoop (versions up through 1.x), the project consisted of two primary components: HDFS and MapReduce. One thing to store the data in an append-only file model, distributed across an arbitrarily large number of inexpensive nodes with disk and processing power; another to process it, in batch, with a relatively small number of available function calls. And some other stuff called Commons to handle bits of the plumbing. But early adopters demanded more functionality, so the Hadoop footprint grew. The result was an identity crisis that grows progressively more challenging for decisionmakers with almost every new announcement.
This expanding footprint included a sizable group of “related projects,” mostly under the Apache Software Foundation. When Gartner published How to Choose the Right Apache Hadoop Distribution in early February 2012 the leading vendors we surveyed (Cloudera, MapR, IBM, Hortonworks, and EMC) all included Pig, Hive, HBase, and Zookeeper. Most were willing to support Flume, Mahout, Oozie, and Sqoop. Several other projects were supported by some, but not all. If you were asked at the time, “What is Hadoop?” this set of ten projects, the commercially supported ones, would have made a good response.
In 2013, Hadoop 2.0 arrived, and with it a radical redefinition. YARN muddied the clear definition of Hadoop by introducing a way for multiple applications to use the cluster resources. You have options. Instead of just MapReduce, you can run Storm (or S4 or Spark Streaming), Giraph, or HBase, among others. The list of projects with abstract names goes on. At least fewer of them are animals now.
During the intervening time, vendors have selected different projects and versions to package and support. To a greater or lesser degree all of these vendors call their products Hadoop – some are clearly attempting to move “beyond” that message. Some vendors are trying to break free from the Hadoop baggage by introducing new, but just as awful, names. We have data lakes, hubs, and no doubt more to come.
But you get the point. The vague names indicate the vendors don’t know what to call these things either. If they don’t know what they’re selling, do you know what you’re buying? If the popular definition of Hadoop has shifted from a small conglomeration of components to a larger, increasingly vendor-specific conglomeration, does the name “Hadoop” really mean anything anymore?
Today the list of projects supported by leading vendors (now Cloudera, Hortonworks, MapR, Pivotal and IBM) numbers 13. Today it’s HDFS, YARN, MapReduce, Pig, Hive, HBase, and Zookeeper, Flume, Mahout, Oozie, Sqoop – and Cascading and HCatalog. Coming up fast are Spark, Storm, Accumulo, Sentry, Falcon, Knox, Whirr… and maybe Lucene and Solr. Numerous others are only supported by their distributor and are likely to remain so, though perhaps MapR’s support for Cloudera Impala will not be the last time we see an Apache-licensed, but not Apache project, break the pattern. All distributions have their own unique value-add. The answer to the question, “What is Hadoop?” and the choice buyers must make will not get easier in the year ahead – it will only become more difficult.
Category: Accumulo Ambari Apache Apache Drill Apache Yarn Big Data BigInsights Cloudera Elastic MapReduce Gartner Giraph Hadoop Hbase HCatalog HDFS Hive Hortonworks IBM Intel Lucene MapR MapReduce Oozie open source OSS Pig Solr Sqoop Storm YARN Zookeeper Tags: Apache, big data, BigInsights, CDH, Cloudera, Datastax, EMC, Flume, Hadoop, Hbase, HDFS, Hive, Hortonworks, IBM, InfoSphere, Isilon, MapR, MapReduce, Oozie, open source, OSS, Pig, Pivotal, Sqoop, zookeeper
by Merv Adrian | February 23, 2014 | 5 Comments
In my post about the BYOH market last October, I noted that increasing numbers of existing players are connecting their offerings to Apache Hadoop, even as upstarts enter their markets with a singular focus. And last month, I pointed out that Nick Heudecker and I detected a surprising lack of concern about security in a recent Hadoop webinar. Clearly, these two topics have an important intersection – both Hadoop specialists (including distribution vendors) and existing security vendors will need to expand their efforts to drive awareness if they are to capture an opportunity that is clearly going begging today. Security for big data will be a key issue in 2014 and beyond.
Other analysts at Gartner have tracked many of these products, and in my own followup I’ve been catching up on the work of Joseph Feiman and Brian Lowans, among others. Their Magic Quadrant for Data Masking, published in December, offers useful discussion of that capability (both static and dynamic) and which existing players have already added Hadoop support. Axis Technology’s DMSuite, Dataguise (who partners with Compuware), IBM InfoSphere Optim Data Privacy and InfoSphere Guardium Data Activity Monitor, Informatica Dynamic Data Masking and Persistent Data Masking, and Voltage SecureData Enterprise are all mentioned in the MQ.
There are other offerings, of course – for example Feiman and Lowans note that masking of big data is available for the Oracle Big Data Appliance with its installed Cloudera distribution, but added that it requires the use of Oracle consulting services, or the services of Oracle’s numerous service partners. Similarly, there are several emerging Hadoop focused firms I’ve mentioned elsewhere and will cover in an upcoming piece of Gartner research I’m doing with Neil MacDonald. With RSA coming up this week (unfortunately, I can’t attend), I expect to see more heat – and perhaps light as well – on the issue ahead.
Category: Apache Big Data Cloudera Dataguise Gartner Hadoop IBM Magic Quadrant Oracle Security Tags: Apache, Axis Technology, big data, Cloudera, Compuware, Dataguise, Hadoop, IBM, Informatica, open source, Oracle, OSS, Security, Voltage
by Merv Adrian | February 19, 2014 | 3 Comments
In the most profound change of leadership in Microsoft’s history, Satya Nadella, who was head of the Cloud and Enterprise division, has taken the helm, succeeding Steve Ballmer. Nadella’s “insider” understanding of Microsoft’s culture and his effectiveness in cross-team communication and collaboration could help him reshape Microsoft for the digital era — which will be key for the company to attain the visionary technical leadership to which it aspires.
Nadella’s main challenge consists of evolving Microsoft’s existing businesses (including its enterprise offerings, which represent half of its current revenue) while reinventing Microsoft to make it relevant in mobile and cloud-centric markets.
Unlike his sales-focused predecessor, Steve Ballmer, Nadella has an engineering background; thus, his selection has reinstated the model of a technically minded CEO driving the company’s technical vision. But Nadella also must overcome several challenges.
He lacks direct experience in the mobile market. His insider status raises the risk of his being overly respectful of existing businesses, and hanging back from tough decisions that potentially threaten them but are critical to generating innovation. He will also need to shake up what is widely viewed as a culturally dysfunctional management structure.
Nadella must quickly demonstrate that he is not backing a “business as usual” strategy, and that he recognizes that design is front and center in client computing for both consumers and enterprise users and that a mobilized environment has replaced the desktop. The next six months will show how well Nadella and Gates collaborate to determine Microsoft’s technical direction.
We expect Microsoft’s trajectory will be clear by year-end 2014. Do not expect radical changes in the company’s overall “devices and services” strategy; instead watch for organizational shifts, product design changes and updated product road maps to address a mobile- and cloud-dominant world.
- Establish a vision of itself as an innovative, disruptive force in IT. Concentrating on mobile technology and leveraging lessons learned from gaming can help Microsoft appeal to the next generation.
- Emphasize design to enhance ease of use for consumers, and apply these lessons to its considerable assets in IT infrastructure to change its image of a legacy enterprise vendor competing in a consumerized market.
- Enable entrepreneurs and developers to develop new business value atop a common Windows client environment with unified, cross-platform services. Microsoft must enable a complete, compelling set of apps that attracts developers and can compete with and within iOS and Android environments.
- Acknowledge its customers’ heterogeneity by supporting Google and Apple client environments, the Linux/Java environment on servers, and cloud-based services in general.
- Deliver compelling experiences and solutions to both IT and to non-IT buyers.
For clients, my research note “Nadella Must Disrupt Microsoft Models to Establish Market Leadership,” co-authored with David Cearley, can be found on the Gartner website at http://www.gartner.com/doc/2664432.
Category: Gartner Microsoft Tags: Microsoft
by Merv Adrian | January 21, 2014 | 19 Comments
“Not looking” at security and privacy seems to be the posture of people implementing Hadoop, based on recent data Gartner has collected. This is troubling, and paradoxical. In an era when the privacy of data, from government surveillance to medical record-keeping to “creepy” marketing initiatives and password breaches, has been in the news regularly, it is hard to understand why professionals implementing Hadoop are not paying attention.
The data here comes from a recent webinar I conducted with my colleague Nick Heudecker. We had over 600 attendees, and during the discussion we offered several polling questions. One had to do with barriers to Hadoop adoption. We had 213 responses to that question.
You can see the results below and two things leap out: only 2% of the respondents see lack of robust security as a barrier, and half of the respondents feel that they do not have a sufficiently defined value proposition. More on the latter in another post.
For me, the nearly non-existent response to the security issue is shocking. Can it be that people believe Hadoop is secure? Because it certainly is not. At every layer of the stack, vulnerabilities exist, and at the level of the data itself there numerous concerns. These include the use of external unveiled data and of data in file systems that lack any protection, and the separation of Hadoop initiatives in most organizations from IT governance. Add to that the kinds of use cases Hadoop is being pointed at: sensitive health care information personal data in retail systems; telephone usage; social media connection and sentiment analytics – all of them give us pause.
I’ve pointed to security as a key issue facing the Hadoop community in 2014 for some time now. The fact that awareness of the problem is not getting attention only reinforces my belief that we will see major problems as Hadoop goes mainstream.
Category: Big Data Gartner Hadoop Security Tags: big data, data security, Hadoop, Security
by Merv Adrian | January 17, 2014 | 3 Comments
In the Hadoop community there is a great deal of talk of late about its positioning as an Enterprise Data Hub. My description of this is “aspirational marketing;” it addresses the ambition its advocates have for how Hadoop will be used, when it realizes the vision of capabilities currently in early development. There’s nothing wrong with this, but it does need to be kept in perspective. It’s a long way off.
Start here: A Gartner Research Circle survey revealed that in 2013, big data projects went into production in less than 8% of enterprises – and Hadoop was by no means the only technology included in their count. In 2014, many more enterprises will go into production with their first Hadoop project or two, perhaps even pushing deployed Hadoop shops into low double digit percentages.
In those same shops, there are thousands of significant database instances, and tens of thousands of applications – and those are conservative numbers. So the first few Hadoop applications will represent a toehold in their information infrastructure. It will be a significant beachhead, and it will grow as long as the community of vendors and open source committers deliver on the exciting promise of added functionality we see described in the budding Hadoop 2.0 era, adding to its early successes in some analytics and data integration workloads.
So “Enterprise Data Hub?” Not yet. At best in 2014, Hadoop will begin to build a role as part of an Enterprise Data Spoke in some shops. Aspirations are good, but perspective helps. Don’t confuse vision with strategy. The enterprise data warehouse has a long life ahead of it; the synergistic addition of Hadoop to its evolution into the logical data warehouse is just beginning, and Hadoop’s role in operational workloads and event processing has yet to launch.
Category: Apache Big Data data warehouse DBMS Gartner Hadoop RDBMS Tags: Apache, big data, data warehouse, Hadoop
by Merv Adrian | January 13, 2014 | 11 Comments
Talk to security folks, especially network ones, and AAA will likely come up. It stands for authentication, authorization and accounting (sometimes audit). There are even protocols such as Radius (Remote Authentication Dial In User Service, much evolved from its first uses) and Diameter, its significantly expanded (and punnily named) newer cousin, implemented in commercial and open source versions, included in hardware for networks and storage. AAA is and will remain a key foundation of security in the big data era, but as a longtime information management person, I believe it’s time to acknowledge that it’s not enough, and we need a new A – anonymization.
I realize I’m speaking out of turn here. I’m not a security guy myself, and I don’t pretend to be deep in the disciplines that decide whether you are who claim to be (authentication) and govern whether you can get to the network. Nor do I know the detailed nuances, spread across many different resources, that grant me permission to do what I will be allowed to do with those resources when I get there (authorization.) I don’t understand the various protections that assure breaches do not/have not occurred, which depend on the audit capabilities (the latter, as accounting, also provides the mechanism to report on all of the above.
What I do spend some time on is what happens within the resource that holds the data, when an authenticated, authorized person who is appropriately audited gets to it. For example, we need to distinguish what DBAs can see from what an analyst can – financial types call that “separation of concerns,” and it’s typically managed by a DBMS, which has mechanisms to interact with authorization capabilities to implement policy. It can be coarse- or quite fine-grained, and it’s one of the reasons we analysts always like to remind people that we talk about database management systems, not just databases.
But here’s the problem: in the big data era, much of the data we work with is not in DBMSs – and more and more of it will not be, as file-based systems like Hadoop gain broader and broader use. File systems don’t provide that granular control, so intervening layers will be required. They too can be coarse – we can encrypt/decrypt everything, for example. Or they too can be fine-grained, offering selective, policy-based decryption – in memory, after the bits come off the disk, before handing to the requester.
Personally, I hope people who model disease vectors, or even purchase behaviors, can build effective predictive models that describe what happens to people with certain characteristics. I just don’t want that process to result in my name being “on their list.” If they can intuit and classify what I am by my behavior and assign me to a category in some separate process, that is a different issue.
One approach that matters a great deal is obfuscation, which replaces a field like name or SSN with valid characters, but not the original data. Its value is that if properly implemented, it maintains mathematical cohesion and permits statistical analysis, aggregation, model building, etc to proceed without individuation of the records they are performed over. This is a privacy concern. Redaction – the familiar “blacking out” of content, is also used – but in some policy scenarios, being able to peer into the redacted data might subsequently be of value, and redaction typically doesn’t permit this.
Both approaches, however, can be classified as anonymization (or data deidentification, but I prefer to add another A to AAA for consistency!), and in an era where big data will increasingly be used to track human behaviors for medical, commercial and security reasons, I believe it’s time for anonymization to join the other 3 As. Perhaps it’s time to talk of authentication, authorization, anonymization and audit as the true foundation for data security.
Thanks to my esteemed Gartner colleague Neil MacDonald for commenting on and improving this post.
Category: Big Data data integration data warehouse DBMS Gartner Hadoop HDFS Industry trends Security Tags: big data, data integration, data security, data warehouse, Hadoop, HDFS, Security
by Merv Adrian | October 9, 2013 | 9 Comments
When is a technology offering a platform? Arguably, when people build products assuming it will be there. Or extend their existing products to support it, or add versions designed to run on it. Hadoop is there. The age of Bring Your Own Hadoop (BYOH) is clearly upon us. Specific support for components such as Pig and Hive vary, as do capabilities and levels of partnership in development, integration and co-marketing. Some vendors are in many categories – for example, Pentaho and IBM at opposite ends of the size spectrum interact with Hadoop in development tools, data integration, BI, and other ways. A few category examples, by no means exhaustive:
Analytic Platforms: Kognitio - an analytics-focused in-memory database with SQL 2011 support – offers significant Hadoop support. Jethrodata adds indexes for SQL and stores them in HDFS to accelerate BI tools, while Splunk’s offering called – gotta love it – Hunk has a similar approach. SAS recently added a partnership wkith Hortonworks, extending existing capabilities. For integrated marketing, RedPoint Global offers a true YARN-enabled engine with a rich array of capabilities that many tool vendors would envy.
Application Performance Management – longtime stalwart Compuware is bringing its portfolio to both Hadoop and leading NoSQL offerings.
BI Tools: specialists like Alpine Data Labs, Datameer, Karmasphere and Platfora position themselves as targeted for Hadoop environments. Traditional players like SAP Business Objects may represent themselves as connecting via Hive, and increasingly we will see some, like Alteryx, Qlikview and Tableau, are partnering with emerging distribution-specific stack components like Cloudera’s Impala.
Database: some vendors, like IBM and Teradata offer their own distributions, and even appliances. Others like Actian, Calpont, Oracle and Microsoft partner with pureplay vendors. All provide connectors, management interfaces to their own management tools, etc. MarkLogic adds an “enterprise NoSQL” flavor; Rainstor adds an archiving solution for a highly compressed Hadoop environment.
Data Integration: Informatica and Talend both support HDFS and even have specific offerings for ETL, data quality, etc. Revelytix Loom offers data prep and metadata creation capabilities to shorten time to use cycles.
Development platforms: Continuuity is out to an early lead here, but is hardly alone and won’t be the last. For example, SQL Server development tool player Red Gate will enter the market soon.
Hadoop as a Service: Altiscale, Amazon, Qubole, Rackspace, Savvis, and Xplenty (who mask Hadoop development complexity) offer varying degrees of control and surrounding capabilities – and marketing, as the links demonstrate.
In-memory data grid (IMDG) engine: Gridgain offers GGFS, one of several HDFS substitutes, and like ScaleOut hServer offers an in-memory grid for execution of MapReduce code. Longtime IMDG player Terracotta has added a Hadoop connector.
Lifecycle Management: WANdisco is offering ALM and support for highly available distributed network deployments, and has recently partnered with Hortonworks.
Platform Performance Management: Appfluent, already providing visibility and performance optimization for Oracle, Teradata and IBM DB2 and PureData for Analytics (aka Netezza) platforms, has now added a Hadoop offering as well.
Search: numerous approaches and players here. One of the more interesting is LucidWorks, leveraging Lucene and Solr for search based use cases on a Hadoop infrastructure.
Security: Dataguise, Gazzang, Protegrity, and Zettaset offer varying components of a full-stack security hardening for a Hadoop deployment.
Stream Processing: DataTorrent, Tibco Streambase, Vitria and Zoomdata are among the early players here.
Workload Automation: BMC’s Control-M is already in place for a need that will become more significant as adoption rises and efficiency becomes more of an issue.
This is just a bare smattering of the evolving ecosystem, and I’d be delighted to have your additions, recommendations and comments. I will update this post to include them. Please jump in.
Category: Big Data BigInsights Cloudera DBMS Hadoop HDFS Hive Hortonworks IBM Lucene MapR MapReduce Microsoft Oracle Pig Rainstor RDBMS Security Solr SQL Server SQLstream Talend Teradata YARN Tags: Apache, big data, BigInsights, CDH, Cloudera, Compuware, Hadoop, Hbase, HDFS, Hive, Hortonworks, IBM, Jethrodata, Karmasphere, Kognitio, metadata, Oracle, Pentaho, Pig, Qubole, Revelytix, SAS, Splunk, Teradata
by Merv Adrian | October 3, 2013 | 1 Comment
My Gartner APAC colleague Darryl Carlton and I recently discussed the obscene ratio between CEO pay and average worker pay in the US. And this IS about the US – we are supporting an astonishing gap compared to the rest of the world, and high tech vendors like Oracle are not the only ones at the top of the list – Larry Ellison comes in only number 4 on this Bloomberg list, pulling down 1,287 times what an average Oracle worker (not impoverished at nearly $75K per year) collects.
Personally, I was surprised that the banking community was not represented “better” here – but perhaps that’s because there seem to be dozens of executives, not just the CEOs, in every firm currently hauling cash home in wheelbarrows as mortgages remain underwater and pension funds and healthcare plans get de-funded. At least they share with each other.
In my conversation with Darryl, we discussed the link to company performance. Pick your measure – stock price, revenue improvement, etc. – in general those gorging at the trough are not dramatically outperforming their peers – and they certainly are not doing 10 times, or a hundred times, better than their counterparts in other countries. But look at how the US plutocrats compare to their brethren in other countries – Japan at 11:1, Britain at 22:1 -
and the US at 475:1.
It’s all in the table linked below.
CEO Pay Ratio
Darryl notes that
Drucker argues that CEO pay which exceeds 20-25 times the average paid in that company is bad for business. The new law in Australia is that if 25% of the stockholders vote down executive pay twice (two strike rule) at successive AGM’s then the Board can be replaced. The CEO of Lenovo in China has once again distributed his bonus to the lowest paid factory workers.
Not here, I’m afraid. And I want to say so much more, but only one word is needed, as far as I’m concerned:
Category: Industry trends Tags: Industry trends
by Merv Adrian | September 6, 2013 | 12 Comments
Many things have changed in the software industry in an era when the use of open source software has pervaded the mainstream IT shop. One of them is the significance – and descriptive adequacy – of the word “proprietary.” Merriam-Webster defines it as “something that is used, produced, or marketed under exclusive legal right of the inventor or maker.” In the Hadoop marketplace, it has come to be used – even by me, I must admit – to mean “not Apache, even though it’s open source.”
But Hadoop is not a thing, it’s a collection of things. Apache says Hadoop is only 4 things – HDFS, MapReduce, YARN, and Common. Distributions are variable collections that may contain non-Apache things, and may be called “proprietary” even when all the things in them are open source – even marketed under Apache licenses, if they are not entirely Apache Projects.
To the most aggressive proponents of open source software, this use of “proprietary” equates to “impure,” even “evil.” And some marketing stresses the importance of “true openness” as an unmitigated good as opposed to the less-good “proprietary,” even when the described offering is in fact open source software.
Some vendors respond to market needs with distributions containing open source-licensed components that come from a community process outside the Apache model – where a committee of people from other companies with their own agendas might delay, or seek to change, the implementation or even functionality of “their” software. Many Apache projects have started in other communities, like git, and were proposed later to Apache. Some call these distributions or components “proprietary,” hence less pure, less good, less worthy of consideration.
To the consumer – the enterprise that acquires its Hadoop from a distributor that is the only provider offering that particular capability – it’s a dilemma, and a mixed bag. If they need the functionality, and no Apache project with similar functionality (if one exists) is supported by some other distributor, but the distributor offers and supports it, it seems obvious to go there. Apache, after all, does not support its projects. Vendors (distributors) do.
On the other hand, if and when Apache gets there and someone supports that Project, it will potentially be more difficult to move from one distribution to another. Organizations are increasingly wary of switching costs, even as they embrace the rapid innovation – and resultant disruption – of the open source revolution.
“Purity” is not the question; timing and availability are. Consumers need to buy what will work, and want to buy what will be supported. And the use of “proprietary” here is both inaccurate and not to the point.
So I’m proposing to use a different expression in my future discussions: “distribution-specific.” It can apply today to an Apache project if only a specific distributor includes it. And it will apply to vendor enhancements, even if API-compatible and open source, where the same is true.
I’d love to hear your thoughts – and your nominations for the list. Some are obvious: Cloudera Impala, MapR’s file system, IBM’s JAQL. Or Apache Cassandra, listed on the Apache Hadoop page linked above as a “Hadoop-related project” along with HBase, Pig and others. Only one company commercializes a software distribution with it – Datastax. And they don’t even call themselves a Hadoop distribution. What do you think of all this? Please leave your comments.
Category: Apache Apache Yarn Big Data BigInsights Cassandra Cloudera Hadoop Hbase IBM MapR MapReduce open source OSS Pig YARN Tags: Apache, big data, BigInsights, Cassandra, Cloudera, Datastax, Hadapt, Hadoop, Hbase, HDFS, IBM, open source, OSS, Pig, Yarn
by Merv Adrian | July 15, 2013 | 10 Comments
Probably the most widespread, and commercially imminent, theme at the Summit was “SQL on Hadoop.” Since last year, many offerings have been touted, debated, and some have even shipped. In this post, I offer a brief look at where things stood at the Summit and how we got there. To net it out: offerings today range from the not-even-submitted to GA – if you’re interested, a bit of familiarity will help. Even more useful: patience.
The earliest uses of Hadoop were – and most still are – ETL-style batch workloads of java MapReduce code for extracting information from hitherto unused data sets, sometimes new, and sometimes simply unutilized – what Gartner has called “dark data.” But Hadapt had already begun talking in 2011 about putting SQL tables inside HDFS files – what my friend Tony Baer has called “shoehorn[ing] tables into a file system with arbitrary data block sizes.” We highlighted them in early 2012. But this thread, though it continues, was about pre-creating relational tables – suitable for the known problems, like the existing data warehouses and data marts, and thus not complete over the new, exploratory ad unpredictable jobs Hadoop advocates envisioned.
Pre-defined or not, SQL access would need metadata, and the indispensable DDL it would provide at runtime if not in advance. The HCatalog project was incubating – first outside, then inside SQL interface project Apache Hive (itself first developed at Facebook in 2009), getting significant support from Teradata and Microsoft partnering with Hortonworks. RDBMS vendors were doing what they always do with new wrinkles at the innovative market edges – co-opting them with “embrace and extend” strategies. Thus, HP Vertica, IBM DB2 and Netezza, Kognitio, EMC Greenplum, Microsoft SQL Server, Oracle, Paraccel, Rainstor and Teradata all offered SQL-oriented ways to call out to the Hadoop stack, using external table functions, import pipes, etc. But actual SQL, for use directly against Hive (or HCatalog) metadata from inside their engine was so far a “not exactly, but…” kind of proposition.
At Strata/Hadoop World in October 2012, Cloudera announced its Impala, an already visible SQL engine bypassing MapReduce and executing directly against HDFS and HBase; it is now available in the extra cost add-on RTQ (real time query) option to CDH Enterprise. Although Apache Drill (based on Google’s Dremel) had been announced a few months before, it was and is still listed as Incubating by Apache. That is a pre-Project, arguably pre-alpha state, though MapR’s plan to support it commercially is imminent. Microsoft publicly described its plans for Polybase, which is included in SQL Server 2102 Parallel Data Warehouse V2. Platfora, a BYOH (bring your own Hadoop) BI player, announced its interactive tool with a memory-cached embedded store, opening another front in the battles for customers to muddle over – a database or a tool? Which is a better fit for me?
In February, Hortonworks had thrown the “open, Apache, Hadoop” hat into the ring by announcing the Stinger initiative, promising a 100x speedup – and a few months later the labors of 55 developers from several companies progressed Hive to release – wait for it – 0.11. This is a platform for further development; in its upcoming stages, Stinger will make use of yet another Incubator project, in this case Tez.
Shortly after that, EMC Greenplum (by then using Pivotal as a product, not corporate, name) announced its HAWQ (“Hadoop With Query”). It added some wrinkles by grafting the commercially developed and market-hardened Greenplum MPP Postgres-based DBMS directly to HDFS (or EMC’s Isilon OneFS) – and promising in interviews to be half the price of leading alternatives.
In April, IBM announced its BigSQL, and Teradata unveiled SQL-H, both making real “SQL against Hadoop” part of their portfolios (Teradata already had Aster’s SQL-MR, first introduced in 2008.) Oracle announced an updated Big Data Appliance and some software component upgrades. Oracle again left its MySQL unit out of the conversation, relegated to making its own announcement of a MySQL Applier for Hadoop, which replicates events to HDFS via Hive, extending its existing Sqoop capability. MySQL continues to be remarkably invisible in Oracle big data messaging, despite the widespread presence of the LAMP stack in big data circles.
There you have it – the state of SQL on Hadoop at the Summit. This discussion doesn’t even include things that have not even been entered into the races yet, such as Facebook’s yet-to-be-submitted-to-Apache Presto interface. A closing thought or two:
- None of this is real-time, no matter what product branding is applied. On the continuum between batch and true real-time, these offerings fall in between – they are interactive. And the future of Hadoop in the enterprise includes both interactive and real-time. As Hadoop will be increasingly used for operational purposes, critical real-time applications will require continuous availability for successful deployment in large global organizations. There are stirrings on the horizon, but nonstop operation is still aspirational, not easily available.
- There continues to be much hype about the advantages open source community innovation. In this case, it’s often innovating what has been in SQL since standards like SQL92 were agreed upon and deployed by those slowpoke commercial vendors.
- The reality is, the open source community struggles to be complete and timely with its offerings. YARN’s lateness was discussed in my first post on the Summit. Two weeks before that, at HBaseCon I watched a speaker asking his audience to volunteer for features identified as needed for the roadmap. Sometimes, a commercial product manager is a blessing indeed.
P.S. Yes, I know the syntax of the title is not correct. Call it literary license.
Category: Apache Apache Drill Apache Yarn Aster Big Data Cloudera data warehouse DBMS Gartner Hadapt Hadoop HCatalog HDFS Hive Hortonworks IBM MapR MapReduce Microsoft Netezza Oozie Oracle Rainstor RDBMS Real-time SQL Server Sqoop Teradata YARN Tags: Apache, Aster, big data, BigSQL, CDH, Cloudera, data warehouse, DB2, Drill, EMC, ETL, Greenplum, Hadapt, Hadoop, HAWQ, Hbase, HCatalog, HDFS, Hive, Hortonworks, HP, Impala, Isilon, Kognitio, MapR, MapReduce, MPP, MySQL, OneFS, Oracle, Paraccel, Platfora, Polybase, Postgres, Rainstor, SQL, Sqoop, Stinger, Teradata, Tez, Vertica