Gartner Blog Network


Hadoop Summit Recap Part Two – SELECT FROM hdfs WHERE bigdatavendor USING SQL

by Merv Adrian  |  July 15, 2013  |  10 Comments

Probably the most widespread, and commercially imminent, theme at the Summit was “SQL on Hadoop.” Since last year, many offerings have been touted, debated, and some have even shipped. In this post, I offer a brief look at where things stood at the Summit and how we got there. To net it out: offerings today range from the not-even-submitted to GA – if you’re interested, a bit of familiarity will help. Even more useful: patience.

The earliest uses of Hadoop were – and most still are – ETL-style batch workloads of java MapReduce code for extracting information from hitherto unused data sets, sometimes new, and sometimes simply unutilized – what Gartner has called “dark data.” But Hadapt had already begun talking in 2011 about putting SQL tables inside HDFS files – what my friend Tony Baer has called “shoehorn[ing] tables into a file system with arbitrary data block sizes.” We highlighted them in early 2012. But this thread, though it continues, was about pre-creating relational tables – suitable for the known problems, like the existing data warehouses and data marts, and thus not complete over the new, exploratory ad unpredictable jobs Hadoop advocates envisioned.

Pre-defined or not, SQL access would need metadata, and the indispensable DDL it would provide at runtime if not in advance. The HCatalog project was incubating – first outside, then inside SQL interface project Apache Hive (itself first developed at Facebook in 2009), getting significant support from Teradata and Microsoft partnering with Hortonworks. RDBMS vendors were doing what they always do with new wrinkles at the innovative market edges – co-opting them with “embrace and extend” strategies. Thus, HP Vertica, IBM DB2 and Netezza, Kognitio, EMC Greenplum, Microsoft SQL Server, Oracle, Paraccel, Rainstor and Teradata all offered SQL-oriented ways to call out to the Hadoop stack, using external table functions, import pipes, etc. But actual SQL, for use directly against Hive (or HCatalog) metadata from inside their engine was so far a “not exactly, but…” kind of proposition.

At Strata/Hadoop World in October 2012, Cloudera announced its Impala, an already visible SQL engine bypassing MapReduce and executing directly against HDFS and HBase; it is now available in the extra cost add-on RTQ (real time query) option to CDH Enterprise. Although Apache Drill (based on Google’s Dremel) had been announced a few months before, it was and is still listed as Incubating by Apache. That is a pre-Project, arguably pre-alpha state, though MapR’s plan to support it commercially is imminent.  Microsoft publicly described its plans for Polybase, which is included in SQL Server 2102 Parallel Data Warehouse V2. Platfora, a BYOH (bring your own Hadoop) BI player, announced its interactive tool with a memory-cached embedded store, opening another front in the battles for customers to muddle over – a database or a tool? Which is a better fit for me?

In February, Hortonworks had thrown the “open, Apache, Hadoop” hat into the ring by announcing the Stinger initiative, promising a 100x speedup – and a few months later the labors of 55 developers from several companies progressed Hive to release – wait for it – 0.11. This is a platform for further development; in its upcoming stages, Stinger will make use of yet another Incubator project, in this case Tez.

Shortly after that, EMC Greenplum (by then using Pivotal as a product, not corporate, name) announced its HAWQ (“Hadoop With Query”). It added some wrinkles by grafting the commercially developed and market-hardened Greenplum MPP Postgres-based DBMS directly to HDFS (or EMC’s Isilon OneFS) – and promising in interviews to be half the price of leading alternatives.

In April, IBM announced its BigSQL, and Teradata unveiled SQL-H, both making real “SQL against Hadoop” part of their portfolios (Teradata already had Aster’s SQL-MR, first introduced in 2008.) Oracle announced an updated Big Data Appliance and some software component upgrades. Oracle again left its MySQL unit out of the conversation, relegated to making its own announcement of a MySQL Applier for Hadoop, which replicates events to HDFS via Hive, extending its existing Sqoop capability. MySQL continues to be remarkably invisible in Oracle big data messaging, despite the widespread presence of the LAMP stack in big data circles.

There you have it – the state of SQL on Hadoop at the Summit. This discussion doesn’t even include things that have not even been entered into the races yet, such as Facebook’s yet-to-be-submitted-to-Apache Presto interface. A closing thought or two:

  • None of this is real-time, no matter what product branding is applied. On the continuum between batch and true real-time, these offerings fall in between – they are interactive. And the future of Hadoop in the enterprise includes both interactive and real-time. As Hadoop will be increasingly used for operational purposes, critical real-time applications will require  continuous availability for successful deployment in large global organizations. There are stirrings on the horizon, but nonstop operation is still aspirational, not easily available.
  • There continues to be much hype about the advantages open source community innovation. In this case, it’s often innovating what has been in SQL since standards like SQL92 were agreed upon and deployed by those slowpoke commercial vendors.
  • The reality is, the open source community struggles to be complete and timely with its offerings. YARN’s lateness was discussed in my first post on the Summit. Two weeks before that, at HBaseCon I watched a speaker asking his audience to volunteer for features identified as needed for the roadmap. Sometimes, a commercial product manager is a blessing indeed.

P.S. Yes, I know the syntax of the title is not correct. Call it literary license.

Category: apache  apache-drill  apache-yarn  aster  big-data  cloudera  data-warehouse  dbms  gartner  hadapt  hadoop  hcatalog  hdfs  hive  hortonworks  ibm  mapr  mapreduce  microsoft  netezza  oozie  oracle  rainstor  rdbms  real-time  sql-server  sqoop  teradata  yarn  

Tags: apache  aster  big-data-2  bigsql  cdh  cloudera  data-warehouse  db2  drill  emc  etl  greenplum  hadapt  hadoop  hawq  hbase  hcatalog  hdfs  hive  hortonworks  hp  impala  isilon  kognitio  mapr  mapreduce  mpp  mysql  onefs  oracle  paraccel  platfora  polybase  postgres  rainstor  sql  sqoop  stinger  teradata  tez  vertica  


Thoughts on Hadoop Summit Recap Part Two – SELECT FROM hdfs WHERE bigdatavendor USING SQL


  1. “At Strata/Hadoop World in October 2012, Cloudera announced its Impala, an already visible SQL engine bypassing MapReduce and executing directly against HDFS and HBase; it is now available in the extra cost add-on RTQ (real time query) option to CDH Enterprise.”

    To briefly clarify, Cloudera Impala is an Apache-licensed open source project, available today for free download and use from cloudera.com/impala.

    The RTQ (Real-time Query) subscription is an add-on to Cloudera Enterprise and includes technical support, indemnity and open source advocacy for Cloudera Impala. More information here: http://www.cloudera.com/content/cloudera/en/products/cloudera-enterprise/RTQ-subscription.html

    • Merv Adrian says:

      Thanks, Matt. I can see that the language might imply to some that there is no “free” version. I’ll tweak the text a bit.

  2. Guy says:

    Also related:

    Phoenix (salesforce)
    JethroData
    Citus Data
    Shark/Spark
    Tajo

  3. Merv,

    Good post. I’d like to comment on two of your points:

    1. Re: HBase: “asking his audience to volunteer for features identified as needed for the roadmap”

    Successful community-driven projects are constantly recruiting people to get involved and help implement new features. Quality code doesn’t just fall from the skies…it takes talented and passionate engineers involved in the projects to make it happen. At Hortonworks, we have product managers focused on identifying key enterprise requirements for technologies like HBase. We also have engineers who work with their community counterparts on implementing those requirements as well as other itches the community feels need scratching.

    2. Re: YARN: “the open source community struggles to be complete and timely with its offerings”

    The beauty of community-driven open source is that the entire sausage making process is open and transparent, so it’s easy to see when new components or key features are proposed and how the code progresses over time.

    Since this same level of visibility does not exist for proprietary software, there’s no accurate way to measure “timely” in that domain, now is there?

    As an analyst, you’re privileged [ or cursed ;-) ] to have more visibility into open source projects than you have with proprietary solutions. Neither changes the fact that complex software systems take time, effort, testing and hardening; open source or proprietary.

    Bottom-line: The desire to gather ALL data in one place and interact with that data in MULTIPLE ways (batch, interactive, online, streaming, etc.) with predictable performance has really emerged over the past year or so. The nice thing is that YARN has been under development for a few years in anticipation of this need. YARN’s availability is timely in that the commercial ecosystem and other open source projects now have the ability to plug natively into Hadoop in a manner that addresses this desire.

    • Merv Adrian says:

      Shaun, as always, thanks for rising to the issue – I hope we would get a conversation going about this.
      1. Admittedly, I was being a bit provocative, but I do believe the differences are real. The Product Manager role in a commercial software company drives directed engineering on schedule, driven by market-driven requirements through the Product Marketing function’s assistance in the best firms. The connection can be less direct, and less deadline-driven, in community process models. That’s why commercializers like Hortonworks play a valuable role – you have competition to drive you. It’s a measure of how much effort and structure this takes that your company needed a year to go GA with HDP after you were founded. Building that stuff takes time, and your subsequent roadmap has been well executed.
      2. Again, my comment about YARN was edgy, but I’ve been equally critical of commercial SW providers with multi-year “early release, ramp up, community preview” and similar programs. See Look Before you Leap Into Extended Software Beta Programs (G00214570). I spend a lot of time reminding enterprise clients that they need to be cautious with unsupported, pre-GA software for production use. This was another reminder of that.

  4. Merv, you’ll agree that one of the key differences between proprietary development and community/open source development is that the former stays under wraps until the vendor decides to start talking about it, while the open source development process is open and visible from day one.

    While the latter adds welcome transparency, it also sometimes create frustration about the time it takes to build things properly. Which is probably no longer than proprietary – but it’s the foreplay that is much longer…

    Proprietary vendors have the luxury of controlling the news cycle, which, when properly done, can be very powerful marketing. Open source vendors, even commercial open source, can’t do this.

    • Merv Adrian says:

      Hard to dispute what you’re saying, of course, Yves. That said, it’s not the news cycle that drives what we do at Gartner, though we do want to know what is coming and when. So we track incoming news from both communities and attempt to give our clients the best guidance based on what we know and can talk about.

  5. […] Blogs & Guest Articles Authored by Analysts Hadoop Summit Recap Part Two – Gartner Blog Network by Merv Adrian | July 15, 2013 | 7 Comments. Probably the most widespread, and commercially […]



Comments are closed

Comments or opinions expressed on this blog are those of the individual contributors only, and do not necessarily represent the views of Gartner, Inc. or its management. Readers may copy and redistribute blog postings on other blogs, or otherwise for private, non-commercial or journalistic purposes, with attribution to Gartner. This content may not be used for any other purposes in any other formats or media. The content on this blog is provided on an "as-is" basis. Gartner shall not be liable for any damages whatsoever arising out of the content or use of this blog.