Probably the most widespread, and commercially imminent, theme at the Summit was “SQL on Hadoop.” Since last year, many offerings have been touted, debated, and some have even shipped. In this post, I offer a brief look at where things stood at the Summit and how we got there. To net it out: offerings today range from the not-even-submitted to GA – if you’re interested, a bit of familiarity will help. Even more useful: patience.
The earliest uses of Hadoop were – and most still are – ETL-style batch workloads of java MapReduce code for extracting information from hitherto unused data sets, sometimes new, and sometimes simply unutilized – what Gartner has called “dark data.” But Hadapt had already begun talking in 2011 about putting SQL tables inside HDFS files – what my friend Tony Baer has called “shoehorn[ing] tables into a file system with arbitrary data block sizes.” We highlighted them in early 2012. But this thread, though it continues, was about pre-creating relational tables – suitable for the known problems, like the existing data warehouses and data marts, and thus not complete over the new, exploratory ad unpredictable jobs Hadoop advocates envisioned.
Pre-defined or not, SQL access would need metadata, and the indispensable DDL it would provide at runtime if not in advance. The HCatalog project was incubating – first outside, then inside SQL interface project Apache Hive (itself first developed at Facebook in 2009), getting significant support from Teradata and Microsoft partnering with Hortonworks. RDBMS vendors were doing what they always do with new wrinkles at the innovative market edges – co-opting them with “embrace and extend” strategies. Thus, HP Vertica, IBM DB2 and Netezza, Kognitio, EMC Greenplum, Microsoft SQL Server, Oracle, Paraccel, Rainstor and Teradata all offered SQL-oriented ways to call out to the Hadoop stack, using external table functions, import pipes, etc. But actual SQL, for use directly against Hive (or HCatalog) metadata from inside their engine was so far a “not exactly, but…” kind of proposition.
At Strata/Hadoop World in October 2012, Cloudera announced its Impala, an already visible SQL engine bypassing MapReduce and executing directly against HDFS and HBase; it is now available in the extra cost add-on RTQ (real time query) option to CDH Enterprise. Although Apache Drill (based on Google’s Dremel) had been announced a few months before, it was and is still listed as Incubating by Apache. That is a pre-Project, arguably pre-alpha state, though MapR’s plan to support it commercially is imminent. Microsoft publicly described its plans for Polybase, which is included in SQL Server 2102 Parallel Data Warehouse V2. Platfora, a BYOH (bring your own Hadoop) BI player, announced its interactive tool with a memory-cached embedded store, opening another front in the battles for customers to muddle over – a database or a tool? Which is a better fit for me?
In February, Hortonworks had thrown the “open, Apache, Hadoop” hat into the ring by announcing the Stinger initiative, promising a 100x speedup – and a few months later the labors of 55 developers from several companies progressed Hive to release – wait for it – 0.11. This is a platform for further development; in its upcoming stages, Stinger will make use of yet another Incubator project, in this case Tez.
Shortly after that, EMC Greenplum (by then using Pivotal as a product, not corporate, name) announced its HAWQ (“Hadoop With Query”). It added some wrinkles by grafting the commercially developed and market-hardened Greenplum MPP Postgres-based DBMS directly to HDFS (or EMC’s Isilon OneFS) – and promising in interviews to be half the price of leading alternatives.
In April, IBM announced its BigSQL, and Teradata unveiled SQL-H, both making real “SQL against Hadoop” part of their portfolios (Teradata already had Aster’s SQL-MR, first introduced in 2008.) Oracle announced an updated Big Data Appliance and some software component upgrades. Oracle again left its MySQL unit out of the conversation, relegated to making its own announcement of a MySQL Applier for Hadoop, which replicates events to HDFS via Hive, extending its existing Sqoop capability. MySQL continues to be remarkably invisible in Oracle big data messaging, despite the widespread presence of the LAMP stack in big data circles.
There you have it – the state of SQL on Hadoop at the Summit. This discussion doesn’t even include things that have not even been entered into the races yet, such as Facebook’s yet-to-be-submitted-to-Apache Presto interface. A closing thought or two:
- None of this is real-time, no matter what product branding is applied. On the continuum between batch and true real-time, these offerings fall in between – they are interactive. And the future of Hadoop in the enterprise includes both interactive and real-time. As Hadoop will be increasingly used for operational purposes, critical real-time applications will require continuous availability for successful deployment in large global organizations. There are stirrings on the horizon, but nonstop operation is still aspirational, not easily available.
- There continues to be much hype about the advantages open source community innovation. In this case, it’s often innovating what has been in SQL since standards like SQL92 were agreed upon and deployed by those slowpoke commercial vendors.
- The reality is, the open source community struggles to be complete and timely with its offerings. YARN’s lateness was discussed in my first post on the Summit. Two weeks before that, at HBaseCon I watched a speaker asking his audience to volunteer for features identified as needed for the roadmap. Sometimes, a commercial product manager is a blessing indeed.
P.S. Yes, I know the syntax of the title is not correct. Call it literary license.