Has HDFS joined MapReduce in the emerging “legacy Hadoop project” category, continuing the swap-out of components that formerly answered the question “what is Hadoop?” Stores for data were certainly a focus at Strata/Hadoop World in NY, O’Reilly and Cloudera’s well-run, well-attended, and always impactful fall event. The limitations of HDFS, including its append-only nature, have become inconvenient enough to push the community to “invent” something DBMS vendors like Oracle did decades ago: a bypass. After some pre-event leaks about its arrival, Cloudera chose its Strata keynote to announce Kudu, a new columnstore written in C++, bypassing HDFS entirely. Kudu will use an Apache license and will be submitted to the Apache process at some undetermined future time.
Replacing file systems to optimize storage for dominant use cases is not new, and there is ample reason to consider it. Recall what the Hadoop pioneers needed: a place for files, with optimizations for certain block sizes and data types that are no longer majority use cases as the ambition of the stack expands. Store your videos? Great. Log files and clickstreams that are read but never updated? You bet. Columnar stores that can be manipulated fast and efficiently, and updated, maybe even for HadOLAP processing? Not so much.
What about HBase, which is deployed on HDFS? Cloudera’s Amr Awadallah told Gartner that “for low latency lookups, HBase is still a good answer….maybe leveraging Apache Phoenix, if it needs to do a scan with a filter – great.” All the distributors support HBase – it’s also one of the earliest of the core stack pieces. For what Gartner calls HTAP (hybrid transactional analytic processing), where end to end business events include read, write, and compute in various combinations, developers know that structured data, schema on write, and other traditional values make a big difference. And Cloudera thinks the Hadoop stack can capture some of that work with the right foundation. Hence Kudu.
It’s not their only play: so far, Apache Parquet, which Cloudera has driven, has done pretty well despite the limitations; Impala over Parquet files has delivered satisfactory results. But Cloudera wants more. If Kudu succeeds in its goals, Parquet may eventually be just as marginalized as MapReduce is beginning to be – suitable for a subset of use cases, but superseded by superior stores from many others. Hortonworks is not supporting it, putting its weight instead behind Apache ORC, which became a top level project in April 2015. Every vendor is attempting to build its own differentiated component collection – just like closed-source vendors. The Open Data Platform Initiative (ODPI), which recently has gained more vendors along with another initial in its name, will have the interesting challenge of determining whether a standard reference architecture will play when participants all want their own pieces at each layer.
Cloudera was not alone at Strata in announcing store-level efforts. MapR enhanced MapR-DB to go after another use case: document DBMS-style storage within the Hadoop stack. MapR was early to rewrite at the storage layer, taking the non-Java road from its beginnings for better performance. Now, promising granular sub-document access, with multi-master, global deployment at scale on a Hadoop infrastructure with strong consistency and Python and Node.js language bindings available at launch, MapR seeks to tap JSON patterns that a huge developer community is already familiar with, tapping into a new audience. It will offer OJAI (Open JSON Application Interface), like other pieces of its stack, with an Apache license, but not via the Apache Software Foundation – a community strategy similar to Cloudera’s. And it will be competing with Apache Cassandra for that business – opening a front outside the Hadoop theater of operations.
WANdisco featured the storage layer too, announcing that its Fusion will now be available for Hadoop clusters running on EMC’s Elastic Cloud Storage (ECS) and Isilon storage systems, as well as other HDFS-compatible storage in both public and private cloud environments. The aim is to provide a consistent view of an organization’s data across all clusters and locations, without downtime or data loss. WANdisco claims customers can scale up their Hadoop deployments to double their current size by converting formerly read-only standby backup and recovery servers and clusters to full production use.
Stores in the cloud got their share of attention elsewhere as well. Microsoft announced its Azure Data Lake, using HDFS in the cloud, featuring autosharding, an understanding of locality, and the promise of support for other types of data stores looming as a possible future. Amazon was on the Strata floor too, perhaps feeling the heat as other players start to target their relatively unchallenged cloud leadership so far. They have partnered with MapR for quite some time at the storage level, but that was not a focus at their booth.
What is clear is that innovation in storage has not finished yet. More ideas – some new, some Re:Invented (Amazon chose a good name for its annual event) will continue to emerge and disrupt information fabrics everywhere. Different stores fir different chores.
There were other themes too at Strata this year: security, simplicity for developers and deployers all got plenty of attention – and, of course, streaming. Security and streaming share the theme of competing, proliferating standards, with Hortonworks creating a new business line, DataTorrent pushing its leadership into Apache Apex, Kafka and Spark surging into leadership on the hype front, and the entire Hadoop ecosystem seeming ready to pivot from data at rest to data in motion. More to come on those topics in upcoming blog posts.