The first three posts in this series talked about performance, projects and platforms as key themes in what is beginning to feel like a watershed year for Hadoop. All three are reflected in the surprising emergence of a number of new players on the scene, as well as some new offerings from additional ones, which I’ll cover in another post. Intel, WANdisco, and Data Delivery Networks recently entered the distribution game, making it clear that capitalizing on potential differentiators (real or perceived) in a hot market is still a powerful magnet. And in a space where much of the IP in the stack is open source, why not go for it? These introductions could all fall into the performance theme as well – they are all driven by innovations intended to improve Hadoop speed.
Intel is by far the biggest of the new entrants. I discuss them along with my colleague Arun Chandrasekaran in a recent Gartner First Take: Hadoop Distribution Seeks to Leverage Intel’s Microprocessor Strengths. Net: processor-level exploitation and expertise in memory and IO architectures can drive great improvements. There’s more: Intel made several key partnership deals:
- An agreement with SAP around the HANA in-memory DBMS to collaborate on both a technology roadmap and go to market plans, including both streaming and bulk data movement between the two environments. The two firms plan to have a single-install deliverable later this year that will provide direct bidirectional queries and integrate management as well, building atop the already demonstrable support for SAP Data Services to pull data from Intel’s Hadoop distribution, and to deliver both on the same hardware platform. (Note that appliances that combine DBMS and Hadoop on a single rack are already available from EMC, HP and IBM – but that is another post.)
- A deal with MarkLogic Enterprise NoSQL to incorporate and support Intel’s distribution, seeing Intel’s chip-based encryption as a good complement to MarkLogic’s role-based NIAP and CCEVS-compliant security system. And of course to expand the reach of the MarkLogic engine to HDFS.
- An OEM agreement with Pentaho – the latter will be in the box. Its data mining, reporting, data discovery/visualizations, predictive analytics, and data integration will help round out the offering and make it easier to build without deep java MapReduce skills – making an interesting foil to Hortonworks’ similar arrangement with Talend.
- Numerous other partnerships – over 20 – that include hardware, network and systems integrator deals.
WANdisco may be an unfamiliar name to some data management folks, but not to those using Apache Subversion, the open source version control system. As a leader in Wide Area Network Distributed Computing (there is an acronym-based name in there if you look) with patent-protected active-active peer-to-peer replication, WANdisco sees a performance-driven opening too. With a leadership team that made foundational contributions to Apache HDFS and Apache BigTop and helped build out Yahoo’s infrastructure, WANdisco comes to the table with the WDD distribution, based on Apache Hadoop 2.0 with support for WANs across data centers, including mirroring and auto-recovery. (It joins MapR in its ability to provide the latter, but without requiring the use of its own filesystem.) It supports Amazon S3 storage as well as HBase, and provides a console for wizard-based deployment. monitoring and management on both virtualized (VMware) and dedicated physical infrastructure. It also provides the usual support and consulting services and plans aggressive moves to ramp up following last year’s IPO on the London Stock Exchange and acquisition of Altostor.
Data Direct Networks (DDN), whose presence in the high performance computing (HPC) market may also be less well known to the typical Hadoop prospect, is targeting the mid-to upper end of the market with its hScaler appliance. Above 100 nodes, where significant enterprise production workloads run, and in the multi-thousand node space where government and data center customers operate, DDN is already often familiar for its Lustre-based ExaScaler and GridScaler filesystem plays. hScaler is pointed at the fact that by some estimates, 30% or more of a job on large clusters takes place on data that is not local to the node, despite the value proposition of Hadoop putting processing “next to the data.” This is the “shuffle” phase, which takes place between Map and Reduce – multiple times in a multi-job step workflow – and most are multi-step. This is a performance play that gets more attractive with larger size and more complexity. DDN touts its pipelining of Hadoop, and its ability to scale compute and storage independently as key differentiators. [edited] hScaler includes and DDN supports the Hortonnworks’ HDP, and has Pentaho for its “ETL graphical designer” and the DirectMon management console (like the ones I discussed in Part One), compatible with ExaScaler and GridScaler.
There you have it: 3 new players, all of whom are focusing on the performance dimension as their reason for entering the market. This is a maturation step; in the earliest days of a new technology, it’s enough to be able to do it at all. Some of the tire-kickers are now moving on to evaluate alternatives against one another not just on what functions they have, but on how well they do them. POCs are appearing in my Gartner inquiries, and the results, not surprisingly, vary by workloads, by the data types, and by the skills of the field staff involved in the tests. We’ll see lots of benchmarketing in the months ahead. Just remember that your mileage may vary, and ALWAYS test with a POC bakeoff.