Gartner inquiries about “big data,” as much as I have always hated the term, continue to be frequent but Hadoop ones are not. In the past 24 months, our data management analysts have advised clients with hundreds of “big data” inquiries – a couple of dozen a month, still a substantial volume. For perspective, in the same period, there were over twice as many about data lakes. (There is some overlap, and I have not attempted to tease them apart.) If you’re wondering about “lakehouse,” it’s less than a tenth of that, and mostly in the past few months, but it’s rising rapidly.
Tracking these things is a good way for Gartner to hear what the market is saying – it informs our coverage, directs our research, and focuses our attention. The “Hadoop” story is instructive. It has come up only half as often as “big data”over the same period, and the trend is very clear, as the figure below shows.
This should not surprise anyone. After all, Cloudera announced its Enterprise Data Hub 8 years ago in 2014. Its messaging rapidly moved away from marketing based on “Hadoop” and embraced Spark, Kafka, Flink, NiFi and other key open source developments over the next few years, and its competitors did the same. In fact, Cloudera, Databricks, Dremio, Ververica and other software vendors continue to watch developing open source components and use them to enhance their commercial offerings – and return the favor by contributing code to those projects.
The statistics don’t tell you what the “Hadoop” inquiries are about, but that has undergone a change as well. Two years ago, in the Hype Cycle for Data Management, 2020, Gartner discussed the emerging interest in SQL Access to Hadoop and SQL Access to Object Stores – based in large part on inquiries we were receiving. That trend was so pronounced we began covering a new category called Analytics Query Accelerators, products promising to improve performance against unoptimized data lake data, with acceleration technology often built on open source offerings. Many of the data teams’ “Hadoop” inquiries have moved over there. I posted my last Project Tracker updating which open source project versions were in commercial Hadoop offerings a few months later. The continuing inquiry traffic is increasingly about where to put the data first assembled for use with the original Hadoop toolset, and what new tools to use.
Apache Hadoop is far from dead – it’s still quite active
For example, clearly Spark has taken a lot of the compute workloads. Gartner’s inquiries about Spark in the same time period have soared to 1120, but over three quarters of them go to analysts outside the data management team. The separation of compute and storage is making its presence felt in who asks, and who answers, the questions that used to be about “Hadoop.” The issues are separate now, and often the teams are too. Compute-oreinted folks are expecting the stores provided to them by their data teams to be accessible with open APIs, and the marketplace offerings at the storage layer will increasingly accommodate those expectations. Another refactoring looms.
Apache Hadoop is far from dead – it’s still quite active, with the version 3.3 line seeing its first release in July 2020, and several updates have published since in the same stream. It’s still defined on its Apache site by MapReduce, HDFS and YARN, which have continuing value and significant installed bases. But the next steps are in sight. MapReduce is not a preferred tool anymore. HDFS is seeing numerous competitors at the storage layer. YARN is not really found elsewhere, while other open source tools for resource management compete in a dwindling on-premises landscape. It’s not about Hadoop anymore – it’s about what’s next.
The Gartner Blog Network provides an opportunity for Gartner analysts to test ideas and move research forward. Because the content posted by Gartner analysts on this site does not undergo our standard editorial review, all comments or opinions expressed hereunder are those of the individual contributors and do not represent the views of Gartner, Inc. or its management.
Comments are closed
Thanks for that. Good to get a broader view of data management that only someone taking all those calls sees.
That motion toward a Hadoop alternative seems to have been a big movement for a while. Wrote a Medium post about it a few years back: https://medium.com/@paigeonthewing/what-is-the-best-hadoop-alternative-18013a4980be
Dask was one that hadn’t been on my radar till I worked on the data architecture O’Reilly book with a young ML developer, Ben Epstein last year. With the popularity of Python, I can see that having a future if they can work out the kinks.
I also read your Analytics Query Accelerator paper. I think one of the solutions to a big chunk of the problems with Hadoop and data lakes in general is the query accelerator technology. It’s one of the reasons Vertica added that capability years ago, although it used to be “SQL on Hadoop” and isn’t anymore for obvious reasons.
Thanks, Paige – we have been on the same – ahem – page about this for years. My number one line back in those days was “Hadoop is not a thing – it’s a set of things. And the things are changing all the time.”