The Hadoop ecosystem is like a kaleidoscope, where particles keep colliding, tumbling and forming mesmerizing patterns created by the reflections. My research note What Matters When Comparing Hadoop Distributions is finally out. I’ve been writing it for four months. As soon as I felt it was ready, there were some principle points that I had to resolve, because the Hadoop kaleidoscope kept turning. Hadoop distribution vendors were changing their stances and clients were seeking guidance on more and more Hadoop-related subjects. What’s even more interesting, over this time, a whole wave of the Hadoop ecosystem products became better visible in the kaleidoscope: Databricks / Apache Spark, 0xdata H2O and Adatao are the examples.
I’d like to offer the main points from my research, which can help enterprises get a snapshot of the Hadoop kaleidoscope. Be aware, many new announcements will come from Strata / Hadoop World this week to keep beautiful and evanescent pictures in motion.
- Commercial Hadoop distributions eliminate the complexity of building a Hadoop stack on your own. They ensure indemnification and provide support of open-source software.
- There are more similarities than differences among commercial Hadoop distributions: All Hadoop distributions include the core open-source Apache Hadoop projects, many other open-source projects and a smaller set of distribution-specific components. Most distribution-specific components deliver functionality comparable with functionality of other distributions. This makes vendor lock-in concerns ungrounded for the majority of use cases.
- Hadoop distributions will improve and will deviate from their current state. Gartner expects new technologies in the Hadoop ecosystem in the near future.
- Hadoop’s value is not only in its features and capabilities — given growing YARN resource management maturity — it is also becoming the de facto cluster management standard.
- Given that Hadoop is engineering-driven, certain gaps important to the business could get low priority or may be overlooked.
- For many organizations, big data initiatives are the cutting edge of their innovation. Talented and experienced distribution vendors are often not just service providers but innovation partners and the source of new ideas in the enterprise.
- Cost should not be a key factor in deciding to implement Hadoop on your own. Acquire a commercial Hadoop distribution for your on-premises implementation to address unavoidable technology challenges.
- Partnerships between Hadoop distribution vendors and your key software or hardware suppliers are a main Hadoop distribution selection factor. Determine how a Hadoop distribution fits into your overall architecture.
- The majority of your time would be better spent on determining the value of Hadoop to your enterprise, rather than on choosing among Hadoop distributions.
- Your long-term architectures will evolve along with Hadoop. In the light of rapid changes and upcoming Hadoop improvements, focus your architecture on your immediate use cases.
Follow Svetlana on Twitter @Sve_Sic