Recently I had an opportunity to learn a little more about Apache Spark, a new in-memory cluster computing system originally developed at the UC Berekeley AMPlab. By moving data into memory, Spark improves performance for tasks like interactive data analysis and iterative machine learning. These improvements are especially pronounced when comparing them to a batch oriented, disk-bound system like Apache Hadoop. While Spark has seen rapid adoption at a number of companies, I learned how Yahoo! has started integrating Spark into its analytics.
With over 800 million users, Yahoo! has to conduct data science at a massive scale, with results timely enough to be meaningful. To get a feel for the necessary scale, Yahoo! has over 150 petabytes stored on a 35,000-node Hadoop cluster. This data is used for machine learning, as well as BI and analytics. With the quantity of data under management, efficient access is key. Also, most projects require the processing power of the entire cluster. Yahoo! looked to Spark to improve performance of its iterative model training.
Spark is appealing for a few reasons beyond its efficiency improvements. Its rich API is available in several programming languages, has resilient in-memory storage options and is compatible with Hadoop through YARN (see “Hadoop Evolves to Face New Challenges“) and the Spark-YARN project.
Yahoo! tested Spark’s performance relative to Hadoop using an e-commerce pilot project. The project had a few simple, but resource intensive, use cases: viewed-also-viewed, bought-also-bought and bought-also-viewed. The pilot was successfully implemented with 30 lines of code (Spark’s Scala API) and executed in 14 minutes on just 10 servers. The equivalent Hadoop implementation took 106 minutes (lines of code wasn’t provided).
While these improvements are impressive, Yahoo! isn’t abandoning its Hadoop cluster for Spark. There is a clear need for both types of workloads. Spark will be the preferred technology for iterative processing, while Hadoop continues to fulfill its niche for batch data processing tasks. What’s interesting is that both types of tasks run on the same Hadoop cluster through YARN.
Complete Your Data and Analytics Strategy With a Clear Value Proposition
As a data and analytics leader, one of the most important things to articulate in your strategy is the value proposition. Learn how to create a modern, actionable D&A strategy that creates common ground amount stakeholders.Read Free Gartner Research
Comments or opinions expressed on this blog are those of the individual contributors only, and do not necessarily represent the views of Gartner, Inc. or its management. Readers may copy and redistribute blog postings on other blogs, or otherwise for private, non-commercial or journalistic purposes, with attribution to Gartner. This content may not be used for any other purposes in any other formats or media. The content on this blog is provided on an "as-is" basis. Gartner shall not be liable for any damages whatsoever arising out of the content or use of this blog.