This year’s West Coast edition of Spark Summit continued the transition from a data science and data engineering event to an event focused on machine learning and, to a lesser extent, artificial intelligence. The summit content was accompanied by product announcements from Databricks (Serverless was the most notable) and updates to the Apache Spark project, like structured streaming.
Like any event where machine learning is a primary topic, the keynote hype was thick. As one keynote speaker stated, “Being right is good, but being fast is better.” I get the sense that data science groups are relieved that machine learning is the new hot topic. With a new windmill to tilt at, data science groups can avoid questions about the lack of value created during the ‘big data’ phase of the hype.
Tempering that pessimism are successful uses of ML from companies like Hotels.com and EveryMundo. These successful implementations took time to implement and refine, and they haven’t been perfected yet. Applying machine learning to areas like image recognition is a process requiring constant refinement and optimization.
Takeaway: If you’re looking to experiment with machine learning, Spark is as good a platform as any to start with. It continues to attract the most interest from academia and open source developers. But don’t let that new windmill on the horizon distract you from simple methods. Sometimes a linear regression is just as effective and faster to implement.
The general sessions were more pragmatic. Several companies explained how they’re using machine learning in areas like genomics, video streaming and data quality. Others, like Baidu, offered a detailed look at the amount of work they had to do to support self-servce analytics over advertising and search data. (Spoiler – it was a lot of work.)
Takeaway: Scaling your Spark-based analytics and data processing for an enterprise-wide audience is non-trivial. You must understand the different dimensions of your various workloads (complexity, frequency, skills) and optimize accordingly. While there are tools and products that help, the clear message from the engineers presenting at Spark Summit was that getting to production required an iterative, phased approach that, in many cases, took years.