You remember it: the highly publicized competition in which no less than the world as we know it was challenged to beat Netflix’s incumbent recommendation engine by — well, I suppose by squeezing more insight out of Netflix’ own data on its members’ film viewing and rating habits than the company itself could.
And you remember the winners, perhaps: a team of brainiacs from a couple of different, no doubt impressive, universities, who managed to beat a successful system that offered up recos on what to watch next. This insight — and of course the attendant publicity — were worth $1 million to Netflix. (The company’s 2014 revenue was $5.5 billion.)
Contestants were given real data on 100 million ratings linked to (anonymized) households. (This household thing is important — file it away for the moment.) They were also given the specific challenge to improve the performance of Netflix’s in-house system, called Cinematch, by 10%, building a model that would more accurately predict how Netflix users would rate a film they hadn’t seen. How was this “improvement” measured? Technically, the winners had to reduce something called root mean squared error from 0.9525 to 0.8572 or less. That’s roughly $100K per 0.1 chip-off.
Now, what is this super-valuable root mean square error? Basically, it is a measure of the difference between what the model (Netflix’ or the competitors’) predicted a bunch of people would like and what they actually liked. It’s an example of something common in data science projects, where the modelers are frantically trying to minimize some number that measures their mistakes.
The challenge appears to have closed down (for now). The winning solution was never used. Looking back at this pioneering data science competition — forerunner of the modern cage match-scape in which teams of competitive modelers throw away hours of their lives like demented ultramarathoners on Kaggle and so on — is very telling, I think, in two dimensions. It:
- Reminds us that there is always a balance between precision and practicality — that the perfect solution is something best left to academics with infinite leisure
- Shines a light on how very smart people have recently been thinking about recommendation systems, which are some of the most common models used in marketing
Actually, there were three Netflix Prizes awarded after the competition was announced in 2007. At the time, Netflix had yet to launch its streaming service and was a red-envelope-mailing company with generous terms for broken DVD’s. The first two cash prizes went to related teams called BellKor and Big Chaos that didn’t quite eke out 10% (their models were 8.4% and then 9.5% better), but it cost them more than 2,000 hours of code pounding, hammering together no less than 107 different algorithms, artfully combined into something called an “ensemble.” Ensemble is shorthand for “bag of algorithms that are combined and weighted.”
Netflix actually used the BellKor solution — or some of it. According to one Netflix engineer, the team’s model had to be adapted to handle 50X greater volume than the 100 million-record test set and to deal with all the new ratings screaming through the walls. But lesson learned; $50K partial-prize money well spent.
Two years later, a combined team including BellKor kept pounding away and their slightly better solution — also, as you no doubt guessed, a mighty ensemble — won the $1 million Grand Prize. It was never used. Why? The Netflix engineer admitted:
“… the additional accuracy gains that we measured did not seem to justify the engineering effort needed to bring them into a production environment.”
In other words: it wasn’t worth the trouble. Good enough was good enough. And Netflix’ business was changing rapidly into a streaming-delivery service that puts business value on many factors other than infinitely accurate recommendations. These other factors (as Netflix has described them) include diverse recommendations — catching all the different snowflakes in your modern family household, — pushing Netflix’ own original content and partners’ programming, promoting new movies, and so on.
The winners received some adulation and were reportedly snapped up by Silicon Valley hedge funds. In September 2009, Wired ran a breathless sports-style recap … that turned into a misty retrospective three years later titled “Netflix Never Used Its $1 Million Algorithm Due to Engineering Costs.”
The mood turned dark. People who misunderstand the game-playing and career-making aspects of these competitions, which pay off winners in spiritual and material terms, began to feel a bit tweaked. One of Forbes’ self-appointed pundits was quite pointed: “The failure of the Netflix prize is a timely reminder that basically all the books and blogs on social media are bulls#%t.” And really, this “failure” is a story of how Netflix “stops throwing good money after bad.”
Really, what did the Netflix Prize experience tell us?
- Good Enough Usually Is: That you can get to an 80% solution quickly and a 90% solution slowly . . . and a 95% solution after person-years of absolute agony . . and that, usually, the 80% is good enough. I think this is true in many marketing disciplines, not just data science.
- Right Here, Right Now: As Netflix became a streaming service, it became aware that in-session data is more valuable than the work it might do offline. What do I mean? If a person is browsing the Sandra Bullock movies — guilty pleasure — right now, that’s a very strong signal that they might want to watch a Sandra Bullock movie right now. You could spend months building predictive recommenders that are less likely to be accurate than this simple piece of near real-time information.
- Keep It Real: Marketing is rarely about life and death, amigos. It’s only movies. Leisure time activities need some serendipity and we aren’t always rational in our moments of freedom. What I mean is that we can become so focused on squeezing information out of incomplete and frustrating data sets that we forget our customers want to have fun. They can tell when they’ve been modeled. It’s a human system.
Next time, we’ll take a look at the actual approaches used by the winning team and by recommender systems in general.
Thanks to Michael Driscoll at Metamarkets for reminding me about the Netflix Prize and lessons learned. Let me know your lessons @martykihn