In the absence of best practices for big data quality, individual companies are coming up with their own solutions. Of course, these organizations first have problems. Let’s look at the example from Paytronix, a cool company providing loyalty management for restaurant chains including my favorite Panera Bread. Paytronix is converging social, mobile, cloud and big data for its business (aha! The Nexus of Forces!). And — by the way — cutting edge technologies help a lot to attract top talent to the company. But first things first, Paytronix had a big data quality problem, here is the description:
- Over a quarter of their clients, restaurants, do not ask for age
- Of those who ask age, 18% leave it blank
- Of those who answer, approximately 10% are blatant liars
All of the above means that identifying families with kids is a huge challenge (spoiler: Paytronix successfully met the challenge). People with kids are younger. They tend to fill the restaurants earlier in the evening. Check average is higher when orders include a kids meal (I confirm for our orders in Panera). That’s why restaurants often want to market to people with children: when they offer a kids meal coupon they get 25% more redemptions. But! What customers say is different than what they do. (Aren’t we all customers?) In other words, here is the picture, instructive for parents:
Big data quality is new and different: Traditional models do not work, familiar standards do not apply, typical metrics miss the mark. Most important, people’s mentality has to change when they assure quality of big data. My colleague Martin Reynolds likes to cite, “most people are woefully muddled information processors who often stumble along ill-chosen shortcuts to reach bad conclusions.” This quote appeared in Newsweek in 1987, BC (before Cloudera, the first commercial Hadoop distribution vendor). E.g. the problem is eternal although it wasn’t so widespread in data management because there was not much data management in 1987. That Newsweek with the quote still advertised typewriters, best in the world. Wikipedia gives a daunting list of cognitive biases —each bias is a big data quality factor because quality applies to the resulting analysis, and to intermediate results, and to iterative data science. In case of Paytronix, segmentation was biased. Biases also apply to data mashups: to evaluating granularity, trustworthiness and dependencies of participating data sets. And sometimes, biases matter even to the absence or presence of particular data sources. Martin Reynolds shared with me the most astonishing example of cognitive bias.
Paytronix solved its big data quality problem by deciding not to change how people think. They validated data by giving it in cubes in a familiar BI tool to good old people. By the way, crowdsourcing is another excellent big data quality method that relies on people. But this is a subject of my next post — I will tell what vendors are doing about big data quality, and even maybe about big data governance. As DBAs like to say, stay tuned.
Follow Svetlana on Twitter @Sve_Sic
Category: Big Data big data market Crossing the Chasm crowdsourcing data data governance data paprazzi Hadoop Information Everywhere innovation Inquire Within skills Uncategorized Tags: big data, big data adoption, cloudera, data paprazzi, data spy, end users, Hadoop distribution, Information Everywhere, innovation, pseudo-tweets