Increasing adoption of big data technologies brings about the big data dilemmas:
- Quality vs. quantity
- Truth vs. trust
- Correction vs. curation
- Ontology vs. anthology
Data profiling, cleansing or matching in Hadoop or elsewhere are all good but they don’t resolve these dilemmas. My favorite semantic site Twinword pictures what a dilemma is.
You get the picture. New technologies promote sloppiness. People do stupid things because now they can.
Why store all data? — Because we can.
What’s in this data? — Who knows?
Remember the current state of big data analytics? — “It’s not just about finding the needle, but getting the hay in the stack.”
Big data technologies are developing fast. Silicon Valley is excited about new capabilities (which very few are using). In my mind, the best thing to do right now is to enable vast and vague data sources that are commingling in the new and immature data stores, and are confined in mature data stores. Companies store more data than they can process or even fathom. My imagination fails at a quintillion rows (ask Cloudera). Instead, it paints a continuous loop: data enables analysis, analytics boosts the value of data. How to do this? It starts dawning on the market — through information quality and information governance!
My subject today is just the information quality piece. It continues my previous blog post BYO Big Data Quality. (I explained the whole information loop on this picture in Big Data Analytics Will Drive the Visible Impact of the Nexus of Forces.)
Data liberation means more people accessing and changing data. Innovative information quality approaches — visualization, exception handling, data enrichment — are needed to transform raw data into a trusted source suitable for analysis. Some companies use crowdsourcing for data enrichment and validation. Social platforms provide a crowdsourced approach to cleaning up data and facilitate finding armies of workers with diverse backgrounds. Consequently, the quality of crowdsourcing is another new task.
Big data is a way to preserve context that is missing in the refined structured data stores — this means a balance between intentionally “dirty” data and data cleaned from unnecessary digital exhaust, sampling or no sampling. A capability to combine multiple data sources creates new expectations for consistent quality; for example, to accurately account for differences in granularity, velocity of changes, life span, perishability and dependencies of participating datasets. Convergence of social. mobile, cloud and big data technologies presents new requirements — getting the right information to the consumer quickly, ensuring reliability of external data you don’t have control over, validating the relationships among data elements, looking for data synergies and gaps, creating provenance of the data you provide to others, spotting skewed and biased data.
In reality, a data scientist job is 80% of a data quality engineer, and just 20% of a researcher, dreamer and scientist. Data scientist spends enormous amount of time on data curation and exploration to determine whether s/he can get value out of it. The immediate practical answer — work with dark data confined in relational data stores. Well, it’s structured, therefore, it is not really new. But at least, you get enough untapped sources of reasonable quality, and you can extract enough value right away, while new technologies are being developed. Are they?
While Silicon Valley is excited about Drill, Spark and Shark, I am watching a nascent trend — big data quality and data-enabling. Coincidentally, I got two briefings last week, Peaxy (that I liked a lot for its strength in the Industrial Internet) and Viking-FS in Europe with the product called K-Pax, mainly for the financial industry. A briefing with Paxata is on Friday, and a briefing with Trifacta is scheduled too. Earlier this month, I got a briefing from Waterline Data Science, a stealth startup with lots of cool ideas on enabling big data. Earlier, I had encounters with Data Tamer, Ataccama and Cambridge Semantics among others. Finally, have you heard about G2 and sensemaking? Take a look at this intriguing video. All these solutions are very different. The only quality they have in common is immaturity. For now, you are on your own, but hold on — help is coming!
Follow Svetlana on Twitter @Sve_Sic