Today the Gartner Information Management and Analytics Community held its weekly Twitter Chat, (Tweetchat, Tweetjam, TweetUp, whichever you prefer) to discuss concepts around big data, the role of the data scientist, and data quality. Over a half dozen Gartner analysts shared their ideas and research. (Where else can you get access to that many Gartner analysts in one place at the same time?) And dozens more individuals from other organizations also shared their perspectives and questions.
Big Data—Hey What’s the Big Idea?
First we discussed whether “Big Data” is an animal, vegetable or mineral, concluding that it has become very much a marketing term. Gartner analyst Andy Bitterer (@bitterer) jabbed, “Is Big Data nothing but a marketing play, since many organizations had ‘big data’ for a long time?’ Tim Elliott (@timoelliott) concurred, stating that “new terms arise because of new technology, not new business problems.” Esteban Kolsky (@ekolsky) thought the term was a more specific “marketing word used to describe the incredible volume coming out of social [networks].”
Yves de Montcheuil (@ydemontcheuil) suggested that organizations “have had Big Data all along but couldn’t get value out of it, except with lots of $$$,” and Gartner analyst Doug Laney (@doug_laney) agreed with a quip about Big Data being relative: “Big Data is merely data that’s an order of magnitude greater than data you’re accustomed to…Grasshopper.”
Hadoop was mentioned more than a few times as both an enabler and also a driver of big data, with Mark Troester (@mtroester) summing it up that the “hype of Hadoop is driving pressure on people to keep everything.” Some suggested archiving or even unloading data that is unused, but John Haddad (@JohnM_Haddad) and Martin Schneider (@mschneider718) both reminded everyone that data retention may depend on industry regulations and government mandates.
Some inquired about how to finding value in data so Doug Laney offered that there are two sides to that equation: 1) “looking beyond basic BI to advance analytics” and 2) “quantifying data’s potential and actual value.” Doug also summarized one of Gartner’s strategic planning assumptions for 2012: “Through 2015, >90% of business leaders say info is a strategic asset, yet <10% will quantify its economic value.” Gartner analyst Merv Adrian (@merv) admittedly had some fun with the notion of hidden value in data, asking, “Would it be a bad thing for organizations to say ‘Maybe there is value in the dark fiber of our information fabric?’”
The Art of Data Science
This led into a discussion about data science and the realization of data value. Gartner analyst Ted Friedman (@ted_friedman) wrote that it’s “good that analytics roles are becoming key, but ‘data scientist’ is a little bit elitist IMO.” Esteban disagreed contending that the term “scientist is not elitist, it defines a specific role.” Gartner analyst Carol Rozwell (@CRozwell) responded by suggesting, “But shouldn’t the average person be able to derive value from data?…[even though] some people refuse to see the truth in data.”
Nenshad Bardoliwalla (@nenshad) contended that the need for data scientists may be overblown. He believes that “Purpose-built apps can democratize making sense of Big Data for business folks without the need for data scientists (in some domains).” @Brett2point0 agreed, offering that “ideally end users should be empowered to explore their own data, seek their own insights through self-service.”
Gartner’s Doug Laney shared his analysis of current job descriptions for “data scientist” versus those for “BI analyst”. Key words in the “data scientist” job title include: design, knowledge, research, complex, learning, machine, models, problems, and performance; whereas top words used in “BI scientist” job descriptions are reporting/reports, company, technical, industry, user, sql, applications, and metrics. Tony Baer (@TonyBaer) and Doug agreed that communication is the skill that differs theoretical from applied science.
Mark Troester argued that someone needs to have “real intelligence to identify relevance and rationalize data,” and Jill Hulme (@jill_hulme) chimed that “a data scientist needs skills in math, engineering, writing, and a healthy dose of skepticism.” Adrian Bowles (@ajbowles) philosophized that a data scientist is like “a sculptor, finding a figure in material,” and that “Science is discovery, but not all who discover are scientists.”
Mopping Up with Data Quality
Finally we wrapped up with some thoughts on data quality in a Big Data context. Esteban claimed that “Big Data has compounded the [data quality] problem” and that now 40% of the data he sees now is bad. Seth Grimes (@SethGrimes) similarly lamented that “questionable data is the rule rather than the exception in my specialization areas: text and sentiment analysis.”
Yves thinks that “data volumes make it hard for traditional data quality architectures to keep up with big data.” However, Gartner’s Ted Friedman offered up another perspective that “data quality problems can be eased by big volumes in that individual flaws may have less impact when the data set is bigger.”
Mark Troester turned the idea of analytics on its head, recommending, “We shouldn’t just apply data quality for analytics, we should use analytics to help with quality.” He said he’s also “seen people so aggressive about cleansing that they cleanse away insight.”
When some participants suggested that data should ideally be cleansed at the source or when received, Doug Laney cautioned that “you can’t always cleanse data before storing it because of performance and the need to integrate and analyze it first.” Ted Friedman added that data quality is a “harder problem when organizations wish to use data they didn’t produce or don’t own it. The greater competency is assessing data quality…but that depending upon the usage and type of data, some you will still have to get nearly perfect.”
Thanks again to the following individuals and organizations for their participation:
@ajbowles @arbeiza @berkson0 @bgassman @bikespoke @bitterer @Brett2point0 @briellenikaido @chirag_mehta @cpreston64 @cpydimuk @CRozwell @datachick @DataIntegrate @DavideCamera @decisionmgt @DivineParty @donloden @doug_laney @eIQnetworks @ekolsky @erao @EventCloudPro @furukama @howarddresner @iam_joshd @infanteAL @InformaticaCorp @jamet123 @JayMOza @jessewilkins @jill_hulme @johndavidstutts @johnlmyers44 @JohnM_Haddad @JSussin @juliebhunt @loranstefani @marciamarcia @merv @mschneider718 @mtroester @Natasha_D_G @NeilRaden @NekkidTech @nenshad @OhThisBloodyPC @pishabh @RobertsPaige @RomanStanek @rqtaylor @ryanprociuk @s_pritchard @seamuswalsh @SethGrimes @SocialMediaJeff @StacyLeidwinger @stevesarsfield @Tanvi_MR @techguerilla @ted_friedman @timoelliott @TonyBaer @userevents @ValaAfshar @Vivisimo_Inc @wiseanalytics @XeroxDocuShare
Please join or follow Gartner’s BI, analytics and information management analysts each Friday at 12:00pm ET on Twitter at #GartnerChat.
Note: Some tweets have been edited slightly in this blog to improve their comprehension and/or enhance context.
Comments or opinions expressed on this blog are those of the individual contributors only, and do not necessarily represent the views of Gartner, Inc. or its management. Readers may copy and redistribute blog postings on other blogs, or otherwise for private, non-commercial or journalistic purposes, with attribution to Gartner. This content may not be used for any other purposes in any other formats or media. The content on this blog is provided on an "as-is" basis. Gartner shall not be liable for any damages whatsoever arising out of the content or use of this blog.