Last week I headed to sunny Richmond, Virginia to attend the International Data Quality Summit, co-hosted by the International Association for Information and Data Quality (IAIDQ) and the Electronic Commerce Code Management Association (ECCMA).
A practitioner-led event, this four-day conference provided an opportunity for approximately 150 data professionals to share experiences, update on latest trends, and develop the body of practices on all aspects of Information Quality, Master Data Management and Data Governance. The whole event was wonderfully chaired by the charming, knowledgeable and horribly well-organised data management professional Joy Medved, I was also delighted to win a copy of Danette McGilvray’s book “Executing Data Quality Projects” in the conference raffle. (As a true Scotsman, I’m canny with my bawbees and so getting something for nothing is always good, especially if it’s something so useful!)
There was time for a social side too, and the American Civil War themed gala dinner on the Wednesday night was a particular highlight!
As well as giving my own presentation session on techniques for information requirements gathering, (building upon my previous blog post “The One Question You Must Never Ask!”), and participating in an expert panel discussion on data and ethics, I also attended a number of other workshops, tutorials and discussions with peers and collegues.
Peter Benson of ECCMA laid out the philosophy and purpose of the ISO8000 Standards for Data Quality, as well as outlining the relationship between ISO8000 and the ISO22745 Metadata standards. Simply put, the aim of the ISO standards for data is to ensure that there is true portability of data in terms of context, syntax, semantic encoding, and stated user requirements – and that these three are maintained in synchronisation. The core lesson was about precision – the more precise the statement of requirement, the more likely it is that you will get data that meets that requirement! (A theme which tied in rather nicely with my own session, given that getting business users to state their information requirements can be pretty challenging…) Applying the principles of ISO8000 should ensure that the data set is warrantable for a bounded set of user circumstances.
Bill Inmon of Forest Rim Technology (well-known to most in the data world as “The Father of Data Warehousing”) examined the issues of getting business value from “Big Data“, and textual data in particular. In short – it’s hard! Bill shares a general skepticism about data lakes as a concept was at pains to point out that there is no point in doing a “Big Data” project (or any data project, for that matter), unless there is identifiable business value in doing so. Vanity projects, or panic initiatives motivated by a “keeping up with the Jonses” attitude, are likely to fail. He also identified the different processing needs when treating repetitive data sets (such as click-stream, log data, call records etc) and non-repetitive data (email, call centre notes, documents, contracts etc). The former is usually best processed using more traditional “schema on write” methods, where the latter would normally require a “schema on read” approach. The big challenge with “schema on read” is semantic disambiguation and contextualisation of text – language is complex, subtle, and nuanced. A range of functions are necessary to disambiguate text, including qualified vocabularies, known taxonomies and onologies, acronym and homograph resolution, textual proximity analysis, and in-line position-based resolution.
Piyush Malik of IBM also looked at the value of “Big Data” projects, and offered a four-step model for deriving business value from data: 1) FIND (discovery & profiling), 2) PREPARE (understand context, cleanse, format, enrich), 3) DEFEND (defensibility, trust, lineage, confidence), 4) SECURE/COMPLY (including privacy & retention).
Alexander Borek of IBM noted the rising trend of the Chief Data Officer, and the implications for both data quality and data analytics, with contextualisation being the key link. There is a shift from Data Quality as a “control” to Data Quality as an “enabler”, although the current industry approaches to Data Governance are not yet addressing the issues of governance of the analytics. Transparency and trust of algorithms needs a lot more work.
Both David Marco of EWSolutions and Kelle O’Neal of First San Francisco Partners addressed issues of stakeholder management in data quality and analytic scenarios. Simply put, where data is concerned, people are the problem! Questions of purpose, accountability and communication need to be addressed. Data professionals need to be much more aware and involved in raising increased awareness, engagement and adoption at the human level. David suggested that we need to get better at “eating our own dog food” when it comes to being organised in our collation and disemination of information, while Kelle noted that the Data Governance Operating Model and Organisational Structures are not the same thing!
My discussion with Daryl Crockett of Validus focussed on the application of the Scientific Method to data problems – do “Data Scientists” really do data, and/or science?! Unfortunately, we still see too much data analysis being (mistakenly) used to justify and reinforce an already-established position, rather than truly using the Scientific Method to test a hypothesis, conclude, and adjust thinking accordingly. The expectation of immediate Return-On-Investment is also an innovation killer, as it removes any incentive for experimentation and learning.
Ethics of data governance was another key topic throughout the week, both in a workshop led by Anne Marie Smith of EWSolutions, as well as the panel discussion in which I participated alongside Nicola Askham the data governance coach and President of DAMA International, Sue Guens. The need for a data code of conduct was discussed, as well as key themes of ethical uses of data including: the continuum of whether data use is ‘Legal’ vs ‘Acceptable’ vs ‘Right’ vs ‘Reputational’ and issues of relationship and trust; explicit statement of the purpose(s) for which data is collected; data aggregation, anonymisation and-re-construction of personal identity; bias in the analytical use of data (whether deliverate or unintended); righ-to-know and balancing security with access; balance of ethical vs reputational impacts; implied (and often unrecognised) fiduciary responsibilities of data quality and governance professionals; the responsibilities of the data professional to provide leadership in a world where the population are often unaware of the implications and repercussions of digital life.
As you might expect in this Information Age, the whole event was documented live as it happened on Twitter via the hashtag #IDQS14. Too many tweet highlights to mention, though Ronald Damhof made lively contributions thoughout (both during the formal sessions and in the bar afterwards) and probably deserves a mention in despatches as “Twit of the Week”!
In summary, the two most consistent themes throughout the summit were the importance of context and the impact on/of people. The ultimate message is that data professionals of all kinds need to focus more on the latter, in order to increase the integrity of the former.
All in all an excellent event – informative, thought-provoking and fun.