Blog post

A Data Lake without any information governance is a data cesspool

By Andrew White | September 09, 2015 | 3 Comments

Data LakeBusiness IntelligenceBusiness ApplicationsAnalyticsAdvanced AnalyticsDark DataData and Analytics Strategies

I had the good fortune of speaking at two briefings last week – one in Dallas and the other in Houston.  The sun was out, the heat was on, and the conversations with clients was awesome.  I was presenting on sustaining and updating a business relevant digital information strategy.

As part of the briefing the analyst attending also participate in 1-1’s with attendees.  So even though I spoke as part of the event, I also took part in a number of 1-1’s in both locations.  All told I had about 20 1-1’s or individual conversations with attendees.  I was amazed to note that in about 75% of the conversations, “data lakes” were mentioned.  Sometimes this was the main focus of the conversation; other times it was part of the wider need to manage, govern and/or exploit information but also analytics.

What was even more interesting was the emergent uniformity in question and understanding of the data lake and the lack of information governance.  It seems the idea that a data lake is just a place to collect data – all kinds of data in any state – is becoming quite widespread.  For those firms that have played with a data lake another notable discovery pops up quite quickly:

  • A data lake does not, alone, bring with it any capability to support the broad topic of information governance; and with no information governance, the ability to re-use and exploit further insight on someone else’s work, cannot take place.
  • Vendor’s in and around the data lake space are all now talking about “governance” or “information governance” – even if it means a PowerPoint slide update.  In some cases it might go as far as talking about metadata management and even, if you are lucky, “data lineage”.

It seems we have quite a bit of discovery to develop:

  • What exactly is information governance?
  • What range of technology capability is needed to sustain information governance?
  • How does information governance change in a data lake/big data situation versus a traditional data warehouse, or even operational business application environment?

This will keep us all quite busy for a while, I would think.

Comments are closed

3 Comments

  • One important aspect of governance is to have clarity over which user has access to which data. Using technology that provides the greatest level of granularity to match every business needs in critical. This also feeds directly into better security – what we referred to as data-centric security at BlueTalon.

  • Andrew White says:

    Hi Isabelle,
    Thanks for dropping by and leaving a comment. I agree with you – user access is certainly part of an overall information governance framework.
    Thanks again,
    Andrew

  • Nice piece. Elsewhere on the web Martin Fowler of Thoughtworks discusses the important distinction between the Data Lake and the Lake Shore where Lakes are handling raw data and shores are handling curated information.

    The DataLake/LakeShore approach is the evolution of the traditional datawarehouse model- the Operational Data Store is the data lake (Hadoop) and datawarehouse is morphing into lake shores thanks to the enhanced capabilities of NoSQL databases such as graph and search. Lake shores will power systems of insight and machine learning capabilities because it represents contextual governed data.