I visited a client the other day and they wanted to talk about data lakes. Someone at the client, not at the meeting, had been promoting the concept of a data lake as an answer to question we explored. Before I tell you what happened, let me update you on my “opening” position.
A few weeks ago my colleague Nick Heudecker and I published a note (See The Data Lake Fallacy: All Water and No Substance) on data lakes. The note called out what appeared to be missing from the vendor hype related to data lakes, that being the lack of any sustaining practice (or technology) to help any value persistence from re-use of the data in the lake. There IS value in mining information in a lake. But to assume that the IP and structure used to expose that insight and value persists in the data lake is wrong. A data lake does not persist that. In the jargon, “no information governance, no sustainable or repeatable value”. It seems to be good advice.
Not everyone agrees. Another colleague of mine brought this InfoWorld “review” by a “strategic developer” to my attention – see “Gartner gets the ‘data lake” concept all wrong”. It seems we said that data lakes are not useful, and that somehow a large scale, enterprise wide, wall to wall governance effort is required. Apparently we were also touting proprietary technology. Since we don’t support either perspective (devoid of context, and data lakes is not sufficient in either case) I don’t even feel the need to respond. If there had been a response to the main fallacy we call out, I would have. Truth is, if you don’t maintain any structure in the data you use, how on earth can someone that follows you get a leg up, and avoid repeating your effort? Either way the hype around data lakes continues apace.
So let’s go back to the meeting this week with the client.
This client has several established data warehouses, each with some successful if local information governance supporting analytics. The client had 17 or so data centers, each supporting one of these data warehouses. The business uses these 17 systems a lot and gets value from the data- they rely on what they get from them.
There was one question: can we use a data lake? However we had to drill down to the REAL questions behind what was being asked. There were two real questions/desires:
- Can we reduce IT costs by reducing the number of data centers, and
- Can we increase synergy by supporting shared governance across the silos, as if we had a single, unified layer?
In truth this client wants to consolidate data centers, and quite separately adopt a focused information governance program to sustain common data spanning and connecting the local insights for additional value. As far as I can tell, a data lake plays no role in either question. Yet it was being pushed by a vendor to one of the end users at this client.
The end-user even spotted the fallacy themselves. They asked, “If we used a data lake, don’t we actually take steps backwards, in that we ‘lose’ all those currently silod yet effective IP and governance frameworks?
YES! A data lake by definition has a zero barrier to entry and so supports zero information governance. Any and all data is accepted because it has no need to confirm or relate to the rest of the data that exists in the lake already. If there IS a cost to enter, it is not a data lake. In contrast, a data warehouse or EDW has a higher barrier to entry. So why not go for a balance? In this case the user was right. A data lake would be a step backward. .
So why was data lake being referenced? Perhaps this vendor is selling a form of data warehouse but wanting to use the new silvery bullet-like name. My final recommendation to the client: forget the new names. Identify the real requirement (data center consolidation, and multi-warehouse information governance) and design the target architecture. If you really want a name for it, let’s chat again. But don’t use “data lake” since it does not seem to fit.