I was party to a large, wide spread email exchange among Gartner analysts last week. The topic was related to data lakes. To be more specific, the conversation and dialog explored what is a data lake, and what form its value will accrue to the business. It was one of those emails that generated a significant number of emails from a range of quarters.
In a nutshell there does seem to be confusion in the market place. And vendors, some vendors, are (predictably) taking advantage of the confusion. This confusion is important since the hype related to data lakes has caught on really fast. It is certainly looking like a brand new silver bullet.
I won’t regurgitate the whole content of the conversation, though much of it relates to a lot of analogies such as marsh, toilet water, swamp, rivers, pools and so on. Simply a data lake is an attempt to bring together physically a number (or all?) of the data stores available in such a way as they can be easily accessible to users (data scientists). There is no attempt to manage the data in the lake; there is no need to qualify the data, its format, its quality, its consistency, or anything that would allow one data to relate to another. Thus, by definition, the data in the lake is ungoverned. It might be that some of the sources of data were governed before they were added to the lake. But once in there, there is no attempt to maintain that data, or reconcile it to the other views of the same (or different) data in the data lake.
That is the part that is winding me up: a data lake does not include any intrinsic, persistent, data lake-wide information governance.
- A data lake is meant to be a set of data sources. It might be a set of all available data sources; and it is meant to focus on physical assembly of data. It may include structured transaction data or content (with its attendant metadata).
- By definition, it is an argument against a compromised single, homogeneous data model that would otherwise be needed to build an EDW. That explains a key difference between an EDW and a data lake.
- A data lake has no predefined model to which all the data is aligned, or matched too as part of a qualification to enter the lake. All data is let in, whatever its state.
- The data lake is wide open, wild west, such that any end-user could use cool tools** to find stuff. In that finding, the resulting perspective or view of the truth is fleeting. It does not persist; another use could override what had gone before.
- There is no universal lens through which the next user can perceive the same version of the truth. That is a key difference between a data lake and a Logical Data Warehouse.
** It is the vendors and their “cool tools” that adding to this hype related to data lakes. They do not, as a rule, substitute for information governance; they bypass that effort but they may still offer value to the end user. However, that value may also be fleeting since the structure developed to support the perspective/analytic/view, is not enforced data lake-wide, nor can it be. It if were, it would then be part of a formal information governance effort, would need to reconcile to other views, and thus it would no longer be a data lake. It might even then become another EDW – even though you only ever wanted one of those!
Thus a data lake does not support a single version or view of the truth. It supports “any number of views of any number of truths”. So it might be holistic in terms of data collection; but in terms of information and insight, it is fleeting and is more of a swamp. There is value in a data lake, but it is not the same value you can get from an agreed, shared view of the truth.
My colleague (Nick Heudecker) and me have a research note on this that explains in more detail the challenges in this area, but it would first help to have a common understanding for what a data lake is, and isn’t. So watch this space – and please, avoid the hype! It is rampant.