Already after just a few years of excitement and enthusiasm for data lakes, it is quite clear that there are a few areas where many organizations are struggling to make progress. One of the recent exasperation a I noted was the recent tendency of vendors in this space (whatever this ‘space’ is) to reinvent what they know and loved, and what clients hated and wanted to ignore, related to building of a classic data warehouse. See my blog: When is a Data Lake an Enterprise Data Lake?
Bypassing the hype and the obvious relationship to the proverbial silver bullet, here are three areas where it seems like potentially new next-practices are emerging. I am not sure these are baked, or fully thought through, but we would love to get your feedback. In no particular order of priority….
Discovery versus Delivery
We have noted a pattern that seems reasonable enough to capture the challenges with data lakes and their attendant design and technology, while at the same time provide a means to compartmentalize them so that they can be overcome. Maybe this is a useful way to think of the challenge?
- Design one set of capability around broad-based, flexible discovery methods (mode 2, using Bimodal terminology)
- Design one set of capability around efficient and highly optimized delivery (mode 1, using Bimodal terminology)
The former equates, but should not be perfectly correlated, to a data lake or Hadoop; the second equates but should not be perfectly correlated with a relational data warehouse. The actual technology used by either mode is a different topic.
What kind of conceptual models are you setting up? How are you getting along?
Information Stewardship: operational versus analytical
This is a relatively new impact.
Only a just about five years ago we collectively figured out the real needed for the role of an information steward. It is best served and instantiated basically as business, as in line-function, role. Most application ‘power users’ are great candidates. They are business centric chief problem solvers. They are not not all the users of your applications; nor are they IT roles. They are not easy to set up or sell; and many firms are struggling by first setting these roles up in IT as an expedient first step.
But things are changing again. Many clients now report that their ‘BI centers of excellence’ are being supplanted by a ‘data science lab’ with wider powers, more tools, yet with the same governance headache only worse. So the question is, what role related to information governance should persist or be supported by the data science lab
How are you getting along with this challenge? Do tell us.
(Big) Data catalogs – enterprise-wide or organic?
This is another classic from yesteryear, only with a big data spin.
We all tried to build enterprise-wide data catalogs and data models. Remember the fun and games? Not three years ago I even sat in a vendors’ customer case study presentation where I heard the following from consecutive responses to architects are a very large bank:
- “It took us about 6 years to design and develop our ideal, future-state data model for our business,” and
- “It is too early to tell if this has generated any business value for our company yet. We have not yet (figured out how to) used it.”
But it seems that big data catalogs are all the rage again. It’s as if a new bunch of technology vendors have just rediscovered semantic discovery and classification tools. Yes, they are faster now. Yes, maybe they even learn a little bit.
What sort of uses for (big) data catalogs has your organization developed recently?