It seems that introducing a data hub strategy a couple of years ago was both good and bad. For some clients we put a name to something that makes total sense and something they had been working on for years. For other clients we have just opened up Pandora’s Box and introduced more complexity. For another set of clients we have just birthed a new toy that might indeed solve all their problems – whatever they are. For some vendors, we even gave a breath of fresh air to some pretty tired old marketing messages. Welcome to the life of an analyst.
In truth, a data hub strategy was not some big idea, nor was it something that came out of the blue. It was I think more a natural evolution of a number of, to that point, discrete even disparate trends or observations:
- Master Data Management (MDM) programs tended to be too big, too large, too slow, took over weight, too expense and so not successful.
- More and more data sources were being connected too for wide range of uses spanning operational and analytical uses.
- Data lakes were exposing the mess that results when you collect a lot of data and drop it into a wide open space (the lake).
- Data Governance itself was going through its own renaissance and the emerging best practices, when looked at together, seem to be providing a real workable pathway forward for implementation. See Effective Data and Analytics Governance – Finally! and associated webinar.
- Data integration techniques were shifting from assuming a need to collect all the data for a range of purposes to connecting data. This was leading to new ideas for how to do that.
- Too much focus was shifting toward APIs or microservices as if this was going to solve the problem.
- So many organizations remain partway drowned in spaghetti of integration wires and applications and data stores.
- And more!
So Ted Friedman and I looked at the tea leaves and just connected the dots. We did not invent anything new. We just said, “Here – there is a pattern here. Let’s see what happens if we connect these dots. Does it hang together? Could a client, starting from the mess of spaghetti, use this idea to gently unravel the mess and start to improve the semantic consistent and quality and security and privacy and ethics of the least amount of data to drive the biggest business outcome?
“Yes” was the answer.
So what is a hub exactly?
We have published a few notes already and we have more to come:
- Examples of data hubs: Data Hubs: Understanding the Types, Characteristics and Use Cases
- Comparing data hubs to data warehouses/lakes: Data Hubs, Data Lakes and Data Warehouses: Choosing the Core of Your Digital Platform
- One form of data hub for application integration: Innovation Insight: The Digital Integration Hub Turbocharges Your API Strategy
- More on the third element of the data hub strategy – integration and implementation: Infuse Your Data Hub Strategy With Data and Application Integration
- The first two notes we published on data hub strategy concept:
But during an inquiry with a client today it suddenly hit me – a new metaphor. And that metaphor is Piccadilly Circus.
On any underground rail map (for any transportation map, really), there are a number of stations and lines. The lines are the means by which passengers move between the stations. Stations are places where passengers enter the system or leave.
- Stations are applications – any place where data is created or processed.
- Lines between stations are all the forms of integration such as ETL processes, flat files and FTP; or they can be services based. Lines are what moves data between applications.
King’s Cross St. Pancras is a MAJOR data hub. It acts as a clearing house for a lot of data moving between a large numbers of applications. Piccadilly Circus is also a hub, a little smaller than King’s Cross but more well-known name-wise.
If you wanted a view of the entire system, or business, you would need to collect (or connect) to all the stations (and maybe lines) and sum what you see. You might put what you see into yet another thing – let’s call it a data warehouse or data lake. It’s just a place where you collect all the data (or as much of it as you can get): Transaction data, content, documents, relationship data, reference data, master data, application data, and the lot.
So what is a data hub?????
Most stations are single line stations. They collect data and process it and pass it on. Some stations are several lines entering and leaving. Yet fewer stations provide interchange between lines. It is these fewest of places that provide the greatest value to the overall system. But here the analogy breaks down: data hubs are NOT the largest set of data, they should be relatively small compared to the set of data in the system; they may not store data at all but just policies. Let’s look at this in detail.
A data hub in the data hub strategy concept is just a small number of nominated places on the whole transportation map that are identified to play key roles:
- Store, host or process a set of business rules that help execute the goals related to data and analytics governance (keeping in mind this spans data security, data privacy, data quality, data ethics and so on)
- Store, host or process another set of rules that describe who or what else needs access to the data (but is not involved in the governance side of the conversation)
With those two steps in hand, I can then make smarter, more rational decisions about how to simplify persistence of that least amount of data that is being governed and shared. In other words, across the whole railway landscape we determine where Piccadilly Circus and few other major nodes are to be the focus.
Like Piccadilly Circus and King’s Cross, it can also be a normal station. A data hub can exists physically and stand-alone, or it can be virtualized within another thing (such as an application, ODS, EDW or data lake. But don’t confuse them – a data hub within the confines of an application, ODW, EDW or data lake does not convert those applications, ODWs, EDWs or data lakes into a data hub. The role each of these things plays is quite different.
Data could be created in the data hub. Data can be copied or virtualized from or to it. These are all possible. But not that important. What is important is that there are only a few hubs in the whole network. And the hubs are not the same as the total sum or collection of all the data (such as a data warehouse or data lake). Thus a hub is nothing like data warehouse or lake. It would be like comparing applies to doctors.
So there you have it. Is that as clear as mud now?