As this new year started I was reminded of a suggestion by my friend and colleague Erik to blog a diagram I was using that illustrates the size and shape of Logical Data Warehouse (LDW) projects. He’d remarked that it was likely that many people would likely find it useful.
The LDW is a multi-server / multi-engine architecture. This has a number of implications. With the original single-server physical data warehouse we had a binary choice, an all or nothing decision. Either we implemented one or we didn’t.
However, the LDW can have multiple engines, the main ones being: The data warehouse(DW), the operational data store(ODS), data marts and the data lake – and therefore we also have multiple potential entry points into the project.
Moreover, we can be simultaneously implementing different parts of the architecture, developing different components in parallel and in complementary ways – I mentioned this in a blog last year.
The diagram is shown below and provides an outline of LDW development. We see three parallel swim lanes or development streams. They are color coded; green for traditional DW, amber for agile development and blue for the data lake. The additional grey coded activities denote activities common to all three streams. The intermittent gold colored squares show the timings of the delivery “drops” of analytic results. Each time we do a drop of an analytical deliverable we provide benefit.
The diagram above is not intended to be prescriptive, so “your mileage may vary”. However, it should provide a useful starting point to focus planning of these large complex projects. Here are some observations that often come up in discussions about building the LDW.
Using the High Level Plan to Communicate Expectations to Stakeholders
The diagram allows you to see, and explain, the overall size and shape of a project. The different styles of development proceed in parallel. They are not seen as mutually exclusive alternatives. The order and typical length of tasks can be seen and checked
Allowing ourselves to plan for parallel activities allows us to systematically check that all the right activities are in each development stream / swim lane. Equally importantly, we can methodically consider what governance should be added (or relaxed) as development products are moved across the boundaries between streams. What did we miss out in the stream we are moving from that we should add back in in the stream we are moving to? What are we carrying across from the stream we are coming from that we should abandon in the stream we are moving to?
The DW stream typically takes longer to release its deliverables, but when it does there tends to be many of them. For example, it may take a while to obtain, cleanse and to model our cost and revenue data. However, once this is done, many different cost, revenue and profitability reports can be generated from them. In the diagram the DW stream, for sake of illustration, has three sub-phases, each of which will release a set of analytics and reports. These sub-phases might be defined by business subject area, or some other sensible phasing.
Balancing Agile and Classic Waterfall Development
We might be concerned that the classic data warehouse component of the LDW looks like a waterfall. This is not surprising since a good deal of work needs to be done up front, in preparation. However, we might vary this by having, say, four quarterly, DW sub-projects rather than one year long one. We worry less about trying to rush this – if we want quicker, more agile results they can be done elsewhere, in the other swim lanes.
The agile and lake activities swim lanes (amber and blue) potentially provide their benefits early and often. For the right requirements we might use data virtualization, marts or the lake to quickly assemble data and do some analysis . These are usually special project based efforts aiming to produce a specific new analytic – but to develop that analytic quickly. Thus we can meet immediate and urgent needs. However, we don’t need to worry about whether this tactical activity will result in a big mess. We can be comfortable that we will end up with a coherent strategic system; the necessary and more structured work is being taken care of elsewhere.
Aside from this useful separation of responsibility we can plan for the different swim lanes to complement each other. Requirements can be prototyped in the agile or lake streams. If they need to be moved to regular production, with all the attendant ETL and data quality checks, they can passed across to the DW stream.
Equally, we may make copies of data from the DW available (by views, data virtualization or copying) into the agile or lake streams to enable experimentation there. Simply by planning for some element of consistency between the swim lanes each can support the others, and overall we can save effort, and speed results.
For example, we may want to do predictive analytics using IoT sensor data from machines on our manufacturing production line. Much of our analysis will be on the sensor data itself (in the data lake). But we’ll also potentially want to add identifying information about the machines themselves: what they are, when they were installed, who supplied them and where they are located. We should expect to be able to pick up most of that directly, quickly and easily from the register of company assets. This can include predefined metadata definitions for existing data objects, such as ASSET and SUPPLIER – it would be pointless and wasteful making these up every time from scratch. Equally we need to know things like PRODUCTION_SCHEDULE, PRODUCTION_BATCH, PRODUCT and PART. This is so we can understand the manufacturing process context such as what products the machines (the subjects of the analysis) are manufacturing during the times they are being analyzed. These data items should be readily to hand in our DW.
Integrated Collaborating Engines Not Competing Solutions
The modern logical data warehouse is an integration of multiple analytic engines. However, rather than seeing the LDW components: the physical enterprise data warehouse, the data lake, data marts and other stores, as being competing solutions it helps to see them instead as collaborating engines.
We should not be trying to choose a single engine for all our requirements. Instead, we should be sensibly distributing all our requirements across the various components we have to choose from. With this perspective it becomes easier to see where in the architecture each of our requirements should be met, and what is the most effective way to develop the analytics to meet them.
LDW Planning, Architecture and RoI
By allowing ourselves to see the LDW system as an integration of collaborating engines we find that many otherwise difficult design decisions become much simpler to satisfy. We will find there is a natural home in the architecture for pretty much anything we might be asked to do.
The observations above illustrate how we would expect to implement our requirements more easily, at less cost, and do more of them. The more requirements we meet, the more benefit we accrue. Additionally, by choosing the right engine(s) for the right job we would expect to run our processing more efficiently, that is, at less cost!
Therefore it follows that there is a direct connection between our plan (based on our architecture) increased revenue, lowered cost and, in turn, a better return on investment.
Put simply, having an effective LDW should mean never having to say we’re sorry – for unmet requirements, or for missed service level agreements.