Gartner Blog Network


Logical Data Warehouse Project Planning

by Henry Cook  |  January 18, 2018  |  2 Comments

As this new year started I was reminded of a suggestion by my friend and colleague Erik to blog a diagram I was using that illustrates the size and shape of Logical Data Warehouse (LDW) projects. He’d remarked that it was likely that many people would likely find it useful.

The LDW is a multi-server / multi-engine architecture. This has a number of implications. With the original single-server physical data warehouse we had a binary choice, an all or nothing decision. Either we implemented one or we didn’t.

However, the LDW can have multiple engines, the main ones being: The data warehouse(DW), the operational data store(ODS), data marts and the data lake – and therefore we also have multiple potential entry points into the project. 

Moreover, we can be simultaneously implementing different parts of the architecture, developing different components in parallel and in complementary ways – I mentioned this in a blog last year.

The diagram is shown below and provides an outline of LDW development. We see three parallel swim lanes or development streams. They are color coded; green for traditional DW, amber for agile development and blue for the data lake. The additional grey coded activities denote activities common to all three streams. The intermittent gold colored squares show the timings of the delivery “drops” of analytic results. Each time we do a drop of an analytical deliverable we provide benefit.

Example Logical Data Warehouse Plan (Click diagram to pop out to full size)
LDW Project Plan

The diagram above is not intended to be prescriptive, so “your mileage may vary”. However, it should provide a useful starting point to focus planning of these large complex projects. Here are some observations that often come up in discussions about building the LDW.

Using the High Level Plan to Communicate Expectations to Stakeholders

The diagram allows you to see, and explain, the overall size and shape of a project. The different styles of development proceed in parallel. They are not seen as mutually exclusive alternatives. The order and typical length of tasks can be seen and checked

Allowing ourselves to plan for parallel activities allows us to systematically check that all the right activities are in each development stream / swim lane. Equally importantly, we can methodically consider what governance should be added (or relaxed) as development products are moved across the boundaries between streams. What did we miss out in the stream we are moving from that we should add back in in the stream we are moving to? What are we carrying across from the stream we are coming from that we should abandon in the stream we are moving to?

The DW stream typically takes longer to release its deliverables, but when it does there tends to be many of them. For example, it may take a while to obtain, cleanse and to model our cost and revenue data. However, once this is done, many different cost, revenue and profitability reports can be generated from them.  In the diagram the DW stream, for sake of illustration, has three sub-phases, each of which will release a set of analytics and reports. These sub-phases might be defined by business subject area, or some other sensible phasing.

Balancing Agile and Classic Waterfall Development

We might be concerned that the classic data warehouse component of the LDW looks like a waterfall. This is not surprising since a good deal of work needs to be done up front, in preparation. However, we might vary this by having, say,  four quarterly, DW sub-projects rather than one year long one. We worry less about trying to rush this – if we want quicker, more agile results they can be done elsewhere, in the other swim lanes.

The agile and lake activities swim lanes (amber and blue) potentially provide their benefits early and often. For the right requirements we might use data virtualization, marts or the lake to quickly assemble data and do some analysis . These are usually special project based efforts aiming to produce a specific new analytic – but to develop that analytic quickly. Thus we can meet immediate and urgent needs. However, we don’t need to worry about whether this tactical activity will result in a big mess. We can be comfortable that we will end up with a coherent strategic system; the necessary and more structured work is being taken care of elsewhere.

Aside from this useful separation of responsibility we can plan for the different swim lanes to complement each other. Requirements can be prototyped in the agile or lake streams. If they need to be moved to regular production, with all the attendant ETL and data quality checks, they can passed across to the DW stream.

Equally, we may make copies of data from the DW available (by views, data virtualization or copying) into the agile or lake streams to enable experimentation there. Simply by planning for some element of consistency between the swim lanes each can support the others, and overall we can save effort, and speed results.

For example, we may want to do predictive analytics using IoT sensor data from machines on our manufacturing production line. Much of our analysis will be on the sensor data itself (in the data lake). But we’ll also potentially want to add identifying information about the machines themselves: what they are, when they were installed, who supplied them and where they are located. We should expect to be able to pick up most of that directly, quickly and easily from the register of company assets. This can include predefined metadata definitions for existing data objects, such as ASSET and SUPPLIER – it would be pointless and wasteful making these up every time from scratch. Equally we need to know things like PRODUCTION_SCHEDULE, PRODUCTION_BATCH, PRODUCT and PART. This is so we can understand the manufacturing process context such as what products the machines (the subjects of the analysis) are manufacturing during the times they are being analyzed. These data items should be readily to hand in our DW.

Integrated Collaborating Engines Not Competing Solutions

The modern logical data warehouse is an integration of multiple analytic engines. However, rather than seeing the LDW components: the physical enterprise data warehouse, the data lake, data marts and other stores, as being competing solutions it helps to see them instead as collaborating engines.

We should not be trying to choose a single engine for all our requirements. Instead, we should be sensibly distributing all our requirements across the various components we have to choose from. With this perspective it becomes easier to see where in the architecture each of our requirements should be met, and what is the most effective way to develop the analytics to meet them.

LDW Planning, Architecture and RoI

By allowing ourselves to see the LDW system as an integration of collaborating engines we find that many otherwise difficult design decisions become much simpler to satisfy. We will find there is a natural home in the architecture for pretty much anything we might be asked to do.

The observations above illustrate how we would expect to implement our requirements more easily, at less cost, and do more of them. The more requirements  we meet, the more benefit we accrue. Additionally, by choosing the right engine(s) for the right job we would expect to run our processing more efficiently, that is, at less cost!

Therefore it follows that there is a direct connection between our plan (based on our architecture)  increased revenue, lowered cost and, in turn, a better return on investment.

Put simply, having an effective LDW should mean never having to say we’re sorry – for unmet requirements, or for missed service level agreements.

I’ll be at our Gartner Catalyst conferences in San Diego and London later in the year and would be delighted to discuss this.

 

Category: big-data  configuration  data-warehouse  

Tags: architecture  big-data  data-lake  data-warehouse  ldw  logical-data-warehouse  

Henry Cook
Research Director
.2 years at Gartner
40 years IT Industry, 28 in data warehousing

Henry Cook is a Research Director in Gartner for Technical Professionals with particular specialization in data warehousing and related database, big data, in-memory database and analytics topics. Read Full Bio


Thoughts on Logical Data Warehouse Project Planning


  1. Dan Graham says:

    Great blog Henry. We need more details and guidance like this, especially in the public domain.

    Regarding the EDW being a waterfall, there are some things like logical and physical data modeling that should not be done with agile techniques. But there are many like BI reports that should be agile. Agile methods become vital once the foundation is in place.

    Your work at Gartner this year has been outstanding. ‘Please, sir, I want some more.’

    • Henry Cook says:

      Dan, thank you, you old flatterer, you are too kind. I’m hoping to do some research around this topic (agile DW) at some point. It’s interesting that it’s often forgotten that one of the principle aims of the original DW was to enable agility of analysis. Sure, work had to be done up front to model and load the data, but once done you could do analysis across the business, and through time in a way no other system would let you – as fast as you could craft a query. This remains valid today, but there are now other types of agility we need to enable. The logical integration of multiple platforms in the LDW and the automatic discovery of data and metadata upon loading would be examples. As ever, its a great market to be in as there is always something new going on – new technology, data, analysis and business models.



Leave a Reply

Your email address will not be published. Required fields are marked *

Comments or opinions expressed on this blog are those of the individual contributors only, and do not necessarily represent the views of Gartner, Inc. or its management. Readers may copy and redistribute blog postings on other blogs, or otherwise for private, non-commercial or journalistic purposes, with attribution to Gartner. This content may not be used for any other purposes in any other formats or media. The content on this blog is provided on an "as-is" basis. Gartner shall not be liable for any damages whatsoever arising out of the content or use of this blog.