I remarked in a previous blog that there are certain themes that regularly pop up in my client conversations, because they are useful tools in thinking about analytics and the Logical Data Warehouse (LDW) in particular. Here’s another favorite:
In many situations when trying to understand how to move forward, a Return on Investment (RoI) point of view makes things clearer
I appreciate that this may seem to be just motherhood and apple pie – but stay with me and I think you’ll see what I mean. Technical architecture and the way it supports both development and run time efficiency, has a huge effect on cost and benefits, and thus the financial return obtained.
In any analytic system there are always a list of potential business requirements. These may be formal and explicit and neatly compiled in a list, or they may be implicit, existing in the heads of business users. Either way we obtain benefit when we fulfill any of these requirements. If we have an explicit list then it just makes it easier to track our progress.
Each requirement can be thought of as always having two estimates associated with it:
1) The cost and/or effort
2) The benefit to be obtained.
Regardless of whether either are actually measured it will always be the case that a requirement will have a certain cost to implement, and will deliver a certain amount of benefit. We can choose to ignore this but it will always be true. The more we explicitly recognize these measures the firmer the ground we will be standing on when justifying our systems.
I go into this in more detail in one of my pieces of research, see Solution Path for Planning and Implementing the Logical Data Warehouse.
This does not need to be complicated since it is self evident that:
- If we can do twice the number of requirements then, on average, we’ll expect to get twice the benefits.
- If we can meet requirements with half the effort, we can do twice as many and get twice the benefit.
As an aside, there are some nuances to this. If you are creating new reports on existing data, this can be done very fast – as fast as you can create the queries. You don’t have to re-analyze all the data sources from scratch, that’s already been done and they are now being reused. To be sure, if you have many new reports, new users and a greater frequency of reporting this will put additional demand on the underlying platform. However, if you have used platforms that scale up it is simple to add capacity. The run time costs go up linearly, and if you are getting additional benefits they do too. Thus you are in the happy circumstance of having variable costs – you only need add capacity and cost when you can justify it by the incremental benefits to be obtained.
If you do need to add a data source then this only needs be done once. If you have a new requirement that needs ten sources, one of which is new, you only need put in the effort to introduce the one new source, thereafter it is available to be combined with other sources in any way you choose. The more sources you add the lower the proportion of effort needed to add a new source compared to the total effort. Put another way, as time progresses the effort you need to put into adding sources steadily reduces. In the early days each new requirement probably needs new sources, but the probability of this diminishes over time and you find that eventually most of the sources are already loaded. Even for unstructured or semi-structured data where we use “schema on read” we just create new schema rather than have to physically re-obtain the data from scratch.
Our organizations have a fixed amount of resource available – computer equipment, people, money and we want to ensure we get the maximum return. This is where technical architects can really help. A well architected system can handle many more analytical requests for a given amount of money. This was stated succinctly in the following quote, variants of which are usually attributed to Henry Ford.
“An engineer can do for a nickel what any fool can do for a dollar”
The message here is not to always go for the cheapest solution, but rather that good engineering will make the best use of the resources and thus obtain the most benefit for the least cost.
A common question is whether to put data into the data warehouse or into the data lake. Which is right? Typically we could put our data in either. Which platform we choose will become clear by considering:
- Cost of storage
- Cost of processing
- Amount of analysis that can be performed – the number of analyses / queries per day, week, month. That is, the number of new analyses we can attack, or alternatively the number of iterations we can go through in refining an analysis. This has a direct effect on the number of requirements that can be met and therefore the amount of benefit that is derived. Simple measures here are queries per hour, and $/query
- The variety of analyses that can be performed. SQL queries, Data Science, Map-Reduce etc. The wider the variety then the greater range of analyses that can be tackled, and thus the overall number of requirements that can be met. Or alternatively, requirements that can be tackled more cost effectively by using the right tool for the right job.
- Speed of development. Similar to the previous point, by having the right kind of tools available, and considering their ease of use, we can do more, or fewer analyses. How many analyses could we develop in a given time period? This might also be a limit on benefit, just like the number of analyses we could execute at run time.
- Ability to meet Service Level Agreements (SLAs) – and thus the ability to avoid (quantifiable) penalties
- Ability to meet regulatory / compliance requirements, and thus avoid fines
- Concurrency, the ability to support large populations of users, this is usually one of the SLAs. This also determines how much benefit the end users can derive – especially if they regularly give up because of lack of systems performance.
When assessing the return on investment it’s also important to choose an appropriate time period over which the return on investment is to be assessed. Somewhere between one and five years is usually what is aimed for.
By doing a quick back-of-envelope calculation the right method of implementation normally becomes clear. This can vary across time, as technology changes. A strength of the LDW is that it allows us to move data from one part of the architecture to another to optimize the RoI over time.
Another similar scenario is balancing the requirements for short term and longer term requirements. This is illustrated by the eternal tension between BI power tool users and the developer of the data warehouse.
We may overhear educational snippets of conversation:
IT: “Please don’t keep generating marts and copies of the data everywhere, they make a big tangled mess. We don’t know what’s in them and its costs a huge amount to maintain them all, especially in the longer term, “
Business Analyst: “I don’t want to and you can’t make me”
CEO: “We need reports incorporating the new profitability measures by the end of the week, it is potentially worth millions of dollars in our negotiations with suppliers”
Data Architect: “Well, these things need to be done carefully, first we need to agree the measures across the whole company so they can be reliably shared, then the necessary changes put through change control into the warehouse, I can probably do it in three months”
CEO: “You’re fired”
These demands are only irreconcilable if we restrict ourselves to a single method of development, and single platform. When we simply allow ourselves the freedom an architecture with multiple processing engines and modes of development it becomes obvious how to proceed. Ideally we’d like to develop fast, and run our analyses in the most efficient way possible. Sometimes we can do this, sometimes not, we have to choose one or the other.
This is illustrated at the left. Who is to stay that an analysis has to stay on the platform it was originally developed on? If the analysis is only run once, because it is a one off project, it may not be important that it runs very efficiently – in fact making it run efficiently is a waste of valuable effort – and thus reduces RoI. Contrariwise, if the anlysis does have to be run repeatedly over many years then run-time efficiency has a big impact on RoI, so the extra effort of tuning gives a good payback.
If a piece of analysis is being developed or run on the wrong part of the architecture that has to pop out in the numbers somewhere
An obvious way we can square this circle is by developing in one way, then moving the analysis to another platform to run on a regular basis over time. The people who develop the analysis, and those that consolidate it onto another platform for regular running might not be the same people – they may have fundamentally different motivations and skills. But by mixing platforms, we can develop quickly, but run efficiently over the longer term – just by allowing ourselves to move the workload. Looking at the components of the LDW not as rival solutions, but integrated components of the same solution frees up our thinking here.
Another good basic question to ask is:
“How often do I want to do this job?”
Generally the more often I need to do a job the less efficient I am. There is no fixed answer to this. If I am doing a one-off analysis for a special project, there is no point doing more than is absolutely necessary to model the data, do quality assurance, meet regulatory requirements etc. (Note: I may still need to do these things, its amount of effort expended that is in question here). Equally, if we are doing something and expect to reuse it and share it many times in the future then it is well worth putting in the extra effort up front. There will be different answers for different kinds of analyses, and different answers depending on which kind of server, or LDW component that is chosen. However, it is clear that not having a choice, trying to do every requirement on a single type of server, condemns us to sub-optimal returns overall.
Thus, we can see that the old argument about data marts vs data warehouses vs data lakes is sterile. Each represents a different tool in the toolbox, and each may be the optimal solution for a particular stage of the life cycle of a piece of analysis. We may prototype it in a mart, move it to the warehouse for multi-year production then ultimately move it to the lake as the right technology emerges. Or we may prototype it in the lake, move it to the warehouse and then, if demand for the results becomes highly operational move it to a dedicated mart or even an operational data store.
The architectural principle here is that the server and storage mode chosen depends on the business requirement not the other way around. That is, we don’t decide on the server we’re going to use, then force fit all requirements into it. It makes much more sense to look at the requirement and then choose our server (or combination of servers) accordingly.
In this blog I’ve deliberately kept to a high level description of RoI, the benefits divided by the cost. Many readers will be aware that there are a number of more sophisticated RoI techniques that can be applied. These include Internal Rate of Return (IRR), Hurdle Rates, break-even analysis and so forth – maybe that can be the subject of a future blog – but I have deliberately avoided this because I think that the 80/20 rule applies here. You can go a long way just by considering the benefits obtained by each requirements, together with the basic costs in people, time and money.
You can use this basic practical approach to make lots of decisions regarding new areas of benefit that can be accessed, or where we could simply do more for less:
- What platform(s) should be used?
- How should we prioritize our time / requirements?
- Are there any parts of my LDW missing?
- What skills need to be added to the team?
So, try this out if you’re not doing this as a matter of habit. Whenever you have a decision to make around your analytical systems / LDW, look at it from the point of view of RoI. I think you’ll find that often the answer will just jump out the page at you.
I’ll be at the Data and Analytics Summit in London – one of the many D&A events we’ll be hosting around the world during 2018 – see gartner.com/events/bi for details. If you’re there I’ll be delighted to chat about this.
Comments or opinions expressed on this blog are those of the individual contributors only, and do not necessarily represent the views of Gartner, Inc. or its management. Readers may copy and redistribute blog postings on other blogs, or otherwise for private, non-commercial or journalistic purposes, with attribution to Gartner. This content may not be used for any other purposes in any other formats or media. The content on this blog is provided on an "as-is" basis. Gartner shall not be liable for any damages whatsoever arising out of the content or use of this blog.