When developing analytic system system sizing is one of things that it is beneficial to do ‘early and often’. However, getting a reasonable estimate early on in a project can seem difficult. However there is a simple trick that can be used for most analytical big data systems – data warehouses, data marts and data lakes.
I mention this in my latest paper on the logical data warehouse, and whilst discussing it was suggested that this simple sizing tip would be a good thing to pass on in a blog. This little trick hinges on two insights.
- For most large data stores a large proportion of the data resides in the ten or twenty largest tables or data sets. A system may have tens or hundreds of tables, but typically the majority of the data is held in just a small subset of the tables.
- These large tables are typically easily linked to readily available business metrics.
For example, for Telecommunications the biggest tables are typically call detail records, and network events. For Retail it will be point of sale basket items and stock movements, for finance it will be financial transactions and the events leading up to them. If we take these main tables, and the next eight or nine largest then we probably have 95% or more of the overall data. This is because these tables typically store the several years of history we’ll dip into for our analysis.
If we know they are 95% of the data, then we can make a good estimate of the size of 100% of the data – just divide by 0.95! The 95% figure may vary from industry to industry but it’ll be typically somewhere between 80% and 98%. For example insurance and manufacturing tend to have a wider variety of more complex data, so maybe its the top 15 tables and 80% as the estimate for the proportion of total size represented by those biggest tables. You’ll need to experiment and relate it to example systems already in place to do a sanity check to calibrate it for your situation.
These big tables typically relate to readily observable things in the business, or can be gleaned from an annual report. For Telco we know how many calls are made each year, the average number of cell towers traversed in a call, the average number of calls for personal and for business customers, and we have a good idea as to how big a call detail record is. By multiplying these together we can estimate the size of one of our biggest tables. Likewise if we are in retail and know our turnover, and average basket item price, we know how many items we sell, if we know the average basket size we know the number of baskets.
The point here is that whilst exact sizing is difficult by using this trick doing a reasonable sizing can be done early on. We might do one for all the major data stores we are building, the data warehouse, the data lake and any special purpose stores. This avoids nasty surprises later on in the project.
The picture above shows an example spreadsheet. An actual spreadsheet can be downloaded with the paper I mentioned above, or you can just use the above as a model.
Different target systems will have different rules of thumb for data expansion (replication and redundancy for availability), compression, and the need to add performance structures like aggregates and indexes, but these are usually pretty well know and can be plugged in.
Clearly this is very simple, but that is the point too, by keeping it simple we can get a heads up of the size of system we’re building, and therefore the kind of scalability techniques we need to adopt, early in the project. We can extrapolate these simple estimates into capacity plans. Then we can firm them up into configurations for procurement, working out the kinds of expansions and upgrades we need for each project phase. Even if we are using cloud based systems we still need to have an idea of capacity so we can estimate budgets and get them approved. The sooner, and the simpler we do this the better.
I hope you find this simple tip useful.
Comments or opinions expressed on this blog are those of the individual contributors only, and do not necessarily represent the views of Gartner, Inc. or its management. Readers may copy and redistribute blog postings on other blogs, or otherwise for private, non-commercial or journalistic purposes, with attribution to Gartner. This content may not be used for any other purposes in any other formats or media. The content on this blog is provided on an "as-is" basis. Gartner shall not be liable for any damages whatsoever arising out of the content or use of this blog.