Blog post

How Much Master Data is there in the World?

By Andrew White | May 01, 2017 | 0 Comments

MDMMaster Data Management

A vendor sent me an email a few weeks ago and asked, “How much master data does an organization have?”   Well, that’s a pretty big question for a pretty big topic that ends up being a whole lot of small data. I thought I could answer the question so I jotted down my ideas and then realized it was harder than I first thought.  So I fired off an email to a number of analysts who focus on MDM, information architecture and governance to see if we could agree a response.

Logically and simply there is not a lot of it about.  If you think about what master data is meant to be, rather than the precise scientific definition you might find in a book about metadata, you come to the following conclusions:

  • The minimalist set of attributes to uniquely identify one object from another
  • That needs to be shared (or needs to be consistent) between core business processes, systems, apps, business units or uses
  • Where the objects in mind are those that define what an organization does

The third item is key – since it leads to a different understanding between business and IT roles.  Business users will likely conclude that there are only a few (real, worthy) master data objects- the ones they know conceptually that describe what their organization does.  IT and specifically architecture roles will determine that there are many more master data objects.  Both views are valid but the one that is more useful and valuable is the prioritized smaller set of objects that the business can describe.  If the business does not seek to describe it, it is not likely that important.  Not always, but that’s a fair view for most situations.

See  MDM ‘Primer’: How to Define Master Data and Related Data in Your Organization See for more information: for more information:

The derivation of the (master data) objects in question should therefore be a business-led conversation, so it tends to start off with things like ‘customers’, or ‘clients’, or ‘prospects’, or ‘citizens’, and then moves to ‘products’, ‘things’, ‘assets’ or ‘services’.  Then it moves to ‘locations’ and ‘hierarchies’.

Other users might add on a “360 degree view” which tends to mean a whole lot more data about the object in mind.  There will be many variations and different objects; and some industries start out with very different perspectives, such as people data (for, example, workforce management) over product data.

But once you have the list of important objects, you then have to come up with the attributes and rules and metadata about those objects and attributes.  Then we can add (or is implied) relationship data (e.g. big data) between one object and another; or streaming data (e.g. think IoT or machine data).

Finally you then take account the actual data itself – not just the reference data that describes it.  So you then develop a list of the actual observations of those objects – such as the customer list, or product catalog etc.  So with the reference data, the actual data, the rules and policies and the metadata associated with same, you have your master data.

So it sounds pretty simple, right?

What about other reference data such as units of measure, or country code?  This looks, smells and behaves like master data – but it hardly ‘defines what our business does”.  So it is not master data, but can often be used as part of master data or at least in relation to master data.  So you need to track both kinds of data – possibly as two separate data models.

See How to Manage Your Master and Metadata Data Models for More Effective Program Management.

So, we are done, yes?  No.

Let’s try this.  What do you call the data that needs to be consistent between sales, marketing and services processes if all three are housed in one CRM system?  Or what about shared data between billing, inventory and order management, in an ERP system?  Are the common data shared by both applications and suites not master data?

It turns out that while we might agree what is master data from a classification perspective, there is also a temporal uses that changes the scope and so the definition.  One person’s master data is another person’s metadata (don’t tell Michael Blechar I said that!) And one applications master data is another applications application data. It just depends on perspective.

See Designing Your Pace-Layered Information Strategy.

But then there is another wrinkle.  So what if we agree on a couple of objects, and a number of attributes thus:

  • Customer data comprising 8 attributes
  • Product data comprising 15 attributes

Is the copy of this data that is stored in each application not master data, just a copy of it?  What about the version that is forever stored (at a moment in time) in a transaction, or order?  What about the version of this same data that is in a data warehouse?  Is this all master data or just a copy and so no master data per se?

And then you have that other classic- we all know that the vast majority of “data” in a business is actually “content”, that is, unstructured data that is not, without further work, machine readable.  Don’t forget, as Mark Beyer would correct me, “There is no such thing as unstructured content”.  It is pretty much useless data until it has at least some notional structure (i.e. metadata) so that a user or machine can process the content for some purpose.

See Use Gartner’s Three Rings of Information Governance to Prioritize and Classify Records.

So now perhaps we can come up with some outlines or maybe principles:

  1. There is likely to be a finite set of master data (object classes and records of each) that a firm can agree on that needs, to a degree, formal and centralized governance. This will likely be structured data that can be processed by machines and computers.
  2. There will many copies of this data, embedded with other data or alone, that is purpose-copied. Examples exist in a data warehouse, a business or analytic application, or a business transaction.  This is not “master data” per se but an observation of what the master data was at the time or for that use.  It may or may not be actively governed or stewarded.
  3. There will a whole lot more structured data in business systems. Perhaps less than 1% of the structured data will be master data.  Perhaps a whole lot less than 1%.
  4. Given that quantity of content, with structure or without, dwarfs the typical structured data in business systems, the likelihood is that master data is very, very much smaller than 1% of a firms overall data and content.

So let me repeat this in another form:

  1. Of all the typical relational (as in structured and machine readable) data in business systems, master data is most likely to be less than 1% of such data.  The rest of such data is currently being referred to as application data.
  2. This “<1%” is the master data that should be actively and centrally governed (but permitting many different management approaches) for wide re-use,
  3. The remaining application data is made up of:
    1. Data used by one application (which requires local governance)
    2. Data uses by two or more applications (which requires regional governance)
  4. The above percentages refer to the master and application data that requires active governance and ignores the copies of that same data stored in transactions, sensors and streams (though some of this may also be actively governed it will make little difference to the “less than 1%”.)
  5. The sum of master and application data in an enterprise remains a very small part of the overall content in a firm which extends to all relational and non-relationship stores, documents, images, files, emails, data warehouses and apps etc.  As such master data is typically <<1% of an organizations (all) data*.

* AllData sounds rather like the AllSpark from the Transformers.

See Pursue a Pace-Layered Information Strategy to Support Your Business Applications.

So if you read this far you might assume that this was a useful or even thought provoking blog.  However, after I sent the question to a number of analysts I was snowed under with numerous responses and questions and answers.  We exchanged emails in the first week all of a flutter.  By the second week the number of emails started to fall off and by week three there were only one or two.  As such I am not convinced that what I have amalgamated above is finished.  Our team seemed to lose interest once the harder thinking was done and the main boundaries were defined.  I suspect though that we might change our mind; might not understand what we meant; and someone else “out there” may have a better idea. After all, one persons’ system of record is another persons’ system of reference!   Until then, here you go.



Comments are closed