In recent posts I’ve introduced the notion of Big Content as shorthand for incorporating unstructured content into the Big Data world in a systematic and strategic way. Big Data changes the way we think about content and how we manage it. One of the most important areas requiring a fresh look is metadata. Big Content expands the definition of metadata beyond the traditional tombstone information normally associated with documents (title, author, creation date, archive date, etc.). While these elements are necessary and remain foundational to both effective content management and Big Data, more is required. Big Content metadata encompasses any additional illuminating information that can be extracted from or applied to the source content to facilitate its integration and analysis with other information from across the enterprise. This expanded definition results in a three-tiered metadata architecture for Big Content.
At the bottom level of the architecture, a core enterprise metadata framework provides a small set of metadata elements that are applicable to the majority of enterprise information assets under management. These elements are often drawn from a well known standard set of elements such as the Dublin Core but can include whatever common elements are useful to the enterprise. This common framework provides the unifying thread that will facilitate locating content from across the enterprise, making an initial assessment of its relevance and of submitting it to the content ingestion pipeline.
The second layer of the Big Content metadata architecture consists of domain specific elements that are not necessarily applicable to all enterprise content, but are useful to a particular area such as a brand, product or department. At this level, common metadata often exists under different labels depending on where it is created and which department owns it. This increases its value and utility to that department but makes it more difficult to leverage for content integration and analysis. To reconcile domain metadata it is often necessary to create a metadata map that resolves naming and semantic conflicts.
The top layer of the metadata architecture consists of application specific metadata. This is additional information about content that is only relevant to the use-case at hand and the application facilitating its execution. As such it is not created or stored in the content management systems hosting the source content. It is created solely for the purpose of structuring and augmenting the content to be utilized within a vertical application in the Big Content environment.
Throughout the entire Big Content lifecycle ensuring metadata quality and integrity is of the highest importance. Quality measures must go beyond simply reconciling field names. It is important that the steps taken to enrich and refine content are applied consistently. If some dates are not normalized, entity extraction is incomplete, or terminology is not reconciled, the accuracy of the data behind the insights comes into doubt. As a result any analysis and its findings become questionable. Metadata represents a significant upfront investment and ongoing requirement when large amounts of content are involved. Never the less, it is a critical factor in effective content management and the key enabler of the Big Content ecosystem.