Introducing the notion of Big Content has been an interesting study in reactions. To many, it has resonated and the possibility of more fully exploiting documents, social content and other unstructured resources has clicked. For others, it has been like fingernails on a chalkboard. A lot of us never liked the words “Big Data” in the first place, but like it or not we are stuck with the name. One of the reasons we didn’t like it (well…one of the reasons I’ve never liked it) is that “data” tends to emphasize structured resources like databases and logs to the neglect of more textual and free-form assets. We need them both. Big Content is simply shining a spotlight on the shadowy corners of the enterprise information ecosphere. Despite the idiosyncrasies of unstructured content and the unique demands of its management and analysis it remains fully a part of the Big Data world.
Nearly all steps and stages of preparing unstructured content for Big Data consumption have their analogue in the structured data world. Data must be cleaned, reconciled and modeled just as documents must be processed and prepared. To a certain degree this is a case of performing the same task with different tools. While both types of information may be enriched the nature of that enrichment will differ. Where a set of data points from a large array of sensors would be submitted to an inference engine to fill out a sparse data set an equally large number of “Tweets” would be analyzed for sentiment and both sets of information could be geotagged.
Structured information resources have played a more prominent role in Big Data than unstructured resources primarily because the enterprise is more comfortable with managing databases than it is managing documents. Data hygiene and information quality are de rigueur for the data warehouse, but are often never considered in the ECM environment. Likewise, structured resources are more likely to be “hard-wired” into the Big Data pipeline with well established connectors and regularly scheduled ETL windows. Unstructured content is often included almost as an afterthought, with extraction and enrichment applied on-the-fly, from scratch on a case-by-case basis. This undermines the potential of Big Data in several ways. It raises the cost of incorporating unstructured content while also increasing the opportunities for the introduction of inconsistencies and errors reducing the quality of the final product. Most importantly, the ad hoc approach also reduces the potential of Big Data by obscuring the extent of available raw materials.
If unstructured content is difficult to find, reconcile and include in the analytical environment, it is unlikely that novel applications will even be conceived, much less acted upon. The idea of Big Content is simply to encapsulate and enumerate the steps necessary to avoid this unfortunate situation by extending the Big Data environment and infrastructure to incorporate unstructured content in a strategic and systematic manner.
I will be speaking and leading workshops on Big Content at several upcoming Gartner events including Catalyst, The London Portals, Content and Collaboration Summit and Symposium Orlando. I hope to see you there and to continue the conversation.
Follow Darin on Twitter: @darinlstewart