In recent years large, data-centric vendors acquired smaller enterprise search companies at an astonishing rate. Oracle purchased Endeca. IBM purchased Vivisimo. Hewlett-Packard purchased Autonomy. Microsoft purchased FAST. There is a reason for this feeding frenzy of corporate acquisitions. Big Data is the current killer application and search provides a ready entry into the Big Data space for both vendors and practitioners. This is particularly true in the case of incorporating unstructured content into the Big Data world (or as I called it in a previous post Big Content). Search is one area of technology where the order of precedence between structured and unstructured content was reversed. Historically, structured data and its management was the primary focus and top priority of industry. As a result, database, data warehouse and business intelligence technologies initially focused on structured data, leaving unstructured content for later consideration and thus causing those capabilities to mature much more slowly. Enterprise search, by contrast, began with unstructured content and only recently brought structured data sources into the fold. From the outset, search has focused on discovery, access and exploration rather than reporting or transaction processing. This gives search several distinct advantages when dealing with unstructured content.
First and foremost, most enterprise search platforms have mature and in some cases quite sophisticated content ingestion and indexing pipelines. At the most basic level, this surfaces content and makes it both visible and accessible regardless of where it is stored. At a deeper level, the indexing process facilitates the content processing and enrichment that underpins Big Content capabilities. The importance of content preprocessing to Big Data and Big Content capacities cannot be overstated. Content enrichment of this sort can bring structure and consistency to otherwise freeform content. Even with this preprocessing, unstructured content will always present itself inconsistently. Records will by only partially complete. Documents will be of varying length and so forth. A search engine is very comfortable with such jagged records and can incorporate them into its indexes without difficulty.
Business users have an intuitive grasp of what search does and how to make it work. If the enterprise search platform has been implemented well, finding relevant enterprise information assets is not that different or more difficult than using Google. A few well chosen keywords will usually at least put the user in the general vicinity of the information they are looking for. Search is also good at finding things that are “close enough” by managing spelling variants, synonyms, related content and other fuzzy matching mechanisms. This is extremely useful when attempting to uncover nuggets of information scattered across and hidden within large amounts of heterogeneous content.
The concept of search is simple and straight forward. Everything that goes in to making search effective, especially as a foundation for Big Content analytics, can get very complicated. Modern search has the ability to go far beyond simple retrieval. The indexes created by content ingestion pipelines can combine diverse information in novel ways that uncover relationships, trends and patterns that would not otherwise be apparent. This is accomplished primarily by determining the relevance of each indexed item to the query or question at hand. Big Content search takes a broader view of relevance than the one size fits all approach of a public web search engine. In the world of Google’s PageRank and its peer algorithms, a single index is developed and replicated that attempts to support all queries for all people. For basic information location and retrieval, this approach has been remarkably successful, but Big Content cannot take such a universal approach to indexing.
Search operates at two levels in a Big Content environment: Discovery and Analysis. At the discovery level search functions much as it does on the web or in traditional enterprise search. It provides a single, comprehensive index of available information assets against which queries are matched and relevant content is retrieved. Beyond simple information retrieval, this sort of discovery facilitates the deeper analysis at the heart of Big Content. Using search-based discovery, a user can identify a pool of information resources, both structured and unstructured, that may contain a desired insight or answer a particular question in a way that simply reading a document will not reveal. This pool of resources can be gathered from across the enterprise and processed through the indexing pipeline to be enriched, organized and indexed in an iterative manner specifically tailored to the situation at hand. Rather than attempting to envision all possible uses of content when designing the repository, the search index approach allows the user to model what they need when it is needed and only for the content involved.