Regardless of industry, sector or size, the promise of unlocking the full potential of enterprise information with big data technologies and big content techniques is tantalizing. Unfortunately, for most organizations, realizing the promise of big data remains out of reach. The perception is that there is just too much information to get under control, things change too quickly, and there are too many moving parts. The volume, velocity and variety conundrum stymies many potentially transformative undertakings before they ever make it off the whiteboard. The expense and disruption that are involved in expanding and retooling on-premises infrastructure remain insurmountable obstacles for many organizations that desire to undertake a big data initiative.
Providing big data functionality without overhauling the data center is achievable with a two-pronged approach that combines augmented enterprise search and distributed computing. Search is very good at finding interesting things that are buried under mounds of information. Enterprise search can provide near-real-time access across a wide variety of content types that are managed in many otherwise siloed systems. It can also provide a flexible and intuitive interface for exploring that information. The weakness of a search-driven approach is the fact that a search application is only as good as the indexes upon which it is built. The traditional global search index that only contains the raw source content is not sufficient to facilitate big data use cases. Multiple, purpose-built indexes that are derived from enriched content are necessary. Creating such indexes requires significant computational firepower and tremendous amounts of storage.
Distributed computing frameworks provide the environment necessary to create these indexes. They are particularly well-suited to efficiently collect extremely large volumes of unprocessed, individually low-value pieces of information and apply the complex analytics and operations that are necessary to transform them into a coherent and cohesive high-value collection. The ability to process numerous source records and apply multiple transformations in parallel dramatically reduces the time that is required to produce augmented indexes across large pools of information. Additionally, these operations and the infrastructure necessary to support them are open-source-oriented and very cloud-friendly. It is possible to establish a robust search-driven big data ecosystem without massive upfront investments in infrastructure.
Distributed computing and augmented enterprise search are two sides of the big data coin. Both are necessary, but neither is sufficient to facilitate many knowledge-intensive applications. Hadoop and its cousins are purely batch-oriented and so cannot provide the near-real-time access that is facilitated by search. Enterprise search provides rapid, easy access to information but cannot perform the complex analytics necessary to build the indexes supporting that access. Combining distributed computing and enterprise search provides a flexible, scalable and responsive architecture for many big data scenarios.
I explore this approach in depth in the Gartner document The New NoSQL: How Enterprise Search and Distributed Computing Bring Big Data Within Reach and will be speaking on it in London at the Catalyst Europe conference. I hope to see you there.