In my last post I discussed how enterprise search can bring big data within reach. To achieve this, however, crawling and indexing must move beyond traditional vertical scaling and move into a truly distributed model ala Hadoop and its cousins. The end product of integrating enterprise search and distributed computing is a scalable, flexible, responsive environment for information discovery and analytics. The system scales easily and efficiently in terms of both content size and query handling capacity. Both batch oriented content processing and near-real time information access can be supported. The key-value oriented nature of MapReduce along with the flexible schema and dynamic field support of the search engine allow any form of content, structured, unstructured and semi-structured, to be fully leveraged. In short, the enterprise search infrastructure becomes a powerful NoSQL database.
There is currently no canonical definition of what constitutes a NoSQL database. According to Wikipedia, “NoSQL (Not only SQL) is a movement promoting a loosely defined class of non-relational data stores that break with a long history of relational databases. These data stores may not require fixed table schemas, usually avoid join operations and typically scale horizontally.” Even so, there are certain characteristics all NoSQL databases share. First and foremost, NoSQL supports schema-on-read which means that any form of information can be written to the database with a schema only being applied when that information is retrieved for use. This reverses the schema-on-write approach of traditional relational databases in which information must be made to conform to a particular schema before it can be stored. In addition, NoSQL databases prefer eventual consistency over support for ACID compliance and strict consistency. A search-driven approach to Big Data can facilitate each of the four models for NoSQL information management described in that document: key-value stores, document databases, table-style databases and graph databases.
The search-driven approach I describe in the Gartner document The New NoSQL: How Enterprise Search and Distributed Computing Bring Big Data Within Reach offers several additional advantages that are likely not a standard part of more pure-play NoSQL solutions. An enterprise search platform will offer text oriented features that simplify the index generation process. Such standard features will include free text search, faceting, spell checking, vocabulary management, similar item search, hit highlighting, recommendation engine, visualizations, content rating and many other capabilities that will augment and enhance an analytical index. The combination of NoSQL capabilities with the near-real time information access afforded by enterprise search and the ability to do so in the cloud have the potential to unlock the full value of enterprise information assets and finally bring Big Data within reach.