Another post from the DBMS Curmudgeon
On December 22, 2014, I published a blog about the terms Database vs. Database Management System (DBMS) (A Database By Any Other Name…) in an effort to correctly define these terms that are widely misused by practitioners and vendors alike. There seems to be a need in the IT industry (computers) to create new terms and concepts, regardless of the validity or usefulness of the term. At times, we do find value in it. Take for example “Cloud Computing“; in the 60’s and 70’s, we had a term, “Time-Sharing” that was very much like “cloud computing”. Instead of a new term, why not call it “Time-Sharing 2.0”? Because, it does not have the marketing power desired by vendors in the 21st century. I predict that the term “Cloud Computing” will actually disappear over time as it becomes so pervasive in the industry that it becomes simply computing infrastructure. We also saw this happen with Client-Server – today it is simply application architecture – pervasive throughout.
Today, I was reminded of my post and another called “Big Data – One of the Worst Terms Ever in the IT Industry”, that I did not publish. In a really good article on B2C by Telmo Silva, “Forget about Big Data. Think Smart Data“, Silva states: “Big Data, Data Lakes and NoSQL have a place in the technology stack and in certain sectors or departments such as scientific research, high volume small transactions such as telecommunication and gaming metrics and web and log activity”. All three terms, Big Data, Data Lake and NoSQL really bother me, for different reasons. Further, Big Data and NoSQL are ill-defined, meaningless terms that in my opinion, never should have been used and no longer have value to the IT industry. Data Lake is the exception. I believe it is a term that has value when defined properly and understood; but more about that in a minute. Let’s examine each of these terms, what meaning they really have and what they should be called today.
As far back as 2001, a new term “Big Data” was beginning to emerge. Doug Laney, a Gartner, Inc. analyst defined the “3V’s” of Big Data as Volume, Velocity and Variety. In his report “3D Data Management: Controlling Data Volume, Velocity, and Variety“, while at Meta Group (later acquired by Gartner Inc.) he discusses the challenges of managing data with these characteristics. Note: this was almost 15 years ago! And guess what? Doug never called this “Big Data” in his report. Why, because he was discussing different types of Data, but just Data!
So, why the term Big Data? In short – Vendor Marketing Hype. If everyone in the industry believes that this is something new and important, everyone will want to buy it! We see this phenomenon every day in our inquires at Gartner. Questions such as: What is Big Data? Why should I care about Big Data? What is business value behind Big Data? And my favorite: How big is the Big Data Market? The vendors have succeeded in putting Big Data at the forefront for IT management.
Today, the term Big Data is losing steam, fast. Simply because every vendor has “Big Data products”; therefore there is less differentiation among the vendors. The term is dying out, becoming just Data – we must let this happen. We must pay attention to managing data properly, with an information management strategy that is flexible for the future. This will certainly include some of the new tools used to manage varying types and sizes of data, such as Apache Hadoop and new DBMS products.
Now we wait for Big Data 2.0 – oh silly me, it is already there (>47k hits on google)
Big Data is Dead, Long Live Data!
The term NoSQL DBMS started out as truly meaning no SQL and then was redefined to mean not only SQL. Most important is that SQL stands for “Structured Query Language” and is a language used to access data in a database. Originally used to access data in a relational database, it is now used to access data in many different types of databases and with many different types of data. A DBMS is built upon a file system or storage engine, of which there are many. MySQL, on Open-Source Relational DBMS, has supported multiple storage engines since its inception in 1994, where it used ISAM as the storage engine. I would suggest that the storage engine used for a DBMS has nothing to do with language used to access data. Finally, why would anyone define a term that implies it does not use the language that is used by most developers? The term NoSQL, in my opinion is meaningless and useless. You will even find a website with a list of “NoSQL” products (> 255) – many use the full SQL language or some form of SQL and it even lists several relational DBMS products.
So what is the alternative terminology we should use? There are many different database models. We have long used the term pre-relational to refer to DBMS products that existed before relational, such as the mainframe products Adabas, CA-IDMS, IMS and Model 204. The relational model was first defined in 1969. As the term NoSQL refers to a language, the better term for these DBMSs would be nonrelational. Many of these support multiple access languages; SQL being just one of them. This would consolidate all the new database models, such as Wide-column, Key-value, Document and Graph to one, nonrelational. It also then, yields only three terms for the DBMS products of today: pre-relational, relational and nonrelational. These terms are actually descriptive of the type of DBMS engine, regardless of the access language. We explain this further in “State of the Operational DBMS Market, 2017“.
The term Data Lake was originally coined by James Dixon of Pentaho to distinguish differences from a Data Mart. Today, Gartner defines a Data Lake as a concept that includes a collection of storage instances of various data assets (see “Hype Cycle for Data Management, 2017“) . These assets are stored in a near-exact, or even exact, copy of the source format, and are in addition to the originating data stores. Additionally, Data Lake has emerged as an alternative to traditional analytics data management concepts, such as data marts and data warehouses. Many practitioners have come to believe that Data Lakes must be implemented with Apache Hadoop. This is simply not true. A Data Lake can be used to store many different types of data, both curated (governed with a high level of quality) and raw, un-curated data that may or may not have future value to the organization. The underlying DBMS can be either relational or nonrelational.
The term Data Lake is at best, OK. I do not feel it really defines what the data lake is used for today and will be used for in the future. In fact, I believe, in the future, the Data Lake will replace the Data Warehouse. Unfortunately, we do not have a better term. Until such time as we do, I am willing to let this one go and I continue to use the term Data Lake.
In Summary, Big Data is just Data, NoSQL is Nonrelational and Data Lake remains.