by Donald Feinberg | October 9, 2014 | 2 Comments
Durante os últimos (muitos) anos, eu travei a batalha sobre o uso da expressão “estruturados” versus “não estruturados” na gestão de dados. Eu tentei cada argumento lógico e tentei muitos outros termos para descrever dados não estruturados, como também fizeram muitos dos meus colegas do Gartner e toda a indústria. Até mesmo usei a expressão “a palavra U” para dados não estruturados (“Unstructured” em inglês) para implicar que é semelhante aos sete palavrões (em inglês), uma rotina de um dos meus comediantes favoritos, George Carlin. Independentemente de quantas vezes, alguns de nós tem tentado, a expressão “não estruturados” continua a ser amplamente utilizada para descrever todos os dados que não podem ser simplesmente descritos como dados relacionais. Para alguns, é XML ou texto. Para outros, ela abrange todo o espectro de XML para voz e vídeo, incluindo e-mail e SMS (por vezes referido como dados de ruído). Em termos simples, é tudo a que nos referimos como “as outras coisas” que iríamos armazenar em arquivos ou banco de dados.
Segundo a Wikipedia “dados não estruturados (ou informações não-estruturadas) referem-se (geralmente) a informação computadorizada que, ou não tem um modelo de dados, ou tem um que não é facilmente utilizável por um programa de computador. O termo distingue tais informações a partir de dados armazenados em formato de campo em bancos de dados ou anotada (com etiqueta semântica) em documentos”. Onde eu tenho um problema é que XML tem, sim, um modelo de dados (consulte XML Schema). Além disso, um JPEG (Joint Photographic Experts Group), TIFF (Tagged Image File Format) ou outros arquivos de imagem são facilmente utilizáveis em um programa de computador – por exemplo, no Adobe Photoshop. A Wikipedia chega a dizer: “O termo [não estruturado] é impreciso por várias razões …” Esta sempre foi a base para eu não usá-lo – é impreciso e sem definição formal do tipo de dados a que se refere.
Então, por que nós usamos “não estruturados” para descrever todos os dados que não se encaixam muito bem em um modelo de dados? Porque se tornou geralmente aceito em toda a indústria. Quando alguém usa a expressão “dados não estruturados”, todos entendem que estamos descrevendo os dados que não são uma coluna de números, caracteres ou datas. Na realidade, os dados realmente se encaixam em um contínuo que vai do estruturado ao não estruturado, desde números relacionais, datas e caracteres através XML, até não estruturados, tais como voz, vídeo e e-mail. Alguns dados são mais estruturados que outros.
Portanto, eu desisto. Algumas batalhas simplesmente não valem a pena. Chega de lutar essa batalha. Vitória do antigo provérbio “Se você não pode vencê-los, junte-se a eles”. Agora vou usar “não estruturados” para descrever todas as “outras coisas” que não são estruturadas. É claro que agora nós chamamos isso de Big Data – Opa, não vamos entrar nisso (pelo menos hoje).
Obrigadão ao meu amigo e colega do Gartner, Cássio Dreyfuss por obter ajuda com o meu português
Category: Analyst Banco de Dados Big Data Data Management DBMS Tags: Banco de Dados, Database Management System, Structured Data, Unstructured Data, XML
by Donald Feinberg | October 4, 2014 | 11 Comments
Another post from the DBMS Curmudgeon
For the past years (many), I have waged the battle over the use of Structured vs. Unstructured in data management. I have tried every logical argument and tried many other terms to describe Unstructured Data, as have many of my colleagues at Gartner and throughout the industry. I have even used the phrase “The ‘U’ Word” for Unstructured to imply it is similar to the Seven Dirty Words, a routine from one of my favorite comedians, George Carlin. Regardless of how often some of us have tried, the word Unstructured continues to be used widely to describe all the data that cannot be simply described as Relational Data. For some it is XML or text data. For others, it covers the spectrum from XML to Voice and Video, including e-mail and SMS (sometimes referred to as noise data). In simple terms, it is all the “other stuff” we would store in files or a database.
According to Wikipedia “Unstructured Data (or unstructured information) refers to (usually) computerized information that either does not have a data model or has one that is not easily usable by a computer program. The term distinguishes such information from data stored in fielded form in databases or annotated (semantically tagged) in documents.” Where I have a problem is that XML does have a data model, see XML Schema. In addition , a JPEG (Joint Photographic Experts Group), TIFF (Tagged Image File Format) or other image file is easily usable in a computer program – for example, in Adobe Photoshop. Wikipedia even says, “The term is imprecise for several reasons…” This has always been the underlying basis for my argument against using it – it is imprecise with no formal definition of the type of data to which it refers.
So why do we use Unstructured to describe all of the data that does fit nicely into a data model ? Because it has become generally accepted throughout the industry. When one uses the words Unstructured Data, everyone understands that we are describing data that is not a column of numbers, characters or dates. In reality, data actually fits in a continuum from Structured to Unstructured, from relational numbers, dates and characters through XML to unstructured, such as voice, video and e-mail. Some data is more structured than other.
Therefore, I give up. Some battles are simply not worth havin. I am finished fighting this battle. The age-old proverb “If you can`t beat ‘em, join ‘em” wins. I will now use Unstructured to describe all the “other stuff” that is not Structured. Of course, now we call this Big Data – Opps, let’s not go there (at least today).
Category: Data Management DBMS General Tags: Data Management, Database Management System, RDBMS, Relational DBMS, Structured Data, Unstructured Data, XML
by Donald Feinberg | September 28, 2014 | 3 Comments
Recently, we published a Market Guide for In-Memory Computing. The document covers all forms of IMC, including Database Management Systems (DBMS). Gartner defines In-Memory Computing (IMC) as a computing style where applications assume all the data required for processing is located in the main memory of their computing environment. Although we define many styles of IMC (Application Servers, Data Grids, Messaging and Complex Event Processing), I want to concentrate specifically on DBMS technology in-memory. Why? There appears to be some level of misconception about what does and does not qualify as an In-Memory DBMS (IMDBMS).
Our definition of IMDBMS requires the database structure to be in-memory, specifically the main memory of the server. Data in the database is accessed through instructions for accessing memory and not using I/O instructions. This should not be confused with products that buffer data in a disk-block cache. Disk-block caching has been used in the industry for many years, pre-dating relational technology. For example, IBM’s IMS DBMS was, from its introduction in 1968, able to cache data in memory, also referred to as pre-fetch or read-ahead; however, it is not an IMDBMS. While we agree that caching does improve performance, over accessing disk or flash, it is not IMC.
One major difference between traditional disk-based DBMS engines and IMDBMS is the implementation of the consistency model. IMDBMS covers all DBMS consistency models from ACID consistency to eventually consistent models, the latter found in many of the noSQL DBMS engines. However, regardless of the consistency model, a commit operation will be performed. Disk-based systems, even if all the data is cached in memory buffers, require the transaction to be written to disk or flash. Regardless of the length of time taken to perform this operation, it is greater than zero. With IMDBMS products, the commit operation takes place in memory. Although this requires unique methods or assuring the persistence of the data, due to the volatility of memory, such as synchronous writing of data to a second server using Remote Direct Memory Access (RDMA), the latency is less than writing to external media. This illustrates why the performance of IMDBMS is higher, even over using a disk-block buffer.
With our precise definition of true IMDBMS, we seek to dissipate the hype in the market over IMDBMS and claims made by some vendors that their technology is IMDBMS when, in fact, it is not.
Category: Analyst Data Management DBMS General In-Memory Computing In-Memory DBMS Operational DBMS Tags: ACID, Database Management System, IMDBMS, In-Memory, Online Transaction Processing, Operational DBMS, RDBMS, Relational DBMS
by Donald Feinberg | June 15, 2009 | 1 Comment
On June 9, Google Labs announced Google Fusion Tables , a new system for managing data in the Google cloud from Google Labs. I want to be clear about one point – this is an experiment from Google Research not exactly ready for production systems (Google is clear about this also). The issue I have is how the press exaggerates the announcement by warning the Database Management System (DBMS) vendors to watch out as they are being blindsided by Google. You must be kidding!
First, what is Fusion Tables? It is a system for managing data in the cloud for collaboration with data from disparate sources in a simple way, including the ability to “drill-down” to the sources of the data. It allows the user to “join” (in a loose definition) data without the constraints of the data model, normally found in a relational DBMS. What it is not is a DBMS to manage data for an On-Line Transaction Processing (OLTP) system or a Data Warehouse. Fusion Tables is based on Data Spaces, defined in Wikipedia as “a container for domain specific data” and further “A Data Space system is a multi-model data management system that manages data sourced from a variety of local or external sources”. Data Spaces were originally defined in the early 1990’s during the Object Oriented DBMS (OODBMS) era.
As with many new ideas, there are elements of the technology that may have value. When this happens, we find that the original relational model is evolved to incorporate this new technology or model. We saw this occur with OODBMS – the modern DBMS does use inheritance and user defined classes. We saw this happen with XML – now the modern DBMS has full native XML as a data type as robust as the original pure-play XML DBMSs. Today we are seeing this happen with MapReduce as several DBMS vendors have incorporated it into its DBMS engine. We will see this happen also with the column-store construct, which we believe will be incorporated into many modern DBMS engines as an indexing technique for optimization. As to the validity of Fusion Tables and the ability to mix disparate data source and types, there is little question as to the usefulness of this. Oracle has already put a capability in its current release (11g) as SecureFiles and Microsoft in SQL Server 2008 has a feature called FILESTREAM. These are not experimental or beta test features but implemented in full production.
Is Fusion Tables worth watching? Of Course! The concept of easily combining disparate sources of data for analysis and collaboration is important and has been around since the inception of IT. Mashups and other Web 2.0 constructs have made some of this available today (see The Rise of Collaborative Decision Making). Google has a good start on this with the ability to use data from Google Apps and other spreadsheet style data with the initial version of Fusion Tables. Organizations must take care or these types of applications will cause additional turmoil in the governance and security space (see Developing a Strategy for Dealing With Desktop Database Management System Proliferation ). Will this technology replace your DBMS for OLTP and DW systems – not soon or in the future. Many have tried (e.g., OODBMS). There are other new techniques and systems being researched today that have promise (e.g., Akiba), however, the relational model continues to demonstrate flexibility and resiliency (over 30 years) and you can expect that to continue. Products like DB2, Informix, Ingres, MySQL, Oracle, PostgreSQL, SQL Server and Sybase ASE will be used in new IT systems for many years to come.
Category: DBMS Tags: Akiba, Collaborative Decision Making, Data Spaces, Data Warehouse, Database Management System, DBMS, Desktop DBMS, DW, FILESTREAM, Fusion Tables, Google, Google Labs, MapReduce, Object Oriented Database Management System, Online Transaction Processing, OODBMS, OPTP, Oracle, RDBMS, Relational DBMS, SecureFiles, SQL Server, SQL Server 2008, XML
by Donald Feinberg | April 17, 2009 | Submit a Comment
The Merriam-Webster dictionary defines curmudgeon as “Archaic”. That’s me – sometimes. Many of the open source bloggers might agree. In years past, they all thought I was a curmudgeon. Go ahead Tony – laugh. But I can come around – although I still believe that companies like to make money and developers like a pay check. When it comes to blogging – that is not me – or so I thought. I do not read blogs and this is my first attempt at writing one. Funny that two years ago, I won an award in our group for being mentioned more than anyone else in blogs that year! And this, when I never read blogs or comment on them. So the curmudgeon changes again and here I am with my own blog.
Category: Analyst General Tags: Curmudgeon