Gartner Blog Network

Databases and Machine Learning Coalesce

by Sanjeev Mohan  |  September 14, 2020  |  Submit a Comment

The boundaries between data management and advanced analytics are blurring fast. Databases are enhancing capabilities to build, train and validate machine learning models right where the data sits – inside the databases and data warehouses. When the ML operations and the data-preparation are in separate artifacts, the round-trip for investigative analytics is long and ponderous. The old way has high latency at Big-Data scale because it relies on movement (and reformatting) of data from the DBMS to the ML/analytics environment.

An ever-growing list of databases are embedding ML functions and notebooks into the database with built-in Apache Spark. The notebooks allow SQL queries to coexist along with training code written in languages like R and Python. Teradata Vantage, the venerable data warehouse has over 200 pre-packaged ML functions and algorithms from data preparation to statistics (linear regression, K-Means, Random Forest, Naïve Bayes, SVM, XGBoost etc.) to sentiment analysis to graph and time series analysis.

Oracle Autonomous database with AutoML, Google Cloud BigQuery ML, MS SQL Server and Cosmos DB with Spark integration and, Pivotal Greenplum with GPU- accelerated open-source deep learning library called Apache MADlib are only a few of the examples. Amazon Aurora’s integration with Amazon SageMaker and Comprehend AI allows it to do NLP on free-form text data e.g. tweets in a single-line SQL query to perform sentiment analysis.

Azure’s Synapse Analytics platform creates models in Python in the Open Neural Network Exchange (ONNX) format. The model is stored in a table as a binary object and can be called from SQL. For example, the model can be trained on the NY city taxi data on a certain number of features such as the number of passengers and trip distance. Then it can be applied to the table that has all the taxi data to predict, say, the tip amount. The model constantly retrains as the table is updated every few minutes with new taxi trips data.

Splice Machine, a row and column (HTAP) scale-out database built on Apache Spark with an underlying transactional key-value store, takes it a step further. This database is natively built on Spark facilitating the integration of MLflow, Jupyter, H2O, TensorFlow, Keras and Scikit Learn. It provides complete ML lifecycle capabilities including some MLOps features. For example, using the MLflow UI, it allows multiple versions of the models to be compared simply by changing the type of algorithm or the features. Each model is stored as a binary object with a unique run id. When the times comes to deploy the model, with one line of code, Splice Machine automatically creates the necessary table (with features and prediction columns) and the trigger needed to automatically run the model every time new data arrives. Splice Machine can ‘time travel’ to a particular run id and get the state of the data used to train the model at that specific time. This helps in reproducibility of the model.

One drawback cited frequently is overloading operational databases with resource intensive ML training. Splice Machine handles this by providing data scientists on-demand dedicated CPUs and memory with their Jupyter notebook instances and dedicated Spark executors managed by Kubernetes.

The ability to execute ML functions in database using SQL is not new but it will become table stakes in near future. We also feel that some MLOps functionality will eventually move into more databases in future. This will reduce integration cost and complexity and, by reducing the number of copies of data, help in data governance.

Additional Resources

View Free, Relevant Gartner Research

Gartner's research helps you cut through the complexity and deliver the knowledge you need to make the right decisions quickly, and with confidence.

Read Free Gartner Research


Tags: ai  aurora  aws  azure  bigquery  bqml  data-science  gcp  h2o  kuberntetes  microsoft  ml  mlflow  nlp  onnx  oracle  sagemaker  splice-machine  sql-server  synapse  teradata  

Sanjeev Mohan
Research Vice President
4 years at Gartner
31 years IT Industry

Sanjeev Mohan researches data management strategies within the Gartner for Technical Professionals (GTP) group. Mr. Mohan covers big data frameworks and related technologies. He provides insights on cloud data management, AI and ML, data storage including Hadoop, NoSQL, data integration, data governance and data transformation strategies. Read Full Bio

Leave a Reply

Comments or opinions expressed on this blog are those of the individual contributors only, and do not necessarily represent the views of Gartner, Inc. or its management. Readers may copy and redistribute blog postings on other blogs, or otherwise for private, non-commercial or journalistic purposes, with attribution to Gartner. This content may not be used for any other purposes in any other formats or media. The content on this blog is provided on an "as-is" basis. Gartner shall not be liable for any damages whatsoever arising out of the content or use of this blog.