Gartner Blog Network

Robust Machine Learning needs Data and a Lot More

by Sumit Agarwal  |  May 27, 2020  |  1 Comment

“It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts.”

Sir Arthur Conan Doyle, Sherlock Holmes

My aha moment came when I was working on a proof of concept using the MNIST dataset. The dataset includes a curated set of handwritten digits from 0 to 9. The trained model had a 99% accuracy score. I wanted to see if it would recognize my handwriting. I have a decent handwriting. At least, when it comes to numbers. I wrote a number on a post-it-note, took a picture, converted it into a 28 x 28 pixel size, and passed it to the model to predict. I was a pretty decent “3”, but the model didn’t recognize it. I tried a different number with the same result. A little bit of troubleshooting made me realize that MNIST digits are black and white, while my data was green and red. No wonder it didn’t work. I also realized that I had a gap between my training data and real production data.

I wasn’t the only one who made such a mistake. There have been several instances in recent past where the gap between training data and production model execution caused a failure (Google’s
medical AI was super accurate in a lab. Real life was a different story
, It’s 2020. Where are our self-driving cars?).

Conceptual model for machine learning model training
Conceptual Model for ML Model Training

Often, machine learning discussions focus on algorithm selection.

  • Is it better to use XGBoost or RandomForest?
  • Would a deep learning model provide better accuracy?

There are additional questions, all extremely important, for a successful model development process:

  • I have 10 years of historical data. Should I use all the data for the ML model?
  • How much historical data is required for model training?
  • How do I reduce or remove bias from my model?
  • How frequently should I retrain the ML model?
  • How do I prevent overfitting or underfitting of the model?
  • How do I ensure that the model is using appropriate features?
  • How do I explain the model results, both to business users and auditors?

While algorithms provide the tools, it’s the data that is fundamental to the model. A successful machine learning model training process requires data, clear business objectives and often more than one data science experiments. The process output – a trained model needs to evaluated using several metrics and supported by transparency (fairness, explainability and privacy). A robust machine learning model training process is a synthesis of all the inputs and a thorough evaluation of the output.

Gartner research document: Machine Learning Training Essentials and Best Practices provides a six-step framework, with data-selection as the first step to ensure a robust production quality machine learning solution. The framework steps through data selection – model learning – model testing – model tuning – versioning – retrain, with transparency, fairness and privacy as key factors.

Click here to begin the survey

Additional Resources

Upskilling in Crisis - Elevate Your Skills With Continuous Learning

The pandemic has truly turned our world upside down, throwing us into a constant state of volatility and uncertainty. Technical professionals must take advantage of the “pandemic pause” to elevate their skills. Now is the time to prepare for what comes next.

Read Free Gartner Research

Category: artificial-intelligence  analytics-and-bi-solutions-for-technical-professionals  data-and-analytics-leaders  data-and-analytics-strategies  data-management-solutions-for-technical-professionals  

Tags: artificial-intelligence  data-science  machine-learning  model-training  

Sumit Agarwal
Sr Director Analyst
1 year at Gartner
24 years IT Industry

Sumit Agarwal provides guidance on Artificial Intelligence (AI), Machine Learning (ML), Data Science Architectures, Data Management and Data Integration architecture and strategies, based on upcoming ideas, current trends, and past project implementations. Read Full Bio

Thoughts on Robust Machine Learning needs Data and a Lot More

  1. Your amazing insightful information entails much to me and especially to my peers. Thanks a ton; from all of us. ExcelR Machine Learning Course

Leave a Reply

Your email address will not be published. Required fields are marked *

Comments or opinions expressed on this blog are those of the individual contributors only, and do not necessarily represent the views of Gartner, Inc. or its management. Readers may copy and redistribute blog postings on other blogs, or otherwise for private, non-commercial or journalistic purposes, with attribution to Gartner. This content may not be used for any other purposes in any other formats or media. The content on this blog is provided on an "as-is" basis. Gartner shall not be liable for any damages whatsoever arising out of the content or use of this blog.