“It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts.”
Sir Arthur Conan Doyle, Sherlock Holmes
My aha moment came when I was working on a proof of concept using the MNIST dataset. The dataset includes a curated set of handwritten digits from 0 to 9. The trained model had a 99% accuracy score. I wanted to see if it would recognize my handwriting. I have a decent handwriting. At least, when it comes to numbers. I wrote a number on a post-it-note, took a picture, converted it into a 28 x 28 pixel size, and passed it to the model to predict. I was a pretty decent “3”, but the model didn’t recognize it. I tried a different number with the same result. A little bit of troubleshooting made me realize that MNIST digits are black and white, while my data was green and red. No wonder it didn’t work. I also realized that I had a gap between my training data and real production data.
I wasn’t the only one who made such a mistake. There have been several instances in recent past where the gap between training data and production model execution caused a failure (Google’s
medical AI was super accurate in a lab. Real life was a different story, It’s 2020. Where are our self-driving cars?).
Often, machine learning discussions focus on algorithm selection.
- Is it better to use XGBoost or RandomForest?
- Would a deep learning model provide better accuracy?
There are additional questions, all extremely important, for a successful model development process:
- I have 10 years of historical data. Should I use all the data for the ML model?
- How much historical data is required for model training?
- How do I reduce or remove bias from my model?
- How frequently should I retrain the ML model?
- How do I prevent overfitting or underfitting of the model?
- How do I ensure that the model is using appropriate features?
- How do I explain the model results, both to business users and auditors?
While algorithms provide the tools, it’s the data that is fundamental to the model. A successful machine learning model training process requires data, clear business objectives and often more than one data science experiments. The process output – a trained model needs to evaluated using several metrics and supported by transparency (fairness, explainability and privacy). A robust machine learning model training process is a synthesis of all the inputs and a thorough evaluation of the output.
Gartner research document: Machine Learning Training Essentials and Best Practices provides a six-step framework, with data-selection as the first step to ensure a robust production quality machine learning solution. The framework steps through data selection – model learning – model testing – model tuning – versioning – retrain, with transparency, fairness and privacy as key factors.
View Free, Relevant Gartner Research
Gartner's research helps you cut through the complexity and deliver the knowledge you need to make the right decisions quickly, and with confidence.Read Free Gartner Research
Category: artificial-intelligence analytics-and-bi-solutions-for-technical-professionals data-and-analytics-leaders data-and-analytics-strategies data-management-solutions-for-technical-professionals
Tags: artificial-intelligence data-science machine-learning model-training
Comments or opinions expressed on this blog are those of the individual contributors only, and do not necessarily represent the views of Gartner, Inc. or its management. Readers may copy and redistribute blog postings on other blogs, or otherwise for private, non-commercial or journalistic purposes, with attribution to Gartner. This content may not be used for any other purposes in any other formats or media. The content on this blog is provided on an "as-is" basis. Gartner shall not be liable for any damages whatsoever arising out of the content or use of this blog.