“It is a capital mistake to theorize before one has data. Insensibly one begins to twist facts to suit theories, instead of theories to suit facts.”
Sir Arthur Conan Doyle, Sherlock Holmes
My aha moment came when I was working on a proof of concept using the MNIST dataset. The dataset includes a curated set of handwritten digits from 0 to 9. The trained model had a 99% accuracy score. I wanted to see if it would recognize my handwriting. I have a decent handwriting. At least, when it comes to numbers. I wrote a number on a post-it-note, took a picture, converted it into a 28 x 28 pixel size, and passed it to the model to predict. I was a pretty decent “3”, but the model didn’t recognize it. I tried a different number with the same result. A little bit of troubleshooting made me realize that MNIST digits are black and white, while my data was green and red. No wonder it didn’t work. I also realized that I had a gap between my training data and real production data.
I wasn’t the only one who made such a mistake. There have been several instances in recent past where the gap between training data and production model execution caused a failure (Google’s
medical AI was super accurate in a lab. Real life was a different story, It’s 2020. Where are our self-driving cars?).
Often, machine learning discussions focus on algorithm selection.
- Is it better to use XGBoost or RandomForest?
- Would a deep learning model provide better accuracy?
There are additional questions, all extremely important, for a successful model development process:
- I have 10 years of historical data. Should I use all the data for the ML model?
- How much historical data is required for model training?
- How do I reduce or remove bias from my model?
- How frequently should I retrain the ML model?
- How do I prevent overfitting or underfitting of the model?
- How do I ensure that the model is using appropriate features?
- How do I explain the model results, both to business users and auditors?
While algorithms provide the tools, it’s the data that is fundamental to the model. A successful machine learning model training process requires data, clear business objectives and often more than one data science experiments. The process output – a trained model needs to evaluated using several metrics and supported by transparency (fairness, explainability and privacy). A robust machine learning model training process is a synthesis of all the inputs and a thorough evaluation of the output.
Gartner research document: Machine Learning Training Essentials and Best Practices provides a six-step framework, with data-selection as the first step to ensure a robust production quality machine learning solution. The framework steps through data selection – model learning – model testing – model tuning – versioning – retrain, with transparency, fairness and privacy as key factors.