Human Interviewer: “How many data points do you need to solve this problem?”
Renee Robot: “Just the good ones.”
With AI, nothing matters more than the data
And your data is not ready for AI. This is the case for almost all data. Why, because the data collection was never intended for AI.
And yet, data is the lifeblood of AI. The right amount of the right data is required for success. This is often the biggest challenge in finding AI based solutions to big problems. AI will directly reflect the characteristics of the data. Unreliable data creates unreliable AI. Bad data creates bad AI. Bias data creates bias AI.
Every business leader must understand the characteristics of the data underneath the AI to make judgements on AI quality. Here is where business leaders can add significant value. In theory, business leaders should know the strengths, weaknesses and value of their business data. Never underestimate the value of your business data. And never overestimate its quality. In my experience, most business leaders do the opposite. As I presented in the parent post to this one, the availability of quality data will endure as the main inhibitor to AI progress. Here are a few of my personal positions that some may find controversial.
“Data engineering is more important to AI solutions than data science.”
“The data about completing a task is more valuable than the task itself.”
“Synthetic data will displace real data as the primary fuel for AI.”
Here are some critical data questions that business leaders must ask before green lighting any AI initiative.
Do we have enough data?
When COVID-19 reached Global pandemic status in 2020, people placed high expectations on AI as the path to a solution. So why were solutions slow in coming and ineffective? One big reason (if not the biggest), we just didn’t have the data. COVID-19 was a new, novel, never before identified, coronavirus. We needed to collect, process and analyze new data. Data that didn’t exist. Many countries implemented AI near immediately for contact tracing and tracking the disease but we didn’t have enough data to pursue a treatment even with the power of AI.
Is it the right data?
In some cases we have a lot of data but it isn’t the right data. Remember that the right data holds the answers you seek. This was another weakness in Princeton’s “Fragile Families Challenge” effort. Yes, there was a very large robust data set. But the data set was designed for social scientists to study families formed by unmarried parents and the lives of children born into these families. It was not designed for AI to predict the six particular life outcomes of the children included in the observations. There was little chance that the answer sought was anywhere in the data. In reality, you can’t design a data set for that broad goal with any guarantees that the answer would be in the data.
Can we get the right data?
One of the best ways to ensure you have the right data is to design the data set specifically for applying AI to the problem. This was the case with the 50,000 chest X-rays collected for the Stanford CheXNeXT radiology AI study. These data scientists knew that a sufficient number of a specific set of cardiac maladies were represented in the X-ray data so they knew there was a good likelihood that they could use the data to build an AI model that would be able to detect those maladies. For those targeted maladies at least, they knew the answer was in the data. They had no expectation that the AI algorithm would recognize any other maladies.
Sometimes “the right data” doesn’t exist and is too expensive to collect. This is where synthetic data comes in. With today’s tech you can create a large data set to spec. However, there is always a risk that the data will not reflect the real world. In some cases, organizations don’t want their AI to reflect the real world. Instead they train AI algorithms to reflect the world they want. Then they look for their desired scenario in the real world. This is one way companies try to combat real world bias. Because even the right real world data still may not address the next question.
Does the data hold the answer you want?
As if it were not hard enough to ensure an answer is in the data, you must ensure that the answer you want is in the data. Good AI data not only has the answer in it but it also reflects the scenario you wish to model. And this scenario may not be the way of the world. All data is biased, period. Accurate “real world” data will reflect actual bias in the real world. So if we are examining home lending practices or real estate sales practices or K-12 teaching systems, any inherent biases within those people, practices and systems will be in the data. And those biases will be reflected in the AI algorithms trained with that data.
In the mid 2000s, Amazon was building an AI-based recruiting systems. The ultimate goal was to have a system that could look through thousands upon thousands of resumes and weed the pile down to a handful of highly qualified candidates that Amazon managers could then interview and choose the most qualified. It became apparent relatively quickly that the results were pretty heavily male gender-biased. Why was that? The data they trained the AI model against was a repository of resumes submitted to Amazon over a 10-year period. And who was submitting those resumes? Men. And so the “answer” in the data was “the men” who are qualified for the position, not “the men and women” who are qualified for the position.
There was an answer in the data. But it was a biased answer and not necessarily the answer Amazon sought. Eventually, Amazon abandoned the AI-based recruiting project essentially because, though they had a lot of “the right” data, they did not have data that gave them an acceptable answer. If Amazon can make this mistake, anyone can.
Bias is only good or bad depending on your desired outcome.
Another now famous example of bias arose from groundbreaking research by Joy Buolamwini, Deb Raji, and Timnit Gebru. This research showed that facial recognition classification of white men was far more accurate than black women. This launched a significant effort by numerous companies to further explore bias in facial recognition algorithms.
Since all real world data is biased, it is critical to understand how that bias will affect the “answers” AI will find in the data. With this knowledge business leaders can either ensure the data is adjusted or factor the bias into the business decisions that follow the AI. Bias and transparency are important aspects of AI. A whole field around ethical AI is evolving rapidly. A big part of ensuring ethical AI is for business leaders to develop an awareness of the inherent biases in data (and therefore AI) and, if needed, adjust business decisions and practices to counteract those biases. A big part of ethical AI involves making sure the data holds the right answer to the business problem.
Getting the right amount of the right data will be a formidable AI challenge for the foreseeable future. It often makes AI cost prohibitive for all but the largest companies. The cost of acquiring, preparing and processing data can reach millions of dollars depending on the type of AI needed. There are several ways of gain access to data including:
- Accumulating, managing and processing internal business data
- Acquiring, managing and processing external data
- Collecting data via trial and error experience (reinforcement learning)
- Synthesizing data for AI training
- Acquiring algorithms trained by other organizations on their managed data
Each of these approaches, and others, come with cost/benefit trade-offs.
It is critical that business leaders understand the fundamentals of the data behind the AI. The quality and cost of the data is foundational to any AI business case. A poor decision here puts the entire AI project and perhaps the business at risk.
So, the key data questions for business leaders to ask their AI team are:
- Do we have the right amount of the right data to give us the results we want?
- What are the main challenges with the data and how will we overcome them?
- How much will it cost to collect, prepare and manage the right data?
- Can we effectively understand and manage the biases in the data?
The Gartner Blog Network provides an opportunity for Gartner analysts to test ideas and move research forward. Because the content posted by Gartner analysts on this site does not undergo our standard editorial review, all comments or opinions expressed hereunder are those of the individual contributors and do not represent the views of Gartner, Inc. or its management.