When it comes to an AI (Artificial Intelligence) project, there is usually lots of excitement. The focus is often on using new-fangled algorithms – such as deep learning neural networks – to unlock insights that will transform the business.
But in this process, something often gets lost: The importance of establishing the right plan for the data. Keep in mind that 80% of the time of an AI project can be spent on identifying, storing, processing and cleansing data.
“The big gotcha is having bad data fed into your AI systems,” said David Linthicum, who is the Chief Cloud Strategy Officer at Deloitte Consulting LLP. “It’s only as smart as the data that it’s allowed to cull through. The quality of data is of utmost importance. The use of cloud computing allows for massive amounts of data to be stored for very low costs, which means that you can afford to provide all the data that your AI systems need.”
The data process can certainly be dicey. Even subtle changes can have a major impact on the outcomes.
So what to do to avoid the problems? Well, here are some strategies to consider:
Clear-Cut Focus: A majority of AI projects for traditional companies are about reducing costs, increasing revenues or keeping up with the competition. But for the most part, the goals can get easily muddled.
According to Stuart Dobbie, who is the Product Owner at Callsign: “Fundamentally, the core recurring problem remains simple: many businesses fail to clearly articulate their business problem prior to choosing the technologies and skill-sets required to solve it.”
The temptation is to over complicate things. But of course, this can mean an AI project will go off the rails and be a major waste of resources.
Overfitting: It seems like the more variables an AI model has, the better, right? Not really. If there are a large number of variables, then the model will probably not reflect what’s happening in the real world. This is known as overfitting. And it’s a common issue with data.
“Overfitting, for example, is not solely a data problem,” said Dan Olley, who is the Global EVP and CTO of Elsevier, “but also a model training problem. This all comes back to designing the training and testing of models carefully and incorporating a varied group of inputs to validate the training and testing.”
Noise: This is the result of mislabeled examples (class noise) or errors in the values of attributes (attribute noise). The good news is that class noise can be easily identified and excluded. But attribute noise is another matter. This usually does not show up as an outlier.
“In machine learning algorithms, most good ones have the outlier identification/ elimination embedded in the algorithm logic,” said Prasad Vuyyuru, who is a partner for the Enterprise Insights Practice at Infosys Consulting. “The data scientist or SME will still need to apply additional filters or decision trees during the learning stage to exclude certain data that may skew from the sample.”
One way is to use cross validation, say by dividing the data into ten similar sized folds. You will then train the algorithm on nine folds and the computer will evaluate the measure on the last one – which should be done ten times.
“We should always follow Ockham’s Razor which states that the best Machine Learning models are simple models that fit the data well,” said Vuyyuru.
Maintenance: AI models are not static. They get better over time. Or, then again, they could actually decay over time because the data is not adequately updated. In other words, the data needs ongoing maintenance.
“AI systems are not like other pieces of software,” said Kurt Muehmel, who is the VP of Sales Engineering at Dataiku. “They can’t be released once and then forgotten. They take a lot of maintenance because people change, data changes, and models can drift over time. As more and more businesses develop AI systems, the issue of maintenance as a gotcha will quickly come to the forefront.”