Why bad data is bad for AI

This blog post is an extension of the Salesforce Trailhead Trail: Prepare Your Data for AI, specifically the module Data Quality.

In this article we’re going to take a look at data quality in relation to AI – why bad data affects AI and where to begin when it comes to preparing your data for using AI in the future.

Salesforce Trailhead’s module lays out the landscape clearly: “Data quality plays a major role in shaping the outcomes and reliability of AI systems.” An article in Forbes from February 2023 puts this even more seriously: “Getting highly accurate AI model outputs relies on one thing—good quality data as the input.”

So, data quality is not simply beneficial for the successful use of AI, it’s fundamental.

Why does bad data affect AI?

Let’s take a look at three factors contributing to poor data quality, and how these examples of bad data affect AI. At ProvenWorks, we’re experts in address data, so we’ll keep our case studies address related! Then we’ll spend the rest of this article diving into the third factor – inconsistent data.

Inaccurate data

One obvious statement we’ll tackle first is that inaccurate data negatively affects the resulting accuracy of AI. AI models and machine learning models are dependent on high quality data if they are to perform well. If poor data is used to train AI, this will inevitably lead to inaccurate analysis and unreliable decisions.

Case study: how inaccurate address data affects AI

Let’s consider a practical example in the world of deliveries. For a package delivery company relying heavily on AI for optimising routes and delivery time estimates, the impact of inaccurate data is evident. Outdated or poor-quality address information within their database can lead to highly inefficient delivery routes, resulting in wasted time and fuel for drivers. Customers may experience delays or even non-delivery due to incorrect addresses, affecting the company’s reputation and potentially causing the loss of future business.

On top of this, inaccurate data also skews estimated delivery times, leading to customer dissatisfaction and damaging the delivery company’s own trust in AI technologies.

This case underscores the critical importance of prioritising data quality in AI projects, as it not only affects performance but also trust and adoption of these valuable technologies.

Biased data

Consider the breadth of data from studies and research that is available for an AI system. What happens when this research used to train an AI system is not representative of the real world? If that sounds dramatic, consider recruitment firms that train AI with historical data that has an imbalanced ratio towards a particular sex, race, religion or sexual orientation. Data that is based off, or comes from, one majority group will inevitably affect machine learning.

Case study: how biased address data affects AI

Imagine an e-commerce company leveraging AI to make decisions on marketing strategies and product recommendations. If the company’s address data is collected predominantly from affluent neighborhoods, the AI system might recommend specific luxury products to all its customers, assuming that everyone has similar buying habits to those in affluent areas. This biased data leads to a biased representation of inaccurate and out-of-touch perceptions of customer preferences.

Again, we see how poor data quality not only results in the business losing possible revenue but also how it damages trust in the company by alienating entire segments of its customer base.

Inconsistent data

In general, consistency is crucial for data quality. With consistency comes patterns and with patterns come predictions. When an AI model is trained using inconsistent data, the model might be unable to identify patterns or generate precise predictions. O’Reilly’s The State of Data Quality in 2020 survey found that the most common reason AI and ML fail in the marketing sector is due to too many data sources and inconsistent data. The importance of creating greater consistency across data values, taxonomies, data structures and meta tags is clear if AI is to succeed.

Case study: how inconsistent address data affects AI

Helpfully, Salesforce lays out its own example of inconsistent address data in the Trailhead Trail: Prepare Your Data for AI, specifically the module Data Quality.

In this scenario, we see a chart of Accounts By State but when we scrutinise the bars, we can see that the data is very inconsistent. California is represented by a number of different values including “Surfin’, USA”, “Calif”, “California” and “Cali”.

How are humans, let alone AI, able to determine the true record count, and therefore business value, of California accounts when the data is so inconsistent?!

Let’s dive deeper into this idea of data consistency.

Consistency is key, but what actually is it?

Data consistency is a crucial attribute of data quality so it’s important we understand this dimension before we go about fixing it.

Salesforce introduces the idea of consistent data in its Data Quality module:

Source: https://trailhead.salesforce.com/content/learn/modules/data_quality/data_quality_assess_your_data

As Salesforce implies above, consistency covers a range of ideas from formatting and spelling to language and taxonomy. For the rest of this article, we’re going to examine data consistency by focusing on the process of standardisation. Below we will look specifically at an example of standardisation – standardisation of address data.

What is standardisation?

Standardisation is the process of making something conform to a standard. Address standardisation therefore is the process of converting multiple known values to a single predetermined format. For example, “United States”, “USA”, “US”, and “United States of America” can each be standardised to “US”. As a result of standardisation, our data becomes consistent!

So, standardisation is a crucial piece of functionality for organisations who want to use AI in relation to their country and state data. Afterall, we know by now that data consistency is key!

But how do we standardise this data?

Standardise your address data for AI

How do we turn this report into something clean, consistent and useful for our business and for AI?

There’s no doubt that consistent data is key for successfully leveraging AI. In our next article, we’ll take Salesforce’s own case study of inconsistent state data and walk through step-by-step how you can use AddressTools to solve the problem of these inconsistent California values.

Follow through our step-by-step guide for standardising your address data in the next article: How to standardise your address data.

Resources

When it comes to AI, it all starts with the data. If you need a place to start your AI journey, we recommend the Salesforce Trailhead Trail that the above example was taken from: https://trailhead.salesforce.com/content/learn/trails/prepare-your-data-for-ai

If you’re looking to skill up, Salesforce offers a certification called AI Associate which is designed for individuals already familiar with Salesforce CRM. You can learn more about it here: https://trailhead.salesforce.com/en/credentials/associate

Now you know that bad data affects AI, it’s time to consider your own data. If you like the look of how we easily we might standardise our address data in this article, you can check out AddressTools Premium on the Salesforce AppExchange and get started with a free 2-week trial: https://appexchange.salesforce.com/appxListingDetail?listingId=a0N30000002zt9uEAA

PS. This post was not written by AI 😉

Blog

Prepare your data for AI: why bad data is bad for AI