AI’s Achilles’ Heel: The Data Quality Dilemma

Bill Franks

Internationally recognized chief analytics officer who is a thought leader, speaker, consultant, and author focused on analytics, data science, and AI

Published Jul 15, 2025

As AI has gained prominence, all the data quality issues we’ve faced historically are still relevant. However, there are additional complexities faced when dealing with the nontraditional data that AI often makes use of.

AI Data Has Different Quality Needs

When AI makes use of traditional structured data, all the same data cleansing processes and protocols that have been developed over the years can be used as-is. To the extent an organization already has confidence in its traditional data sources, the use of AI shouldn’t require any special data quality work.

The catch, however, is that AI often makes use of nontraditional data that can’t be cleansed in the same way as traditional structured data. Think of images, text, video, and audio. When using AI models with this type of data, quality is as important as ever. But unfortunately, the traditional methods utilized for cleansing structured data simply don’t apply. New approaches are required.

AI’s Different Needs: Input And Training

First, let’s use an example of image data quality from the input and model training perspective. Typically, each image has been given tags summarizing what it contains. For example, “hot dog” or “sports car” or “cat.” This tagging, typically done by humans, can have true errors and also situations where different people interpret the image differently. How can we identify and handle such situations?

It isn’t easy! With numerical data, it is possible to identify bad data via mathematical formulas or business rules. For example, if the price of a candy bar is $125, we can be confident it can’t be right because it is so far above expectation. Similarly, a person shown as age 200 clearly doesn’t make any sense. There really isn’t an effective way today to mathematically check if tags are accurate for an image. The best way to validate the tag is to have a second person assess the image.

An alternative is to develop a process that uses other AI models to scan the image and see if the tags applied appear to be correct. In other words, we can use existing image models to help validate the data being fed into future models. While there is potential for some circular logic doing this, models are becoming strong enough that it shouldn’t be a problem pragmatically.

AI’s Different Needs: Output And Scoring

Next, let’s use an example of image data quality from the model output and scoring perspective. Once we have an image model that we have confidence in, we feed the model new images so that it can assess the images. For instance, does the image contain a hot dog, or a sports car, or a cat? How can we assess if an image provided for assessment is “clean enough” for the model? What if the image is blurry or pixelated or otherwise not clear? Is there a way to “clean” the image?

The confidence we can have in what an AI model tells us is in the image directly depends on how clean the image is. In a case such as the image above, how do we know if the image is a blurred view of trees or something else entirely? Even as humans, there is subjectivity in this assessment and no clear path for having an automated, algorithmic approach to declaring the image as “clean enough” or not. Here, manual review might be best. In absence of that, we can again have an algorithm that scores the clarity of the input image along with processes to rate the confidence in the descriptions generated by the model’s assessment. Many AI applications do this today, but there is surely improvement possible.

Rising To The Challenge

The examples provided illustrate that classic data quality approaches like missing value imputation and outlier detection can’t be applied directly to data such as images or audio. These new data types, which AI is heavily dependent on, will require new and novel methodologies for assessing quality both on the input and the output end of the models. Given it took us many years to develop our approaches for traditional data, it should come as no surprise that we have not yet achieved similar standards for the unstructured data which AI uses.

Until those standards arise, it is necessary to:

Constantly scan industry blogs, papers, and code repositories to keep tabs on newly developed approaches
Make your data quality processes modular so that it is easy to alter or add procedures to use the latest advances
Be diligent in studying identified errors so that you can identify if patterns exist related to where your cleansing processes and models are performing better and worse

Data quality has always been a thorn in the side of data and analytics practitioners. Not only do the traditional issues remain as AI is deployed, but the different data that AI uses introduces all sorts of novel and difficult data quality challenges to address. Those working in the data quality realm should have job security for some time to come!

Analytics Matters

4,567 followers

+ Subscribe

Bronson Lee, MPEI, CPC, SPC

Inspiring Stronger People + Team-Cohorts to Build Value via Transformational Experiential Learning; Coaching Leaders to Kindle Innovations, Grow Resilient Agility, and Discern Responses over Reaction. ("Burnish Wisdom")

2mo

I have most recently been frustrated by mapping data - which should be so straightforward! ChatGPT is consistently off when using maps and pinning locations ... and can't seem to fix it even with extra coaching. How do you tackle that one?

Julia Bardmesser

Helping Companies Maximize the Business Value of Data and AI | ex-CDO advising CDOs at Data4Real | Keynote Speaker & Bestselling Author | Drove Data at Citi, Deutsche Bank, Voya and FINRA

2mo

Bill, beyond the technical challenges, I’ve seen the cultural hurdle of making data quality a real priority become even tougher with unstructured data. Many teams still struggle to embed it in decision-making. How are you seeing organizations align leadership around this without slowing AI initiatives?

Michael McIntire

Data Technologist, Architect, Advisor | ex eBay, Yahoo!, Teradata

2mo

Bill - We also have to think about what is *implied* in our data. We assume quality = accuracy, but thats not always the case. The implied part is how we structured the data - how did we model it either relationally or otherwise. Even with log data, how the developer chose to write the logs implies relationships between entities & attributes we may not have intended. The related topic I've been stewing about of late (no answers, not even good questions !!) is what effect the non-determanism of AI has on all the determanistic systems we have built around computing over the last 75 years. Testing, Coding, Pipelines - even data design & analytics are all largely implemented under the presumption of deterministic outcomes. Data Quality - is another one.

Joel Sumner

Program Manager/Administrative Coordinator

2mo

Thank you for sharing this, DeLinda—such a powerful example of purpose-driven leadership and visionary collaboration. Grateful for your continued support of movements that inspire meaningful change.

Ali Khalid

2mo

Testing different modalities is something I have not spent too much time on either, that's a tricky one.

LinkedIn respects your privacy

AI’s Achilles’ Heel: The Data Quality Dilemma

Bill Franks

Internationally recognized chief analytics officer who is a thought leader, speaker, consultant, and author focused on analytics, data science, and AI

Analytics Matters

4,567 followers

More articles by this author

Others also viewed

AI’s Potential Is Huge—But Is Your Data Ready? A Readiness Guide for CXOs & IT Leaders

Quality Data, Powerful AI: Laying the Groundwork for Intelligent Solutions

The Role of AI in Transforming Data Analytics and Data Strategy in Business

"Beyond Traditional RAG: What Data-Driven Leaders Must Know About Graph-Powered AI"

Why Organizations Need an AI Catalog

Data: The Unsung Hero of AI-Driven Business Transformation

Five ways AI can negatively impact your data quality

Synthetic Data: Modern Day Alchemy, or the Voice of the Under-Represented?

How I used Agentic AI to solve a real-world data challenge

Harnessing AI in Government: Why Quality Data is Key to Success

Explore content categories

Analytics Matters

4,567 followers

Careers In The Age Of AI

Sep 16, 2025

Employers Don’t Care What You Know!

Aug 12, 2025

AI Agents – Simplicity That Leads To Complexity

Jun 17, 2025

Has AI Changed The Flow Of Innovation?

May 13, 2025

Don’t Build Up Relationship Debt!

Apr 15, 2025

Artificial “Good Enough” Intelligence (AGEI) Is Almost Here!

Mar 18, 2025

Do NOT Deploy THAT Chatbot!

Feb 11, 2025

The Complexities Of Computing Analytics ROI

Jan 14, 2025

Artificial Intelligence Concerns & Predictions For 2025

Dec 10, 2024

Data Science Collaboration In The Age Of AI

Nov 12, 2024

Others also viewed

AI’s Potential Is Huge—But Is Your Data Ready? A Readiness Guide for CXOs & IT Leaders

Quality Data, Powerful AI: Laying the Groundwork for Intelligent Solutions

The Role of AI in Transforming Data Analytics and Data Strategy in Business

"Beyond Traditional RAG: What Data-Driven Leaders Must Know About Graph-Powered AI"

Why Organizations Need an AI Catalog

Data: The Unsung Hero of AI-Driven Business Transformation

Five ways AI can negatively impact your data quality

Synthetic Data: Modern Day Alchemy, or the Voice of the Under-Represented?

How I used Agentic AI to solve a real-world data challenge

Harnessing AI in Government: Why Quality Data is Key to Success

Explore content categories