AI’s Achilles’ Heel: The Data Quality Dilemma

AI’s Achilles’ Heel: The Data Quality Dilemma

As AI has gained prominence, all the data quality issues we’ve faced historically are still relevant. However, there are additional complexities faced when dealing with the nontraditional data that AI often makes use of.

AI Data Has Different Quality Needs

When AI makes use of traditional structured data, all the same data cleansing processes and protocols that have been developed over the years can be used as-is. To the extent an organization already has confidence in its traditional data sources, the use of AI shouldn’t require any special data quality work.

The catch, however, is that AI often makes use of nontraditional data that can’t be cleansed in the same way as traditional structured data. Think of images, text, video, and audio. When using AI models with this type of data, quality is as important as ever. But unfortunately, the traditional methods utilized for cleansing structured data simply don’t apply. New approaches are required.

AI’s Different Needs: Input And Training

First, let’s use an example of image data quality from the input and model training perspective. Typically, each image has been given tags summarizing what it contains. For example, “hot dog” or “sports car” or “cat.” This tagging, typically done by humans, can have true errors and also situations where different people interpret the image differently. How can we identify and handle such situations?

It isn’t easy! With numerical data, it is possible to identify bad data via mathematical formulas or business rules. For example, if the price of a candy bar is $125, we can be confident it can’t be right because it is so far above expectation. Similarly, a person shown as age 200 clearly doesn’t make any sense. There really isn’t an effective way today to mathematically check if tags are accurate for an image. The best way to validate the tag is to have a second person assess the image.

An alternative is to develop a process that uses other AI models to scan the image and see if the tags applied appear to be correct. In other words, we can use existing image models to help validate the data being fed into future models. While there is potential for some circular logic doing this, models are becoming strong enough that it shouldn’t be a problem pragmatically.

AI’s Different Needs: Output And Scoring

Next, let’s use an example of image data quality from the model output and scoring perspective. Once we have an image model that we have confidence in, we feed the model new images so that it can assess the images. For instance, does the image contain a hot dog, or a sports car, or a cat? How can we assess if an image provided for assessment is “clean enough” for the model? What if the image is blurry or pixelated or otherwise not clear? Is there a way to “clean” the image?


Article content

The confidence we can have in what an AI model tells us is in the image directly depends on how clean the image is. In a case such as the image above, how do we know if the image is a blurred view of trees or something else entirely? Even as humans, there is subjectivity in this assessment and no clear path for having an automated, algorithmic approach to declaring the image as “clean enough” or not. Here, manual review might be best. In absence of that, we can again have an algorithm that scores the clarity of the input image along with processes to rate the confidence in the descriptions generated by the model’s assessment. Many AI applications do this today, but there is surely improvement possible.

Rising To The Challenge

The examples provided illustrate that classic data quality approaches like missing value imputation and outlier detection can’t be applied directly to data such as images or audio. These new data types, which AI is heavily dependent on, will require new and novel methodologies for assessing quality both on the input and the output end of the models. Given it took us many years to develop our approaches for traditional data, it should come as no surprise that we have not yet achieved similar standards for the unstructured data which AI uses.

Until those standards arise, it is necessary to:

  1. Constantly scan industry blogs, papers, and code repositories to keep tabs on newly developed approaches
  2. Make your data quality processes modular so that it is easy to alter or add procedures to use the latest advances
  3. Be diligent in studying identified errors so that you can identify if patterns exist related to where your cleansing processes and models are performing better and worse

Data quality has always been a thorn in the side of data and analytics practitioners. Not only do the traditional issues remain as AI is deployed, but the different data that AI uses introduces all sorts of novel and difficult data quality challenges to address. Those working in the data quality realm should have job security for some time to come!

Bronson Lee, MPEI, CPC, SPC

Inspiring Stronger People + Team-Cohorts to Build Value via Transformational Experiential Learning; Coaching Leaders to Kindle Innovations, Grow Resilient Agility, and Discern Responses over Reaction. ("Burnish Wisdom")

2mo

I have most recently been frustrated by mapping data - which should be so straightforward! ChatGPT is consistently off when using maps and pinning locations ... and can't seem to fix it even with extra coaching. How do you tackle that one?

Like
Reply
Julia Bardmesser

Helping Companies Maximize the Business Value of Data and AI | ex-CDO advising CDOs at Data4Real | Keynote Speaker & Bestselling Author | Drove Data at Citi, Deutsche Bank, Voya and FINRA

2mo

Bill, beyond the technical challenges, I’ve seen the cultural hurdle of making data quality a real priority become even tougher with unstructured data. Many teams still struggle to embed it in decision-making. How are you seeing organizations align leadership around this without slowing AI initiatives?

Like
Reply
Michael McIntire

Data Technologist, Architect, Advisor | ex eBay, Yahoo!, Teradata

2mo

Bill - We also have to think about what is *implied* in our data. We assume quality = accuracy, but thats not always the case. The implied part is how we structured the data - how did we model it either relationally or otherwise. Even with log data, how the developer chose to write the logs implies relationships between entities & attributes we may not have intended. The related topic I've been stewing about of late (no answers, not even good questions !!) is what effect the non-determanism of AI has on all the determanistic systems we have built around computing over the last 75 years. Testing, Coding, Pipelines - even data design & analytics are all largely implemented under the presumption of deterministic outcomes. Data Quality - is another one.

Like
Reply
Joel Sumner

Program Manager/Administrative Coordinator

2mo

Thank you for sharing this, DeLinda—such a powerful example of purpose-driven leadership and visionary collaboration. Grateful for your continued support of movements that inspire meaningful change.

Like
Reply
Ali Khalid

Heading Data Quality @ Emirates Airline | Data Quality for AI | Data Governance | Data Engineering | Award winning speaker | Trainer

2mo

Testing different modalities is something I have not spent too much time on either, that's a tricky one.

Like
Reply

To view or add a comment, sign in

Others also viewed

Explore content categories