A Data Quality Operating System for today's AI world

Working in the space for a long time, I can see only bits and pieces for data quality processes, but no integrated process covering the diverse needs of the different areas.

Looking SAP as just one example, we have

Data Quality services to plug in things like address suggestion lists into the ERP system, for faster and more accurate customer address entries.
Tools to query the data in the database according to rules specified, e.g. count then rows where company name is null, and then report that on a daily basis as Data Quality dashboards.
Data Quality transformation in the ETL process.
Master Data Management solution to enable users to create golden records combining the information from different rows and different systems.

Bits and pieces, as said. But why are there so many options? How about these statements below?

The data quality should be perfect in the ERP system already.
Whatever issue is found during the ETL process should be played back into the source system.
Data Quality dashboards are informational only, they do not help increasing the data quality.
We can store the data quality results in an aggregated fashion due to the volume. 10m rows time 100 rules would mean 1000m rows with rule results. That is way too expensive to store in a database.

Based on my experience, all problems stem from misunderstandings. First, Data Quality is not absolute but is measured in a specific context. What is perfect quality in one context is not-usable in another. Second, data quality comes at costs. The higher you aim, the exponentially higher the costs.

Let's proof these statements with an example:

When is an address perfect? When all fields are set and correct, agreed? Okay, so "2551 N First St, San Jose, CA 95131, US" is correct, yes? If that is correct, then the city name of "San José" is not correct. And "SAN JOSE" is not correct also. This is where the context come into play. For the ERP system we would argue all records are correct, there are multiple perfect values because the ERP system works mostly in the context of a single record - this record - and its sole purpose is to ship something. Hence in this context even missing the state CA would be okay, because the product would still reach its destination.

We cannot even say, the perfect record is the one the US postal service is using as reference data, because a newly finished building might not be part of the postal reference data yet. We still want to have those tenants as customers.

For an analytical application, in its context where multiple rows are viewed together, we need a higher degree of quality. We need city names to be standardized. Otherwise the report shows 10 rows when grouping the data per city, or worse, the user queries for "SAN JOSE" and gets a single record out of many only without noticing it.

This also shows the second misunderstanding, that data quality comes with costs. In the ERP system the data should be entered as quickly as possible. Just imagine a system where error messages pop up constantly. "SAN JOSE" is not a valid city - select from the list of 300'000 US cities the correct one. This address is not found in the postal reference data, sorry you cannot order!?! In the ERP system, good-enough is way more important, at the expense of perfection.

Finally, many tools have been introduced years ago, when there was no thing like a Lakehouse. Yes, storing 1000m rows in a database is expensive, but storing it in the Lakehouse as an array column is cheap. The reality has changed, the assumptions of back then are no longer valid.

I want an architecture that follows these rules:

If the data quality and speed of entry can be made better in the source, go with that. Customer start entering address information and a short suggestion list lets him fill out all other fields automatically. Perfect.
Data standardization happens while loading the data into the analytical platform.
Data validation happens also in the data integration pipeline, augments the analytical platform with its metadata "yes, the customer master record is correct"; "don't create analytics on the field GENDER as it is mostly empty.
The output of the validation can be used to trigger a workflow or a notification. For example, update the record, send it to the MDM system as proposal and the end user can then accept using the standardized address data.
Data Quality dashboards can be created using the rule results attached to each single record. I am not only interested that today there are five records in the database with customer name null, I want to know which records these are. Analyze the similarities, e.g. do they all come from the same source system or not?

The interesting bit is, above requirements fit into a modern data integration architecture perfectly. We get that essentially for free.

What the source data does for increased data quality is its business. If something can be improved, it will certainly be a project of its own.

All changed data is pushed into Kafka, into the raw layer and a rules engine (github repo) does validate the records. The original record with all rule results attached is put into the silver layer and can now be used. Used in the context of reporting because the data is stored in the Lakehouse. But also by other realtime processes to act on it immediately.

The AI processes can now be trained with cleansed data of the Lakehouse, know about data quality issues for every single record and the trained model can consume either the raw data or the data from the cleansed (silver) Kafka topic.

From an organizational point of view, the Data Quality dashboards are important, as they allow to find problems, quantify them and provide to the source systems and management, to create pressure to work on this topic.

More details in my company blog.

A Data Quality Operating System for today's AI world

Werner Daehn

BSc Digital Transformation, Chief Software Architect for Data Integration and Big Data

More articles by this author

Others also viewed

Data Warehousing in Modern Analytics and Business Intelligence - BI : Tools, Technologies, and Solutions for Smarter Decision-Making

Top 15 Data Warehousing Companies in Manufacturing – Features & Services

What You Need to Know About Data Warehouses

Data Integration

Key Functions Of A Data Warehouse Explained Simply!

Understanding Data Warehousing: The Backbone of Modern Data Management - InbuiltData | Data Warehousing in the AI Era

Data Warehousing (DW) Excellence: Elevate Your Business Insights and Performance

Can Databricks Power Your Master Data Management (MDM) Strategy?

Unifying Data to Unify Decisions: Why SAP and Databricks Is More Than a Business Deal

🚀 What's New: SAP Analytics Cloud Q2 2025 Quarterly Release Cycle (2025.2–2025.8)

Explore topics

Vibe coding - Let's do a code review

Jun 15, 2025

Data Quality in the data world

Jan 28, 2025

SAP Recognized as a Leader in 2024 Gartner Magic Quadrant for Data Integration Tools

Dec 21, 2024

SAP Data Services: How it achieves such stellar performance

Nov 27, 2024

What is the network latency of above architecture?

Jun 24, 2024

SAP Hana transaction log reading

May 21, 2024

How to deal with Central vs Decentral data requirements

Apr 29, 2024

SAP stock price conundrum

Apr 2, 2024

Selecting a Data Integration Tool

Mar 13, 2024

How to use Apache Kafka for Data Integration

Feb 27, 2024