A Data Quality Operating System for today's AI world

A Data Quality Operating System for today's AI world

Working in the space for a long time, I can see only bits and pieces for data quality processes, but no integrated process covering the diverse needs of the different areas.

Looking SAP as just one example, we have

  • Data Quality services to plug in things like address suggestion lists into the ERP system, for faster and more accurate customer address entries.
  • Tools to query the data in the database according to rules specified, e.g. count then rows where company name is null, and then report that on a daily basis as Data Quality dashboards.
  • Data Quality transformation in the ETL process.
  • Master Data Management solution to enable users to create golden records combining the information from different rows and different systems.

Bits and pieces, as said. But why are there so many options? How about these statements below?

  • The data quality should be perfect in the ERP system already.
  • Whatever issue is found during the ETL process should be played back into the source system.
  • Data Quality dashboards are informational only, they do not help increasing the data quality.
  • We can store the data quality results in an aggregated fashion due to the volume. 10m rows time 100 rules would mean 1000m rows with rule results. That is way too expensive to store in a database.

Based on my experience, all problems stem from misunderstandings. First, Data Quality is not absolute but is measured in a specific context. What is perfect quality in one context is not-usable in another. Second, data quality comes at costs. The higher you aim, the exponentially higher the costs.

Let's proof these statements with an example:

When is an address perfect? When all fields are set and correct, agreed? Okay, so "2551 N First St, San Jose, CA 95131, US" is correct, yes? If that is correct, then the city name of "San José" is not correct. And "SAN JOSE" is not correct also. This is where the context come into play. For the ERP system we would argue all records are correct, there are multiple perfect values because the ERP system works mostly in the context of a single record - this record - and its sole purpose is to ship something. Hence in this context even missing the state CA would be okay, because the product would still reach its destination.

We cannot even say, the perfect record is the one the US postal service is using as reference data, because a newly finished building might not be part of the postal reference data yet. We still want to have those tenants as customers.

For an analytical application, in its context where multiple rows are viewed together, we need a higher degree of quality. We need city names to be standardized. Otherwise the report shows 10 rows when grouping the data per city, or worse, the user queries for "SAN JOSE" and gets a single record out of many only without noticing it.

This also shows the second misunderstanding, that data quality comes with costs. In the ERP system the data should be entered as quickly as possible. Just imagine a system where error messages pop up constantly. "SAN JOSE" is not a valid city - select from the list of 300'000 US cities the correct one. This address is not found in the postal reference data, sorry you cannot order!?! In the ERP system, good-enough is way more important, at the expense of perfection.

Finally, many tools have been introduced years ago, when there was no thing like a Lakehouse. Yes, storing 1000m rows in a database is expensive, but storing it in the Lakehouse as an array column is cheap. The reality has changed, the assumptions of back then are no longer valid.

I want an architecture that follows these rules:

  • If the data quality and speed of entry can be made better in the source, go with that. Customer start entering address information and a short suggestion list lets him fill out all other fields automatically. Perfect.
  • Data standardization happens while loading the data into the analytical platform.
  • Data validation happens also in the data integration pipeline, augments the analytical platform with its metadata "yes, the customer master record is correct"; "don't create analytics on the field GENDER as it is mostly empty.
  • The output of the validation can be used to trigger a workflow or a notification. For example, update the record, send it to the MDM system as proposal and the end user can then accept using the standardized address data.
  • Data Quality dashboards can be created using the rule results attached to each single record. I am not only interested that today there are five records in the database with customer name null, I want to know which records these are. Analyze the similarities, e.g. do they all come from the same source system or not?

The interesting bit is, above requirements fit into a modern data integration architecture perfectly. We get that essentially for free.

What the source data does for increased data quality is its business. If something can be improved, it will certainly be a project of its own.

All changed data is pushed into Kafka, into the raw layer and a rules engine (github repo) does validate the records. The original record with all rule results attached is put into the silver layer and can now be used. Used in the context of reporting because the data is stored in the Lakehouse. But also by other realtime processes to act on it immediately.

The AI processes can now be trained with cleansed data of the Lakehouse, know about data quality issues for every single record and the trained model can consume either the raw data or the data from the cleansed (silver) Kafka topic.

From an organizational point of view, the Data Quality dashboards are important, as they allow to find problems, quantify them and provide to the source systems and management, to create pressure to work on this topic.

Article content

More details in my company blog.

To view or add a comment, sign in

Others also viewed

Explore topics