A Data Quality Operating System for today's AI world
Working in the space for a long time, I can see only bits and pieces for data quality processes, but no integrated process covering the diverse needs of the different areas.
Looking SAP as just one example, we have
Bits and pieces, as said. But why are there so many options? How about these statements below?
Based on my experience, all problems stem from misunderstandings. First, Data Quality is not absolute but is measured in a specific context. What is perfect quality in one context is not-usable in another. Second, data quality comes at costs. The higher you aim, the exponentially higher the costs.
Let's proof these statements with an example:
When is an address perfect? When all fields are set and correct, agreed? Okay, so "2551 N First St, San Jose, CA 95131, US" is correct, yes? If that is correct, then the city name of "San José" is not correct. And "SAN JOSE" is not correct also. This is where the context come into play. For the ERP system we would argue all records are correct, there are multiple perfect values because the ERP system works mostly in the context of a single record - this record - and its sole purpose is to ship something. Hence in this context even missing the state CA would be okay, because the product would still reach its destination.
We cannot even say, the perfect record is the one the US postal service is using as reference data, because a newly finished building might not be part of the postal reference data yet. We still want to have those tenants as customers.
For an analytical application, in its context where multiple rows are viewed together, we need a higher degree of quality. We need city names to be standardized. Otherwise the report shows 10 rows when grouping the data per city, or worse, the user queries for "SAN JOSE" and gets a single record out of many only without noticing it.
This also shows the second misunderstanding, that data quality comes with costs. In the ERP system the data should be entered as quickly as possible. Just imagine a system where error messages pop up constantly. "SAN JOSE" is not a valid city - select from the list of 300'000 US cities the correct one. This address is not found in the postal reference data, sorry you cannot order!?! In the ERP system, good-enough is way more important, at the expense of perfection.
Finally, many tools have been introduced years ago, when there was no thing like a Lakehouse. Yes, storing 1000m rows in a database is expensive, but storing it in the Lakehouse as an array column is cheap. The reality has changed, the assumptions of back then are no longer valid.
I want an architecture that follows these rules:
The interesting bit is, above requirements fit into a modern data integration architecture perfectly. We get that essentially for free.
What the source data does for increased data quality is its business. If something can be improved, it will certainly be a project of its own.
All changed data is pushed into Kafka, into the raw layer and a rules engine (github repo) does validate the records. The original record with all rule results attached is put into the silver layer and can now be used. Used in the context of reporting because the data is stored in the Lakehouse. But also by other realtime processes to act on it immediately.
The AI processes can now be trained with cleansed data of the Lakehouse, know about data quality issues for every single record and the trained model can consume either the raw data or the data from the cleansed (silver) Kafka topic.
From an organizational point of view, the Data Quality dashboards are important, as they allow to find problems, quantify them and provide to the source systems and management, to create pressure to work on this topic.
More details in my company blog.