This document discusses an approach used by PoolParty (PPT) to validate and repair taxonomies integrated from various data sources to ensure they conform to quality heuristics. It specifies data constraints using SHACL, validates datasets for violations, and combines formal constraint definitions with reusable repair strategies that can automatically resolve violations. The approach was tested on PPT-generated and third-party datasets, with validation performance varying based on dataset size and structure, and repair strategies scaling well to larger datasets.
Related topics: