SlideShare a Scribd company logo
AN EXTENDED DATA OBJECT-DRIVEN APPROACH TO DATA
QUALITY EVALUATION: CONTEXTUAL DATA QUALITY ANALYSIS
21st International Conference on Enterprise Information Systems (ICEIS),
Heraklion, Crete – Greece, 2019
Anastasija Nikiforova, Janis Bicevskis
Faculty of Computing, University of Latvia
Anastasija.Nikiforova@lu.lv
 “Quality” is a desirable goal to be achieved through management of the
production process.
 «Data quality» is a relative concept, largely dependent on specific
requirements resulting from the data use.
QUALITY AND DATA QUALITY
Source: Bičevska (2018)
Source: ISO 9001:2015: Quality
management principles.
2016
Decisions resulting from bad
data cost the US economy
$3.1 trillion dollars per year
-IBM
2017
Organizations believe poor data quality
to be responsible for an average of $15
million per year in losses
-Gartner
Data quality weaknesses
can lead to huge losses
!!! The same data may be
sufficiently qualitative in one case
BUT
completely useless under other
circumstances.
«Dimensions are not defined in a measurable and formal way»
-Batini et al., 2016, DAMA, 2019, Huang et al., 1999, Eppler, 2006
«…Even amongst data quality professionals the key data quality dimensions
are not universally agreed. This state of affairs has led to much confusion
within the data quality community and is even more bewildering for those
who are new to the discipline and more importantly to business
stakeholders…»
-DAMA, 2019
RELATED RESEARCHES
 General studies on data and information quality - define different
dimensions of quality and their groupings as well as data assessment
methodologies.
 Assessments of specific industry data and information quality - sector-
specific methods.
• Cancer registry, Healthcare, Manufacturing, Chemical Hazard Risk Assessments, etc.
BUT!!!
 There is no consensus on data quality dimensions
and their usability.
 How to relate particular dimension (and which
one?) to a particular use-case???
 Dimensions of the same name can have different
semantics in different researches.
Problem: necessity to involve data quality experts at every stage of data
quality analysis process
Solution: data object-driven approach to data quality evaluation
(Bicevskis, Bicevska, Nikiforova, Oditis, 2018)
TDQM data quality lifecycle
Data quality
definition
Data quality
measuring
Data quality
analysis
Data quality
improvemen
t
MAIN PRINCIPLES OF THE PROPOSED
SOLUTION
 Each specific application can have its own specific DQ checks;
 DQ requirements can be formulated on several levels
• from informal text in natural language
• to an automatically executable model, SQL statements or program
code;
 DQ can be checked in various stages of the data processing;
 DQ definition language is graphical DSL:
• the diagrams are easy to read, create, understand and edit even by
non-IT and non-Data Quality professionals;
• syntax and semantics can be easily applied to any new IS.
!!! All three components are
defined by using a graphical
domain specific language
(DSL)**
**Three DSL families were developed as graphic languages
based on the possibilities of the modelling platform DIMOD
1. DATA OBJECT (DO) - the set of values of the parameters that characterize a real-life object
 primary data object - the initial DO which quality is analysed;
 secondary data object – DO that determines the context for analysis of the primary DO.
* Many objects of the same structure form class of data objects
2. DATA QUALITY REQUIREMENTS - conditions that must be met in order a data object is
considered of high quality.
** May contain: informal or formalized implementation-independent descriptions of conditions
3. DATA QUALITY MEASURING PROCESS - procedures should be performed to evaluate the
data object’s quality.
DATA QUALITY MODEL
instead of dimensions
DATA QUALITY ANALYSIS. STEP-BY-STEP GUIDE
0-1. Definition of the use case
0-2. Analysis of source data
1-1. Definition of the primary data object
1-2. Definition of the secondary data object(-s)
1-3. Primary and secondary data objects linking
2-1. Primary data object quality specification
2-2. Primary and secondary data objects linking conditions
3. Data quality measuring process
defined using
graphical DSL
4-1. Analysis of the results
4-2. Data quality improvement (MS DQS)
Use-cases:
1. company search/ identification
(by its name, registration
number, incorporation date);
2. contacting by post
(by address and postal code)
Company registers of:
 United Kingdom (UK)
 Latvia (LV)
 Estonia (EE)
 Norway (NOR)
Global Open Data Index
UK: 1st place
LV: 18st place
EE: -
NOR: 1st place
APPROBATION. DATA SETS
Country # of columns # of columns with quality problems
(number, %)
United Kingdom 55 15 (27.3%)
Latvia 22 11 (50%)
Estonia 14 7 (50%)
Norway 42 8 (19%)
1) company identification
(by its name, registration number and incorporation date)
2) contacting by post
(by its address and postal code)
Country Identificat
ion
Name Reg.
Nr.
Incorporation
date
UK
-
1
0.0001%
0
3 invalid
0.0004%
Latvia - 10
0.0025%
0 94 NULL
0.02%
Estonia + 0 0 -
Norway - 0 0 9 doubtful
0.001%
Contactin
g by post
Address Postal
code
- 7 514 NULL –
1%
4 invalid –
0.0005%
12 151
1.6%
- 366
0.09%
20 498
5.16%
- 29 918
11.24%
22 621
8.5%
- 68 128
6.2%
14 683
1.3%
APPROBATION. RESULTS
Mainly syntactic analysis was done -
analysis in scope of one data object
!!!
More in-depth and comprehensive
analysis should be done -
analysis in scope of multiple data
objects
TOTAL: 128 different values,
that possibly contain data quality problems
Various names indicating the
same country
USA
United States
United States of America
Northern Ireland
Republic of Ireland
Ireland
Virgin Island
British Virgin Island
Virgin Islands, British
Scotland
Scotland UK
…
???
Which of them
is valid?
APPROBATION. ADDITIONAL CHECKS
OF «COMPANIES HOUSE» (UK)
# Type of issue Example
1
various names
indicating the same
country
USA, United
States and United
States of America
etc.
2
names of dissolved
countries
Czechoslovakia
Yugoslavia
USSR
3
values indicating
administrative
division or region
Wales
Scotland
England & Wales
England
…
4 not countries at all
“SW7”
“EAST SUSSEX”
“BWI”
“DE 19901”
The single data object analysis indicates the mere
existence of the data quality problem without
detecting all the defective records.
The secondary data object is
needed!!!
• Data object is platform-independent.
• The checking of parameter values is local and
formal process.
• The quality checking for one of the DO
parameters values is an examination of properties
of the individual values, e.g. whether:
• (1) a text string may serve as a value of the field Name,
• (2) value of the field Address is a correct address.
• Can be formulated at different levels of abstraction:
• from the formal language grammar
• to definitions of variables in programming languages.
DATA OBJECT
Secondary DO
Primary DO
• Quality conditions are defined only for the
primary data object.
• DQ requirements are defined by using logical
expressions.
• The names of DO attributes/ fields serve as
operands in the logical expressions.
• Both syntactical and semantical data quality can
be analysed according to unified principles.
DATA QUALITY SPECIFICATION
SendMessage
Assess Field "CountryOfOrigin"
checkvalueExists(CountryOfOrigin)
Assess Field "URI"
checkValueExists(URI)
checkValueURI(URI,
'http://business.data.gov.uk.id/company/$CompanyName')
Assess Field "CompanyNumber"
checkValueExists(CompanyNumber)
checkValueDigits(8)
Assess Field "RegAddress AddressLine1"
checkValueExists(RegAddress AddressLine1)
Assess Field "IncorporationDate"
checkValueExists(IncorporationDate)
checkValueDate(IncorporationDate, "DD/MM/YYYY")
Assess Field "RegAddress AddressPostCode"
checkvalueExists(RegAddress AddressPostCode)
Assess Field "CompanyName"
checkValueExists(CompanyName)
SendMessage
SendMessage
SendMessage
SendMessage
SendMessage
SendMessage
SendMessage
Assess Field "RegAddressCountry"
checkvalueExists(RegAddressCountry)
ShortName
OfficialName
ISO2
ISO3
UNDP
checkCountryOfOriginName(Country,
CountryOfOrigin)
checkRegAddressCountryName(Country,
RegAddressCountry)
NO
NO
OK
NO
NO
NO
NO
NO
NO
OK
OK
OK
OK
OK
OK
OK
Secondary DO
Link between
primary and
secondary DOs
(informal rule)
DATA QUALITY MEASURING
PROCESS
The activities to be taken to select data object values from data sources.
One or more steps to evaluate the quality of the data, each of which describes one
test for the compliance of the data object with a specific quality specification.
+
Gather values of the secondary DOs from the data sources if the parameter indicating
the secondary DO’s value in scope of defined quality condition is true:
1. read/ write operations from data source into database,
2. connection of primary and secondary data objects via appropriate
parameters
The steps to improve data quality automatically or manually triggering changes in
the data source.
For contextual
checks
The language describing the quality evaluation
process involves verification activities for a
particular DO that can be defined:
 informally as a natural language text,
 using UML activity diagrams,
 in the own DSL.
Additionally, processing of DO classes instances
may require looping constructions, similar to
iterator used in C#.
• A concrete DO or a class of DO is used as an
input for a quality verification process.
• The quality verification process creates a test
protocol.
In case of SQL:
 SELECT statement specifies the target DO
 WHERE clause specifies quality
requirements
+
 JOIN clause link primary and secondary
DOs
DATA QUALITY MEASURING
PROCESS
BERMUDA
BWI
…
CZECHOSLOVAKIA
DE 19901
EAST SUSSEX
ENGLAND
ENGLAND & WALES
GIBRALTAR
Great Britain
HOLLAND
…
JERSEY
…
ST VINCENT
NORTHERN
IRELAND
REPUBLIC OF
IRELAND
Country Of Origin Short Name Official Name ISO3 ISO2
… … … … …
DE 19901 NULL NULL NULL NULL
GREECE Greece the Hellenic Republic GRC GR
… … … … …
LATVIA Latvia the Republic of Latvia LVA LV
… … … … …
United States of
America
United States
of America
the United States of
America
USA US
… … … … …
Invalid names
TOTAL: 128 different values,
that possibly can contain data quality
problems
TOTAL: 48 different values,
that definetely have data quality problems
Various names indicating
the same country
USA
United States
United States of America
Northern Ireland
Republic of Ireland
Ireland
Virgin Island
British Virgin Island
Virgin Islands, British
Scotland
Scotland UK
…
REPUBLIC OF NIGERIA
…
SCOTLAND UK
SOUTH KOREA
SW7
TADJIKISTAN
TAIWAN
TURKS & CAICOS
ISLANDS
UNITED STATES
UK
USSR
VENEZUELA
VIETNAM
VIRGIN ISLANDS
WEST GERMANY
YEMEN ARAB
REPUBLIC
YUGOSLAVIA
???
Which of
them is valid?
Results in scope of single data object Results in scope of multiple data objects
SINGLE vs MULTIPLE DATA
OBJECT ANALYSIS • Analysis of 2 parameters containing names of
countries against 4 representations of countries’
names and their subdivisions.
• Although this problem was observed in 27.6%
records, it could be solved by making just 48
corrections.
• All values of “CountryOfOrigin” and 73 of 74 values
of “RegAddress Country” conform to one standard,
i.e., the short name.
ONLY 13 instead of 48 invalid
values were detected!!!
 Data quality analysis in context of multiple data objects was applied to 23 «external» open datasets,
+ 22 different secondary DOs were used;
 21 of 23 datasets (91.3%) have at least few data quality issues that weren’t detected previously;
• initial version: indicated records potentially containing data quality problems - very resources-consuming
process.
• proposed extension of the approach: detects only the records with the certain data quality.
 The initial analysis detected 128 values:
• only 13 values with data quality problem instead of 48.
• 115 values didn’t have data quality problems (false negative).
In this particular case, results of analysis were
improved by 72.9%.
FEW REMARKS
!!! The proposed structure eliminated the necessity of additional in-depth quality
analysis, as well as writing complex queries and individual analysis of the results.
 An data object-driven approach to data quality evaluation:
• 3 components: data object, quality specification, quality measuring process defined using graphical DSLs;
• provide ability to analyse «foreign»/ «external» data without the involvement of data holders (higher level of
abstraction);
• very intuitive – suitable even for non-IT and non-DQ experts.
 The contextual quality analysis significantly improves data quality analysis results:
• possibility to analyse real data object’s quality within the context of multiple data objects;
• detects the records with the certain data quality problem.
• the number of possible controls, where the proposed extended approach can yield valuable results, is very high.
 Both syntactical and contextual data quality are analysed according to unified description principles 
the diagram’s structure remained easy to read, create, understand and edit.
User’s participation in [open] data quality analysis using the presented approach brings benefits not
only the users themselves, but also data holders, when users share their feedbacks, as data holders are not
even aware of data quality problems.
RESULTS
 application and evaluation of the extended approach in the cases of complex data object’s
structure, including supplementing data objects when direct connection between the primary and
the secondary data objects is not possible,
 detection of possible limitations of the proposed extended approach,
 ensuring possibility to evaluate data sets’ evolution,
 assessment of possibility to provide users with suggestions for data improvement,
 developing data quality theory.
FUTURE WORK
THANK YOU!
For more information, see ResearchGate
See also anastasijanikiforova.com
For questions or any other queries, contact me via email - Anastasija.Nikiforova@lu.lv
Article: Nikiforova, A., & Bicevskis, J. (2019). An Extended Data Object-driven Approach to Data Quality
Evaluation: Contextual Data Quality Analysis. In ICEIS (1) (pp. 274-281).

More Related Content

PDF
Stakeholder-centred Identification of Data Quality Issues: Knowledge that Can...
PPTX
Analysis of open health data quality using data object-driven approach to dat...
PDF
OPEN DATA: ECOSYSTEM, CURRENT AND FUTURE TRENDS, SUCCESS STORIES AND BARRIERS
PDF
IoTSE-based Open Database Vulnerability inspection in three Baltic Countries:...
PDF
Comparative analysis of national open data portals or whether your portal is ...
PDF
Towards enrichment of the open government data: a stakeholder-centered determ...
PDF
Assessment of the usability of Latvia’s open data portal or how close are we ...
PDF
Knowledge graphs dedicated to the memory of amrapali zaveri 3388748
Stakeholder-centred Identification of Data Quality Issues: Knowledge that Can...
Analysis of open health data quality using data object-driven approach to dat...
OPEN DATA: ECOSYSTEM, CURRENT AND FUTURE TRENDS, SUCCESS STORIES AND BARRIERS
IoTSE-based Open Database Vulnerability inspection in three Baltic Countries:...
Comparative analysis of national open data portals or whether your portal is ...
Towards enrichment of the open government data: a stakeholder-centered determ...
Assessment of the usability of Latvia’s open data portal or how close are we ...
Knowledge graphs dedicated to the memory of amrapali zaveri 3388748

What's hot (20)

PDF
Linked Data Quality Assessment: A Survey
PDF
Loops of humans and bots in Wikidata
PPTX
Metadata Quality Assurance
PDF
Crowdsourcing Linked Data Quality Assessment
PPTX
Metadata Quality Assurance Part II. The implementation begins
PDF
TIMELINESS OF OPEN DATA IN OPEN GOVERNMENT DATA PORTALS THROUGH PANDEMIC-RELA...
PDF
A SURVEY OF LINK MINING AND ANOMALIES DETECTION
PDF
Efficient, Scalable, and Provenance-Aware Management of Linked Data
PDF
A comprehensive survey of link mining and anomalies detection
PDF
PETROCHEMICAL PRODUCTION BIG DATA AND ITS FOUR TYPICAL APPLICATION PARADIGMS
PPTX
Semantics-enhanced Cyberinfrastructure for ICMSE : Interoperability, Analyti...
PDF
Amrapali Zaveri Defense
PDF
Data Collection Methods for Building a Free Response Training Simulation
PPTX
PhD Consortium ADBIS presetation.
DOCX
Sherlock a deep learning approach to semantic data type dete
PPTX
Characterizing Data and Software for Social Science Research
PPTX
PhD defense
PPTX
Semantic Data Retrieval: Search, Ranking, and Summarization
PPTX
Enhancing educational data quality in heterogeneous learning contexts using p...
PPTX
A Framework for Linked Data Quality based on Data Profiling and RDF Shape Ind...
Linked Data Quality Assessment: A Survey
Loops of humans and bots in Wikidata
Metadata Quality Assurance
Crowdsourcing Linked Data Quality Assessment
Metadata Quality Assurance Part II. The implementation begins
TIMELINESS OF OPEN DATA IN OPEN GOVERNMENT DATA PORTALS THROUGH PANDEMIC-RELA...
A SURVEY OF LINK MINING AND ANOMALIES DETECTION
Efficient, Scalable, and Provenance-Aware Management of Linked Data
A comprehensive survey of link mining and anomalies detection
PETROCHEMICAL PRODUCTION BIG DATA AND ITS FOUR TYPICAL APPLICATION PARADIGMS
Semantics-enhanced Cyberinfrastructure for ICMSE : Interoperability, Analyti...
Amrapali Zaveri Defense
Data Collection Methods for Building a Free Response Training Simulation
PhD Consortium ADBIS presetation.
Sherlock a deep learning approach to semantic data type dete
Characterizing Data and Software for Social Science Research
PhD defense
Semantic Data Retrieval: Search, Ranking, and Summarization
Enhancing educational data quality in heterogeneous learning contexts using p...
A Framework for Linked Data Quality based on Data Profiling and RDF Shape Ind...
Ad

Similar to AN EXTENDED DATA OBJECT-DRIVEN APPROACH TO DATA QUALITY EVALUATION: CONTEXTUAL DATA QUALITY ANALYSIS (20)

PDF
A step towards a data quality theory
PDF
META DATA QUALITY CONTROL ARCHITECTURE IN DATA WAREHOUSING
PPTX
Data Quality
PPTX
Towards an extensible measurement of metadata quality (DATeCH 2017)
PPTX
Quality key users
PPTX
Data Exploration and Transformation.pptx
DOC
Pradeep_ETL Testing_CV with 3 years of Exerience
PDF
TejGaurThesis
PPTX
Data Management Best Practices
PPTX
Prof. Melinda Laituri, Colorado State University | Map Data Integrity | SotM ...
PPTX
Intro to Data Management
PPT
Dublin Core In Practice
PDF
Creating a Data validation and Testing Strategy
PPTX
Intro to Data warehousing lecture 10
PPTX
How Clean is your Database? Data Scrubbing for all Skill Sets
PDF
Making data typing efforts or automatically detecting data types for automat...
PDF
Dqs mds-matching 15042015
DOC
Etl And Data Test Guidelines For Large Applications
PDF
DIRA : A FRAMEWORK OF DATA INTEGRATION USING DATA QUALITY
PDF
Lecture 6 Data Pre-processing in data mining.pdf
A step towards a data quality theory
META DATA QUALITY CONTROL ARCHITECTURE IN DATA WAREHOUSING
Data Quality
Towards an extensible measurement of metadata quality (DATeCH 2017)
Quality key users
Data Exploration and Transformation.pptx
Pradeep_ETL Testing_CV with 3 years of Exerience
TejGaurThesis
Data Management Best Practices
Prof. Melinda Laituri, Colorado State University | Map Data Integrity | SotM ...
Intro to Data Management
Dublin Core In Practice
Creating a Data validation and Testing Strategy
Intro to Data warehousing lecture 10
How Clean is your Database? Data Scrubbing for all Skill Sets
Making data typing efforts or automatically detecting data types for automat...
Dqs mds-matching 15042015
Etl And Data Test Guidelines For Large Applications
DIRA : A FRAMEWORK OF DATA INTEGRATION USING DATA QUALITY
Lecture 6 Data Pre-processing in data mining.pdf
Ad

More from Anastasija Nikiforova (20)

PPTX
From the evolution of public data ecosystems to the evolving horizons of the ...
PDF
Data Quality for AI or AI for Data quality: advances in Data Quality Manageme...
PDF
Towards High-Value Datasets determination for data-driven development: a syst...
PDF
Public data ecosystems in and for smart cities: how to make open / Big / smar...
PDF
Artificial Intelligence for open data or open data for artificial intelligence?
PDF
Overlooked aspects of data governance: workflow framework for enterprise data...
PDF
Data Quality as a prerequisite for you business success: when should I start ...
PDF
Framework for understanding quantum computing use cases from a multidisciplin...
PPTX
Data Lake or Data Warehouse? Data Cleaning or Data Wrangling? How to Ensure t...
PPTX
Putting FAIR Principles in the Context of Research Information: FAIRness for ...
PDF
Open data hackathon as a tool for increased engagement of Generation Z: to h...
PDF
Barriers to Openly Sharing Government Data: Towards an Open Data-adapted Inno...
PDF
Combining Data Lake and Data Wrangling for Ensuring Data Quality in CRIS
PDF
The role of open data in the development of sustainable smart cities and smar...
PDF
Data security as a top priority in the digital world: preserve data value by ...
PDF
ShoBeVODSDT: Shodan and Binary Edge based vulnerable open data sources detect...
PDF
Invited talk "Open Data as a driver of Society 5.0: how you and your scientif...
PDF
Atvērto datu potenciāls
PDF
ATVĒRTO DATU SAVLAICĪGUMS NACIONĀLAJOS ATVĒRTO DATU PORTĀLOS AR PANDĒMIJU SAI...
PDF
Towards a Concurrence Analysis in Business Processes
From the evolution of public data ecosystems to the evolving horizons of the ...
Data Quality for AI or AI for Data quality: advances in Data Quality Manageme...
Towards High-Value Datasets determination for data-driven development: a syst...
Public data ecosystems in and for smart cities: how to make open / Big / smar...
Artificial Intelligence for open data or open data for artificial intelligence?
Overlooked aspects of data governance: workflow framework for enterprise data...
Data Quality as a prerequisite for you business success: when should I start ...
Framework for understanding quantum computing use cases from a multidisciplin...
Data Lake or Data Warehouse? Data Cleaning or Data Wrangling? How to Ensure t...
Putting FAIR Principles in the Context of Research Information: FAIRness for ...
Open data hackathon as a tool for increased engagement of Generation Z: to h...
Barriers to Openly Sharing Government Data: Towards an Open Data-adapted Inno...
Combining Data Lake and Data Wrangling for Ensuring Data Quality in CRIS
The role of open data in the development of sustainable smart cities and smar...
Data security as a top priority in the digital world: preserve data value by ...
ShoBeVODSDT: Shodan and Binary Edge based vulnerable open data sources detect...
Invited talk "Open Data as a driver of Society 5.0: how you and your scientif...
Atvērto datu potenciāls
ATVĒRTO DATU SAVLAICĪGUMS NACIONĀLAJOS ATVĒRTO DATU PORTĀLOS AR PANDĒMIJU SAI...
Towards a Concurrence Analysis in Business Processes

Recently uploaded (20)

PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPT
Reliability_Chapter_ presentation 1221.5784
PPTX
Introduction to machine learning and Linear Models
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PPT
Quality review (1)_presentation of this 21
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PPTX
Database Infoormation System (DBIS).pptx
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PDF
annual-report-2024-2025 original latest.
PDF
Fluorescence-microscope_Botany_detailed content
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
climate analysis of Dhaka ,Banglades.pptx
Introduction-to-Cloud-ComputingFinal.pptx
Reliability_Chapter_ presentation 1221.5784
Introduction to machine learning and Linear Models
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Quality review (1)_presentation of this 21
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Database Infoormation System (DBIS).pptx
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
STUDY DESIGN details- Lt Col Maksud (21).pptx
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
annual-report-2024-2025 original latest.
Fluorescence-microscope_Botany_detailed content
Business Ppt On Nestle.pptx huunnnhhgfvu
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg

AN EXTENDED DATA OBJECT-DRIVEN APPROACH TO DATA QUALITY EVALUATION: CONTEXTUAL DATA QUALITY ANALYSIS

  • 1. AN EXTENDED DATA OBJECT-DRIVEN APPROACH TO DATA QUALITY EVALUATION: CONTEXTUAL DATA QUALITY ANALYSIS 21st International Conference on Enterprise Information Systems (ICEIS), Heraklion, Crete – Greece, 2019 Anastasija Nikiforova, Janis Bicevskis Faculty of Computing, University of Latvia Anastasija.Nikiforova@lu.lv
  • 2.  “Quality” is a desirable goal to be achieved through management of the production process.  «Data quality» is a relative concept, largely dependent on specific requirements resulting from the data use. QUALITY AND DATA QUALITY Source: Bičevska (2018) Source: ISO 9001:2015: Quality management principles. 2016 Decisions resulting from bad data cost the US economy $3.1 trillion dollars per year -IBM 2017 Organizations believe poor data quality to be responsible for an average of $15 million per year in losses -Gartner Data quality weaknesses can lead to huge losses !!! The same data may be sufficiently qualitative in one case BUT completely useless under other circumstances.
  • 3. «Dimensions are not defined in a measurable and formal way» -Batini et al., 2016, DAMA, 2019, Huang et al., 1999, Eppler, 2006 «…Even amongst data quality professionals the key data quality dimensions are not universally agreed. This state of affairs has led to much confusion within the data quality community and is even more bewildering for those who are new to the discipline and more importantly to business stakeholders…» -DAMA, 2019 RELATED RESEARCHES  General studies on data and information quality - define different dimensions of quality and their groupings as well as data assessment methodologies.  Assessments of specific industry data and information quality - sector- specific methods. • Cancer registry, Healthcare, Manufacturing, Chemical Hazard Risk Assessments, etc. BUT!!!  There is no consensus on data quality dimensions and their usability.  How to relate particular dimension (and which one?) to a particular use-case???  Dimensions of the same name can have different semantics in different researches. Problem: necessity to involve data quality experts at every stage of data quality analysis process Solution: data object-driven approach to data quality evaluation (Bicevskis, Bicevska, Nikiforova, Oditis, 2018)
  • 4. TDQM data quality lifecycle Data quality definition Data quality measuring Data quality analysis Data quality improvemen t MAIN PRINCIPLES OF THE PROPOSED SOLUTION  Each specific application can have its own specific DQ checks;  DQ requirements can be formulated on several levels • from informal text in natural language • to an automatically executable model, SQL statements or program code;  DQ can be checked in various stages of the data processing;  DQ definition language is graphical DSL: • the diagrams are easy to read, create, understand and edit even by non-IT and non-Data Quality professionals; • syntax and semantics can be easily applied to any new IS.
  • 5. !!! All three components are defined by using a graphical domain specific language (DSL)** **Three DSL families were developed as graphic languages based on the possibilities of the modelling platform DIMOD 1. DATA OBJECT (DO) - the set of values of the parameters that characterize a real-life object  primary data object - the initial DO which quality is analysed;  secondary data object – DO that determines the context for analysis of the primary DO. * Many objects of the same structure form class of data objects 2. DATA QUALITY REQUIREMENTS - conditions that must be met in order a data object is considered of high quality. ** May contain: informal or formalized implementation-independent descriptions of conditions 3. DATA QUALITY MEASURING PROCESS - procedures should be performed to evaluate the data object’s quality. DATA QUALITY MODEL instead of dimensions
  • 6. DATA QUALITY ANALYSIS. STEP-BY-STEP GUIDE 0-1. Definition of the use case 0-2. Analysis of source data 1-1. Definition of the primary data object 1-2. Definition of the secondary data object(-s) 1-3. Primary and secondary data objects linking 2-1. Primary data object quality specification 2-2. Primary and secondary data objects linking conditions 3. Data quality measuring process defined using graphical DSL 4-1. Analysis of the results 4-2. Data quality improvement (MS DQS)
  • 7. Use-cases: 1. company search/ identification (by its name, registration number, incorporation date); 2. contacting by post (by address and postal code) Company registers of:  United Kingdom (UK)  Latvia (LV)  Estonia (EE)  Norway (NOR) Global Open Data Index UK: 1st place LV: 18st place EE: - NOR: 1st place APPROBATION. DATA SETS Country # of columns # of columns with quality problems (number, %) United Kingdom 55 15 (27.3%) Latvia 22 11 (50%) Estonia 14 7 (50%) Norway 42 8 (19%)
  • 8. 1) company identification (by its name, registration number and incorporation date) 2) contacting by post (by its address and postal code) Country Identificat ion Name Reg. Nr. Incorporation date UK - 1 0.0001% 0 3 invalid 0.0004% Latvia - 10 0.0025% 0 94 NULL 0.02% Estonia + 0 0 - Norway - 0 0 9 doubtful 0.001% Contactin g by post Address Postal code - 7 514 NULL – 1% 4 invalid – 0.0005% 12 151 1.6% - 366 0.09% 20 498 5.16% - 29 918 11.24% 22 621 8.5% - 68 128 6.2% 14 683 1.3% APPROBATION. RESULTS Mainly syntactic analysis was done - analysis in scope of one data object !!! More in-depth and comprehensive analysis should be done - analysis in scope of multiple data objects
  • 9. TOTAL: 128 different values, that possibly contain data quality problems Various names indicating the same country USA United States United States of America Northern Ireland Republic of Ireland Ireland Virgin Island British Virgin Island Virgin Islands, British Scotland Scotland UK … ??? Which of them is valid? APPROBATION. ADDITIONAL CHECKS OF «COMPANIES HOUSE» (UK) # Type of issue Example 1 various names indicating the same country USA, United States and United States of America etc. 2 names of dissolved countries Czechoslovakia Yugoslavia USSR 3 values indicating administrative division or region Wales Scotland England & Wales England … 4 not countries at all “SW7” “EAST SUSSEX” “BWI” “DE 19901” The single data object analysis indicates the mere existence of the data quality problem without detecting all the defective records. The secondary data object is needed!!!
  • 10. • Data object is platform-independent. • The checking of parameter values is local and formal process. • The quality checking for one of the DO parameters values is an examination of properties of the individual values, e.g. whether: • (1) a text string may serve as a value of the field Name, • (2) value of the field Address is a correct address. • Can be formulated at different levels of abstraction: • from the formal language grammar • to definitions of variables in programming languages. DATA OBJECT Secondary DO Primary DO
  • 11. • Quality conditions are defined only for the primary data object. • DQ requirements are defined by using logical expressions. • The names of DO attributes/ fields serve as operands in the logical expressions. • Both syntactical and semantical data quality can be analysed according to unified principles. DATA QUALITY SPECIFICATION SendMessage Assess Field "CountryOfOrigin" checkvalueExists(CountryOfOrigin) Assess Field "URI" checkValueExists(URI) checkValueURI(URI, 'http://business.data.gov.uk.id/company/$CompanyName') Assess Field "CompanyNumber" checkValueExists(CompanyNumber) checkValueDigits(8) Assess Field "RegAddress AddressLine1" checkValueExists(RegAddress AddressLine1) Assess Field "IncorporationDate" checkValueExists(IncorporationDate) checkValueDate(IncorporationDate, "DD/MM/YYYY") Assess Field "RegAddress AddressPostCode" checkvalueExists(RegAddress AddressPostCode) Assess Field "CompanyName" checkValueExists(CompanyName) SendMessage SendMessage SendMessage SendMessage SendMessage SendMessage SendMessage Assess Field "RegAddressCountry" checkvalueExists(RegAddressCountry) ShortName OfficialName ISO2 ISO3 UNDP checkCountryOfOriginName(Country, CountryOfOrigin) checkRegAddressCountryName(Country, RegAddressCountry) NO NO OK NO NO NO NO NO NO OK OK OK OK OK OK OK Secondary DO Link between primary and secondary DOs (informal rule)
  • 12. DATA QUALITY MEASURING PROCESS The activities to be taken to select data object values from data sources. One or more steps to evaluate the quality of the data, each of which describes one test for the compliance of the data object with a specific quality specification. + Gather values of the secondary DOs from the data sources if the parameter indicating the secondary DO’s value in scope of defined quality condition is true: 1. read/ write operations from data source into database, 2. connection of primary and secondary data objects via appropriate parameters The steps to improve data quality automatically or manually triggering changes in the data source. For contextual checks The language describing the quality evaluation process involves verification activities for a particular DO that can be defined:  informally as a natural language text,  using UML activity diagrams,  in the own DSL. Additionally, processing of DO classes instances may require looping constructions, similar to iterator used in C#.
  • 13. • A concrete DO or a class of DO is used as an input for a quality verification process. • The quality verification process creates a test protocol. In case of SQL:  SELECT statement specifies the target DO  WHERE clause specifies quality requirements +  JOIN clause link primary and secondary DOs DATA QUALITY MEASURING PROCESS
  • 14. BERMUDA BWI … CZECHOSLOVAKIA DE 19901 EAST SUSSEX ENGLAND ENGLAND & WALES GIBRALTAR Great Britain HOLLAND … JERSEY … ST VINCENT NORTHERN IRELAND REPUBLIC OF IRELAND Country Of Origin Short Name Official Name ISO3 ISO2 … … … … … DE 19901 NULL NULL NULL NULL GREECE Greece the Hellenic Republic GRC GR … … … … … LATVIA Latvia the Republic of Latvia LVA LV … … … … … United States of America United States of America the United States of America USA US … … … … … Invalid names TOTAL: 128 different values, that possibly can contain data quality problems TOTAL: 48 different values, that definetely have data quality problems Various names indicating the same country USA United States United States of America Northern Ireland Republic of Ireland Ireland Virgin Island British Virgin Island Virgin Islands, British Scotland Scotland UK … REPUBLIC OF NIGERIA … SCOTLAND UK SOUTH KOREA SW7 TADJIKISTAN TAIWAN TURKS & CAICOS ISLANDS UNITED STATES UK USSR VENEZUELA VIETNAM VIRGIN ISLANDS WEST GERMANY YEMEN ARAB REPUBLIC YUGOSLAVIA ??? Which of them is valid? Results in scope of single data object Results in scope of multiple data objects SINGLE vs MULTIPLE DATA OBJECT ANALYSIS • Analysis of 2 parameters containing names of countries against 4 representations of countries’ names and their subdivisions. • Although this problem was observed in 27.6% records, it could be solved by making just 48 corrections. • All values of “CountryOfOrigin” and 73 of 74 values of “RegAddress Country” conform to one standard, i.e., the short name. ONLY 13 instead of 48 invalid values were detected!!!
  • 15.  Data quality analysis in context of multiple data objects was applied to 23 «external» open datasets, + 22 different secondary DOs were used;  21 of 23 datasets (91.3%) have at least few data quality issues that weren’t detected previously; • initial version: indicated records potentially containing data quality problems - very resources-consuming process. • proposed extension of the approach: detects only the records with the certain data quality.  The initial analysis detected 128 values: • only 13 values with data quality problem instead of 48. • 115 values didn’t have data quality problems (false negative). In this particular case, results of analysis were improved by 72.9%. FEW REMARKS !!! The proposed structure eliminated the necessity of additional in-depth quality analysis, as well as writing complex queries and individual analysis of the results.
  • 16.  An data object-driven approach to data quality evaluation: • 3 components: data object, quality specification, quality measuring process defined using graphical DSLs; • provide ability to analyse «foreign»/ «external» data without the involvement of data holders (higher level of abstraction); • very intuitive – suitable even for non-IT and non-DQ experts.  The contextual quality analysis significantly improves data quality analysis results: • possibility to analyse real data object’s quality within the context of multiple data objects; • detects the records with the certain data quality problem. • the number of possible controls, where the proposed extended approach can yield valuable results, is very high.  Both syntactical and contextual data quality are analysed according to unified description principles  the diagram’s structure remained easy to read, create, understand and edit. User’s participation in [open] data quality analysis using the presented approach brings benefits not only the users themselves, but also data holders, when users share their feedbacks, as data holders are not even aware of data quality problems. RESULTS
  • 17.  application and evaluation of the extended approach in the cases of complex data object’s structure, including supplementing data objects when direct connection between the primary and the secondary data objects is not possible,  detection of possible limitations of the proposed extended approach,  ensuring possibility to evaluate data sets’ evolution,  assessment of possibility to provide users with suggestions for data improvement,  developing data quality theory. FUTURE WORK
  • 18. THANK YOU! For more information, see ResearchGate See also anastasijanikiforova.com For questions or any other queries, contact me via email - Anastasija.Nikiforova@lu.lv Article: Nikiforova, A., & Bicevskis, J. (2019). An Extended Data Object-driven Approach to Data Quality Evaluation: Contextual Data Quality Analysis. In ICEIS (1) (pp. 274-281).