SlideShare a Scribd company logo
A STEP TOWARDS A DATA QUALITY THEORY
The Third International Workshop on Data Science Engineering and its Applications (DSEA 2019)
In conjunction with
The Fifth International Conference on Social Networks Analysis, Management and Security(SNAMS-2019)
Granada, Spain. October 22-25, 2019.
Janis Bicevskis, Anastasija Nikiforova, Zane Bicevska, Ivo Oditis, Girts Karnitis
Faculty of Computing, University of Latvia
Anastasija.Nikiforova@lu.lv
 Def. I: «Quality» is a desirable goal to be achieved
through management of the production process.
 Def. II: «Data quality» is a relative concept, largely
dependent on specific requirements resulting from the
data use.
late 60’s
the data quality issues were firstly researched by statisticians, when mainly
mathematical theory for considering duplicates in statistical data sets was
proposed
late 80’s
the data quality issue has attracted management researchers
early 90’s
computer researchers have begun their own researches, focusing on the data
that are stored in databases and data warehouses, examining how to define,
measure and improve the quality of different types of data, relating the concept
of “data quality” to the “data quality dimension”
nowadays
almost 30 years later, since the data are everywhere and their amount increases
significantly, this issue is still popular and relevant, but, unfortunately, has
not yet been solved
2017
Organizations believe poor data quality to be responsible
for an average of $15 million per year in losses (Gartner)
Data quality weaknesses can lead to huge losses
The aggregate economic impact from applications based
on open data across the EU27 economy is estimated to
be €140 billion annually.
2016
Decisions resulting from bad data cost the US economy
$3.1 trillion dollars per year (IBM)
A BRIEF INSIGHT INTO THE HISTORY
[open] data are usually used by wide audience that
may not have deep knowledge in IT or data quality areas
a solution should be simple enough
ensuring particular users with possibility to take part in
the analysis of «third-party» data
for their own purposes
DATA QUALITY
Solution: previously proposed user-oriented data object-driven
approach
(Bicevskis, Bicevska, Nikiforova, Oditis, 2018), (Nikiforova, 2019)
!!! The same data may be
sufficiently qualitative in one case
BUT
completely useless under other
circumstances.
RELATED RESEARCHES
Problem I: necessity to involve data quality experts at every stage of data quality analysis
process.
Solution: data object-driven approach to data quality evaluation (Bicevskis, Bicevska, Nikiforova, Oditis, 2018)
Problem II: absence of data quality theory.
* «… This state of affairs has led to much confusion within the data quality
community and is even more bewildering for those who are new to the
discipline and more importantly to business stakeholders…» (DAMA UK,
2018)
** In different proposals, dimensions of the same name can have different
semantics and vice versa. (Batini, 2016)
Example I: (Kerr, et al., 2007):
New Zealand’s healthcare data:
 6 data quality dimensions,
 24 characteristics
 69 data quality criteria.
Example II: (Dahbi et al., 2018; Weiskopf et al.,
2013):
 2 data quality dimensions: accuracy
and completeness
 Most of the theoretical researches are characterized by a wide range of data and information quality
dimensions:
✘ data quality theoretical studies have not provided a unified system of data quality concepts yet*;
✘ the exact meaning of each dimension and how it should be assesd is still under discussion**;
✘ different proposals often use the same notation indicating semantically different dimensions and
vice versa.
✘ sometimes the difference between some of them is almost unnoticeable.
✘ each dimension can be supplied with one or more metrics that varies from one solution to another;
✘ the number of different dimensions, their definitions are often useful for only particular solution.
Question: How to relate particular dimension (and which one?) to a particular use-case???
SUMMARY
 This research is of a theoretical nature, the main objectives of which are:
 to provide a clear and straightforward definition of data quality concepts to ensure that all stakeholders perceive
them equally,
 to provide a language family that will describe the data quality requirements and assess the quality of data, taking
into account the various possible uses of the data and their variability over time.
 to provide a formalisation of the previously proposed practical solution to take a step towards a theory of data
quality, which hasn’t been proposed yet, despite numerous attempts.
TDQM data quality lifecycle
Data quality
definition
Data quality
measuring
Data quality
analysis
Data quality
improvement
MAIN PRINCIPLES OF THE
PROPOSED SOLUTION
 Each specific application can have its own specific DQ checks;
 DQ requirements can be formulated on several levels:
 DQ can be checked in various stages of the data processing;
 DQ definition language is graphical DSL:
• the diagrams are easy to read, create, understand and edit even by non-
IT and non-DQ experts;
• syntax and semantics can be easily applied to any new IS.
from informal text
in natural language (PIM)
to an automatically executable model,
SQL statements or program code (PSM);
!!! All three components are
defined by using a graphical
domain specific language
(DSL)**
**Three DSL families were developed as graphic languages based on the
possibilities of the modelling platform DIMOD
2. DATA QUALITY REQUIREMENTS - conditions that must be met in order a
data object is considered of high quality.
** May contain: informal or formalized implementation-independent descriptions of conditions
3. DATA QUALITY MEASURING PROCESS - procedure to be followed to assess quality of the data
DATA QUALITY MODEL
instead of dimensions
1. DATA OBJECT (DO) - the set of values of the parameters that characterize a real-life object
 primary data object - the initial DO which quality is analysed;
 secondary data object – DO that determines the context for analysis of the primary DO.
 both, primary and secondary DOs may contain an unlimited number of data sub-objects.
* Many objects of the same structure form class of data objects
** The primary data object is usually one, but the number of secondary data objects is not limited and determined by the
nature of the primary data object and the specific use-case.
d1
d2
d3
d4
dn
d..
ARCHITECTURE OF DATA QUALITY SYSTEM
 DO is a set of attribute values that characterize
one real object.
 The address for the attribute value of a
single data object is
<dataObjectName.attributeName> - is
used at the stage of determining data
quality requirements.
 Can be formulated at different levels of abstraction:
 from the formal language grammar
 to definitions of variables in programming languages.
DATA OBJECT
Students Programs
inputMessage
studentName
varchar
courseCode
varchar
progCode
varchar
Name
varchar
Success
Assessment
enumerable
Date
date
courseCode
varchar
Assessment
enumerable
Date
date
Courses
Code
varchar
Name
varchar
Name
varchar
Code
varchar
Primary DO
Secondary data object
Data sub-object
 In order to include quality requirements in the
contextual requirements, addresses of the
secondary data object’s parameters are used in
the appropriate conditions:
<secondaryDataObjectName(instanceIdent).
attributeName>.
 If the secondary data object should be searched
for by its attribute values, a secondary data
object search command similar to the primary
data object is used: <instanceIdent = seekInst
(secondaryObjectName, expression)>.
 When processing a data object class:
 instances of the data object class are selected,
 examining the fulfilment of the quality requirements for each individual instance.
 The instance processing cycle is determined by users.
The most commonly used options
If quality is analysed for all instances of a DO
reviewing all class instances by changing address
<dataObjectName(instanceIdent).attributeName>,
that is (a) calculated first by selecting the first instance
using method: instanceIdent =
getFirst(dataObjectName),
(b) followed by a transition to the next instance using
<instanceIdent = getNext(dataObjectName)> method.
If quality is analysed for only one instance of a DO
using a dynamically calculated address
<instanceIdent = seekInst(dataObject,
expression)>,
If an instance of a DO is found, then (a) a reference to the
DO is inserted into the variable instanceIdent,
(b) the value TRUE is returned to the environment;
otherwise – FALSE and NULL is inserted into the variable.
QUALITY SPECIFICATION FOR DATA
OBJECT’S CLASS
 When processing a data object class:
 instances of the data object class are selected,
 examining the fulfilment of the quality
requirements for each individual instance.
 DQ requirements are defined by using logical
expressions.
 The names of DO attributes/ fields serve as
operands in the logical expressions.
PRE-CONDITION QUALITY DEFINITIONS
Check Course
instProgram = seekInst(Programs,'Programs.Code =
Students(instStudent).progCode')
Check Student
instStudent = seekInst(Students,'Students.Name =
inputMessage.studentName')
Check Course
instCourse = seekInst(Programs(instProgram).Courses,
'Courses.Code = inputMessage.courseCode')
Send Message
sendMessage(18,
inputMessage.courseCode)
Send Message
sendMessage(19,
inputMessage.courseCode)
Send Message
sendMessage(17,
inputMessage.studentName)
YES
YES
YES
NO
NO
NO
 Pre-condition verifies (bold lines in «DO»):
 whether a student to whom inputMessage
applies exists;
 whether a student is registered to any training
program;
 whether the course specified in inputMessage
belongs to training program.
Students Programs
inputMessage
studentName
varchar
courseCode
varchar
progCode
varchar
Name
varchar
Success
Assessment
enumerable
Date
date
courseCode
varchar
Assessment
enumerable
Date
date
Courses
Code
varchar
Name
varchar
Name
varchar
Code
varchar
If quality is analysed for all instances of a DO If quality is analysed for only one instance of a DO
review all class instances by changing address
<dataObjectName(instanceIdent).attributeName>,
that is (a) calculated first by selecting the first instance using method:
<instanceIdent = getFirst(dataObjectName)>,
(b) followed by a transition to the next instance using <instanceIdent =
getNext(dataObjectName)> method.
using a dynamically calculated address
<instanceIdent = seekInst(dataObject, expression)>,
If an instance of a DO is found, then
(a) a reference to the DO is inserted into the variable instanceIdent,
(b) the value TRUE is returned to the environment;
otherwise – FALSE and NULL is inserted into the variable.
 A concrete DO or a class of DO is used as an
input for a quality verification process.
 The quality verification process creates a test
protocol.
EXAMPLE: POST-CONDITION
QUALITY DEFINITIONS
Check Course Insert
instSuccess = seekInst(Students(instStudent).Success,
'Success.courseCode = inputMessage.courseCode)
Check Assessment Insert
Success(instSuccess).Assessment =
inputMessage.Assessment
Check Date Insert
Success(instSuccess).Date = inputMessage.Date
Send Message
sendMessage(23,
inputMessage.Date)
Send Message
sendMessage(22,
inputMessage.Assessment)
Send Message
sendMessage(21,
inputMessage.courseCode)
Seek Student
instStudent = seekInst(Students, 'Student.Name = inputMessage.studentName')
YES
YES
YES
NO
NO
NO
 Post-condition is executed after Data_Input and
it verifies (thin arrows in Fig. «DO»):
 whether a new instance has been added to
the Student sub-object Success data object;
 whether a new instance with the
corresponding course assessment has been
added to the Student sub-object Success
data item;
 whether a new instance with the
corresponding exam date has been added to
the Student data object Success sub-object.
Students Programs
inputMessage
studentName
varchar
courseCode
varchar
progCode
varchar
Name
varchar
Success
Assessment
enumerable
Date
date
courseCode
varchar
Assessment
enumerable
Date
date
Courses
Code
varchar
Name
varchar
Name
varchar
Code
varchar
 In total: 25 data sets  23 (92%) have at least several data quality issues;
 The most popular and frequently occurred data quality issues:
✘ lack of values even for the primary parameters;
✘ doubtful/ invalid dates;
✘ issues in interrelated parameters;
✘ multiple notation for the same object;
✘ values that don’t belong to the list of valid values;
✘ contextual data quality issues such as lack of values and conflicting values;
EXPERIENCE OF EVALUATION
OF OPEN DATA QUALITY
 structured and semi-structured
open data sets provided by
different data publishers;
 the data quality requirements
formulated for each data set vary
from very simple to fairly complex.
 The research proposes a data-object driven theory of data quality, which arose from previous studies, eliminating the
lack of formalization.
 An end-user who is interested in data quality analysis according to his needs is set into the centre of
the data quality analysis.
 The most significant advantages:
 all concepts of the proposed data quality theory are straightforward;
 the proposed approach is an «external» mechanism that allows describing the DQ and veryfying the applicability of data to a
specific use case independently from the IS accumulating and processing data;
 the use of graphical DSLs simplifies the interaction process by allowing multiple stakeholders to be involved;
 designing of diagrams is fairly simple  it is assumed that DQ analysis can be performed even by non-IT and non-DQ experts;
 the appliance of the proposed solution for the analysis of “third-party” data sets proves the simplicity and effectiveness of the
proposed solution.
RESULTS
THANK YOU FOR ATTENTION!
For more information, see ResearchGate
See also anastasijanikiforova.com
For questions or any other queries, contact me via email - Anastasija.Nikiforova@lu.lv
Article: Bicevskis, J., Nikiforova, A., Bicevska, Z., Oditis, I., & Karnitis, G. (2019, October). A step towards a data
quality theory. In 2019 Sixth International Conference on Social Networks Analysis, Management and Security
(SNAMS) (pp. 303-308). IEEE.

More Related Content

PDF
AN EXTENDED DATA OBJECT-DRIVEN APPROACH TO DATA QUALITY EVALUATION: CONTEXTUA...
PDF
Data quality assessment in context= a cognitive perspective
PDF
Multi-label Classification Methods: A Comparative Study
PDF
TejGaurThesis
PDF
11.0005www.iiste.org call for paper. data mining tools and techniques- a revi...
PDF
5. data mining tools and techniques a review--31-39
PDF
Conceptual framework for entity integration from multiple data sources - Draz...
PDF
Characterizing and Processing of Big Data Using Data Mining Techniques
AN EXTENDED DATA OBJECT-DRIVEN APPROACH TO DATA QUALITY EVALUATION: CONTEXTUA...
Data quality assessment in context= a cognitive perspective
Multi-label Classification Methods: A Comparative Study
TejGaurThesis
11.0005www.iiste.org call for paper. data mining tools and techniques- a revi...
5. data mining tools and techniques a review--31-39
Conceptual framework for entity integration from multiple data sources - Draz...
Characterizing and Processing of Big Data Using Data Mining Techniques

What's hot (8)

PDF
The Survey of Data Mining Applications And Feature Scope
PDF
Explanations in Data Systems
PDF
4.on demand quality of web services using ranking by multi criteria 31-35
PDF
11.0004www.iiste.org call for paper.on demand quality of web services using r...
PPTX
Data Management Best Practices
PDF
Hy3414631468
PPT
1.2 steps and functionalities
PDF
A SURVEY ON DATA MINING IN STEEL INDUSTRIES
The Survey of Data Mining Applications And Feature Scope
Explanations in Data Systems
4.on demand quality of web services using ranking by multi criteria 31-35
11.0004www.iiste.org call for paper.on demand quality of web services using r...
Data Management Best Practices
Hy3414631468
1.2 steps and functionalities
A SURVEY ON DATA MINING IN STEEL INDUSTRIES
Ad

Similar to A step towards a data quality theory (20)

PDF
Stakeholder-centred Identification of Data Quality Issues: Knowledge that Can...
DOCX
Seminar Report Vaibhav
PDF
Evaluating the effectiveness of data quality framework in software engineering
PDF
Projection Multi Scale Hashing Keyword Search in Multidimensional Datasets
PDF
META DATA QUALITY CONTROL ARCHITECTURE IN DATA WAREHOUSING
PDF
Principles of Software-defined Elastic Systems for Big Data Analytics
PDF
Selecting the correct Data Mining Method: Classification & InDaMiTe-R
PPTX
Prof. Melinda Laituri, Colorado State University | Map Data Integrity | SotM ...
PDF
customized eager lazy data cleansing for satisfactory big data veracity
PDF
Artificial Intelligence for Automating Data Analysis
PDF
Making data typing efforts or automatically detecting data types for automat...
DOCX
Nikita rajbhoj(a 50)
PDF
Ijariie1184
PDF
Ijariie1184
PDF
How to clean data less through Linked (Open Data) approach?
PPTX
Advances in Exploratory Data Analysis, Visualisation and Quality for Data Cen...
PDF
DIRA : A FRAMEWORK OF DATA INTEGRATION USING DATA QUALITY
PPTX
UNIT 3-L2.pptx introduction to machine learning
PDF
Crowdsourcing Approaches to Big Data Curation - Rio Big Data Meetup
PPT
Chapter 4 Organizational Aspects of Data Management.ppt
Stakeholder-centred Identification of Data Quality Issues: Knowledge that Can...
Seminar Report Vaibhav
Evaluating the effectiveness of data quality framework in software engineering
Projection Multi Scale Hashing Keyword Search in Multidimensional Datasets
META DATA QUALITY CONTROL ARCHITECTURE IN DATA WAREHOUSING
Principles of Software-defined Elastic Systems for Big Data Analytics
Selecting the correct Data Mining Method: Classification & InDaMiTe-R
Prof. Melinda Laituri, Colorado State University | Map Data Integrity | SotM ...
customized eager lazy data cleansing for satisfactory big data veracity
Artificial Intelligence for Automating Data Analysis
Making data typing efforts or automatically detecting data types for automat...
Nikita rajbhoj(a 50)
Ijariie1184
Ijariie1184
How to clean data less through Linked (Open Data) approach?
Advances in Exploratory Data Analysis, Visualisation and Quality for Data Cen...
DIRA : A FRAMEWORK OF DATA INTEGRATION USING DATA QUALITY
UNIT 3-L2.pptx introduction to machine learning
Crowdsourcing Approaches to Big Data Curation - Rio Big Data Meetup
Chapter 4 Organizational Aspects of Data Management.ppt
Ad

More from Anastasija Nikiforova (20)

PPTX
From the evolution of public data ecosystems to the evolving horizons of the ...
PDF
Data Quality for AI or AI for Data quality: advances in Data Quality Manageme...
PDF
Towards High-Value Datasets determination for data-driven development: a syst...
PDF
Public data ecosystems in and for smart cities: how to make open / Big / smar...
PDF
Artificial Intelligence for open data or open data for artificial intelligence?
PDF
Overlooked aspects of data governance: workflow framework for enterprise data...
PDF
Data Quality as a prerequisite for you business success: when should I start ...
PDF
Framework for understanding quantum computing use cases from a multidisciplin...
PPTX
Data Lake or Data Warehouse? Data Cleaning or Data Wrangling? How to Ensure t...
PPTX
Putting FAIR Principles in the Context of Research Information: FAIRness for ...
PDF
Open data hackathon as a tool for increased engagement of Generation Z: to h...
PDF
Barriers to Openly Sharing Government Data: Towards an Open Data-adapted Inno...
PDF
Combining Data Lake and Data Wrangling for Ensuring Data Quality in CRIS
PDF
The role of open data in the development of sustainable smart cities and smar...
PDF
Data security as a top priority in the digital world: preserve data value by ...
PDF
IoTSE-based Open Database Vulnerability inspection in three Baltic Countries:...
PDF
ShoBeVODSDT: Shodan and Binary Edge based vulnerable open data sources detect...
PDF
OPEN DATA: ECOSYSTEM, CURRENT AND FUTURE TRENDS, SUCCESS STORIES AND BARRIERS
PDF
Invited talk "Open Data as a driver of Society 5.0: how you and your scientif...
PDF
Towards enrichment of the open government data: a stakeholder-centered determ...
From the evolution of public data ecosystems to the evolving horizons of the ...
Data Quality for AI or AI for Data quality: advances in Data Quality Manageme...
Towards High-Value Datasets determination for data-driven development: a syst...
Public data ecosystems in and for smart cities: how to make open / Big / smar...
Artificial Intelligence for open data or open data for artificial intelligence?
Overlooked aspects of data governance: workflow framework for enterprise data...
Data Quality as a prerequisite for you business success: when should I start ...
Framework for understanding quantum computing use cases from a multidisciplin...
Data Lake or Data Warehouse? Data Cleaning or Data Wrangling? How to Ensure t...
Putting FAIR Principles in the Context of Research Information: FAIRness for ...
Open data hackathon as a tool for increased engagement of Generation Z: to h...
Barriers to Openly Sharing Government Data: Towards an Open Data-adapted Inno...
Combining Data Lake and Data Wrangling for Ensuring Data Quality in CRIS
The role of open data in the development of sustainable smart cities and smar...
Data security as a top priority in the digital world: preserve data value by ...
IoTSE-based Open Database Vulnerability inspection in three Baltic Countries:...
ShoBeVODSDT: Shodan and Binary Edge based vulnerable open data sources detect...
OPEN DATA: ECOSYSTEM, CURRENT AND FUTURE TRENDS, SUCCESS STORIES AND BARRIERS
Invited talk "Open Data as a driver of Society 5.0: how you and your scientif...
Towards enrichment of the open government data: a stakeholder-centered determ...

Recently uploaded (20)

PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
Supervised vs unsupervised machine learning algorithms
PDF
Clinical guidelines as a resource for EBP(1).pdf
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
Business Acumen Training GuidePresentation.pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPT
Quality review (1)_presentation of this 21
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PDF
annual-report-2024-2025 original latest.
PPTX
1_Introduction to advance data techniques.pptx
PPTX
Introduction to machine learning and Linear Models
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Introduction to Knowledge Engineering Part 1
Data_Analytics_and_PowerBI_Presentation.pptx
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
Supervised vs unsupervised machine learning algorithms
Clinical guidelines as a resource for EBP(1).pdf
IBA_Chapter_11_Slides_Final_Accessible.pptx
Business Acumen Training GuidePresentation.pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
ISS -ESG Data flows What is ESG and HowHow
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Quality review (1)_presentation of this 21
STUDY DESIGN details- Lt Col Maksud (21).pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck
.pdf is not working space design for the following data for the following dat...
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
annual-report-2024-2025 original latest.
1_Introduction to advance data techniques.pptx
Introduction to machine learning and Linear Models

A step towards a data quality theory

  • 1. A STEP TOWARDS A DATA QUALITY THEORY The Third International Workshop on Data Science Engineering and its Applications (DSEA 2019) In conjunction with The Fifth International Conference on Social Networks Analysis, Management and Security(SNAMS-2019) Granada, Spain. October 22-25, 2019. Janis Bicevskis, Anastasija Nikiforova, Zane Bicevska, Ivo Oditis, Girts Karnitis Faculty of Computing, University of Latvia Anastasija.Nikiforova@lu.lv
  • 2.  Def. I: «Quality» is a desirable goal to be achieved through management of the production process.  Def. II: «Data quality» is a relative concept, largely dependent on specific requirements resulting from the data use. late 60’s the data quality issues were firstly researched by statisticians, when mainly mathematical theory for considering duplicates in statistical data sets was proposed late 80’s the data quality issue has attracted management researchers early 90’s computer researchers have begun their own researches, focusing on the data that are stored in databases and data warehouses, examining how to define, measure and improve the quality of different types of data, relating the concept of “data quality” to the “data quality dimension” nowadays almost 30 years later, since the data are everywhere and their amount increases significantly, this issue is still popular and relevant, but, unfortunately, has not yet been solved 2017 Organizations believe poor data quality to be responsible for an average of $15 million per year in losses (Gartner) Data quality weaknesses can lead to huge losses The aggregate economic impact from applications based on open data across the EU27 economy is estimated to be €140 billion annually. 2016 Decisions resulting from bad data cost the US economy $3.1 trillion dollars per year (IBM) A BRIEF INSIGHT INTO THE HISTORY
  • 3. [open] data are usually used by wide audience that may not have deep knowledge in IT or data quality areas a solution should be simple enough ensuring particular users with possibility to take part in the analysis of «third-party» data for their own purposes DATA QUALITY Solution: previously proposed user-oriented data object-driven approach (Bicevskis, Bicevska, Nikiforova, Oditis, 2018), (Nikiforova, 2019) !!! The same data may be sufficiently qualitative in one case BUT completely useless under other circumstances.
  • 4. RELATED RESEARCHES Problem I: necessity to involve data quality experts at every stage of data quality analysis process. Solution: data object-driven approach to data quality evaluation (Bicevskis, Bicevska, Nikiforova, Oditis, 2018) Problem II: absence of data quality theory. * «… This state of affairs has led to much confusion within the data quality community and is even more bewildering for those who are new to the discipline and more importantly to business stakeholders…» (DAMA UK, 2018) ** In different proposals, dimensions of the same name can have different semantics and vice versa. (Batini, 2016) Example I: (Kerr, et al., 2007): New Zealand’s healthcare data:  6 data quality dimensions,  24 characteristics  69 data quality criteria. Example II: (Dahbi et al., 2018; Weiskopf et al., 2013):  2 data quality dimensions: accuracy and completeness  Most of the theoretical researches are characterized by a wide range of data and information quality dimensions: ✘ data quality theoretical studies have not provided a unified system of data quality concepts yet*; ✘ the exact meaning of each dimension and how it should be assesd is still under discussion**; ✘ different proposals often use the same notation indicating semantically different dimensions and vice versa. ✘ sometimes the difference between some of them is almost unnoticeable. ✘ each dimension can be supplied with one or more metrics that varies from one solution to another; ✘ the number of different dimensions, their definitions are often useful for only particular solution. Question: How to relate particular dimension (and which one?) to a particular use-case???
  • 5. SUMMARY  This research is of a theoretical nature, the main objectives of which are:  to provide a clear and straightforward definition of data quality concepts to ensure that all stakeholders perceive them equally,  to provide a language family that will describe the data quality requirements and assess the quality of data, taking into account the various possible uses of the data and their variability over time.  to provide a formalisation of the previously proposed practical solution to take a step towards a theory of data quality, which hasn’t been proposed yet, despite numerous attempts.
  • 6. TDQM data quality lifecycle Data quality definition Data quality measuring Data quality analysis Data quality improvement MAIN PRINCIPLES OF THE PROPOSED SOLUTION  Each specific application can have its own specific DQ checks;  DQ requirements can be formulated on several levels:  DQ can be checked in various stages of the data processing;  DQ definition language is graphical DSL: • the diagrams are easy to read, create, understand and edit even by non- IT and non-DQ experts; • syntax and semantics can be easily applied to any new IS. from informal text in natural language (PIM) to an automatically executable model, SQL statements or program code (PSM);
  • 7. !!! All three components are defined by using a graphical domain specific language (DSL)** **Three DSL families were developed as graphic languages based on the possibilities of the modelling platform DIMOD 2. DATA QUALITY REQUIREMENTS - conditions that must be met in order a data object is considered of high quality. ** May contain: informal or formalized implementation-independent descriptions of conditions 3. DATA QUALITY MEASURING PROCESS - procedure to be followed to assess quality of the data DATA QUALITY MODEL instead of dimensions 1. DATA OBJECT (DO) - the set of values of the parameters that characterize a real-life object  primary data object - the initial DO which quality is analysed;  secondary data object – DO that determines the context for analysis of the primary DO.  both, primary and secondary DOs may contain an unlimited number of data sub-objects. * Many objects of the same structure form class of data objects ** The primary data object is usually one, but the number of secondary data objects is not limited and determined by the nature of the primary data object and the specific use-case. d1 d2 d3 d4 dn d..
  • 8. ARCHITECTURE OF DATA QUALITY SYSTEM
  • 9.  DO is a set of attribute values that characterize one real object.  The address for the attribute value of a single data object is <dataObjectName.attributeName> - is used at the stage of determining data quality requirements.  Can be formulated at different levels of abstraction:  from the formal language grammar  to definitions of variables in programming languages. DATA OBJECT Students Programs inputMessage studentName varchar courseCode varchar progCode varchar Name varchar Success Assessment enumerable Date date courseCode varchar Assessment enumerable Date date Courses Code varchar Name varchar Name varchar Code varchar Primary DO Secondary data object Data sub-object
  • 10.  In order to include quality requirements in the contextual requirements, addresses of the secondary data object’s parameters are used in the appropriate conditions: <secondaryDataObjectName(instanceIdent). attributeName>.  If the secondary data object should be searched for by its attribute values, a secondary data object search command similar to the primary data object is used: <instanceIdent = seekInst (secondaryObjectName, expression)>.  When processing a data object class:  instances of the data object class are selected,  examining the fulfilment of the quality requirements for each individual instance.  The instance processing cycle is determined by users. The most commonly used options If quality is analysed for all instances of a DO reviewing all class instances by changing address <dataObjectName(instanceIdent).attributeName>, that is (a) calculated first by selecting the first instance using method: instanceIdent = getFirst(dataObjectName), (b) followed by a transition to the next instance using <instanceIdent = getNext(dataObjectName)> method. If quality is analysed for only one instance of a DO using a dynamically calculated address <instanceIdent = seekInst(dataObject, expression)>, If an instance of a DO is found, then (a) a reference to the DO is inserted into the variable instanceIdent, (b) the value TRUE is returned to the environment; otherwise – FALSE and NULL is inserted into the variable. QUALITY SPECIFICATION FOR DATA OBJECT’S CLASS
  • 11.  When processing a data object class:  instances of the data object class are selected,  examining the fulfilment of the quality requirements for each individual instance.  DQ requirements are defined by using logical expressions.  The names of DO attributes/ fields serve as operands in the logical expressions. PRE-CONDITION QUALITY DEFINITIONS Check Course instProgram = seekInst(Programs,'Programs.Code = Students(instStudent).progCode') Check Student instStudent = seekInst(Students,'Students.Name = inputMessage.studentName') Check Course instCourse = seekInst(Programs(instProgram).Courses, 'Courses.Code = inputMessage.courseCode') Send Message sendMessage(18, inputMessage.courseCode) Send Message sendMessage(19, inputMessage.courseCode) Send Message sendMessage(17, inputMessage.studentName) YES YES YES NO NO NO  Pre-condition verifies (bold lines in «DO»):  whether a student to whom inputMessage applies exists;  whether a student is registered to any training program;  whether the course specified in inputMessage belongs to training program. Students Programs inputMessage studentName varchar courseCode varchar progCode varchar Name varchar Success Assessment enumerable Date date courseCode varchar Assessment enumerable Date date Courses Code varchar Name varchar Name varchar Code varchar If quality is analysed for all instances of a DO If quality is analysed for only one instance of a DO review all class instances by changing address <dataObjectName(instanceIdent).attributeName>, that is (a) calculated first by selecting the first instance using method: <instanceIdent = getFirst(dataObjectName)>, (b) followed by a transition to the next instance using <instanceIdent = getNext(dataObjectName)> method. using a dynamically calculated address <instanceIdent = seekInst(dataObject, expression)>, If an instance of a DO is found, then (a) a reference to the DO is inserted into the variable instanceIdent, (b) the value TRUE is returned to the environment; otherwise – FALSE and NULL is inserted into the variable.
  • 12.  A concrete DO or a class of DO is used as an input for a quality verification process.  The quality verification process creates a test protocol. EXAMPLE: POST-CONDITION QUALITY DEFINITIONS Check Course Insert instSuccess = seekInst(Students(instStudent).Success, 'Success.courseCode = inputMessage.courseCode) Check Assessment Insert Success(instSuccess).Assessment = inputMessage.Assessment Check Date Insert Success(instSuccess).Date = inputMessage.Date Send Message sendMessage(23, inputMessage.Date) Send Message sendMessage(22, inputMessage.Assessment) Send Message sendMessage(21, inputMessage.courseCode) Seek Student instStudent = seekInst(Students, 'Student.Name = inputMessage.studentName') YES YES YES NO NO NO  Post-condition is executed after Data_Input and it verifies (thin arrows in Fig. «DO»):  whether a new instance has been added to the Student sub-object Success data object;  whether a new instance with the corresponding course assessment has been added to the Student sub-object Success data item;  whether a new instance with the corresponding exam date has been added to the Student data object Success sub-object. Students Programs inputMessage studentName varchar courseCode varchar progCode varchar Name varchar Success Assessment enumerable Date date courseCode varchar Assessment enumerable Date date Courses Code varchar Name varchar Name varchar Code varchar
  • 13.  In total: 25 data sets  23 (92%) have at least several data quality issues;  The most popular and frequently occurred data quality issues: ✘ lack of values even for the primary parameters; ✘ doubtful/ invalid dates; ✘ issues in interrelated parameters; ✘ multiple notation for the same object; ✘ values that don’t belong to the list of valid values; ✘ contextual data quality issues such as lack of values and conflicting values; EXPERIENCE OF EVALUATION OF OPEN DATA QUALITY  structured and semi-structured open data sets provided by different data publishers;  the data quality requirements formulated for each data set vary from very simple to fairly complex.
  • 14.  The research proposes a data-object driven theory of data quality, which arose from previous studies, eliminating the lack of formalization.  An end-user who is interested in data quality analysis according to his needs is set into the centre of the data quality analysis.  The most significant advantages:  all concepts of the proposed data quality theory are straightforward;  the proposed approach is an «external» mechanism that allows describing the DQ and veryfying the applicability of data to a specific use case independently from the IS accumulating and processing data;  the use of graphical DSLs simplifies the interaction process by allowing multiple stakeholders to be involved;  designing of diagrams is fairly simple  it is assumed that DQ analysis can be performed even by non-IT and non-DQ experts;  the appliance of the proposed solution for the analysis of “third-party” data sets proves the simplicity and effectiveness of the proposed solution. RESULTS
  • 15. THANK YOU FOR ATTENTION! For more information, see ResearchGate See also anastasijanikiforova.com For questions or any other queries, contact me via email - Anastasija.Nikiforova@lu.lv Article: Bicevskis, J., Nikiforova, A., Bicevska, Z., Oditis, I., & Karnitis, G. (2019, October). A step towards a data quality theory. In 2019 Sixth International Conference on Social Networks Analysis, Management and Security (SNAMS) (pp. 303-308). IEEE.