SlideShare a Scribd company logo
P.Missier-2016
Diachronworkshoppanel
Big Data Quality Panel
Diachron Workshop @EDBT
Panta Rhei (Heraclitus, through Plato)
Paolo Missier
Newcastle University, UK
Bordeaux, March 2016
(*) Painting by Johannes Moreelse
(*)
P.Missier-2016
Diachronworkshoppanel
The “curse” of Data and Information Quality
• Quality requirements are often specific to the application
that makes use of the data (“fitness for purpose”)
• Quality Assurance (actions required to meet the
requirements) are specific to the data types
A few generic quality techniques (linkage, blocking, …)
but mostly ad hoc solutions
P.Missier-2016
Diachronworkshoppanel
V for “Veracity”?
Q3. To what extent traditional approaches for diagnosis, prevention
and curation are challenged by the Volume Variety and Velocity
characteristics of Big Data?
V Issues Example
High Volume • Scalability: What kinds of QC
step can be parallelised?
• Human curation not feasible
Parallel meta-blocking
High Velocity • Statistics-based diagnosis, data-
type specific
• Human curation not feasible
Reliability of sensor
readings
High Variety • Heterogeneity is not a new issue! Data fusion for decision
making
Recent contributions on Quality & Big Data (IEEE Big Data 2015)
Chung-Yi Li et al., Recommending missing sensor values
Yang Wang and Kwan-Liu Ma, Revealing the fog-of-war: A visualization-directed, uncertainty-aware
approach for exploring high-dimensional data
S. Bonner et al., Data quality assessment and anomaly detection via map/reduce and linked data: A case
study in the medical domain
V. Efthymiou, K. Stefanidis and V. Christophides, Big data entity resolution: From highly to somehow
similar entity descriptions in the Web
V. Efthymiou, G. Papadakis, G. Papastefanatos, K. Stefanidis and T. Palpanas, Parallel meta-blocking:
Realizing scalable entity resolution over large, heterogeneous data
P.Missier-2016
Diachronworkshoppanel
Can we ignore quality issues?
Q4: How difficult is the evaluation of the threshold under which data
quality can be ignored?
• Some analytics algorithms may be tolerant to {outliers, missing
values, implausible values} in the input
• But this “meta-knowledge” is specific to each algorithm. Hard to
derive general models
• i.e. the importance and danger of FP / FN
A possible incremental learning approach:
Build a database of past analytics task:
H = {<In, P, Out>}
Try and learn (In, Out) correlations over a growing collection H
P.Missier-2016
Diachronworkshoppanel
Data to Knowledge
Meta-knowledge
Big
Data
The Big
Analytics
Machine
Algorithms
Tools
Middleware
Reference
datasets
“Valuable
Knowledge”
The Data-to-Knowledge pattern of the Knowledge Economy:
P.Missier-2016
Diachronworkshoppanel
The missing element: time
Big
Data
The Big
Analytics
Machine
“Valuable
Knowledge”
V3
V2
V1
Meta-knowledge
Algorithms
Tools
Middleware
Reference
datasets
t
t
t
Change  data currency
P.Missier-2016
Diachronworkshoppanel
The ReComp decision support system
Observe change
• In big data
• In meta-knowledge
Assess and
measure
• knowledge decay
Estimate
• Cost and benefits of refresh
Enact
• Reproduce (analytics)
processes
Currency of data and of meta-knowledge:
- What knowledge should be refreshed?
- When, how?
- Cost / benefits
P.Missier-2016
Diachronworkshoppanel
ReComp: 2016-18
Change
Events
Diff(.,.)
functions
“business
Rules”
Prioritised KAs
Cost estimates
Reproducibility
assessment
ReComp DSS
History DB
Past KAs
and their metadata  provenance
Observe
change
Assess and
measure
Estimate
Enact
KA: Knowledge Assets
META-K
P.Missier-2016
Diachronworkshoppanel
Metadata + Analytics
The knowledge is
in the metadata!
Research hypothesis:
supporting the analysis can be achieved through analytical reasoning applied to a collection of metadata
items, which describe details of past computations.
identify
recomp
candidates
large-scale
recomp
estimate
change
impact
Estimate
reproducibility
cost/effort
Change
Events
Change
Impact
Model
Cost
Model
Model
updates
Model
updates
Meta-K • Logs
• Provenance
• Dependencies

More Related Content

PPTX
ReComp for genomics
PPTX
ReComp and the Variant Interpretations Case Study
PPTX
Preserving the currency of genomics outcomes over time through selective re-c...
PPTX
ReComp: Preserving the value of large scale data analytics over time through...
PPTX
ReComp project kickoff presentation 11-03-2016
PPTX
The data, they are a-changin’
PPTX
Your data won’t stay smart forever: exploring the temporal dimension of (big ...
PPTX
ReComp: challenges in selective recomputation of (expensive) data analytics t...
ReComp for genomics
ReComp and the Variant Interpretations Case Study
Preserving the currency of genomics outcomes over time through selective re-c...
ReComp: Preserving the value of large scale data analytics over time through...
ReComp project kickoff presentation 11-03-2016
The data, they are a-changin’
Your data won’t stay smart forever: exploring the temporal dimension of (big ...
ReComp: challenges in selective recomputation of (expensive) data analytics t...

What's hot (20)

DOCX
Himansu sahoo resume-ds
PDF
Sentiment Knowledge Discovery in Twitter Streaming Data
PDF
La résolution de problèmes à l'aide de graphes
PDF
20170110_IOuellette_CV
PDF
Towards reproducibility and maximally-open data
PDF
Future of hpc
PDF
NG2S: A Study of Pro-Environmental Tipping Point via ABMs
PDF
Moa: Real Time Analytics for Data Streams
PPTX
Minimal viable-datareuse-czi
ODP
Big data
PPTX
Data Science, Data & Dashboards Design
PDF
Estimating Query Difficulty for News Prediction Retrieval (poster presentation)
PPTX
The Roots: Linked data and the foundations of successful Agriculture Data
PDF
Pitfalls in benchmarking data stream classification and how to avoid them
PPTX
From Text to Data to the World: The Future of Knowledge Graphs
PDF
Role of Data Accessibility During Pandemic
PPTX
ReComp: optimising the re-execution of analytics pipelines in response to cha...
PDF
Predicting the “Next Big Thing” in Science - #scichallenge2017
PPTX
End-to-End Learning for Answering Structured Queries Directly over Text
PDF
Kenett On Information NYU-Poly 2013
Himansu sahoo resume-ds
Sentiment Knowledge Discovery in Twitter Streaming Data
La résolution de problèmes à l'aide de graphes
20170110_IOuellette_CV
Towards reproducibility and maximally-open data
Future of hpc
NG2S: A Study of Pro-Environmental Tipping Point via ABMs
Moa: Real Time Analytics for Data Streams
Minimal viable-datareuse-czi
Big data
Data Science, Data & Dashboards Design
Estimating Query Difficulty for News Prediction Retrieval (poster presentation)
The Roots: Linked data and the foundations of successful Agriculture Data
Pitfalls in benchmarking data stream classification and how to avoid them
From Text to Data to the World: The Future of Knowledge Graphs
Role of Data Accessibility During Pandemic
ReComp: optimising the re-execution of analytics pipelines in response to cha...
Predicting the “Next Big Thing” in Science - #scichallenge2017
End-to-End Learning for Answering Structured Queries Directly over Text
Kenett On Information NYU-Poly 2013
Ad

Similar to Big Data Quality Panel : Diachron Workshop @EDBT (20)

PDF
PDF
Big Data for Library Services (2017)
PPTX
Introduction to open-data
PDF
Elsevier
PDF
Luciano uvi hackfest.28.10.2020
PPT
BIG DATA.ppt
PPTX
A Big Picture in Research Data Management
PDF
Challenges in Analytics for BIG Data
PPT
Big Data ( Charactertics of 6vs of Big Data)
PDF
Crowdsourcing Approaches to Big Data Curation - Rio Big Data Meetup
PPTX
What Data Science Will Mean to You - One Person's View
PPT
BIG-DATAPPTFINAL.ppt
PPTX
Pemanfaatan Big Data Dalam Riset 2023.pptx
PPT
The Role of Automated Function Prediction in the Era of Big Data and Small Bu...
PPTX
dissertation proposal writing service
PDF
Data Science: Origins, Methods, Challenges and the future?
PDF
My FAIR share of the work - Diamond Light Source - Dec 2018
PPTX
Real-time applications of Data Science.pptx
PPTX
Data Science PPT _basics of data science.pptx
PPTX
Data_Science_Applications_&_Use_Cases.pptx
Big Data for Library Services (2017)
Introduction to open-data
Elsevier
Luciano uvi hackfest.28.10.2020
BIG DATA.ppt
A Big Picture in Research Data Management
Challenges in Analytics for BIG Data
Big Data ( Charactertics of 6vs of Big Data)
Crowdsourcing Approaches to Big Data Curation - Rio Big Data Meetup
What Data Science Will Mean to You - One Person's View
BIG-DATAPPTFINAL.ppt
Pemanfaatan Big Data Dalam Riset 2023.pptx
The Role of Automated Function Prediction in the Era of Big Data and Small Bu...
dissertation proposal writing service
Data Science: Origins, Methods, Challenges and the future?
My FAIR share of the work - Diamond Light Source - Dec 2018
Real-time applications of Data Science.pptx
Data Science PPT _basics of data science.pptx
Data_Science_Applications_&_Use_Cases.pptx
Ad

More from Paolo Missier (20)

PPTX
Data and end-to-end Explainability (XAI,XEE)
PPTX
A simple Introduction to Explainability in Machine Learning and AI (XAI)
PPTX
A simple Introduction to Algorithmic Fairness
PDF
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
PDF
Design and Development of a Provenance Capture Platform for Data Science
PDF
Towards explanations for Data-Centric AI using provenance records
PPTX
Interpretable and robust hospital readmission predictions from Electronic Hea...
PPTX
Data-centric AI and the convergence of data and model engineering: opportunit...
PPTX
Realising the potential of Health Data Science: opportunities and challenges ...
PPTX
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
PDF
A Data-centric perspective on Data-driven healthcare: a short overview
PPTX
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
PPTX
Tracking trajectories of multiple long-term conditions using dynamic patient...
PPTX
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
PPTX
Digital biomarkers for preventive personalised healthcare
PPTX
Digital biomarkers for preventive personalised healthcare
PPTX
Data Provenance for Data Science
PPTX
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
PPTX
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
PPTX
Data Science for (Health) Science: tales from a challenging front line, and h...
Data and end-to-end Explainability (XAI,XEE)
A simple Introduction to Explainability in Machine Learning and AI (XAI)
A simple Introduction to Algorithmic Fairness
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
Design and Development of a Provenance Capture Platform for Data Science
Towards explanations for Data-Centric AI using provenance records
Interpretable and robust hospital readmission predictions from Electronic Hea...
Data-centric AI and the convergence of data and model engineering: opportunit...
Realising the potential of Health Data Science: opportunities and challenges ...
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
A Data-centric perspective on Data-driven healthcare: a short overview
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Tracking trajectories of multiple long-term conditions using dynamic patient...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
Data Provenance for Data Science
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
Data Science for (Health) Science: tales from a challenging front line, and h...

Recently uploaded (20)

PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
TLE Review Electricity (Electricity).pptx
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PPTX
cloud_computing_Infrastucture_as_cloud_p
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Hindi spoken digit analysis for native and non-native speakers
PDF
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
DP Operators-handbook-extract for the Mautical Institute
PDF
Heart disease approach using modified random forest and particle swarm optimi...
PDF
Hybrid model detection and classification of lung cancer
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PDF
Encapsulation theory and applications.pdf
PDF
Getting Started with Data Integration: FME Form 101
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Enhancing emotion recognition model for a student engagement use case through...
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Building Integrated photovoltaic BIPV_UPV.pdf
TLE Review Electricity (Electricity).pptx
Assigned Numbers - 2025 - Bluetooth® Document
cloud_computing_Infrastucture_as_cloud_p
MIND Revenue Release Quarter 2 2025 Press Release
Hindi spoken digit analysis for native and non-native speakers
Transform Your ITIL® 4 & ITSM Strategy with AI in 2025.pdf
NewMind AI Weekly Chronicles - August'25-Week II
DP Operators-handbook-extract for the Mautical Institute
Heart disease approach using modified random forest and particle swarm optimi...
Hybrid model detection and classification of lung cancer
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
SOPHOS-XG Firewall Administrator PPT.pptx
Encapsulation theory and applications.pdf
Getting Started with Data Integration: FME Form 101
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Enhancing emotion recognition model for a student engagement use case through...
Digital-Transformation-Roadmap-for-Companies.pptx
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf

Big Data Quality Panel : Diachron Workshop @EDBT

  • 1. P.Missier-2016 Diachronworkshoppanel Big Data Quality Panel Diachron Workshop @EDBT Panta Rhei (Heraclitus, through Plato) Paolo Missier Newcastle University, UK Bordeaux, March 2016 (*) Painting by Johannes Moreelse (*)
  • 2. P.Missier-2016 Diachronworkshoppanel The “curse” of Data and Information Quality • Quality requirements are often specific to the application that makes use of the data (“fitness for purpose”) • Quality Assurance (actions required to meet the requirements) are specific to the data types A few generic quality techniques (linkage, blocking, …) but mostly ad hoc solutions
  • 3. P.Missier-2016 Diachronworkshoppanel V for “Veracity”? Q3. To what extent traditional approaches for diagnosis, prevention and curation are challenged by the Volume Variety and Velocity characteristics of Big Data? V Issues Example High Volume • Scalability: What kinds of QC step can be parallelised? • Human curation not feasible Parallel meta-blocking High Velocity • Statistics-based diagnosis, data- type specific • Human curation not feasible Reliability of sensor readings High Variety • Heterogeneity is not a new issue! Data fusion for decision making Recent contributions on Quality & Big Data (IEEE Big Data 2015) Chung-Yi Li et al., Recommending missing sensor values Yang Wang and Kwan-Liu Ma, Revealing the fog-of-war: A visualization-directed, uncertainty-aware approach for exploring high-dimensional data S. Bonner et al., Data quality assessment and anomaly detection via map/reduce and linked data: A case study in the medical domain V. Efthymiou, K. Stefanidis and V. Christophides, Big data entity resolution: From highly to somehow similar entity descriptions in the Web V. Efthymiou, G. Papadakis, G. Papastefanatos, K. Stefanidis and T. Palpanas, Parallel meta-blocking: Realizing scalable entity resolution over large, heterogeneous data
  • 4. P.Missier-2016 Diachronworkshoppanel Can we ignore quality issues? Q4: How difficult is the evaluation of the threshold under which data quality can be ignored? • Some analytics algorithms may be tolerant to {outliers, missing values, implausible values} in the input • But this “meta-knowledge” is specific to each algorithm. Hard to derive general models • i.e. the importance and danger of FP / FN A possible incremental learning approach: Build a database of past analytics task: H = {<In, P, Out>} Try and learn (In, Out) correlations over a growing collection H
  • 5. P.Missier-2016 Diachronworkshoppanel Data to Knowledge Meta-knowledge Big Data The Big Analytics Machine Algorithms Tools Middleware Reference datasets “Valuable Knowledge” The Data-to-Knowledge pattern of the Knowledge Economy:
  • 6. P.Missier-2016 Diachronworkshoppanel The missing element: time Big Data The Big Analytics Machine “Valuable Knowledge” V3 V2 V1 Meta-knowledge Algorithms Tools Middleware Reference datasets t t t Change  data currency
  • 7. P.Missier-2016 Diachronworkshoppanel The ReComp decision support system Observe change • In big data • In meta-knowledge Assess and measure • knowledge decay Estimate • Cost and benefits of refresh Enact • Reproduce (analytics) processes Currency of data and of meta-knowledge: - What knowledge should be refreshed? - When, how? - Cost / benefits
  • 8. P.Missier-2016 Diachronworkshoppanel ReComp: 2016-18 Change Events Diff(.,.) functions “business Rules” Prioritised KAs Cost estimates Reproducibility assessment ReComp DSS History DB Past KAs and their metadata  provenance Observe change Assess and measure Estimate Enact KA: Knowledge Assets META-K
  • 9. P.Missier-2016 Diachronworkshoppanel Metadata + Analytics The knowledge is in the metadata! Research hypothesis: supporting the analysis can be achieved through analytical reasoning applied to a collection of metadata items, which describe details of past computations. identify recomp candidates large-scale recomp estimate change impact Estimate reproducibility cost/effort Change Events Change Impact Model Cost Model Model updates Model updates Meta-K • Logs • Provenance • Dependencies

Editor's Notes

  • #2: The times they are a’changin
  • #10: Problem: this is “blind” and expensive. Can we do better?
  • #11: These items are partly collected automatically, and partly as manual annotations. They include: Logs of past executions, automatically collected, to be used for post hoc performance analysis and estimation of future resource requirements and thus costs (S1) ; Runtime provenance traces and prospective provenance. The former are automatically collected graphs of data dependencies, captured from the computation [11]. The latter are formal descriptions of the analytics process, obtained from the workflow specification, or more generally by manually annotating a script. Both are instrumental to understanding how the knowledge outcomes have changed and why (S5), as well as to estimate future re-computation effects. External data and system dependencies, process and data versions, and system requirements associated with the analytics process, which are used to understand whether it will be practically possible to re-compute the process.