SlideShare a Scribd company logo
1
Capturing and querying fine-grained provenance of
preprocessing pipelines in data science
(DP4DS)
Adriane Chapman1, Paolo Missier2, Luca Lauro3, Riccardo Torlone3
(1) University of Southampton, UK
(2) Newcastle University, UK
(3) Universita’ Roma Tre, Italy
[1] Chapman, A.; Missier, P.; Simonelli, G.; and Torlone, R., Capturing and Querying Fine-grained Provenance of Preprocessing Pipelines in Data
Science. PVLDB, 14(4): 507–520. January 2021.
[2] Chapman, A.; Missier, P.; Lauro, L.; and Torlone, R., DPDS: Assisting Data Science with Data Provenance. PVLDB, 15(12): 3614 – 3617. 2022.
2
M
Data
sources
Acquisition,
wrangling
Test
set
Training
set
Preparing for learning
Model
Selection
Training /
test split
Model
Testing
Model
Learning
Model
Validation
Predictions
Model
Usage
Decision points:
- Source selection
- Sample / population shape
- Cleaning
- Integration
Decision points:
- Sampling / stratification
- Feature selection
- Feature engineering
- Dimensionality reduction
- Regularisation
- Imputation
- Class rebalancing
- …
Provenance
trace
M
Model
Learning
Training
set
Training /
test split
Imputation
Feature
selection
D’ D’’
…
Hyper
parameters
C1 C2
C3
Pipeline structure with provenance annotations
3
<event
name>
Provenance of what?
Base case:
- opaque program Po
- coarse-grained dataset
Default provenance:
- Every output depends on every input
P0
- Transparent program PT
- Fine-grained datasets
PT
…
…
…
…
…
…
…
…
- Transparent pipeline
- Fine-grained datasets
P’T
…
…
…
…
…
…
…
…
Pn
T
Pn
T
Pn
T
- Transparent program PT
- coarse-grained datasets
PT
f
if c:
y1  x1
else:
y1  x2
Y2  f(x1, x2)
Runtime: c == True
4
Typical operators used in data prep
5
Data reduction
- Conditional projection
- Selection
6
Data augmentation
Vertical augmentation
Horizontal augmentation
avg(age)
group by age
7
Data transformation
Example: data imputation. Here f replaces nulls with the most frequent value, for
column Zip
8
Data fusion: join and append
9
Provenance model
10
Capturing provenance: Assumptions
- Common data abstraction: (Pandas) dataframes
- Observability: runtime execution of a (python) program can be observed
- Each input and output dataframe to each operator can be inspected
11
Capturing provenance: templates
A different provenance template pt𝜏 is associated with each type 𝜏 of operator
12
Capturing provenance: bindings
At runtime, when operator o of type 𝜏 is executed, the appropriate template pt𝜏 for 𝜏 is selected
Data items from the inputs and outputs of the operator are used to bind the variables in the template
14/03/2021 03_ b _c .
:///U / 65/D a /03_ b _c . 1/1
14/03/2021 03_ b _c .
:///U / 65/D a /03_ b _c . 1/1
op
{old values: F, I, V}  {new values: F’, J, V’}
+
Binding rules
13
This applies to all operators
14
Join provenance pattern -- keys
Join
activity
wasGeneratedBy
Used
Left Right Output
Used
wasDerivedFrom
15
Join provenance pattern -- non-key elements
Join
activity
wasGeneratedBy
Used
Left Right Output
wasDerivedFrom
17
Capturing provenance: a more practical approach
The approach just described requires recognizing the type of operation from the source code
Restricts to a closed set of operators  needs to be maintained over time
(*) extends to joins, append
We take a more generic route to implementing the same idea:
1. look at operators’ input / output dataframes Din, Dout regardless of the specific operator
2. Dataframe diff: Compare both the shapes and values of Din, Dout (*)
3. Use the diff to:
• Select the appropriate template
• Bind the template variables using the relevant values in the two dataframes
18
Example
Consider the following sequence: Imputation  join  append  one hot encoding
Da D1
Db
Dc
D2
D3
Impute K
Join K1=K2
append
Add
‘B0,’ ‘B1’ Remove ‘B’
D4 D5
7
<event
name>
19
Example
Dataframes Diff template
D1, Da value change, reduced number of
null values
Data transformation
D2, {Da, Db} join provenance
D3, {D1, D2} append provenance
D4, D3 Shape change, column(s) added <wait!>
D5, D4 Shape change, column(s) removed Data transformation, composite
Da D1
Db
Dc
D2
D3
Impute K
Join K1=K2
append Remove ‘B’
D4 D5
Add
‘B0,’ ‘B1’
20
Summary: Shape and value changes
Shape changes:
Rows
Added?
Rows
Removed?
Columns
Added?
Columns
Removed?
Columns
Removed?
Horizontal
Augmentation
Reduction
by selection
Reduction
by projection
data
transformation
(composite)
Y
Y
Y
Y
data
transformation
Y
N
N
N
Templates:
N
Value changes for each column:
Nulls reduced?
Values changed?
Y
Y
N
Templates:
data
transformation
(imputation)
data
transformation
1-1 derivations
21
Code instrumentation
A python tracker object intercepts dataframe operations, using an observer pattern
The tracker collects the values required to generate the bindings
Create a provenance object and a tracker object
Simple column transform
One-hot encoding
join
22
Evaluation – benchmark datasets
Census pipeline:
Clerical cleaning on
every cell
(removing blanks)
Replace all ‘?’
with NaN
One-hot encoding
7 categorical
variables
Map binary
labels to 0,1
Drop one
column
23
Evaluation – benchmark pipelines
24
Evaluation: Provenance capture times
25
Evaluation: Provenance query times on Neo4J
26
Scalability: provenance query times
Synthetic Benchmarking datasets created using TPC-DI
27
Scalability: operations on TCI-DI datasets
Basic operators Join + append operators
28
Tool demo
DPDS: Assisting Data Science with Data
Provenance. Chapman, A.; Missier, P.; Lauro, L.; and
Torlone, R. PVLDB, 15(12): 3614 – 3617. 2022.
(demo paper)
29
Summary
1. What is the killer app for such granular provenance?
2. How general is the technique with respect to arbitrary pandas programs?
A method, infrastructure and tooling for collecting, querying, and visualizing
very fine-grained provenance from data processing pipelines

More Related Content

PPTX
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
PPTX
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
PPTX
Data and end-to-end Explainability (XAI,XEE)
PDF
Design and Development of a Provenance Capture Platform for Data Science
PDF
Towards explanations for Data-Centric AI using provenance records
PPTX
Provenance for Reproducible Data Science
PDF
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
PPTX
Data Provenance for Data Science
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Data and end-to-end Explainability (XAI,XEE)
Design and Development of a Provenance Capture Platform for Data Science
Towards explanations for Data-Centric AI using provenance records
Provenance for Reproducible Data Science
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
Data Provenance for Data Science

Similar to Capturing and querying fine-grained provenance of preprocessing pipelines in data science (DP4DS) (20)

PPT
Provinance in scientific workflows in e science
PPTX
Thoughts on Knowledge Graphs & Deeper Provenance
PPTX
Provenance for Data Munging Environments
PPTX
"Data Provenance: Principles and Why it matters for BioMedical Applications"
PPTX
Wrokflow programming and provenance query model
PPT
Data integration and provenance-Chapter-14
PDF
QUERY INVERSION TO FIND DATA PROVENANCE
PPT
Paper talk (presented by Prof. Ludaescher), WORKS workshop, 2010
PPTX
Analytics of analytics pipelines: from optimising re-execution to general Dat...
PDF
Python business intelligence (PyData 2012 talk)
PDF
Tapp 2014 (belhajjame)
PDF
Workflow Provenance: From Modelling to Reporting
PDF
Camp 4-data workshop presentation
PDF
Data analytics beyond data processing and how it affects Industry 4.0
PDF
Provenance Analysis and RDF Query Processing: W3C PROV for Data Quality and T...
PPTX
SemSci2017 - Detailed Provenance Capture of Data Processing
PDF
Week 3 data journey and data storage
PDF
WBDB 2012 - "Big Data Provenance: Challenges and Implications for Benchmarking"
PPTX
Data Science presentation for explanation of numpy and pandas
PDF
Data Science Provenance: From Drug Discovery to Fake Fans
Provinance in scientific workflows in e science
Thoughts on Knowledge Graphs & Deeper Provenance
Provenance for Data Munging Environments
"Data Provenance: Principles and Why it matters for BioMedical Applications"
Wrokflow programming and provenance query model
Data integration and provenance-Chapter-14
QUERY INVERSION TO FIND DATA PROVENANCE
Paper talk (presented by Prof. Ludaescher), WORKS workshop, 2010
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Python business intelligence (PyData 2012 talk)
Tapp 2014 (belhajjame)
Workflow Provenance: From Modelling to Reporting
Camp 4-data workshop presentation
Data analytics beyond data processing and how it affects Industry 4.0
Provenance Analysis and RDF Query Processing: W3C PROV for Data Quality and T...
SemSci2017 - Detailed Provenance Capture of Data Processing
Week 3 data journey and data storage
WBDB 2012 - "Big Data Provenance: Challenges and Implications for Benchmarking"
Data Science presentation for explanation of numpy and pandas
Data Science Provenance: From Drug Discovery to Fake Fans
Ad

More from Paolo Missier (20)

PPTX
A simple Introduction to Explainability in Machine Learning and AI (XAI)
PPTX
A simple Introduction to Algorithmic Fairness
PPTX
Interpretable and robust hospital readmission predictions from Electronic Hea...
PPTX
Data-centric AI and the convergence of data and model engineering: opportunit...
PPTX
Realising the potential of Health Data Science: opportunities and challenges ...
PDF
A Data-centric perspective on Data-driven healthcare: a short overview
PPTX
Tracking trajectories of multiple long-term conditions using dynamic patient...
PPTX
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
PPTX
Digital biomarkers for preventive personalised healthcare
PPTX
Digital biomarkers for preventive personalised healthcare
PPTX
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
PPTX
Data Science for (Health) Science: tales from a challenging front line, and h...
PPTX
ReComp: optimising the re-execution of analytics pipelines in response to cha...
PPTX
ReComp, the complete story: an invited talk at Cardiff University
PPTX
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
PPTX
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
PPTX
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
PPTX
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
PPTX
ReComp and P4@NU: Reproducible Data Science for Health
PPTX
algorithmic-decisions, fairness, machine learning, provenance, transparency
A simple Introduction to Explainability in Machine Learning and AI (XAI)
A simple Introduction to Algorithmic Fairness
Interpretable and robust hospital readmission predictions from Electronic Hea...
Data-centric AI and the convergence of data and model engineering: opportunit...
Realising the potential of Health Data Science: opportunities and challenges ...
A Data-centric perspective on Data-driven healthcare: a short overview
Tracking trajectories of multiple long-term conditions using dynamic patient...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
Data Science for (Health) Science: tales from a challenging front line, and h...
ReComp: optimising the re-execution of analytics pipelines in response to cha...
ReComp, the complete story: an invited talk at Cardiff University
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
Decentralized, Trust-less Marketplace for Brokered IoT Data Trading using Blo...
Efficient Re-computation of Big Data Analytics Processes in the Presence of C...
A Customisable Pipeline for Continuously Harvesting Socially-Minded Twitter U...
ReComp and P4@NU: Reproducible Data Science for Health
algorithmic-decisions, fairness, machine learning, provenance, transparency
Ad

Recently uploaded (20)

PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Electronic commerce courselecture one. Pdf
PDF
KodekX | Application Modernization Development
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
cuic standard and advanced reporting.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
Understanding_Digital_Forensics_Presentation.pptx
Reach Out and Touch Someone: Haptics and Empathic Computing
Chapter 3 Spatial Domain Image Processing.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Electronic commerce courselecture one. Pdf
KodekX | Application Modernization Development
The AUB Centre for AI in Media Proposal.docx
cuic standard and advanced reporting.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Digital-Transformation-Roadmap-for-Companies.pptx
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
The Rise and Fall of 3GPP – Time for a Sabbatical?
sap open course for s4hana steps from ECC to s4
Network Security Unit 5.pdf for BCA BBA.
MIND Revenue Release Quarter 2 2025 Press Release
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Understanding_Digital_Forensics_Presentation.pptx

Capturing and querying fine-grained provenance of preprocessing pipelines in data science (DP4DS)

  • 1. 1 Capturing and querying fine-grained provenance of preprocessing pipelines in data science (DP4DS) Adriane Chapman1, Paolo Missier2, Luca Lauro3, Riccardo Torlone3 (1) University of Southampton, UK (2) Newcastle University, UK (3) Universita’ Roma Tre, Italy [1] Chapman, A.; Missier, P.; Simonelli, G.; and Torlone, R., Capturing and Querying Fine-grained Provenance of Preprocessing Pipelines in Data Science. PVLDB, 14(4): 507–520. January 2021. [2] Chapman, A.; Missier, P.; Lauro, L.; and Torlone, R., DPDS: Assisting Data Science with Data Provenance. PVLDB, 15(12): 3614 – 3617. 2022.
  • 2. 2 M Data sources Acquisition, wrangling Test set Training set Preparing for learning Model Selection Training / test split Model Testing Model Learning Model Validation Predictions Model Usage Decision points: - Source selection - Sample / population shape - Cleaning - Integration Decision points: - Sampling / stratification - Feature selection - Feature engineering - Dimensionality reduction - Regularisation - Imputation - Class rebalancing - … Provenance trace M Model Learning Training set Training / test split Imputation Feature selection D’ D’’ … Hyper parameters C1 C2 C3 Pipeline structure with provenance annotations
  • 3. 3 <event name> Provenance of what? Base case: - opaque program Po - coarse-grained dataset Default provenance: - Every output depends on every input P0 - Transparent program PT - Fine-grained datasets PT … … … … … … … … - Transparent pipeline - Fine-grained datasets P’T … … … … … … … … Pn T Pn T Pn T - Transparent program PT - coarse-grained datasets PT f if c: y1  x1 else: y1  x2 Y2  f(x1, x2) Runtime: c == True
  • 5. 5 Data reduction - Conditional projection - Selection
  • 6. 6 Data augmentation Vertical augmentation Horizontal augmentation avg(age) group by age
  • 7. 7 Data transformation Example: data imputation. Here f replaces nulls with the most frequent value, for column Zip
  • 8. 8 Data fusion: join and append
  • 10. 10 Capturing provenance: Assumptions - Common data abstraction: (Pandas) dataframes - Observability: runtime execution of a (python) program can be observed - Each input and output dataframe to each operator can be inspected
  • 11. 11 Capturing provenance: templates A different provenance template pt𝜏 is associated with each type 𝜏 of operator
  • 12. 12 Capturing provenance: bindings At runtime, when operator o of type 𝜏 is executed, the appropriate template pt𝜏 for 𝜏 is selected Data items from the inputs and outputs of the operator are used to bind the variables in the template 14/03/2021 03_ b _c . :///U / 65/D a /03_ b _c . 1/1 14/03/2021 03_ b _c . :///U / 65/D a /03_ b _c . 1/1 op {old values: F, I, V}  {new values: F’, J, V’} + Binding rules
  • 13. 13 This applies to all operators
  • 14. 14 Join provenance pattern -- keys Join activity wasGeneratedBy Used Left Right Output Used wasDerivedFrom
  • 15. 15 Join provenance pattern -- non-key elements Join activity wasGeneratedBy Used Left Right Output wasDerivedFrom
  • 16. 17 Capturing provenance: a more practical approach The approach just described requires recognizing the type of operation from the source code Restricts to a closed set of operators  needs to be maintained over time (*) extends to joins, append We take a more generic route to implementing the same idea: 1. look at operators’ input / output dataframes Din, Dout regardless of the specific operator 2. Dataframe diff: Compare both the shapes and values of Din, Dout (*) 3. Use the diff to: • Select the appropriate template • Bind the template variables using the relevant values in the two dataframes
  • 17. 18 Example Consider the following sequence: Imputation  join  append  one hot encoding Da D1 Db Dc D2 D3 Impute K Join K1=K2 append Add ‘B0,’ ‘B1’ Remove ‘B’ D4 D5 7 <event name>
  • 18. 19 Example Dataframes Diff template D1, Da value change, reduced number of null values Data transformation D2, {Da, Db} join provenance D3, {D1, D2} append provenance D4, D3 Shape change, column(s) added <wait!> D5, D4 Shape change, column(s) removed Data transformation, composite Da D1 Db Dc D2 D3 Impute K Join K1=K2 append Remove ‘B’ D4 D5 Add ‘B0,’ ‘B1’
  • 19. 20 Summary: Shape and value changes Shape changes: Rows Added? Rows Removed? Columns Added? Columns Removed? Columns Removed? Horizontal Augmentation Reduction by selection Reduction by projection data transformation (composite) Y Y Y Y data transformation Y N N N Templates: N Value changes for each column: Nulls reduced? Values changed? Y Y N Templates: data transformation (imputation) data transformation 1-1 derivations
  • 20. 21 Code instrumentation A python tracker object intercepts dataframe operations, using an observer pattern The tracker collects the values required to generate the bindings Create a provenance object and a tracker object Simple column transform One-hot encoding join
  • 21. 22 Evaluation – benchmark datasets Census pipeline: Clerical cleaning on every cell (removing blanks) Replace all ‘?’ with NaN One-hot encoding 7 categorical variables Map binary labels to 0,1 Drop one column
  • 25. 26 Scalability: provenance query times Synthetic Benchmarking datasets created using TPC-DI
  • 26. 27 Scalability: operations on TCI-DI datasets Basic operators Join + append operators
  • 27. 28 Tool demo DPDS: Assisting Data Science with Data Provenance. Chapman, A.; Missier, P.; Lauro, L.; and Torlone, R. PVLDB, 15(12): 3614 – 3617. 2022. (demo paper)
  • 28. 29 Summary 1. What is the killer app for such granular provenance? 2. How general is the technique with respect to arbitrary pandas programs? A method, infrastructure and tooling for collecting, querying, and visualizing very fine-grained provenance from data processing pipelines

Editor's Notes

  • #7: $f_1$, which associates the string \emph{young} to an age less than 25 and the string \emph{adult} otherwise $f_2$, which computes the average of a set of numbers.
  • #19:     & D_1=\tau_{f(K)}(D_a)\\     & D_2=D_b \join^{\tt outer}_{K_1=K_2} D_c\\     & D_3=D_1 \union D_2 \\     & D_4=\horaug_{h(B)}(D_3)\\     & D_5=\pi_{\{A,B_0, B_1\}}(D_4)\\