SlideShare a Scribd company logo
The lifecycle of reproducible science data
and what provenance has got to do with it
Paolo Missier
School of Computing Science
Newcastle University, UK
Alan Turing Institute
Symposium On Reproducibility for Data-Intensive Research
Oxford, April 6, 2016
With material contributed by:
Yang Cao, Bertram Ludascher, Tim McPhillips, Dave Vieglais, Matt Jones and
the DataONE CyberInfrastructure group
Rawaa Qasha at Newcastle University
Carole Goble at the University of Manchester
P.Missier
ATISymposiumonReproducibility
OxfordApril6th,2016
(Yet another) Data Lifecycle picture
Search
discover
packagepublish
spec(P’)
Deploy
P’ 
Env(dep’)
prov(D’)
Compare
(P,P’,D,D’)
spec(P)
prov(D)
D  D1
P  P’
dep  dep’
<D,P,dep,spec(P), prov(D)>
compute
Env
D’
D1
Reproducibility: working. reporting
submit article
and move on…
publish article
Research
Environment
Publication
Environment
Peer
Review
P.Missier
ATISymposiumonReproducibility
OxfordApril6th,2016
Re-what?
Re-*
ReRun:
vary experiment and setup, same lab
P P’
DD’
depdep’
Repeat:
Same experiment, setup, lab
P, D, dep, env(dep)
Replicate:
Same experiment, setup, different lab
P, D, dep, env’(dep)
Reproduce:
vary experiment and setup, different lab
P P’
DD’
depdep’
env(dep) env’(dep’)
Reuse:
Different experiment
D, P  Q
P.Missier
ATISymposiumonReproducibility
OxfordApril6th,2016
Mapping the reproducibility space
5
Goal: to help scientists understand the effect of workflow / data / dependencies
evolution on workflow execution results
Approach: compare provenance traces generated during the runs: PDIFF
P. Missier, S. Woodman, H Hiden, P. Watson. Provenance and data differencing for workflow
reproducibility analysis, Concurrency Computat.: Pract. Exper., 2013.
P.Missier
ATISymposiumonReproducibility
OxfordApril6th,2016
Workflow evolution
6
Each of the elements in an execution may evolve (semi) independently
from the others:
Can trt be computed again at some time t’>t?
Requires saving EDt but may be impractical (eg large DB state)
Repeatability:
P.Missier
ATISymposiumonReproducibility
OxfordApril6th,2016
Reproducibility
7
Can a new version trt’ of trt be computed at some later time t’ > t, after one
of more of the elements has changed?
• Wi may not run new EDj’
• Wi may not run with wfmsk’
• Wi’ may not run with dh’
• ...
Potential issues:
P.Missier
ATISymposiumonReproducibility
OxfordApril6th,2016
(Yet another) Data Lifecycle picture
Search
discover
packagepublish
spec(P’)
Deploy
P’  Env
D  D1
P  P’
dep  dep’
compute
Env
D’
prov(D’)
Compare
(P,P’,D,D’)
spec(P)
prov(D)
Research
Objects
DataONE
Federated
Research Data
Repositories
- Matlab
provenance
recorder
TOSCA-based
virtualisation
Pdiff: differencing
provenance
YesWorkflow
- Workflow
Provenance
- NoWorkflow
Matlab
provenance
recorder
(DataONE)
ReproZip
P.Missier
ATISymposiumonReproducibility
OxfordApril6th,2016
You are here
Data packaging: Research Objects
DataONE: Data packaging, publication, search and discovery, hosting
• R provenance recorder
• Process-as-a-dataflow: YesWorkflow
Process Virtualisation using TOSCA
Provenance recorders
• Workflow Provenance
• Taverna, eScience Central, Kepler, Pegasus, VisTrails…
• NoWorkflow: provenance recording for Python
• Pdiff: provenance differencing for understanding workflow
differences
Computational Workflow Runs
workflowrun.prov.ttl
(RDF)
outputA.txt
outputC.jpg
outputB/
intermediates/
1.txt
2.txt
3.txt
de/def2e58b-50e2-4949-9980-
fd310166621a.txt
inputA.txt
workflow attribution
execution
environment
Aggregating in Research Object
ZIP folder structure (RO Bundle)
mimetype
application/vnd.wf4ever.robundle+zi
p
.ro/manifest.jso
n
URI
references
Exchange
Reproducibility
Same data
Same code
Systematic and
extensible meta-
data collection
Workflow
Annotation Profile
Wf4Ever
Project
P.Missier
ATISymposiumonReproducibility
OxfordApril6th,2016
Manifests and Containers
Container
Packaging:
Zip files, Docker images, BagIt, …
Catalogues & Commons Platforms:
FAIRDOM SEEK, Farr Commons CKAN,
STELAR eLab, myExperiment
Manifest
Metadata
Describes the aggregated resources, their
annotations and their provenance
Manifest
P.Missier
ATISymposiumonReproducibility
OxfordApril6th,2016
Manifest Metadata
Manifest Construction
• Identification – id, title, creator, status….
• Aggregates – list of ids/links to resources
• Annotations – list of annotations about resources
Manifest
Manifest Description
• Checklists – what should be there
• Provenance – where it came from
• Versioning – its evolution
• Dependencies – what else is needed
Manifest
P.Missier
ATISymposiumonReproducibility
OxfordApril6th,2016
You are here
Data packaging: Research Objects
DataONE: Data packaging, publication, search and discovery, hosting
• R provenance recorder
• Process-as-a-dataflow: YesWorkflow
Process Virtualisation using TOSCA
Provenance recorders
• Workflow Provenance
• Taverna, eScience Central, Kepler, Pegasus, VisTrails…
• NoWorkflow: provenance recording for Python
• Pdiff: provenance differencing for understanding workflow
differences
Components for a flexible, scalable,
sustainable network
Cyberinfrastructure Component 2
Member Nodes
www.dataone.org/member-nodes
Coordinating Nodes
• retain complete
metadata catalog
• indexing for search
• network-wide services
• ensure content
availability
(preservation)
• replication services
Member Nodes
• diverse institutions
• serve local community
• provide resources for
managing their data
• retain copies of data
14
Cyberinfrastructure
Data Services: Extraction, sub-setting etc
Provenance Semantics-enabled Discovery
ontolog
y
annotation
System
Metadata
Science
Data
Search
API
Science
Metadata
Provenance
Replicate
Metadata
Index
15
Data Holdings
16
What input data went
into this study?
What methods were
used?
… with what
parameter settings,
calibrations, …?
Can we trust the data
and methods?
 Provenance (lineage): track origin and processing history
of data  trust, data quality ~ audit trail for attribution, credit
 Discovery of data, methodologies, experiments
Use Provenance for
Transparency, Reproducibility
17
 W3C has published the ‘PROV’ standard
Entity
Activity
Agent
wasAssociatedWith
wasAttributedTo
wasGeneratedBy
W3C PROV model
See w3.org/TR/prov-o/
used
20
map image
R script
Execution
Scientist
wasAssociatedWith
wasAttributedTo
wasGeneratedBy
Using a common model
 Example: Scientific workflow
21
map image
R script
Execution
Scientist
wasAssociatedWith
wasAttributedTo
wasGeneratedBy
CSV data
used
wasDerivedFrom
Using a common model
 Example: Scientific workflow
22
map image
R script
Execution
Scientist
wasAssociatedWith
wasAttributedTo
wasGeneratedBy
CSV data
used
wasDerivedFrom
< “map image” wasDerivedFrom “CSV data” >
Using a common model
 Example: Scientific workflow
23
ProvONE Motivation:
Different Kinds of Provenance
 Prospective Provenance
 method/workflow description (“workflow-land”)
 Retrospective Provenance
 runtime provenance tracking (“trace-land”)
 Better together!
24
ProvONE extends PROV for science!
“Trace-Land”
“Workflow-Land”
“Data-Land”
http://purl.dataone.org/provone-v1-dev
25
DataONE data packages:
Provenance inside!
resource map
science metadata
system
metadata
science data
system
metadata
system
metadata
OAI-ORE with ProvONE trace
figures
system metadata
software
system metadata
29
Provenance
… of Figures
31
Provenance
… of Data
32
1 # @begin CreateGulfOfAlaskaMaps
2 # @in hcdb @as Total_Aromatic_Alkanes_PWS.csv
3 # @in world @as RWorldMap
4 # @out map @as Map_Of_Sampling_Locations.png
5 # @out detailMap @as Detailed_Map_Of_SamplingLocations.png
... mapping code is here ...
25 # @end CreateGulfOfAlaskaMaps
YesWorkflow (YW):
Scripts as prospective provenance
33
MATLAB, R , Python … Scripts
YesWorkflow (YW):
Scripts as prospective provenance
 Script + @YW-annotation
workflow-land & trace-land
 Combine provenance:
 Prospective (workflow)
 Retrospective (runtime trace)
 Reconstructed (logs, files, …)
 User can query own data &
provenance prior to sharing
 Incentive: accelerate work!
 “Provenance for Self”
34
When a user cites a pub, we
know:
 Which data produced it
 What software produced it
 What was derived from it
 Who to credit down the
attribution stack
 Katz & Smith. 2014. Implementing Transitive Credit
with JSON-LD. arXiv:1407.5117
 Missier, Paolo. “Data Trajectories: Tracking Reuse of
Published Data for Transitive Credit Attribution.” 11th
Intl. Data Curation Conference (IDCC). Amsterdam,
2016. (Best Paper Award)
Transitive Credit
36
Provenance today:
Important but hard
C limate C hange Impacts
in the U nited S tates
U .S . N a t iona l C lim a t e A sse ssm e nt
U . S. G lo b a l C h a n g e R e s e a r c h P r o g r a m
“This report is the result of a three-
year analytical effort by a team of
over 300 experts, overseen by a
broadly constituted Federal Advisory
Committee of 60 members. It was
developed from information and
analyses gathered in over 70
workshops and listening sessions
held across the country.”
37
Provenance today:
Important but hard
38
data and “code” / method linked
alt formats
Yaxing’s script with inputs &
output products
YesWorkflow model
Christopher using
Yaxing’s outputs as
inputs for his script
Christopher’s results can be
traced back all the way to Yaxing’s
input
Provenance in action
40
P.Missier
ATISymposiumonReproducibility
OxfordApril6th,2016
You are here
Data packaging: Research Objects
DataONE: Data packaging, publication, search and discovery, hosting
• R provenance recorder
• Process-as-a-dataflow: YesWorkflow
Process Virtualisation using TOSCA
Provenance recorders
• Workflow Provenance
• Taverna, eScience Central, Kepler, Pegasus, VisTrails…
• NoWorkflow: provenance recording for Python
• Pdiff: provenance differencing for understanding workflow
differences
4
TOSCA
• Topology and Orchestration Specification of
Cloud Applications
Use Case: e-Science Central Workflow
5
http://www.esciencecentral.co.uk
TOSCA-based mapping of an e-SC Workflow
6
• Workflow components as Node Types
• Block dependencies as Relationship Types
e-SC Workflow Service Template
7
P.Missier
ATISymposiumonReproducibility
OxfordApril6th,2016
You are here
Data packaging: Research Objects
DataONE: Data packaging, publication, search and discovery, hosting
• R provenance recorder
• Process-as-a-dataflow: YesWorkflow
Process Virtualisation using TOSCA
Provenance recorders
• Workflow Provenance
• Taverna, eScience Central, Kepler, Pegasus, VisTrails…
• NoWorkflow: provenance recording for Python
• Pdiff: provenance differencing for understanding workflow
differences
P.Missier
ATISymposiumonReproducibility
OxfordApril6th,2016
Data divergence analysis using provenance
All work done with reference to the e-Science Central WFMS
Assumption: workflow WFj (new version) runs to completion
thus it produces a new provenance trace
however, it may be disfunctional relative to WFi (the original)
Example: only input data changes: d != d’, WFj == WFi
4
7
Note: results may diverge even when the input datasets are identical, for example when one or
more of the services exhibits non-deterministic behaviour, or depends on external state that has
changed between executions.
P.Missier
ATISymposiumonReproducibility
OxfordApril6th,2016
Provenance traces for two runs
4
8
used
genBy
P.Missier
ATISymposiumonReproducibility
OxfordApril6th,2016
Delta graphs
4
9
A graph obtained as a result of traces “diff”
which can be used to explain observed differences in workflow outputs, in
terms of differences throughout the two executions.
This is the simplest
possible delta “graph”!
P.Missier
ATISymposiumonReproducibility
OxfordApril6th,2016
More involved workflow differences
5
0
WA
WB
sv2
P.Missier
ATISymposiumonReproducibility
OxfordApril6th,2016
The corresponding traces
5
1
P.Missier
ATISymposiumonReproducibility
OxfordApril6th,2016
Delta graph computed by PDIFF
5
2
P.Missier
ATISymposiumonReproducibility
OxfordApril6th,2016
References
Research Objects: www.researchobject.org
Bechhofer, Sean, Iain Buchan, David De Roure, Paolo Missier, J. Ainsworth, J. Bhagat, P. Couch, et
al. “Why Linked Data Is Not Enough for Scientists.” Future Generation Computer Systems (2011).
doi:doi:10.1016/j.future.2011.08.004.
DataONE: dataone.org
Cuevas-Vicenttín, Víctor, Parisa Kianmajd, Bertram Ludäscher, Paolo Missier, Fernando Chirigati,
Yaxing Wei, David Koop, and Saumen Dey. “The PBase Scientific Workflow Provenance Repository.”
In Procs. 9th International Digital Curation Conference, 9:28–38. San Francisco, CA, USA, 2014.
doi:10.2218/ijdc.v9i2.332.
Process Virtualisation using TOSCA
Qasha, Rawaa, Jacek Cala, and Paul Watson. “Towards Automated Workflow Deployment in the Cloud Using
TOSCA.” In 2015 IEEE 8th International Conference on Cloud Computing, 1037–1040. New York, 2015.
doi:10.1109/CLOUD.2015.146.
NoWorkflow: provenance recording for Python
Murta, Leonardo, Vanessa Braganholo, Fernando Chirigati, David Koop, and Juliana Freire.
“noWorkflow: Capturing and Analyzing Provenance of Scripts⋆.” In Procs. IPAW’14. Cologne,
Germany: Springer, 2014.
Pdiff: provenance differencing for understanding workflow differences
Missier, Paolo, Simon Woodman, Hugo Hiden, and Paul Watson. “Provenance and Data Differencing
for Workflow Reproducibility Analysis.” Concurrency and Computation: Practice and Experience
(2013). doi:10.1002/cpe.3035.

More Related Content

PPTX
Scalable Whole-Exome Sequence Data Processing Using Workflow On A Cloud
PPTX
Data Trajectories: tracking the reuse of published data for transitive credi...
PPTX
Your data won’t stay smart forever: exploring the temporal dimension of (big ...
PPTX
ReComp project kickoff presentation 11-03-2016
PPTX
ReComp: challenges in selective recomputation of (expensive) data analytics t...
PDF
Introduction to Data streaming - 05/12/2014
PDF
Moa: Real Time Analytics for Data Streams
PDF
Sentiment Knowledge Discovery in Twitter Streaming Data
Scalable Whole-Exome Sequence Data Processing Using Workflow On A Cloud
Data Trajectories: tracking the reuse of published data for transitive credi...
Your data won’t stay smart forever: exploring the temporal dimension of (big ...
ReComp project kickoff presentation 11-03-2016
ReComp: challenges in selective recomputation of (expensive) data analytics t...
Introduction to Data streaming - 05/12/2014
Moa: Real Time Analytics for Data Streams
Sentiment Knowledge Discovery in Twitter Streaming Data

What's hot (20)

PDF
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
PDF
Artificial intelligence and data stream mining
PDF
Mining big data streams with APACHE SAMOA by Albert Bifet
PDF
ISNCC 2017
PDF
Fast Perceptron Decision Tree Learning from Evolving Data Streams
PPTX
Data Streaming in Big Data Analysis
PPT
5.1 mining data streams
PDF
Mining Big Data Streams with APACHE SAMOA
PDF
18 Data Streams
PDF
MOA for the IoT at ACML 2016
PPTX
Analytics of analytics pipelines: from optimising re-execution to general Dat...
PDF
Cloud-based Data Stream Processing
PDF
parallel OLAP
PPTX
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
PDF
Efficient Duplicate Detection Over Massive Data Sets
PDF
Real-Time Big Data Stream Analytics
PDF
Indexing Techniques for Scalable Record Linkage and Deduplication
PPT
Chapter 08 Data Mining Techniques
PDF
Hadoop ensma poitiers
PDF
Efficient Online Evaluation of Big Data Stream Classifiers
Albert Bifet – Apache Samoa: Mining Big Data Streams with Apache Flink
Artificial intelligence and data stream mining
Mining big data streams with APACHE SAMOA by Albert Bifet
ISNCC 2017
Fast Perceptron Decision Tree Learning from Evolving Data Streams
Data Streaming in Big Data Analysis
5.1 mining data streams
Mining Big Data Streams with APACHE SAMOA
18 Data Streams
MOA for the IoT at ACML 2016
Analytics of analytics pipelines: from optimising re-execution to general Dat...
Cloud-based Data Stream Processing
parallel OLAP
Accelerating the Experimental Feedback Loop: Data Streams and the Advanced Ph...
Efficient Duplicate Detection Over Massive Data Sets
Real-Time Big Data Stream Analytics
Indexing Techniques for Scalable Record Linkage and Deduplication
Chapter 08 Data Mining Techniques
Hadoop ensma poitiers
Efficient Online Evaluation of Big Data Stream Classifiers
Ad

Similar to The lifecycle of reproducible science data and what provenance has got to do with it (20)

PDF
Sharing massive data analysis: from provenance to linked experiment reports
PPTX
Setting Up a Qualitative or Mixed Methods Research Project in NVivo 10 to Cod...
PPTX
Identifying semantics characteristics of user’s interactions datasets through...
PDF
Camp 4-data workshop presentation
PDF
Workflow Provenance: From Modelling to Reporting
PPT
Services For Science April 2009
PDF
Using Neo4j for exploring the research graph connections made by RD-Switchboard
PDF
ISMB Workshop 2014
PPTX
2016 07 12_purdue_bigdatainomics_seandavis
PDF
The Role of Metadata in Reproducible Computational Research
PPTX
Enabling semantic integration
PPTX
Keynote speech - Carole Goble - Jisc Digital Festival 2015
PPTX
RARE and FAIR Science: Reproducibility and Research Objects
PDF
Christine borgman keynote
PPTX
Cloud com foster december 2010
PDF
Tools für das Management von Forschungsdaten
PDF
Towards research data knowledge graphs
PPTX
Using Embeddings for Dynamic Diverse Summarisation in Heterogeneous Graph Str...
PPT
RO Advisory Kickoff Slides
Sharing massive data analysis: from provenance to linked experiment reports
Setting Up a Qualitative or Mixed Methods Research Project in NVivo 10 to Cod...
Identifying semantics characteristics of user’s interactions datasets through...
Camp 4-data workshop presentation
Workflow Provenance: From Modelling to Reporting
Services For Science April 2009
Using Neo4j for exploring the research graph connections made by RD-Switchboard
ISMB Workshop 2014
2016 07 12_purdue_bigdatainomics_seandavis
The Role of Metadata in Reproducible Computational Research
Enabling semantic integration
Keynote speech - Carole Goble - Jisc Digital Festival 2015
RARE and FAIR Science: Reproducibility and Research Objects
Christine borgman keynote
Cloud com foster december 2010
Tools für das Management von Forschungsdaten
Towards research data knowledge graphs
Using Embeddings for Dynamic Diverse Summarisation in Heterogeneous Graph Str...
RO Advisory Kickoff Slides
Ad

More from Paolo Missier (20)

PPTX
Data and end-to-end Explainability (XAI,XEE)
PPTX
A simple Introduction to Explainability in Machine Learning and AI (XAI)
PPTX
A simple Introduction to Algorithmic Fairness
PDF
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
PDF
Design and Development of a Provenance Capture Platform for Data Science
PDF
Towards explanations for Data-Centric AI using provenance records
PPTX
Interpretable and robust hospital readmission predictions from Electronic Hea...
PPTX
Data-centric AI and the convergence of data and model engineering: opportunit...
PPTX
Realising the potential of Health Data Science: opportunities and challenges ...
PPTX
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
PDF
A Data-centric perspective on Data-driven healthcare: a short overview
PPTX
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
PPTX
Tracking trajectories of multiple long-term conditions using dynamic patient...
PPTX
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
PPTX
Digital biomarkers for preventive personalised healthcare
PPTX
Digital biomarkers for preventive personalised healthcare
PPTX
Data Provenance for Data Science
PPTX
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
PPTX
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
PPTX
Data Science for (Health) Science: tales from a challenging front line, and h...
Data and end-to-end Explainability (XAI,XEE)
A simple Introduction to Explainability in Machine Learning and AI (XAI)
A simple Introduction to Algorithmic Fairness
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
Design and Development of a Provenance Capture Platform for Data Science
Towards explanations for Data-Centric AI using provenance records
Interpretable and robust hospital readmission predictions from Electronic Hea...
Data-centric AI and the convergence of data and model engineering: opportunit...
Realising the potential of Health Data Science: opportunities and challenges ...
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
A Data-centric perspective on Data-driven healthcare: a short overview
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Tracking trajectories of multiple long-term conditions using dynamic patient...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
Data Provenance for Data Science
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
Data Science for (Health) Science: tales from a challenging front line, and h...

Recently uploaded (20)

PDF
Empathic Computing: Creating Shared Understanding
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Univ-Connecticut-ChatGPT-Presentaion.pdf
PDF
Approach and Philosophy of On baking technology
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Getting Started with Data Integration: FME Form 101
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Encapsulation theory and applications.pdf
PPTX
OMC Textile Division Presentation 2021.pptx
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PPTX
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
A comparative study of natural language inference in Swahili using monolingua...
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Mushroom cultivation and it's methods.pdf
Empathic Computing: Creating Shared Understanding
Advanced methodologies resolving dimensionality complications for autism neur...
Accuracy of neural networks in brain wave diagnosis of schizophrenia
Mobile App Security Testing_ A Comprehensive Guide.pdf
Univ-Connecticut-ChatGPT-Presentaion.pdf
Approach and Philosophy of On baking technology
SOPHOS-XG Firewall Administrator PPT.pptx
Encapsulation_ Review paper, used for researhc scholars
Unlocking AI with Model Context Protocol (MCP)
Getting Started with Data Integration: FME Form 101
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Encapsulation theory and applications.pdf
OMC Textile Division Presentation 2021.pptx
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
TechTalks-8-2019-Service-Management-ITIL-Refresh-ITIL-4-Framework-Supports-Ou...
Spectral efficient network and resource selection model in 5G networks
A comparative study of natural language inference in Swahili using monolingua...
NewMind AI Weekly Chronicles - August'25-Week II
Mushroom cultivation and it's methods.pdf

The lifecycle of reproducible science data and what provenance has got to do with it

  • 1. The lifecycle of reproducible science data and what provenance has got to do with it Paolo Missier School of Computing Science Newcastle University, UK Alan Turing Institute Symposium On Reproducibility for Data-Intensive Research Oxford, April 6, 2016 With material contributed by: Yang Cao, Bertram Ludascher, Tim McPhillips, Dave Vieglais, Matt Jones and the DataONE CyberInfrastructure group Rawaa Qasha at Newcastle University Carole Goble at the University of Manchester
  • 2. P.Missier ATISymposiumonReproducibility OxfordApril6th,2016 (Yet another) Data Lifecycle picture Search discover packagepublish spec(P’) Deploy P’  Env(dep’) prov(D’) Compare (P,P’,D,D’) spec(P) prov(D) D  D1 P  P’ dep  dep’ <D,P,dep,spec(P), prov(D)> compute Env D’ D1
  • 3. Reproducibility: working. reporting submit article and move on… publish article Research Environment Publication Environment Peer Review
  • 4. P.Missier ATISymposiumonReproducibility OxfordApril6th,2016 Re-what? Re-* ReRun: vary experiment and setup, same lab P P’ DD’ depdep’ Repeat: Same experiment, setup, lab P, D, dep, env(dep) Replicate: Same experiment, setup, different lab P, D, dep, env’(dep) Reproduce: vary experiment and setup, different lab P P’ DD’ depdep’ env(dep) env’(dep’) Reuse: Different experiment D, P  Q
  • 5. P.Missier ATISymposiumonReproducibility OxfordApril6th,2016 Mapping the reproducibility space 5 Goal: to help scientists understand the effect of workflow / data / dependencies evolution on workflow execution results Approach: compare provenance traces generated during the runs: PDIFF P. Missier, S. Woodman, H Hiden, P. Watson. Provenance and data differencing for workflow reproducibility analysis, Concurrency Computat.: Pract. Exper., 2013.
  • 6. P.Missier ATISymposiumonReproducibility OxfordApril6th,2016 Workflow evolution 6 Each of the elements in an execution may evolve (semi) independently from the others: Can trt be computed again at some time t’>t? Requires saving EDt but may be impractical (eg large DB state) Repeatability:
  • 7. P.Missier ATISymposiumonReproducibility OxfordApril6th,2016 Reproducibility 7 Can a new version trt’ of trt be computed at some later time t’ > t, after one of more of the elements has changed? • Wi may not run new EDj’ • Wi may not run with wfmsk’ • Wi’ may not run with dh’ • ... Potential issues:
  • 8. P.Missier ATISymposiumonReproducibility OxfordApril6th,2016 (Yet another) Data Lifecycle picture Search discover packagepublish spec(P’) Deploy P’  Env D  D1 P  P’ dep  dep’ compute Env D’ prov(D’) Compare (P,P’,D,D’) spec(P) prov(D) Research Objects DataONE Federated Research Data Repositories - Matlab provenance recorder TOSCA-based virtualisation Pdiff: differencing provenance YesWorkflow - Workflow Provenance - NoWorkflow Matlab provenance recorder (DataONE) ReproZip
  • 9. P.Missier ATISymposiumonReproducibility OxfordApril6th,2016 You are here Data packaging: Research Objects DataONE: Data packaging, publication, search and discovery, hosting • R provenance recorder • Process-as-a-dataflow: YesWorkflow Process Virtualisation using TOSCA Provenance recorders • Workflow Provenance • Taverna, eScience Central, Kepler, Pegasus, VisTrails… • NoWorkflow: provenance recording for Python • Pdiff: provenance differencing for understanding workflow differences
  • 10. Computational Workflow Runs workflowrun.prov.ttl (RDF) outputA.txt outputC.jpg outputB/ intermediates/ 1.txt 2.txt 3.txt de/def2e58b-50e2-4949-9980- fd310166621a.txt inputA.txt workflow attribution execution environment Aggregating in Research Object ZIP folder structure (RO Bundle) mimetype application/vnd.wf4ever.robundle+zi p .ro/manifest.jso n URI references Exchange Reproducibility Same data Same code Systematic and extensible meta- data collection Workflow Annotation Profile Wf4Ever Project
  • 11. P.Missier ATISymposiumonReproducibility OxfordApril6th,2016 Manifests and Containers Container Packaging: Zip files, Docker images, BagIt, … Catalogues & Commons Platforms: FAIRDOM SEEK, Farr Commons CKAN, STELAR eLab, myExperiment Manifest Metadata Describes the aggregated resources, their annotations and their provenance Manifest
  • 12. P.Missier ATISymposiumonReproducibility OxfordApril6th,2016 Manifest Metadata Manifest Construction • Identification – id, title, creator, status…. • Aggregates – list of ids/links to resources • Annotations – list of annotations about resources Manifest Manifest Description • Checklists – what should be there • Provenance – where it came from • Versioning – its evolution • Dependencies – what else is needed Manifest
  • 13. P.Missier ATISymposiumonReproducibility OxfordApril6th,2016 You are here Data packaging: Research Objects DataONE: Data packaging, publication, search and discovery, hosting • R provenance recorder • Process-as-a-dataflow: YesWorkflow Process Virtualisation using TOSCA Provenance recorders • Workflow Provenance • Taverna, eScience Central, Kepler, Pegasus, VisTrails… • NoWorkflow: provenance recording for Python • Pdiff: provenance differencing for understanding workflow differences
  • 14. Components for a flexible, scalable, sustainable network Cyberinfrastructure Component 2 Member Nodes www.dataone.org/member-nodes Coordinating Nodes • retain complete metadata catalog • indexing for search • network-wide services • ensure content availability (preservation) • replication services Member Nodes • diverse institutions • serve local community • provide resources for managing their data • retain copies of data 14
  • 15. Cyberinfrastructure Data Services: Extraction, sub-setting etc Provenance Semantics-enabled Discovery ontolog y annotation System Metadata Science Data Search API Science Metadata Provenance Replicate Metadata Index 15
  • 17. What input data went into this study? What methods were used? … with what parameter settings, calibrations, …? Can we trust the data and methods?  Provenance (lineage): track origin and processing history of data  trust, data quality ~ audit trail for attribution, credit  Discovery of data, methodologies, experiments Use Provenance for Transparency, Reproducibility 17
  • 18.  W3C has published the ‘PROV’ standard Entity Activity Agent wasAssociatedWith wasAttributedTo wasGeneratedBy W3C PROV model See w3.org/TR/prov-o/ used 20
  • 20. map image R script Execution Scientist wasAssociatedWith wasAttributedTo wasGeneratedBy CSV data used wasDerivedFrom Using a common model  Example: Scientific workflow 22
  • 21. map image R script Execution Scientist wasAssociatedWith wasAttributedTo wasGeneratedBy CSV data used wasDerivedFrom < “map image” wasDerivedFrom “CSV data” > Using a common model  Example: Scientific workflow 23
  • 22. ProvONE Motivation: Different Kinds of Provenance  Prospective Provenance  method/workflow description (“workflow-land”)  Retrospective Provenance  runtime provenance tracking (“trace-land”)  Better together! 24
  • 23. ProvONE extends PROV for science! “Trace-Land” “Workflow-Land” “Data-Land” http://purl.dataone.org/provone-v1-dev 25
  • 24. DataONE data packages: Provenance inside! resource map science metadata system metadata science data system metadata system metadata OAI-ORE with ProvONE trace figures system metadata software system metadata 29
  • 27. 1 # @begin CreateGulfOfAlaskaMaps 2 # @in hcdb @as Total_Aromatic_Alkanes_PWS.csv 3 # @in world @as RWorldMap 4 # @out map @as Map_Of_Sampling_Locations.png 5 # @out detailMap @as Detailed_Map_Of_SamplingLocations.png ... mapping code is here ... 25 # @end CreateGulfOfAlaskaMaps YesWorkflow (YW): Scripts as prospective provenance 33
  • 28. MATLAB, R , Python … Scripts YesWorkflow (YW): Scripts as prospective provenance  Script + @YW-annotation workflow-land & trace-land  Combine provenance:  Prospective (workflow)  Retrospective (runtime trace)  Reconstructed (logs, files, …)  User can query own data & provenance prior to sharing  Incentive: accelerate work!  “Provenance for Self” 34
  • 29. When a user cites a pub, we know:  Which data produced it  What software produced it  What was derived from it  Who to credit down the attribution stack  Katz & Smith. 2014. Implementing Transitive Credit with JSON-LD. arXiv:1407.5117  Missier, Paolo. “Data Trajectories: Tracking Reuse of Published Data for Transitive Credit Attribution.” 11th Intl. Data Curation Conference (IDCC). Amsterdam, 2016. (Best Paper Award) Transitive Credit 36
  • 30. Provenance today: Important but hard C limate C hange Impacts in the U nited S tates U .S . N a t iona l C lim a t e A sse ssm e nt U . S. G lo b a l C h a n g e R e s e a r c h P r o g r a m “This report is the result of a three- year analytical effort by a team of over 300 experts, overseen by a broadly constituted Federal Advisory Committee of 60 members. It was developed from information and analyses gathered in over 70 workshops and listening sessions held across the country.” 37
  • 31. Provenance today: Important but hard 38 data and “code” / method linked alt formats
  • 32. Yaxing’s script with inputs & output products YesWorkflow model Christopher using Yaxing’s outputs as inputs for his script Christopher’s results can be traced back all the way to Yaxing’s input Provenance in action 40
  • 33. P.Missier ATISymposiumonReproducibility OxfordApril6th,2016 You are here Data packaging: Research Objects DataONE: Data packaging, publication, search and discovery, hosting • R provenance recorder • Process-as-a-dataflow: YesWorkflow Process Virtualisation using TOSCA Provenance recorders • Workflow Provenance • Taverna, eScience Central, Kepler, Pegasus, VisTrails… • NoWorkflow: provenance recording for Python • Pdiff: provenance differencing for understanding workflow differences
  • 34. 4 TOSCA • Topology and Orchestration Specification of Cloud Applications
  • 35. Use Case: e-Science Central Workflow 5 http://www.esciencecentral.co.uk
  • 36. TOSCA-based mapping of an e-SC Workflow 6 • Workflow components as Node Types • Block dependencies as Relationship Types
  • 37. e-SC Workflow Service Template 7
  • 38. P.Missier ATISymposiumonReproducibility OxfordApril6th,2016 You are here Data packaging: Research Objects DataONE: Data packaging, publication, search and discovery, hosting • R provenance recorder • Process-as-a-dataflow: YesWorkflow Process Virtualisation using TOSCA Provenance recorders • Workflow Provenance • Taverna, eScience Central, Kepler, Pegasus, VisTrails… • NoWorkflow: provenance recording for Python • Pdiff: provenance differencing for understanding workflow differences
  • 39. P.Missier ATISymposiumonReproducibility OxfordApril6th,2016 Data divergence analysis using provenance All work done with reference to the e-Science Central WFMS Assumption: workflow WFj (new version) runs to completion thus it produces a new provenance trace however, it may be disfunctional relative to WFi (the original) Example: only input data changes: d != d’, WFj == WFi 4 7 Note: results may diverge even when the input datasets are identical, for example when one or more of the services exhibits non-deterministic behaviour, or depends on external state that has changed between executions.
  • 41. P.Missier ATISymposiumonReproducibility OxfordApril6th,2016 Delta graphs 4 9 A graph obtained as a result of traces “diff” which can be used to explain observed differences in workflow outputs, in terms of differences throughout the two executions. This is the simplest possible delta “graph”!
  • 45. P.Missier ATISymposiumonReproducibility OxfordApril6th,2016 References Research Objects: www.researchobject.org Bechhofer, Sean, Iain Buchan, David De Roure, Paolo Missier, J. Ainsworth, J. Bhagat, P. Couch, et al. “Why Linked Data Is Not Enough for Scientists.” Future Generation Computer Systems (2011). doi:doi:10.1016/j.future.2011.08.004. DataONE: dataone.org Cuevas-Vicenttín, Víctor, Parisa Kianmajd, Bertram Ludäscher, Paolo Missier, Fernando Chirigati, Yaxing Wei, David Koop, and Saumen Dey. “The PBase Scientific Workflow Provenance Repository.” In Procs. 9th International Digital Curation Conference, 9:28–38. San Francisco, CA, USA, 2014. doi:10.2218/ijdc.v9i2.332. Process Virtualisation using TOSCA Qasha, Rawaa, Jacek Cala, and Paul Watson. “Towards Automated Workflow Deployment in the Cloud Using TOSCA.” In 2015 IEEE 8th International Conference on Cloud Computing, 1037–1040. New York, 2015. doi:10.1109/CLOUD.2015.146. NoWorkflow: provenance recording for Python Murta, Leonardo, Vanessa Braganholo, Fernando Chirigati, David Koop, and Juliana Freire. “noWorkflow: Capturing and Analyzing Provenance of Scripts⋆.” In Procs. IPAW’14. Cologne, Germany: Springer, 2014. Pdiff: provenance differencing for understanding workflow differences Missier, Paolo, Simon Woodman, Hugo Hiden, and Paul Watson. “Provenance and Data Differencing for Workflow Reproducibility Analysis.” Concurrency and Computation: Practice and Experience (2013). doi:10.1002/cpe.3035.

Editor's Notes

  • #12: Packaging – physical and logical containers Open Archives Initiation Object Reuse and Exchange (OAI ORE) is a standard for describing aggregations of web resources http://www.openarchives.org/ore/ Uses a Resource Map to describe the aggregated resources Proxies allow for statements about the resources within the aggregation Capturing context and viewpoints Several concrete serialisations RDF/XML, Atom, RDFa Open Annotation specification is a community developed data model for annotation of web resources http://www.openannotation.org/spec/core/ Developed by the W3C Open Annotation Community Group Allows for “stand-off” annotations Annotation as a first class citizen Developed to fit with Web Architecture How do you make a research object? Well, gather your resources, describe them in the manifest. Different types of Containers can be used to transfer and package the Research Object; The Research Object Bundle is a structured ZIP file format… but more specific and more general formats are also used, such a Docker images (a bit low-level, capturing the whole execution environment) BagIt (a digital archiving format that is commonly used by libraries), or Simply existing Web resources (which may be subject to change). You can register and archive research object in domain-specific repositories like FAIRDOM’s SEEK (system biology models), FARR Commons CKAN (public health medical data), technology-specific repositories (myExperiment for workflow-centric workflows), or generic data repositories you probably have already heard of, like Zenodo and Figshare.
  • #13: Linked Resource Model very relevant Dublin Core Application Profile Pericles Linked Resource Model Identification includes properties for identifying the “mime type” annotation profile of the RO
  • #15: Need to update with new / upcoming MN locations and logos Amber notes: Retain CN, MN logo? Required if used elsewhere, if not cut? Not all MN logos will fit – select representative or cut? Cross reference with google MN Rebecca: Need updated logos for KNB, AOOS (FIXED) – I would select a different set of MNs to highlight since all won’t fit
  • #16: Rebecca: Can we do a better job than the quad chart? If not, are all the logos in 1st quadrant appropriate?
  • #17: Update before RSV Figure shows from 2020 – edit?
  • #18: Rebecca: the green axis and legend on the right is difficult to read – another color would be better. Bertram: Agreed. But this isn’t our chart. Maybe we can “patch” it? Also: should credit source!
  • #19: Still missing; EYE CANDY Also removed (redundant with next slide!): DataONE Provenance Products & Tools: New ProvONE model extends W3C PROV standard for workflows New Matlab provenance recorder ITK also includes R, Python recorders DataONE Web UI integration UI is “provenance-aware”
  • #21: These statements are the low-level pieces of information that we keep track of.
  • #22: These statements are the low-level pieces of information that we keep track of.
  • #23: These statements are the low-level pieces of information that we keep track of.
  • #24: These statements are the low-level pieces of information that we keep track of.
  • #27: We want to enhance analysis software that scientists are already familiar with. So for our first round, we are working on a Matlab Toolbox, and an R library. In conjunction with Bertram, Paolo, and other colleagues, we are incorporating the Yesworkflow java library into our Matlab Toolbox to capture ‘prospective’ provenance.
  • #31: Is the logo supposed to be R or ONE R?
  • #35: Use tools, concepts scientists are already familiar with
  • #36: Query 3: Where is the raw image corresponding to corrected image DRT322_11000ev_028.img Scientist: Look at the image files nested within the raw directory. Find the image file that contains the values DRT322, 11000, and 028 in the file access path. YW: Extract the URI template variable names and values from the path to DRT322_11000ev_028.img output by the port named corrected_image, look at the paths for all files output by the raw_image port, and return the file whose path includes template variables with names and values matching those for DRT322_11000ev_028.img
  • #40: In the DataONE Search, we can search for ‘grass’, and two data packages show up. The Yaxing Wei (Alice) soil map processing workflow and the Christopher Schwalm (Bob) analysis workflows both show that they have provenance information associated with the Data Packages (via the icon in the search record). We next will choose the Wei’s Data Package to see the details. This can be seen at https://search-sandbox-2.test.dataone.org.
  • #41: Viewing the Wei soil processing workflow we see on the left that the Matlab script (C3_C4_map_present_with_comments.m) has 25 inputs. It also has 6 outputs on the right. The top three outputs are the YesWorkflow diagrams (dataflow, processflow, combined). The bottom three are the NetCDF data files that represent three different world map grids of percentage of grass types (C3 grass fraction, C4 grass fraction, and total grass fraction). The script can be downloaded with the Download button in the center. This can be accessed at https://search-sandbox-2.test.dataone.org/#view/metadata_e859d2dd-c5e6-4ec6-892f-1b00bb6f8f65.xml. Bertram, if you want to show the YesWorkflow diagram (combined) for this run showing how monthly air and precipitation values are used as the inputs, the combined diagram can be accessed from this page, or directly from https://cn-sandbox-2.test.dataone.org/cn/v2/resolve/d87e1a6a-1a78-4f96-bba8-cb74ac2b1efb