SlideShare a Scribd company logo
Provenance as a Building Block for an Open
Science Infrastructure
Andreas Schreiber
German Aerospace Center (DLR)
Cologne/Berlin, Germany
ISGC 2018, Taipei, Taiwan
> ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 1
Topics
• Reproducibility
• Provenance and PROV
• Storing provenance
• Gathering provenance
> ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 2
Reproducibility
Reproducibility in (data) science is based on
• Open Source Software
• Code Reviews
• Code Repositories
• Publications with code
• Container (Docker etc.)
• Workflows
• (Electronic) laboratory notebooks
• Open data formats
• Data management
• Metadata and Provenance
> ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 3
Provenance
Basics
• Provenance refers to the source of
information and the process that led to its
existence
• Where did I get this file?
• How did it come to exist?
• Provenance information is critical to users
trying to understand where a particular
data file came from
> ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 4
Other and related terms
• Traceability
• Lineage
• Logging
• Monitoring
Provenance Information
Capture, archive, and distribute provenance information, for example
• The source of all externally supplied data files
• The source of the algorithms used to transform the data within the system
• The Algorithm design documents
• A complete description of the processing environment
• A complete description of the processing framework
• A record of each job’s execution
> ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 5
> ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 6
Data Science Workflows
More Formal Definition of Provenance
Provenance is
information about entities, activities, and people
involved in
producing a piece of data or thing,
which can be used to form
assessments about its quality, reliability or trustworthiness.
PROV W3C Working Group
https://www.w3.org/TR/prov-overview
> ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 7
W3C Specification „PROV“
• PROV-O, the PROV ontology, an OWL2 ontology allowing the mapping of the PROV data
model to RDF
• PROV-DM, the PROV data model for provenance
• PROV-N, a notation for provenance aimed at human consumption
• PROV-CONSTRAINTS, a set of constraints applying to the PROV data model
• PROV-XML, an XML schema for the PROV data model
• PROV-AQ, mechanisms for accessing and querying provenance
• PROV-DICTIONARY introduces a specific type of collection, consisting of key-entity pairs
• PROV-DC provides a mapping between PROV-O and Dublin Core Terms
• PROV-SEM, a declarative specification in terms of first-order logic of the PROV data model
• PROV-LINKS introduces a mechanism to link across bundles
> ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 8
PROV Elements
Entities
• Physical, digital, conceptual, or other kinds of things
• For example, documents, web sites, graphics, or data sets
Activities
• Activities generate new entities or
make use of existing entities
• Activities could be actions or processes
Agents
• Agents takes a role in an activity and have
the responsibility for the activity
• For example, persons, pieces of software,
or organizations
> ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 9
Activity
Entity
Agent
PROV Relations
> ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 10
Activity
Entity
Agent
wasGeneratedBy
used
wasDerivedFrom
wasAttributedTo
wasAssociatedWith
Baking a Cake
> ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 11
100 g
butter
bake
2
eggs
100 g
sugar
100 g
flour
cake
used
used
used
used
wasGeneratedBy
wasDerivedFrom
Textual Representations Visualizations
PROV Notations and Representations
• Formats: PROV-N, JSON, Turtle, XML, …
> ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 12
document
prefix userdata http://software.dlr.de/qs/userdata/
. . .
wasDerivedFrom(userdata:weights, userdata:WeightReport.csv,
wasDerivedFrom(qs:graphic/weights, userdata:weights,
wasAssociatedWith(qs:graphic/weights, qs:user/onyame@gmail.com, -)
used(python_method:read_csv, library:pandas, -)
used(python_method:matplotlib_plot, userdata:weights, -)
used(python_method:matplotlib_plot, library:matplotlib, -)
used(python_method:read_csv, userdata:WeightReport.csv, -)
wasAttributedTo(userdata:WeightReport.csv, qs:user/onyame@gmail.com)
agent(qs:user/onyame@gmail.com, [prov:type="prov:Person"])
entity(library:pandas, [library:version="0.17.1"])
entity(userdata:WeightReport.csv)
entity(userdata:weights)
. . .
endDocument
Storing and Retrieving Provenance
> ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 13
Provenance Architecture
> ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 14
Recording of Data Processing
Information
Application
Data (Results)
Provenance
Store
Storing and Retrieving Provenance
Some Storage Technologies
• Relational databases and SQL
• XML and Xpath
• RDF and SPARQL
• Graph databases and Gremlin/Cypher
Services
• REST APIs
• PROVSTORE
> ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 15
ProvStore
University of Southampton
• RESTful web service
• storage and access of
provenance documents
• Public and private
documents
• Conversion to various
text formats
• Simple visualizations
• APIs
• Python
• jQuery
> ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 16
https://provenance.ecs.soton.ac.uk/store/
Graphs
Provenance is a Directed Acyclic Graph (DAG)
> ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 17
A
B
E
F
G
D
C
Graph Databases
Naturally, graph databases are a good
technology for storing (Provenance) graphs
Many graph databases are available
• Neo4j
• Titan
• ArangoDB
• ...
Query languages
• Cypher
• Gremlin (TinkerPop)
• GraphQL
> ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 18
Neo4j
• Open-Source
• Implemented in Java
• Stores property graphs
(key-value-based, directed)
http://neo4j.com
> ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 19
Storing Provenance in Graph Database
> ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 20
Graph database Neo4j
MATCH (e:Entity)-[*]-(u:Agent) RETURN u
Trusted Provenance: Storing Provenance in a Blockchain
> ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 21
PROV2BIGCHAINDB
https://github.com/DLR-SC/prov2bigchaindb
Blockchain
Combination of multiple techniques
• Peer-to-peer network
• Public/Private key signing
• Time-stamping
• Proof-of-Work
• Merkle-Trees
Proposed solutions to
• The double-spending problem
• The byzantine generals problem
• Tamper-resistant distributed database
> ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 22
Blockchain Transactions
• Linked by hash of current and preceding
transactions
• Bitcoin: Transfers amount of BTC
• Public/private key signing
• All transactions are broadcasted across
the network
> ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 23
Document-based Storage of PROV Documents
• Only one user/address on the blockchain
• Provenance is stored as one valid document
Pros
• Less complex
• Ownership restricted to one participant
• Easy to query
• Less costly, if less data is added
Cons
• Single point of failure
• Less tamper-resistant
• No chaining of transactions
• Huge amount of data in transactions
> ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 24
Role-based Storage of PROV Documents
• Every agents is a blockchain user/address
• Generates transactions for its entities and
activities
• Relations modeled with references to other
transactions
Pros
• Close to typical process structures
• Implicit ownership and responsibility
Cons
• Agent needs to know relevant transactions
for references
• Difficult to query, if no ownership transfer is
used
> ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 25
Graph-based Storage of PROV Documents
• All PROV relations are modeled as ownership transfer
• All agents, activity and entities are actual blockchain
user/addresses
Pros
• Mapping close to PROV model
• Small amount of data per transaction
• Strong tamper-resistant due to:
• Multiple owner
• Large amount of transactions
Cons
• Complex implementation
• High costs due to many transactions
• Very slow in querying, if traversal of
transactions is needed
> ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 26
Test Setup
> ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 27
> ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 28
Performance Comparison
Gathering Provenance
> ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 29
> ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 30
Data &
Metadata
Workflows
Algorithms / Scripts Machine Learning
Data
Management
PROV
Provenance
Store
</>
Software
Development
Gather or Generate Provenance
Depends on your application (tools, languages, etc.)
• Generation at run-time, compile-time, or retrospectively
Runtime
• Instrumentation of the application
• Cumbersome from software engineering perspective
• Combined with logging or with aspect-oriented approaches
Compile time
• Based on static code analysis (dependency analysis, program slicing, etc.)
Retrospectively
• Reconstructed from files or filesystem metadata
> ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 31
Tools and Libraries for Generating Provenance
Libraries for Python
• PROVPY
• PROVNEO4J
Other Tools
• NOWORKFLOW
• GIT2PROV
> ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 32
Python Library ProvPy (PROV)
https://github.com/trungdong/prov
> ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 33
from prov.model import ProvDocument
# Create a new provenance document
d1 = ProvDocument()
# Entity: now:employment-article-v1.html
e1 = d1.entity('now:employment-article-v1.html')
# Agent: nowpeople:Bob
d1.agent('nowpeople:Bob')
# Attributing the article to the agent
d1.wasAttributedTo(e1, 'nowpeople:Bob')
d1.entity('govftp:oesm11st.zip',
{'prov:label': 'employment-stats-2011',
'prov:type': 'void:Dataset'})
d1.wasDerivedFrom('now:employment-article-v1.html',
'govftp:oesm11st.zip')
# Adding an activity
d1.activity('is:writeArticle')
d1.used('is:writeArticle', 'govftp:oesm11st.zip')
d1.wasGeneratedBy('now:employment-article-v1.html', 'is:writeArticle')
Python Library ProvPy (PROV)
https://github.com/trungdong/prov
> ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 34
PROVNEO4J – Storing PROV Documents in Neo4j
https://github.com/DLR-SC/provneo4j
> ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 35
import provneo4j.api
provneo4j_api = provneo4j.api.Api(
base_url="http://localhost:7474/db/data",
username="neo4j", password="python")
provneo4j_api.document.create(prov_doc, name=”MyProv”)
PROVNEO4J – Storing PROV Documents in Neo4j
https://github.com/DLR-SC/provneo4j
> ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 36
Provenance Instrumentation of TENSORFLOW
Provenance of TENSORFLOW workflows
• Tensor  PROV Entity
• Operations  PROV Activity
Example: MNIST with 400 training iterations
• 64581 database nodes
• 33549 Entities
• 31032 Activities
> ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 37
Provenance Instrumentation
of TENSORFLOW
Example Query
• Shortest paths from all tensors
in 400. iteration to init operation
> ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 38
MATCH path=allShortestPaths((root)<-[*]-(n))
WHERE root.`tf:type`="tf:Session_init" and n.`tf:name` =~ ".*_400"
RETURN path
NOWORKFLOW – Provenance of Scripts
https://github.com/gems-uff/noworkflow
> ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 39
Project
experiment.py
p12.dat
p13.dat
precipitation.py
p14.dat
out.png
$ now run -e Tracker experiment.py
GIT2PROV
http://git2prov.org
• Generate PROV documents
from git repositories
> ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 40
GIT2PROV Example Output
https://provenance.ecs.soton.ac.uk/store/documents/116377/
> ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 41
Provenance Visualization
Visualization of Provenance is an ongoing research topic
• Especially, for non-experts (“Provenance for people”)
• Example: PROV COMICS
> ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 42
Key Messages and Summary
Recording the Provenance of science workflows is important
• to understand where data came from
• to reproduce data processing steps or whole workflows
Use a standard for Provenance
• W3C standard PROV
• Mapping to (graph) databases, allows easy querying
• A standard allow interoperability and comparison
• Storing in blockchains for increasing trust
Recording Provenance is not hard
• APIs and tools available
> ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 43
Activity
Entity
Agent
wasGeneratedBy
used
wasDerivedFrom
wasAttributedTo
wasAssociatedWith
> ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 44
Thank You!
Questions?
Andreas.Schreiber@dlr.de
www.DLR.de/sc | @onyame

More Related Content

PPTX
Privacy by design
PDF
10 ways to stumble with big data
PPTX
2016 urisa track: nhd hydro linked data registery by michael tinker
PDF
Protecting privacy in practice
PDF
2017-01-08-scaling tribalknowledge
PDF
ER 2016 Tutorial
PDF
Kubernetes as data platform
PDF
ISNCC 2017
Privacy by design
10 ways to stumble with big data
2016 urisa track: nhd hydro linked data registery by michael tinker
Protecting privacy in practice
2017-01-08-scaling tribalknowledge
ER 2016 Tutorial
Kubernetes as data platform
ISNCC 2017

What's hot (20)

PDF
Massively Scalable Computational Finance with SciDB
PDF
CCCB Germline Variant Analysis on Cloud Platform
PDF
Real-time Data Analytics mit Elasticsearch
PPTX
Dataset Descriptions in Open PHACTS and HCLS
PDF
Don't build a data science team
PDF
Data pipelines from zero to solid
PDF
Data democratised
PDF
TPC-H analytics' scenarios and performances on Hadoop data clouds
PPTX
معرفی کاربردهای یادگیری عمیق و چالش های آن در کلان داده
PDF
Open core summit: Observability for data pipelines with OpenLineage
PDF
iRODS UGM 2018 Fair data management and DISQOVERability
PDF
Data lineage and observability with Marquez - subsurface 2020
PDF
Bicod2017
PDF
Time Series Analytics for Big Fast Data
PDF
Eventually, time will kill your data processing
PPTX
The Elastic Stack as a SIEM
PDF
Testing data streaming applications
PPTX
Processing genetic data at scale
PDF
Engineering data quality
PDF
Mortal analytics - Covid-19 and the problem of data quality
Massively Scalable Computational Finance with SciDB
CCCB Germline Variant Analysis on Cloud Platform
Real-time Data Analytics mit Elasticsearch
Dataset Descriptions in Open PHACTS and HCLS
Don't build a data science team
Data pipelines from zero to solid
Data democratised
TPC-H analytics' scenarios and performances on Hadoop data clouds
معرفی کاربردهای یادگیری عمیق و چالش های آن در کلان داده
Open core summit: Observability for data pipelines with OpenLineage
iRODS UGM 2018 Fair data management and DISQOVERability
Data lineage and observability with Marquez - subsurface 2020
Bicod2017
Time Series Analytics for Big Fast Data
Eventually, time will kill your data processing
The Elastic Stack as a SIEM
Testing data streaming applications
Processing genetic data at scale
Engineering data quality
Mortal analytics - Covid-19 and the problem of data quality
Ad

Similar to Provenance as a building block for an open science infrastructure (20)

PPTX
Provenance for Reproducible Data Science
PPTX
Reproducible Science with Python
PDF
Provenance Analysis and RDF Query Processing: W3C PROV for Data Quality and T...
PPTX
PROV Tutorials (Data Provenance Standard)
PDF
2010 06 rdf_next
PPTX
Thoughts on Knowledge Graphs & Deeper Provenance
PPT
Keepit Course 3: Provenance (and OPM), based on slides by Luc Moreau
PDF
Prov-O-Viz: Interactive Provenance Visualization
PDF
Transcript - Provenance and Social Science data
PDF
Camp 4-data workshop presentation
PPT
Provinance in scientific workflows in e science
PPT
Reflections on Provenance Ontology Encodings
PDF
Provenance and DataONE: Facilitating Reproducible Science
PDF
Publishing metadata provenance
PDF
Works 2015-provenance-mileage
PPT
Recording and Reasoning Over Data Provenance in Web and Grid Services
PPTX
"Data Provenance: Principles and Why it matters for BioMedical Applications"
PDF
Data Provenance and PROV Ontology
PDF
Provenance and Trust
PDF
Provenance And Annotation Of Data And Processes Revised Selected Papers Debor...
Provenance for Reproducible Data Science
Reproducible Science with Python
Provenance Analysis and RDF Query Processing: W3C PROV for Data Quality and T...
PROV Tutorials (Data Provenance Standard)
2010 06 rdf_next
Thoughts on Knowledge Graphs & Deeper Provenance
Keepit Course 3: Provenance (and OPM), based on slides by Luc Moreau
Prov-O-Viz: Interactive Provenance Visualization
Transcript - Provenance and Social Science data
Camp 4-data workshop presentation
Provinance in scientific workflows in e science
Reflections on Provenance Ontology Encodings
Provenance and DataONE: Facilitating Reproducible Science
Publishing metadata provenance
Works 2015-provenance-mileage
Recording and Reasoning Over Data Provenance in Web and Grid Services
"Data Provenance: Principles and Why it matters for BioMedical Applications"
Data Provenance and PROV Ontology
Provenance and Trust
Provenance And Annotation Of Data And Processes Revised Selected Papers Debor...
Ad

More from Andreas Schreiber (20)

PPTX
Provenance-based Security Audits and its Application to COVID-19 Contact Trac...
PPTX
Visualization of Software Architectures in Virtual Reality and Augmented Reality
PPTX
Raising Awareness about Open Source Licensing at the German Aerospace Center
PDF
Open Source Licensing for Rocket Scientists
PDF
Interactive Visualization of Software Components with Virtual Reality Headsets
PPTX
Visualizing Provenance using Comics
PPTX
Quantified Self Comics
PPTX
Nachvollziehbarkeit mit Hinblick auf Privacy-Verletzungen
PPTX
Python at Warp Speed
PPTX
A Provenance Model for Quantified Self Data
PPTX
Open Source im DLR
PDF
Tracking after Stroke: Doctors, Dogs and All The Rest
PPTX
High Throughput Processing of Space Debris Data
PDF
Bericht von der QS15 Conference & Exposition
PPTX
Telemedizin: Gesundheit, messbar für jedermann
PDF
Big Python
PDF
Quantified Self mit Wearable Devices und Smartphone-Sensoren
PDF
Example Blood Pressure Report of BloodPressureCompanion
PDF
Beispiel-Blutdruckbericht des BlutdruckBegleiter
PDF
Informatik für die Welt von Morgen
Provenance-based Security Audits and its Application to COVID-19 Contact Trac...
Visualization of Software Architectures in Virtual Reality and Augmented Reality
Raising Awareness about Open Source Licensing at the German Aerospace Center
Open Source Licensing for Rocket Scientists
Interactive Visualization of Software Components with Virtual Reality Headsets
Visualizing Provenance using Comics
Quantified Self Comics
Nachvollziehbarkeit mit Hinblick auf Privacy-Verletzungen
Python at Warp Speed
A Provenance Model for Quantified Self Data
Open Source im DLR
Tracking after Stroke: Doctors, Dogs and All The Rest
High Throughput Processing of Space Debris Data
Bericht von der QS15 Conference & Exposition
Telemedizin: Gesundheit, messbar für jedermann
Big Python
Quantified Self mit Wearable Devices und Smartphone-Sensoren
Example Blood Pressure Report of BloodPressureCompanion
Beispiel-Blutdruckbericht des BlutdruckBegleiter
Informatik für die Welt von Morgen

Recently uploaded (20)

PPTX
history of c programming in notes for students .pptx
PDF
medical staffing services at VALiNTRY
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PPTX
Odoo POS Development Services by CandidRoot Solutions
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PPTX
Introduction to Artificial Intelligence
PPTX
Online Work Permit System for Fast Permit Processing
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PDF
How Creative Agencies Leverage Project Management Software.pdf
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PDF
Design an Analysis of Algorithms II-SECS-1021-03
PDF
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PDF
PTS Company Brochure 2025 (1).pdf.......
PPTX
CHAPTER 2 - PM Management and IT Context
PPTX
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
PPTX
ai tools demonstartion for schools and inter college
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PPTX
Operating system designcfffgfgggggggvggggggggg
history of c programming in notes for students .pptx
medical staffing services at VALiNTRY
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
Odoo POS Development Services by CandidRoot Solutions
Which alternative to Crystal Reports is best for small or large businesses.pdf
Introduction to Artificial Intelligence
Online Work Permit System for Fast Permit Processing
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
Navsoft: AI-Powered Business Solutions & Custom Software Development
How Creative Agencies Leverage Project Management Software.pdf
2025 Textile ERP Trends: SAP, Odoo & Oracle
Design an Analysis of Algorithms II-SECS-1021-03
Flood Susceptibility Mapping Using Image-Based 2D-CNN Deep Learnin. Overview ...
PTS Company Brochure 2025 (1).pdf.......
CHAPTER 2 - PM Management and IT Context
Lecture 3: Operating Systems Introduction to Computer Hardware Systems
ai tools demonstartion for schools and inter college
How to Choose the Right IT Partner for Your Business in Malaysia
Operating system designcfffgfgggggggvggggggggg

Provenance as a building block for an open science infrastructure

  • 1. Provenance as a Building Block for an Open Science Infrastructure Andreas Schreiber German Aerospace Center (DLR) Cologne/Berlin, Germany ISGC 2018, Taipei, Taiwan > ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 1
  • 2. Topics • Reproducibility • Provenance and PROV • Storing provenance • Gathering provenance > ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 2
  • 3. Reproducibility Reproducibility in (data) science is based on • Open Source Software • Code Reviews • Code Repositories • Publications with code • Container (Docker etc.) • Workflows • (Electronic) laboratory notebooks • Open data formats • Data management • Metadata and Provenance > ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 3
  • 4. Provenance Basics • Provenance refers to the source of information and the process that led to its existence • Where did I get this file? • How did it come to exist? • Provenance information is critical to users trying to understand where a particular data file came from > ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 4 Other and related terms • Traceability • Lineage • Logging • Monitoring
  • 5. Provenance Information Capture, archive, and distribute provenance information, for example • The source of all externally supplied data files • The source of the algorithms used to transform the data within the system • The Algorithm design documents • A complete description of the processing environment • A complete description of the processing framework • A record of each job’s execution > ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 5
  • 6. > ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 6 Data Science Workflows
  • 7. More Formal Definition of Provenance Provenance is information about entities, activities, and people involved in producing a piece of data or thing, which can be used to form assessments about its quality, reliability or trustworthiness. PROV W3C Working Group https://www.w3.org/TR/prov-overview > ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 7
  • 8. W3C Specification „PROV“ • PROV-O, the PROV ontology, an OWL2 ontology allowing the mapping of the PROV data model to RDF • PROV-DM, the PROV data model for provenance • PROV-N, a notation for provenance aimed at human consumption • PROV-CONSTRAINTS, a set of constraints applying to the PROV data model • PROV-XML, an XML schema for the PROV data model • PROV-AQ, mechanisms for accessing and querying provenance • PROV-DICTIONARY introduces a specific type of collection, consisting of key-entity pairs • PROV-DC provides a mapping between PROV-O and Dublin Core Terms • PROV-SEM, a declarative specification in terms of first-order logic of the PROV data model • PROV-LINKS introduces a mechanism to link across bundles > ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 8
  • 9. PROV Elements Entities • Physical, digital, conceptual, or other kinds of things • For example, documents, web sites, graphics, or data sets Activities • Activities generate new entities or make use of existing entities • Activities could be actions or processes Agents • Agents takes a role in an activity and have the responsibility for the activity • For example, persons, pieces of software, or organizations > ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 9 Activity Entity Agent
  • 10. PROV Relations > ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 10 Activity Entity Agent wasGeneratedBy used wasDerivedFrom wasAttributedTo wasAssociatedWith
  • 11. Baking a Cake > ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 11 100 g butter bake 2 eggs 100 g sugar 100 g flour cake used used used used wasGeneratedBy wasDerivedFrom
  • 12. Textual Representations Visualizations PROV Notations and Representations • Formats: PROV-N, JSON, Turtle, XML, … > ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 12 document prefix userdata http://software.dlr.de/qs/userdata/ . . . wasDerivedFrom(userdata:weights, userdata:WeightReport.csv, wasDerivedFrom(qs:graphic/weights, userdata:weights, wasAssociatedWith(qs:graphic/weights, qs:user/onyame@gmail.com, -) used(python_method:read_csv, library:pandas, -) used(python_method:matplotlib_plot, userdata:weights, -) used(python_method:matplotlib_plot, library:matplotlib, -) used(python_method:read_csv, userdata:WeightReport.csv, -) wasAttributedTo(userdata:WeightReport.csv, qs:user/onyame@gmail.com) agent(qs:user/onyame@gmail.com, [prov:type="prov:Person"]) entity(library:pandas, [library:version="0.17.1"]) entity(userdata:WeightReport.csv) entity(userdata:weights) . . . endDocument
  • 13. Storing and Retrieving Provenance > ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 13
  • 14. Provenance Architecture > ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 14 Recording of Data Processing Information Application Data (Results) Provenance Store
  • 15. Storing and Retrieving Provenance Some Storage Technologies • Relational databases and SQL • XML and Xpath • RDF and SPARQL • Graph databases and Gremlin/Cypher Services • REST APIs • PROVSTORE > ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 15
  • 16. ProvStore University of Southampton • RESTful web service • storage and access of provenance documents • Public and private documents • Conversion to various text formats • Simple visualizations • APIs • Python • jQuery > ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 16 https://provenance.ecs.soton.ac.uk/store/
  • 17. Graphs Provenance is a Directed Acyclic Graph (DAG) > ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 17 A B E F G D C
  • 18. Graph Databases Naturally, graph databases are a good technology for storing (Provenance) graphs Many graph databases are available • Neo4j • Titan • ArangoDB • ... Query languages • Cypher • Gremlin (TinkerPop) • GraphQL > ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 18
  • 19. Neo4j • Open-Source • Implemented in Java • Stores property graphs (key-value-based, directed) http://neo4j.com > ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 19
  • 20. Storing Provenance in Graph Database > ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 20 Graph database Neo4j MATCH (e:Entity)-[*]-(u:Agent) RETURN u
  • 21. Trusted Provenance: Storing Provenance in a Blockchain > ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 21 PROV2BIGCHAINDB https://github.com/DLR-SC/prov2bigchaindb
  • 22. Blockchain Combination of multiple techniques • Peer-to-peer network • Public/Private key signing • Time-stamping • Proof-of-Work • Merkle-Trees Proposed solutions to • The double-spending problem • The byzantine generals problem • Tamper-resistant distributed database > ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 22
  • 23. Blockchain Transactions • Linked by hash of current and preceding transactions • Bitcoin: Transfers amount of BTC • Public/private key signing • All transactions are broadcasted across the network > ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 23
  • 24. Document-based Storage of PROV Documents • Only one user/address on the blockchain • Provenance is stored as one valid document Pros • Less complex • Ownership restricted to one participant • Easy to query • Less costly, if less data is added Cons • Single point of failure • Less tamper-resistant • No chaining of transactions • Huge amount of data in transactions > ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 24
  • 25. Role-based Storage of PROV Documents • Every agents is a blockchain user/address • Generates transactions for its entities and activities • Relations modeled with references to other transactions Pros • Close to typical process structures • Implicit ownership and responsibility Cons • Agent needs to know relevant transactions for references • Difficult to query, if no ownership transfer is used > ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 25
  • 26. Graph-based Storage of PROV Documents • All PROV relations are modeled as ownership transfer • All agents, activity and entities are actual blockchain user/addresses Pros • Mapping close to PROV model • Small amount of data per transaction • Strong tamper-resistant due to: • Multiple owner • Large amount of transactions Cons • Complex implementation • High costs due to many transactions • Very slow in querying, if traversal of transactions is needed > ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 26
  • 27. Test Setup > ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 27
  • 28. > ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 28 Performance Comparison
  • 29. Gathering Provenance > ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 29
  • 30. > ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 30 Data & Metadata Workflows Algorithms / Scripts Machine Learning Data Management PROV Provenance Store </> Software Development
  • 31. Gather or Generate Provenance Depends on your application (tools, languages, etc.) • Generation at run-time, compile-time, or retrospectively Runtime • Instrumentation of the application • Cumbersome from software engineering perspective • Combined with logging or with aspect-oriented approaches Compile time • Based on static code analysis (dependency analysis, program slicing, etc.) Retrospectively • Reconstructed from files or filesystem metadata > ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 31
  • 32. Tools and Libraries for Generating Provenance Libraries for Python • PROVPY • PROVNEO4J Other Tools • NOWORKFLOW • GIT2PROV > ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 32
  • 33. Python Library ProvPy (PROV) https://github.com/trungdong/prov > ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 33 from prov.model import ProvDocument # Create a new provenance document d1 = ProvDocument() # Entity: now:employment-article-v1.html e1 = d1.entity('now:employment-article-v1.html') # Agent: nowpeople:Bob d1.agent('nowpeople:Bob') # Attributing the article to the agent d1.wasAttributedTo(e1, 'nowpeople:Bob') d1.entity('govftp:oesm11st.zip', {'prov:label': 'employment-stats-2011', 'prov:type': 'void:Dataset'}) d1.wasDerivedFrom('now:employment-article-v1.html', 'govftp:oesm11st.zip') # Adding an activity d1.activity('is:writeArticle') d1.used('is:writeArticle', 'govftp:oesm11st.zip') d1.wasGeneratedBy('now:employment-article-v1.html', 'is:writeArticle')
  • 34. Python Library ProvPy (PROV) https://github.com/trungdong/prov > ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 34
  • 35. PROVNEO4J – Storing PROV Documents in Neo4j https://github.com/DLR-SC/provneo4j > ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 35 import provneo4j.api provneo4j_api = provneo4j.api.Api( base_url="http://localhost:7474/db/data", username="neo4j", password="python") provneo4j_api.document.create(prov_doc, name=”MyProv”)
  • 36. PROVNEO4J – Storing PROV Documents in Neo4j https://github.com/DLR-SC/provneo4j > ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 36
  • 37. Provenance Instrumentation of TENSORFLOW Provenance of TENSORFLOW workflows • Tensor  PROV Entity • Operations  PROV Activity Example: MNIST with 400 training iterations • 64581 database nodes • 33549 Entities • 31032 Activities > ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 37
  • 38. Provenance Instrumentation of TENSORFLOW Example Query • Shortest paths from all tensors in 400. iteration to init operation > ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 38 MATCH path=allShortestPaths((root)<-[*]-(n)) WHERE root.`tf:type`="tf:Session_init" and n.`tf:name` =~ ".*_400" RETURN path
  • 39. NOWORKFLOW – Provenance of Scripts https://github.com/gems-uff/noworkflow > ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 39 Project experiment.py p12.dat p13.dat precipitation.py p14.dat out.png $ now run -e Tracker experiment.py
  • 40. GIT2PROV http://git2prov.org • Generate PROV documents from git repositories > ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 40
  • 41. GIT2PROV Example Output https://provenance.ecs.soton.ac.uk/store/documents/116377/ > ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 41
  • 42. Provenance Visualization Visualization of Provenance is an ongoing research topic • Especially, for non-experts (“Provenance for people”) • Example: PROV COMICS > ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 42
  • 43. Key Messages and Summary Recording the Provenance of science workflows is important • to understand where data came from • to reproduce data processing steps or whole workflows Use a standard for Provenance • W3C standard PROV • Mapping to (graph) databases, allows easy querying • A standard allow interoperability and comparison • Storing in blockchains for increasing trust Recording Provenance is not hard • APIs and tools available > ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 43 Activity Entity Agent wasGeneratedBy used wasDerivedFrom wasAttributedTo wasAssociatedWith
  • 44. > ISGC 2018 > A. Schreiber • Provenance as a Building Block for an Open Science Infrastructure > 23.03.2018DLR.de • Chart 44 Thank You! Questions? Andreas.Schreiber@dlr.de www.DLR.de/sc | @onyame