SlideShare a Scribd company logo
Recording and Reasoning Over Data Provenance in Web and Grid Services Martin Szomszor and  Luc Moreau [email_address] University of Southampton
Contents A definition of provenance Example 1: Aerospace engineering Example 2: Organ transplant management Example 3: Bioinformatics grid Provenance architecture Provenance service Conclusion
The Grid and Virtual Organisations The Grid problem is defined as  coordinated resource sharing and problem solving in dynamic, multi-institutional virtual organisations  [FKT01].  Effort is required to allow users to place their trust in the data produced by such virtual organisations Understanding how a given service is likely to modify data flowing into it, and how this data has been generated is crucial.
Provenance and Virtual Organisations Given a set of services in an open grid environment that decide to form a virtual organisation with the aim to produce a given result; How can we determine the process that generated the result, especially after the virtual organisation has been disbanded? The lack of information about the origin of results does not help users to trust such open environments.
Provenance and Workflows Workflow enactment has become popular in the Web Services and Grid communities Workflow enactment can be seen as a scripted form of virtual organisation. The problem is similar: how can we determine the origin of enactment results.
Provenance: Definition Provenance  is an annotation able to explain how a particular result has been derived. In a service-oriented architecture, provenance identifies what data is passed between services, what services are available,and what results are generated for particular sets of input values, etc. Using provenance, a user can trace the “process” that led to the aggregation of services producing a particular output.
Provenance in Aerospace Engineering Aerospace engineering requires to undertake scientific simulations, data pre- and post-processing and visualisation, composed in complex workflows.
Provenance in Aerospace Engineering Provenance is crucially required in this context, as the need to maintain a historical record of outputs from each sub-system is an important requirement for many customers that utilise the end result of simulations.  For instance, aircrafts’ provenance data need to be kept for up to 99 years when sold to some countries .   Currently, however little direct support is available for this.
Provenance in Organ Transplant Management Medical information systems, and in particular decision support systems for organ and tissue transplant, rely on a wide range of data sources, patient data, and knowledge added by doctors, surgeons and other individuals using the systems .
Provenance in Organ Transplant Management Such a domain is heavily regulated European, national, regional and site specific rules govern how decisions are made Application of these rules must be ensured, be auditable and may change over time Patient recovery is highly dependent on  organ allocation choice, extraction and insertion methods,  care/recovery regime.
Provenance in Organ Transplant Management Tracking back previous decisions in any one centre to identify whether the best match was made, who was involved in the decision, what was the context .  Maximise the efficiency in matching and recovery rate of patients.
Provenance in a Bioinformatics Grid (myGrid) myGrid aims to build a personalised problem-solving environment, in which: the scientist can construct in silico experiments, find and adapt others, store results in data repositories, have their own view on public repositories,  be better informed as to the provenance and the currency of the tools and data directly relevant to their experimental space.
Provenance in a Bioinformatics Grid (myGrid) Two major forms of provenance [Greenwood03]: The  derivation path  records the process by which results are generated from input data. Derivation data provides the answer to questions about what initial data was used for a result, and how was the transformation from initial data to result achieved.  FDA requirement on drug companies to keep a record of provenance of drug discovery as long as the drug is in use (up to 50 years sometimes).
Provenance in a Bioinformatics Grid (myGrid) Two major forms of provenance [Greenwood03]: Annotations  are attached to objects, or collections of objects.  Annotation data provides more contextual information that might be of interest: who performed an experiment, when did they supply any comments on the specific methods and materials used, when an object was created, last updated,who owns it and its format. Useful to provide personalised environment.
Other Provenance Requirements and Uses Standard lineage representation, automated lineage recording, unobtrusive information collecting [Frew and Brose] To give reliability and quality, justification and audit, re-usability, reproducibility and repeatability, change and evolution, ownership, security, credit and copyright [Goble]
What is the problem? Provenance recording should be part of the infrastructure, so that users  can elect  to enable it when they execute their complex tasks over the Grid or in Web Services environments.  Currently, the Web Services protocol stack and the Open Grid Services Architecture do not provide any support for recording provenance.
Our Contributions A service-oriented architecture for provenance support in Grid and Web Services environments, based on the idea of a provenance service; A client-side API for recording provenance data for Web Service invocation; A data model for storing provenance data; A server-side interface for querying provenance data;  Two components making use of provenance: provenance browsing and provenance validation.
Overall Architecture
Overall Architecture Provenance gathering is a collaborative process that involves multiple entities, including the workflow enactment engine, the enactment engine's client, the service directory, and the invoked services.  Provenance data will be submitted to one or more “provenance repositories” acting as storage for provenance data.  Upon user's requests, some analysis, navigation and reasoning over provenance data can be undertaken.
Overall Architecture Storage could be achieved by a  provenance service. A library, optionally hosted in the provenance service, would perform the analysis, navigation or reasoning. A client side library would submit provenance data to the provenance service.
System Overview
Sequence Diagram To identify the interactions between provenance service, client side library and enactment engine Creation of a session Need to be able to support the most complex workflows including conditional branching, iteration, recursion and parallel execution. Support asynchronous submission of provenance data so that provenance submission does not delay workflow execution.
Sequence Diagram
Provenance Data Model Must support recording of all information necessary to replay execution Must support all complex forms of workflows (recursion, iterations, parallel execution).
Provenance Data Model
 
Discussion In order for provenance data to be useful, we expect such a protocol to support some “classical” properties of distributed algorithms.  Using  mutual authentication , an invoked service can ensure that it submits data to a specific provenance server, and vice-versa, a provenance server can ensure that it receives data from a given service.  With  non-repudiation , we can retain evidence of the fact that a service has committed to executing a particular invocation and has produced a given result.  We anticipate that cryptographic techniques will be useful to ensure such properties
The purpose of project PASOA to investigate provenance in Grid architectures Funded by EPSRC under the “fundamental computer science for e-Science call” In collaboration with Cardiff www.pasoa.org
Conclusion Provenance is a rather unexplored domain Strategic to bring trust in open environment Our provenance service is the first attempt to incorporate provenance in the infrastructure of Web and Grid services Need to further investigate the algorithmic foundations of provenance, which will lead to scalable and secure industrial solutions.
Acknowledgements Syd Chapman, IBM Omer Rana, Cardiff Andreas Schreiber and Rolf Hempel, DLR Lazslo Varga, SZTAKI Ulises Cortes and Steven Willmott, UPC Mark Greenwood, Carole Goble, Manchester

More Related Content

PDF
IRJET- Swift Retrieval of DNA Databases by Aggregating Queries
PDF
CSE-05-27-34
PPTX
Lspnew (1)
PDF
Centralized Data Verification Scheme for Encrypted Cloud Data Services
PDF
Dynamic Fine-grained Access Control and Multi-Field Keyword Search in Cloud B...
PPT
Cda accesscontrol-final2 (1)
PDF
Java projects
IRJET- Swift Retrieval of DNA Databases by Aggregating Queries
CSE-05-27-34
Lspnew (1)
Centralized Data Verification Scheme for Encrypted Cloud Data Services
Dynamic Fine-grained Access Control and Multi-Field Keyword Search in Cloud B...
Cda accesscontrol-final2 (1)
Java projects

What's hot (18)

DOCX
Outsourced similarity search on
PDF
P2P Cache Resolution System for MANET
DOCX
JAVA 2013 IEEE PARALLELDISTRIBUTION PROJECT A privacy leakage upper bound con...
PDF
Paper id 252014139
PDF
IRJET- Transaction of Healthcare Records using Blockchain
PDF
A Review on Resource Discovery Strategies in Grid Computing
PDF
AUTHORIZATION FRAMEWORK FOR MEDICAL DATA
PPTX
Scalable and secure sharing of personal health records in cloud computing usi...
PPTX
kpit ppt
PDF
Privacy Preserving Distributed Association Rule Mining Algorithm for Vertical...
PPTX
Knowing me, knowing you, knowing your disease
PDF
Data integrity proof techniques in cloud storage
PPT
Mduke sagecite-jisc-march11
PPTX
SageCite demonstrator overview
PDF
IRJET- A Data Stream Mining Technique Dynamically Updating a Model with Dynam...
PDF
An efficeient privacy preserving ranked keyword search
PDF
Paper MIE2016 from Proceedings pags 122-126
PDF
INTRUSION DETECTION AND MARKING TRANSACTIONS IN A CLOUD OF DATABASES ENVIRONMENT
Outsourced similarity search on
P2P Cache Resolution System for MANET
JAVA 2013 IEEE PARALLELDISTRIBUTION PROJECT A privacy leakage upper bound con...
Paper id 252014139
IRJET- Transaction of Healthcare Records using Blockchain
A Review on Resource Discovery Strategies in Grid Computing
AUTHORIZATION FRAMEWORK FOR MEDICAL DATA
Scalable and secure sharing of personal health records in cloud computing usi...
kpit ppt
Privacy Preserving Distributed Association Rule Mining Algorithm for Vertical...
Knowing me, knowing you, knowing your disease
Data integrity proof techniques in cloud storage
Mduke sagecite-jisc-march11
SageCite demonstrator overview
IRJET- A Data Stream Mining Technique Dynamically Updating a Model with Dynam...
An efficeient privacy preserving ranked keyword search
Paper MIE2016 from Proceedings pags 122-126
INTRUSION DETECTION AND MARKING TRANSACTIONS IN A CLOUD OF DATABASES ENVIRONMENT
Ad

Similar to Recording and Reasoning Over Data Provenance in Web and Grid Services (20)

PPT
Provinance in scientific workflows in e science
PPTX
"Data Provenance: Principles and Why it matters for BioMedical Applications"
PDF
Prov-O-Viz: Interactive Provenance Visualization
PDF
Works 2015-provenance-mileage
PPT
Keepit Course 3: Provenance (and OPM), based on slides by Luc Moreau
PDF
Provenance Analysis and RDF Query Processing: W3C PROV for Data Quality and T...
PDF
Data provenance - world in 2030
PDF
Provenance and Trust
PDF
Camp 4-data workshop presentation
PPTX
PROV Tutorials (Data Provenance Standard)
PDF
Extending DCAM for Metadata Provenance
PDF
A Brief Provenance Tour … via DataONE
PPTX
Data Provenance and its role in Data Science
PDF
Workflow Provenance: From Modelling to Reporting
PDF
Trust and linked data jmgomez-v1.1
PDF
Provenance and DataONE: Facilitating Reproducible Science
PDF
DATA PROVENENCE IN PUBLIC CLOUD
PPTX
Provenance for Reproducible Data Science
PPTX
Towards a framework for making applications provenance aware: UML2PROV
Provinance in scientific workflows in e science
"Data Provenance: Principles and Why it matters for BioMedical Applications"
Prov-O-Viz: Interactive Provenance Visualization
Works 2015-provenance-mileage
Keepit Course 3: Provenance (and OPM), based on slides by Luc Moreau
Provenance Analysis and RDF Query Processing: W3C PROV for Data Quality and T...
Data provenance - world in 2030
Provenance and Trust
Camp 4-data workshop presentation
PROV Tutorials (Data Provenance Standard)
Extending DCAM for Metadata Provenance
A Brief Provenance Tour … via DataONE
Data Provenance and its role in Data Science
Workflow Provenance: From Modelling to Reporting
Trust and linked data jmgomez-v1.1
Provenance and DataONE: Facilitating Reproducible Science
DATA PROVENENCE IN PUBLIC CLOUD
Provenance for Reproducible Data Science
Towards a framework for making applications provenance aware: UML2PROV
Ad

More from Martin Szomszor (7)

PPTX
Live Social Semantics @ ISWC2009
PPT
Semantic Modelling of User Interests Based on Cross-Folksonomy Analysis @ IS...
PPT
Modelling Users’ Profiles and Interests based on Cross-Folksonomy Analysis ...
PPTX
Live Social Semantics @ ESWC2010
PPT
Description and Discovery of Type Adaptors for Web Services Workflow
PPT
Syntactic Mediation in Grid and Web Service Architectures
PPT
Automated Syntactic Mediation for Web Service Integration
Live Social Semantics @ ISWC2009
Semantic Modelling of User Interests Based on Cross-Folksonomy Analysis @ IS...
Modelling Users’ Profiles and Interests based on Cross-Folksonomy Analysis ...
Live Social Semantics @ ESWC2010
Description and Discovery of Type Adaptors for Web Services Workflow
Syntactic Mediation in Grid and Web Service Architectures
Automated Syntactic Mediation for Web Service Integration

Recently uploaded (20)

PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Electronic commerce courselecture one. Pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
Spectroscopy.pptx food analysis technology
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PDF
Empathic Computing: Creating Shared Understanding
PDF
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Encapsulation theory and applications.pdf
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Electronic commerce courselecture one. Pdf
Advanced methodologies resolving dimensionality complications for autism neur...
Spectroscopy.pptx food analysis technology
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Review of recent advances in non-invasive hemoglobin estimation
Mobile App Security Testing_ A Comprehensive Guide.pdf
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Empathic Computing: Creating Shared Understanding
TokAI - TikTok AI Agent : The First AI Application That Analyzes 10,000+ Vira...
MYSQL Presentation for SQL database connectivity
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Spectral efficient network and resource selection model in 5G networks
Dropbox Q2 2025 Financial Results & Investor Presentation
Encapsulation theory and applications.pdf
MIND Revenue Release Quarter 2 2025 Press Release
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
The AUB Centre for AI in Media Proposal.docx
Chapter 3 Spatial Domain Image Processing.pdf

Recording and Reasoning Over Data Provenance in Web and Grid Services

  • 1. Recording and Reasoning Over Data Provenance in Web and Grid Services Martin Szomszor and Luc Moreau [email_address] University of Southampton
  • 2. Contents A definition of provenance Example 1: Aerospace engineering Example 2: Organ transplant management Example 3: Bioinformatics grid Provenance architecture Provenance service Conclusion
  • 3. The Grid and Virtual Organisations The Grid problem is defined as coordinated resource sharing and problem solving in dynamic, multi-institutional virtual organisations [FKT01]. Effort is required to allow users to place their trust in the data produced by such virtual organisations Understanding how a given service is likely to modify data flowing into it, and how this data has been generated is crucial.
  • 4. Provenance and Virtual Organisations Given a set of services in an open grid environment that decide to form a virtual organisation with the aim to produce a given result; How can we determine the process that generated the result, especially after the virtual organisation has been disbanded? The lack of information about the origin of results does not help users to trust such open environments.
  • 5. Provenance and Workflows Workflow enactment has become popular in the Web Services and Grid communities Workflow enactment can be seen as a scripted form of virtual organisation. The problem is similar: how can we determine the origin of enactment results.
  • 6. Provenance: Definition Provenance is an annotation able to explain how a particular result has been derived. In a service-oriented architecture, provenance identifies what data is passed between services, what services are available,and what results are generated for particular sets of input values, etc. Using provenance, a user can trace the “process” that led to the aggregation of services producing a particular output.
  • 7. Provenance in Aerospace Engineering Aerospace engineering requires to undertake scientific simulations, data pre- and post-processing and visualisation, composed in complex workflows.
  • 8. Provenance in Aerospace Engineering Provenance is crucially required in this context, as the need to maintain a historical record of outputs from each sub-system is an important requirement for many customers that utilise the end result of simulations. For instance, aircrafts’ provenance data need to be kept for up to 99 years when sold to some countries . Currently, however little direct support is available for this.
  • 9. Provenance in Organ Transplant Management Medical information systems, and in particular decision support systems for organ and tissue transplant, rely on a wide range of data sources, patient data, and knowledge added by doctors, surgeons and other individuals using the systems .
  • 10. Provenance in Organ Transplant Management Such a domain is heavily regulated European, national, regional and site specific rules govern how decisions are made Application of these rules must be ensured, be auditable and may change over time Patient recovery is highly dependent on organ allocation choice, extraction and insertion methods, care/recovery regime.
  • 11. Provenance in Organ Transplant Management Tracking back previous decisions in any one centre to identify whether the best match was made, who was involved in the decision, what was the context . Maximise the efficiency in matching and recovery rate of patients.
  • 12. Provenance in a Bioinformatics Grid (myGrid) myGrid aims to build a personalised problem-solving environment, in which: the scientist can construct in silico experiments, find and adapt others, store results in data repositories, have their own view on public repositories, be better informed as to the provenance and the currency of the tools and data directly relevant to their experimental space.
  • 13. Provenance in a Bioinformatics Grid (myGrid) Two major forms of provenance [Greenwood03]: The derivation path records the process by which results are generated from input data. Derivation data provides the answer to questions about what initial data was used for a result, and how was the transformation from initial data to result achieved. FDA requirement on drug companies to keep a record of provenance of drug discovery as long as the drug is in use (up to 50 years sometimes).
  • 14. Provenance in a Bioinformatics Grid (myGrid) Two major forms of provenance [Greenwood03]: Annotations are attached to objects, or collections of objects. Annotation data provides more contextual information that might be of interest: who performed an experiment, when did they supply any comments on the specific methods and materials used, when an object was created, last updated,who owns it and its format. Useful to provide personalised environment.
  • 15. Other Provenance Requirements and Uses Standard lineage representation, automated lineage recording, unobtrusive information collecting [Frew and Brose] To give reliability and quality, justification and audit, re-usability, reproducibility and repeatability, change and evolution, ownership, security, credit and copyright [Goble]
  • 16. What is the problem? Provenance recording should be part of the infrastructure, so that users can elect to enable it when they execute their complex tasks over the Grid or in Web Services environments. Currently, the Web Services protocol stack and the Open Grid Services Architecture do not provide any support for recording provenance.
  • 17. Our Contributions A service-oriented architecture for provenance support in Grid and Web Services environments, based on the idea of a provenance service; A client-side API for recording provenance data for Web Service invocation; A data model for storing provenance data; A server-side interface for querying provenance data; Two components making use of provenance: provenance browsing and provenance validation.
  • 19. Overall Architecture Provenance gathering is a collaborative process that involves multiple entities, including the workflow enactment engine, the enactment engine's client, the service directory, and the invoked services. Provenance data will be submitted to one or more “provenance repositories” acting as storage for provenance data. Upon user's requests, some analysis, navigation and reasoning over provenance data can be undertaken.
  • 20. Overall Architecture Storage could be achieved by a provenance service. A library, optionally hosted in the provenance service, would perform the analysis, navigation or reasoning. A client side library would submit provenance data to the provenance service.
  • 22. Sequence Diagram To identify the interactions between provenance service, client side library and enactment engine Creation of a session Need to be able to support the most complex workflows including conditional branching, iteration, recursion and parallel execution. Support asynchronous submission of provenance data so that provenance submission does not delay workflow execution.
  • 24. Provenance Data Model Must support recording of all information necessary to replay execution Must support all complex forms of workflows (recursion, iterations, parallel execution).
  • 26.  
  • 27. Discussion In order for provenance data to be useful, we expect such a protocol to support some “classical” properties of distributed algorithms. Using mutual authentication , an invoked service can ensure that it submits data to a specific provenance server, and vice-versa, a provenance server can ensure that it receives data from a given service. With non-repudiation , we can retain evidence of the fact that a service has committed to executing a particular invocation and has produced a given result. We anticipate that cryptographic techniques will be useful to ensure such properties
  • 28. The purpose of project PASOA to investigate provenance in Grid architectures Funded by EPSRC under the “fundamental computer science for e-Science call” In collaboration with Cardiff www.pasoa.org
  • 29. Conclusion Provenance is a rather unexplored domain Strategic to bring trust in open environment Our provenance service is the first attempt to incorporate provenance in the infrastructure of Web and Grid services Need to further investigate the algorithmic foundations of provenance, which will lead to scalable and secure industrial solutions.
  • 30. Acknowledgements Syd Chapman, IBM Omer Rana, Cardiff Andreas Schreiber and Rolf Hempel, DLR Lazslo Varga, SZTAKI Ulises Cortes and Steven Willmott, UPC Mark Greenwood, Carole Goble, Manchester