SlideShare a Scribd company logo
Provenance of workflow data products

               Paolo Missier
            School of Computing,
           Newcastle University, UK




  TAPP’11 workshop
   Heraklion, Crete
   June 20-21, 2011
Workflow provenance
Taverna type system:
- strings + nested lists
- “cat”, [“cat”, “dog”], [ [“cat”, “dog”], [“large”, “small”] ]


      Dataflow model:

      - data-driven execution
      - services activate when input is ready


      Raw provenance:
      A detailed trace of workflow execution
      - tasks performed, data transformations
      - inputs used, outputs produced


                                   Linköping, Sweden -- January 2010
Workflow provenance
Taverna type system:
- strings + nested lists
- “cat”, [“cat”, “dog”], [ [“cat”, “dog”], [“large”, “small”] ]


      Dataflow model:

      - data-driven execution
      - services activate when input is ready


      Raw provenance:
      A detailed trace of workflow execution
      - tasks performed, data transformations
      - inputs used, outputs produced


                                   Linköping, Sweden -- January 2010
Workflow provenance
                                  Taverna type system:
                                  - strings + nested lists
                                  - “cat”, [“cat”, “dog”], [ [“cat”, “dog”], [“large”, “small”] ]

        lister                          Dataflow model:
                  get pathways
                   by genes1
                                        - data-driven execution
                                        - services activate when input is ready
                 merge pathways



     gene_id                            Raw provenance:
                                        A detailed trace of workflow execution
    concat gene pathway ids
                                        - tasks performed, data transformations
        output
                                        - inputs used, outputs produced

pathway_genes

                                                                     Linköping, Sweden -- January 2010
Implicit iteration in Taverna
                           c = [c1 ... ck]

a = [a1 ... an]   X1   X2     X3         b = [b1 ... bm]

                       P
                       Y




                                                           Linköping, Sweden -- January 2010
Implicit iteration in Taverna
                                    c = [c1 ... ck]
                  (0,1)        (1,1)     (0,1)
a = [a1 ... an]           X1    X2     X3         b = [b1 ... bm]

                                P
                                Y




                                                                    Linköping, Sweden -- January 2010
Implicit iteration in Taverna
                                    c = [c1 ... ck]
                  (0,1)        (1,1)     (0,1)
a = [a1 ... an]           X1    X2     X3         b = [b1 ... bm]

                                P
                                Y



       y = [ [y11 ... y1n],
               ...
             [ym1 ... ymn] ]




                                                                    Linköping, Sweden -- January 2010
Implicit iteration in Taverna
                                    c = [c1 ... ck]
                  (0,1)        (1,1)     (0,1)
a = [a1 ... an]           X1    X2     X3         b = [b1 ... bm]

                                P
                                Y



       y = [ [y11 ... y1n],
               ...
             [ym1 ... ymn] ]


How y is computed at P:
      let I = a ⊗ b = [ [ <ai, bj> | bj ∈ b ] | ai ∈ a ] // cross product

      I’ = [ [ <ai, c, bj> | bj ∈ b ] | ai ∈ a ] // same product but with c interleaved

y = (map (map P) I’) = [(map P [ <a1,c, b1> ... <a1,c, bm>]), ...,
                        (map P [ <an,c, b1> ... <an,c, bm>]) ] =
                       [ [y11 ... y1n], ... [yn1 ... ynm] ]
                                                                    Linköping, Sweden -- January 2010
Implicit iteration in Taverna
                                    c = [c1 ... ck]
                  (0,1)        (1,1)     (0,1)
a = [a1 ... an]           X1    X2     X3         b = [b1 ... bm]

                                P
                                Y



       y = [ [y11 ... y1n],                              bottom line:
               ...                            yij depends only on values ai, c, bj
             [ym1 ... ymn] ]


How y is computed at P:
      let I = a ⊗ b = [ [ <ai, bj> | bj ∈ b ] | ai ∈ a ] // cross product

      I’ = [ [ <ai, c, bj> | bj ∈ b ] | ai ∈ a ] // same product but with c interleaved

y = (map (map P) I’) = [(map P [ <a1,c, b1> ... <a1,c, bm>]), ...,
                        (map P [ <an,c, b1> ... <an,c, bm>]) ] =
                       [ [y11 ... y1n], ... [yn1 ... ynm] ]
                                                                         Linköping, Sweden -- January 2010
Fine-grained (precise?) provenance

                                        []


                            [0]              [1]        [2]




                    ...                        ...                           ...


                                                            []


                            [0]                             [1]                    [2]

                    [0,0]       [0,1]              [1,0]         [1,1]      [2,1]     [2,2]



                                  []


        [0]                       [1]                             [2]

    [0,0]   [0,1]         [1,0]        [1,1]           [2,1]        [2,2]




                                                       []


                          [0]                        [1]                      [2]

                [0,0]       [0,1]              [1,0]        [1,1]        [2,1]      [2,2]     4
The Open Provenance Model aims to capture the causal       make this dependency explicit, it is required to assert that
                                             Which provenance model/language?
 endencies between the artifacts, processes, and agents.
erefore, a provenance graph is defined as a directed
                                                            artifact A2 was derived from another artifact A1 . This
                                                            edge gives us a dataflow oriented view of provenance.
 ph, whose nodes are artifacts, processes and agents,           It is also recognized that we may not be aware of the
          • Let’s toedge represents Open depen- exact artifact generated bya starting point was
 icted in Figure 1. An
                          one at the
                                        a causal
                                                   Provenance Modela as another process P . Process P
d whose edges belong look of the following categories
                                                            some
                                                                   artifact that process P2 used, but that there
                                                                                                          1            2
 cy, between its source, denoting the effect, and its des-   is then said to have been triggered by P1 . In contrast to
ation, denoting the cause.                                  edge was derived from, a was triggered by edge allows for
                                                            a process oriented view of past executions to be adopted.
                                                            (Since these edges summarize some activities for which all
                                                        Core OPM not being exposed, it was felt that it was not
                                                            details are
                                                     •  agnostic wrtassociate a role with them.) types
                                                            necessary to
                                                                            Artifact, Processor
                                                                As far as conventions are concerned, we note that causal-
                                                     •  roles: edges use past tense to indicaterelationsrefer to past
                                                            ity annotations on binary that they
                                                            execution. Causal relationships are defined as follows.
                                                     • extensions by subclassing
                                                           Definition 4 (Causal Relationship). A causal relation-
                                                        • node represented by an arc and denotes the presence of a
                                                           ship is types,
                                                        • relation types between arc source of the arc (the effect)
                                                           causal dependency
                                                           and the destination of the
                                                                                      the
                                                                                          (the cause).

                                                               Five causal relationships are recognized: a process used an
                                                               artifact, an artifact was generated by a process, a process
                                                               was triggered by a process, an artifact was derived from
                                                               an artifact, and a process was controlled by an agent. By
                                                               means of annotations (see Section 8), we allow edges to be
                                                               further subtyped from these five categories.
                                                Formal (temporal) semantics hopefully available soon
ure 1: Edges in the Open Provenance Model: sources are effects,
  destinations causes
                                                                   Multiple notions of causal dependencies were consid-
                                                               ered for OPM. A very strong notion of causal dependency
 The first two edges express that a process used an arti-       would express that a set of entities was necessary and suffi-
t and that an artifact was generated by a process. Since       cient to explain the existence of another entity. It was felt
 rocess may have used several artifacts, it is important       that such a notion was not practical, since, with an open
 dentify the roles under which these artifacts were used.      world assumption, one could always argue that additional
      5
oles are denoted by letter ‘R’ in Figure 1.) Likewise,         factors may have influenced an outcome (e.g. electricity
                                                               was used, temperature range allowed computer to work,
OPM relations in the workflow context
                   Single User’s View (Alice)

                                                     Workflow Specification (WA)
                             in             out             in                   out
                       X             A                 Y               B                Z
                                                in
                                                                                              Process space
                                                           Data Binding & Enactment

                                                       Data Storage and Binding

                    X / [x1, x2]!                     SA                    Z / [z1, z2]!
                                                                                              Data space

                                                           Execution & Trace Capture

                                            read            Provenance Trace (TA)


                        x2
                             read
                                      a2
                                                write
                                                           y2          b2          z2
                                                                                              Provenance space
                                                      idep
                             x1            a1                   y1          b1          z1

                                                                ddep



                           ans(X) :- ddep*(X, z1).! Provenance Queries

Missier, P., Ludascher, B., Bowers, S., Anand, M. K., Altintas, I., Dey, S., et al. (2010). Linking Multiple Workflow Provenance Traces for Interoperable
     Collaborative Science. Proc.s 5th Workshop on Workflows in Support of Large-Scale Science (WORKS).
Data and invocation dependencies
    - read, write are natural observables for a workflow run
    - possible additional relations (recorded or inferred):


    •  invocation dependencies:
    Explicit or via:

     a2 depends on a1 because a1 has written data d, a2 has read d


    •  data dependencies:
    Explicit or via:
     d2 depends on d1
    ! because some actor invocation a read d1 prior to writing d2




7
Provenance queries
    •  Closure queries:
    •  operate on the transitive closure ddep* over ddep:




    But also:
    - queries on the workflow structure
    - queries on the data structures (e.g. collections)

    and importantly:
    use workflow graphs to justify/explain the provenance graph for one
    workflow run:
                            TA trace instance of WA:
                            h: TA ➔ WA homomorphism
                            h(x1 ➔ a1) = h(x2 ➔ a2) = X➔A,
                            h(a1 ➔ y1) = h(a2 ➔ y2) = A➔Y
                            ...
8
OPM extensions, principled




                                  core OPM / PIL(*)
                                        -used
                                  -wasGeneratedBy
                                  -wasDerivedFrom




9   (*) PIL = Provenance Interchange Language, W3C Provenance Working Group
OPM extensions, principled




                                                       Processor
                           Data                          types
                           types
                                   core OPM / PIL(*)
                                         -used
                                   -wasGeneratedBy
                                   -wasDerivedFrom




                         (Additional      who
                          context)       when
                                         where
                                           ...




9   (*) PIL = Provenance Interchange Language, W3C Provenance Working Group
OPM extensions, principled




                                                       Processor
                           Data                          types
                           types
                                   core OPM / PIL(*)
                                         -used
       Nested                      -wasGeneratedBy
       Ordered Lists               -wasDerivedFrom                 Operators on lists:
                                                                   - create, insert,
                                                                     delete, select...
                                                                   - map, fold

                         (Additional      who
                          context)       when
                                         where
                                           ...




9   (*) PIL = Provenance Interchange Language, W3C Provenance Working Group
OPM extensions, principled

                        Graph models?
                                                                   Graph
    Relations?                                                     matching
    (sets of tuples)                                               queries?


                                                                     Relational
                                                       Processor     queries?
                           Data                          types
                           types
                                   core OPM / PIL(*)
                                         -used
       Nested                      -wasGeneratedBy
       Ordered Lists               -wasDerivedFrom                 Operators on lists:
                                                                   - create, insert,
                                                                     delete, select...
                                                                   - map, fold

                         (Additional      who
                          context)       when
                                         where
                                           ...




9   (*) PIL = Provenance Interchange Language, W3C Provenance Working Group
Role                   used in relation:              Proposed extensions
                                                             Context of use
       element                Contained                       L Contained(element) x
       list
      Role                    Used in relation:
                              used                            P Used(list) L
                                                               Context of use
       position
      element                 Used i ned
                              C onta                          P Used(position) p
                                                               L C onta i ned (element) x
       term
      list                    Used
                              Used                            ListUsed (list) L
                                                               P comprehensions, see Sec. 4.2
       generator
      position                Used                             P Used (position) p
       filter
      term                    Used                             List comprehensions, see Sec. 4.2
       function               Used                            map, see Sec. 4.3
      generator
       operand
      filter
      function                Used                             map, see Sec. 4.3
      operand
                   Table 2: New roles for Used and Contained relations

                 Table 2: New roles
     Causal relation                        for Used and Contained relations
                                                 Example
     Contained(R) ⊆ [τ ] × A × [Int]          L Contained([i1 . . . in ]) x
                                              x was inserted into L at position [i1 . . . in ]
     Causal relation
     wasSelectedFrom(R) ⊆ [τ ] × [τ ] × [Int]  Example
                                              L wasSelectedFrom([i1 . . . in ]) L
     C onta i ned (R) [ ] × A × [Int]         L at onta i ned ([i 1. . . . nn ]) x selected from L
                                               L C position [i1 . . i i ] was
     wasRemovedFrom(R) ⊆ [τ ] × [τ ] × [Int]  L was inserted into L 1 . .position [i 1 . . . in ]
                                               x wasRemovedFrom([i at . in ]) L
     wasSelected F rom (R) [ ] × [ ] × [Int]  L at position [iF romi([i 1was in ]) L from L
                                               L wasSelected 1 . . . n ] . . . deleted
     wasSameAs ⊆ A × A                        L wasSameAs x[i 1 . . . in ] was selected from L
                                               L at position
     wasRemoved F rom (R) [ ] × [ ] × [Int] inferred (various F rom ([i 1 . . . in ]) L
                                               L wasRemoved contexts)
                                               L at position [i 1 . . . in ] was deleted from L
     wasSame A s     A × A                     L wasSame A s x
                      Table 3: Specialised OPM dependencycontexts)
                                               inferred (various relations.

10
OPM fragments for elementary operations
     empty list

                    wasGenerated by                                            insertion
              L                              !                                                       wasDerivedFrom


                                                                                     wasGenerated by                           used(list)
                                                                               L'                       P:ins                                  L
     unit                                                                                                                 us
                                                                                                                            ed
                                                                                                                                 (p




                                                                                                        used(element)
                                   contained                                                                                        os
                                                                                                                                      itio
                                                                                                                                          n)




                                                                                           co
                                                                                                                                               p




                                                                                            nt
                                                                                            ai
                                                                                              ne
              wasGenerated by                          used(element)




                                                                                                 d
        L                           P:unit                                 x

                                                                                                                   x

     selection                                                         p
                                                                                                                                               p
                                                        ition)
                                                    pos                        deletion                                                on)
                                               d   (
                                                                                                                               po  siti
                                           use                                                                              d(
                                                                                                                        use
                 wasGenerated by                   used(list)
        L'                         P:sel                               L             wasGeneratedBy                              used(list)
                                                                                L'                            P:del                            L



                              WasSelectedFrom
                                                                                                     WasRemovedFrom




11
templates for the two operations. Additionally, however, the following
                    Operator composition → graph composition
y holds whenever insertion is followed by selection, if no other intervening
on changes the state operators may translate into inferences on the graphs
       • Equalities on of the list:

                          sel (ins x L p) p = x

w translates into the following OPM inference rule:
      sameAs(L1,L) :-
          pType(P1, ins), used(P1, X, element), used(P1, Pos, position), wgby(L,P1),
          pType(P2, sel), used(P2, L, list), used(P2, Pos, position), wgby(L1,P2).


                                      7




  12
templates for the two operations. Additionally, however, the following
                    Operator composition → graph composition
y holds whenever insertion is followed by selection, if no other intervening
on changes the state operators may translate into inferences on the graphs
       • Equalities on of the list:

                              sel (ins x L p) p = x

w translates into the following OPM inference rule:
      sameAs(L1,L) :-
          pType(P1, ins), used(P1, X, element), used(P1, Pos, position), wgby(L,P1),
          pType(P2, sel), used(P2, L, list), used(P2, Pos, position), wgby(L1,P2).


                                               7used(position)                                p



                                                                              wasDerivedFrom


                   wasGeneratedBy           used(list)           wasGenerated by                        used(list)
              L'                    P:sel                 L'                       P:ins                                     L
                                                                                                   us
                                                                                                     ed
                                                                                                          (p




                                                                                   used(element)
                                                                                                            os
                             WasSelectedFrom                                                                     itio
                                                                    co                                                  n)
                                                                     nt
                                                                       ai
                                                                         ne
                                                                          d
                                       wasSam
                                               eAs
                                                                                              x

  12
nd p a position in the list L = map f L. The following equality follows from
he definition of map: Approach applies to map, fold (reduce)...

                                                                sel (map f x) p = f (sel x p)                                                                                                       (10)

This equality is captured by the inferred dependency (y wasSameAs y ) in the
                                                            WasSelectedFrom


 nhanced OPM template of Fig. 7 (inference rule omitted for simplicity).
                                                  wasGenerated by             used(list)        wasGenerated by                                   used(list)
                                 y                                  Q2:sel                 L'                                 P:map

                                     WasSelectedFrom                                                                                          us
                                                                                                                                                   ed                       L
                                                                                                                                                     (fu
                                                                                                                                                        nc
                                                                                                     use                                                   tio
                                                                                                         d(p                                                  n)
                                                                                                            o    siti
                                                                                                                     on)
                                                                                                                                                                            f
                                      wasSameAs




                        wasGenerated by                                used(list)                   wasGenerated by                                                             used(list)
        y                                                Q2:sel                            L'                                                        P:map
                                                                                                            io n)
                                                                                                        unct
                                                                                                   ed(f                                                                     p




                                                                                                                                             t)
                                                                                                us




                                                                                                                                     (lis
                                                                                                                                                                            us




                                                                                                                                    ed
                                                                                                                                                                                ed              L




                                                                                                                                   us
                                                                                                                                                                                  (fu
                                              wasGenerated by              used(operand)        wasGenerated by                                                                      nc
                                 y'                              R:apply                   x                use               Q1:sel              used(position)                        tio
                                                                                                                        d(p                                                                n)
                                                                                                                           osi
                                                                                                                                  tion
                                                                                                                                         )
                                                                                                                                                                                                f
            wasSameAs




                                                                                                                     n        )
                                                                                                           (fu nctio
                                                                                                    used                                                                                        p




                                                                                                                                                                      t)
                                                                                                                                                                      lis
                                                                                                                                                                e  d(
                                                                                                                                                             us
   13
Take-home message
     • OPM (PIL) a candidate starting point for workflow-based
       provenance
     • extension mechanisms are provided, but they must be used
       sensibly
       – data types
       – processor types

     • Provenance of nested ordered lists used as a prototypical example
       – semantics of provenance graphs and graph composition grounded in the
         semantics of lists



     • Can this approach be useful for other interesting data types?
       – sets of tuples / relational algebra




14

More Related Content

PDF
Classification Theory
PDF
Datamining 6th Svm
PDF
Datamining 6th svm
PDF
Numerical solution of spatiotemporal models from ecology
PPTX
Global Load Instruction Aggregation Based on Code Motion
PDF
Image denoising
PDF
First few months with Kotlin - Introduction through android examples
Classification Theory
Datamining 6th Svm
Datamining 6th svm
Numerical solution of spatiotemporal models from ecology
Global Load Instruction Aggregation Based on Code Motion
Image denoising
First few months with Kotlin - Introduction through android examples

Viewers also liked (9)

PPT
Invited talk @Roma La Sapienza, April '07
PPT
Paper presentation @IPAW'08
KEY
Tapp11 presentation
PDF
Paper presentation @ SEBD '09
PDF
Paper talk: Idcc 11
PDF
Paper presentation: Taverna, reloaded
PPTX
Охота на Работу!EXCLUSIVE
PDF
Repro pdiff-talk (invited, Humboldt University, Berlin)
PPTX
Scalable Whole-Exome Sequence Data Processing Using Workflow On A Cloud
Invited talk @Roma La Sapienza, April '07
Paper presentation @IPAW'08
Tapp11 presentation
Paper presentation @ SEBD '09
Paper talk: Idcc 11
Paper presentation: Taverna, reloaded
Охота на Работу!EXCLUSIVE
Repro pdiff-talk (invited, Humboldt University, Berlin)
Scalable Whole-Exome Sequence Data Processing Using Workflow On A Cloud
Ad

Similar to provenance of lists - TAPP'11 Mini-tutorial (20)

PDF
MapReduce for Parallel Trace Validation of LTL Properties
PDF
Paper talk @ EDBT'10: Fine-grained and efficient lineage querying of collecti...
PDF
A benchmark evaluation for incremental pattern matching in graph transformation
PDF
Idiomatic R for Rosetta Code (2013)
KEY
Verification with LoLA: 2 The LoLA Input Language
KEY
Verification with LoLA: 7 Implementation
PPT
PDF
Functional Programming In Mathematica
PDF
HaLoop Talk
PPT
Effective flowgraph-based malware variant detection
KEY
Arrows in perl
KEY
Verification with LoLA
PDF
Chap1x6
PDF
Hadoop Summit 2010 Multiple Sequence Alignment Using Hadoop
PDF
A multithreaded method for network alignment
PDF
Matlab/R Dictionary
PDF
Dmss2011 public
PDF
A Study on Compositional Semantics of Words in Distributional Spaces
PDF
Gwt sdm public
PDF
Piotr Szotkowski about "Bits of ruby"
MapReduce for Parallel Trace Validation of LTL Properties
Paper talk @ EDBT'10: Fine-grained and efficient lineage querying of collecti...
A benchmark evaluation for incremental pattern matching in graph transformation
Idiomatic R for Rosetta Code (2013)
Verification with LoLA: 2 The LoLA Input Language
Verification with LoLA: 7 Implementation
Functional Programming In Mathematica
HaLoop Talk
Effective flowgraph-based malware variant detection
Arrows in perl
Verification with LoLA
Chap1x6
Hadoop Summit 2010 Multiple Sequence Alignment Using Hadoop
A multithreaded method for network alignment
Matlab/R Dictionary
Dmss2011 public
A Study on Compositional Semantics of Words in Distributional Spaces
Gwt sdm public
Piotr Szotkowski about "Bits of ruby"
Ad

More from Paolo Missier (20)

PPTX
Data and end-to-end Explainability (XAI,XEE)
PPTX
A simple Introduction to Explainability in Machine Learning and AI (XAI)
PPTX
A simple Introduction to Algorithmic Fairness
PDF
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
PDF
Design and Development of a Provenance Capture Platform for Data Science
PDF
Towards explanations for Data-Centric AI using provenance records
PPTX
Interpretable and robust hospital readmission predictions from Electronic Hea...
PPTX
Data-centric AI and the convergence of data and model engineering: opportunit...
PPTX
Realising the potential of Health Data Science: opportunities and challenges ...
PPTX
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
PDF
A Data-centric perspective on Data-driven healthcare: a short overview
PPTX
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
PPTX
Tracking trajectories of multiple long-term conditions using dynamic patient...
PPTX
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
PPTX
Digital biomarkers for preventive personalised healthcare
PPTX
Digital biomarkers for preventive personalised healthcare
PPTX
Data Provenance for Data Science
PPTX
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
PPTX
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
PPTX
Data Science for (Health) Science: tales from a challenging front line, and h...
Data and end-to-end Explainability (XAI,XEE)
A simple Introduction to Explainability in Machine Learning and AI (XAI)
A simple Introduction to Algorithmic Fairness
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
Design and Development of a Provenance Capture Platform for Data Science
Towards explanations for Data-Centric AI using provenance records
Interpretable and robust hospital readmission predictions from Electronic Hea...
Data-centric AI and the convergence of data and model engineering: opportunit...
Realising the potential of Health Data Science: opportunities and challenges ...
Provenance Week 2023 talk on DP4DS (Data Provenance for Data Science)
A Data-centric perspective on Data-driven healthcare: a short overview
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Tracking trajectories of multiple long-term conditions using dynamic patient...
Delivering on the promise of data-driven healthcare: trade-offs, challenges, ...
Digital biomarkers for preventive personalised healthcare
Digital biomarkers for preventive personalised healthcare
Data Provenance for Data Science
Capturing and querying fine-grained provenance of preprocessing pipelines in ...
Quo vadis, provenancer?  Cui prodest?  our own trajectory: provenance of data...
Data Science for (Health) Science: tales from a challenging front line, and h...

provenance of lists - TAPP'11 Mini-tutorial

  • 1. Provenance of workflow data products Paolo Missier School of Computing, Newcastle University, UK TAPP’11 workshop Heraklion, Crete June 20-21, 2011
  • 2. Workflow provenance Taverna type system: - strings + nested lists - “cat”, [“cat”, “dog”], [ [“cat”, “dog”], [“large”, “small”] ] Dataflow model: - data-driven execution - services activate when input is ready Raw provenance: A detailed trace of workflow execution - tasks performed, data transformations - inputs used, outputs produced Linköping, Sweden -- January 2010
  • 3. Workflow provenance Taverna type system: - strings + nested lists - “cat”, [“cat”, “dog”], [ [“cat”, “dog”], [“large”, “small”] ] Dataflow model: - data-driven execution - services activate when input is ready Raw provenance: A detailed trace of workflow execution - tasks performed, data transformations - inputs used, outputs produced Linköping, Sweden -- January 2010
  • 4. Workflow provenance Taverna type system: - strings + nested lists - “cat”, [“cat”, “dog”], [ [“cat”, “dog”], [“large”, “small”] ] lister Dataflow model: get pathways by genes1 - data-driven execution - services activate when input is ready merge pathways gene_id Raw provenance: A detailed trace of workflow execution concat gene pathway ids - tasks performed, data transformations output - inputs used, outputs produced pathway_genes Linköping, Sweden -- January 2010
  • 5. Implicit iteration in Taverna c = [c1 ... ck] a = [a1 ... an] X1 X2 X3 b = [b1 ... bm] P Y Linköping, Sweden -- January 2010
  • 6. Implicit iteration in Taverna c = [c1 ... ck] (0,1) (1,1) (0,1) a = [a1 ... an] X1 X2 X3 b = [b1 ... bm] P Y Linköping, Sweden -- January 2010
  • 7. Implicit iteration in Taverna c = [c1 ... ck] (0,1) (1,1) (0,1) a = [a1 ... an] X1 X2 X3 b = [b1 ... bm] P Y y = [ [y11 ... y1n], ... [ym1 ... ymn] ] Linköping, Sweden -- January 2010
  • 8. Implicit iteration in Taverna c = [c1 ... ck] (0,1) (1,1) (0,1) a = [a1 ... an] X1 X2 X3 b = [b1 ... bm] P Y y = [ [y11 ... y1n], ... [ym1 ... ymn] ] How y is computed at P: let I = a ⊗ b = [ [ <ai, bj> | bj ∈ b ] | ai ∈ a ] // cross product I’ = [ [ <ai, c, bj> | bj ∈ b ] | ai ∈ a ] // same product but with c interleaved y = (map (map P) I’) = [(map P [ <a1,c, b1> ... <a1,c, bm>]), ..., (map P [ <an,c, b1> ... <an,c, bm>]) ] = [ [y11 ... y1n], ... [yn1 ... ynm] ] Linköping, Sweden -- January 2010
  • 9. Implicit iteration in Taverna c = [c1 ... ck] (0,1) (1,1) (0,1) a = [a1 ... an] X1 X2 X3 b = [b1 ... bm] P Y y = [ [y11 ... y1n], bottom line: ... yij depends only on values ai, c, bj [ym1 ... ymn] ] How y is computed at P: let I = a ⊗ b = [ [ <ai, bj> | bj ∈ b ] | ai ∈ a ] // cross product I’ = [ [ <ai, c, bj> | bj ∈ b ] | ai ∈ a ] // same product but with c interleaved y = (map (map P) I’) = [(map P [ <a1,c, b1> ... <a1,c, bm>]), ..., (map P [ <an,c, b1> ... <an,c, bm>]) ] = [ [y11 ... y1n], ... [yn1 ... ynm] ] Linköping, Sweden -- January 2010
  • 10. Fine-grained (precise?) provenance [] [0] [1] [2] ... ... ... [] [0] [1] [2] [0,0] [0,1] [1,0] [1,1] [2,1] [2,2] [] [0] [1] [2] [0,0] [0,1] [1,0] [1,1] [2,1] [2,2] [] [0] [1] [2] [0,0] [0,1] [1,0] [1,1] [2,1] [2,2] 4
  • 11. The Open Provenance Model aims to capture the causal make this dependency explicit, it is required to assert that Which provenance model/language? endencies between the artifacts, processes, and agents. erefore, a provenance graph is defined as a directed artifact A2 was derived from another artifact A1 . This edge gives us a dataflow oriented view of provenance. ph, whose nodes are artifacts, processes and agents, It is also recognized that we may not be aware of the • Let’s toedge represents Open depen- exact artifact generated bya starting point was icted in Figure 1. An one at the a causal Provenance Modela as another process P . Process P d whose edges belong look of the following categories some artifact that process P2 used, but that there 1 2 cy, between its source, denoting the effect, and its des- is then said to have been triggered by P1 . In contrast to ation, denoting the cause. edge was derived from, a was triggered by edge allows for a process oriented view of past executions to be adopted. (Since these edges summarize some activities for which all Core OPM not being exposed, it was felt that it was not details are • agnostic wrtassociate a role with them.) types necessary to Artifact, Processor As far as conventions are concerned, we note that causal- • roles: edges use past tense to indicaterelationsrefer to past ity annotations on binary that they execution. Causal relationships are defined as follows. • extensions by subclassing Definition 4 (Causal Relationship). A causal relation- • node represented by an arc and denotes the presence of a ship is types, • relation types between arc source of the arc (the effect) causal dependency and the destination of the the (the cause). Five causal relationships are recognized: a process used an artifact, an artifact was generated by a process, a process was triggered by a process, an artifact was derived from an artifact, and a process was controlled by an agent. By means of annotations (see Section 8), we allow edges to be further subtyped from these five categories. Formal (temporal) semantics hopefully available soon ure 1: Edges in the Open Provenance Model: sources are effects, destinations causes Multiple notions of causal dependencies were consid- ered for OPM. A very strong notion of causal dependency The first two edges express that a process used an arti- would express that a set of entities was necessary and suffi- t and that an artifact was generated by a process. Since cient to explain the existence of another entity. It was felt rocess may have used several artifacts, it is important that such a notion was not practical, since, with an open dentify the roles under which these artifacts were used. world assumption, one could always argue that additional 5 oles are denoted by letter ‘R’ in Figure 1.) Likewise, factors may have influenced an outcome (e.g. electricity was used, temperature range allowed computer to work,
  • 12. OPM relations in the workflow context Single User’s View (Alice) Workflow Specification (WA) in out in out X A Y B Z in Process space Data Binding & Enactment Data Storage and Binding X / [x1, x2]! SA Z / [z1, z2]! Data space Execution & Trace Capture read Provenance Trace (TA) x2 read a2 write y2 b2 z2 Provenance space idep x1 a1 y1 b1 z1 ddep ans(X) :- ddep*(X, z1).! Provenance Queries Missier, P., Ludascher, B., Bowers, S., Anand, M. K., Altintas, I., Dey, S., et al. (2010). Linking Multiple Workflow Provenance Traces for Interoperable Collaborative Science. Proc.s 5th Workshop on Workflows in Support of Large-Scale Science (WORKS).
  • 13. Data and invocation dependencies - read, write are natural observables for a workflow run - possible additional relations (recorded or inferred): •  invocation dependencies: Explicit or via: a2 depends on a1 because a1 has written data d, a2 has read d •  data dependencies: Explicit or via: d2 depends on d1 ! because some actor invocation a read d1 prior to writing d2 7
  • 14. Provenance queries •  Closure queries: •  operate on the transitive closure ddep* over ddep: But also: - queries on the workflow structure - queries on the data structures (e.g. collections) and importantly: use workflow graphs to justify/explain the provenance graph for one workflow run: TA trace instance of WA: h: TA ➔ WA homomorphism h(x1 ➔ a1) = h(x2 ➔ a2) = X➔A, h(a1 ➔ y1) = h(a2 ➔ y2) = A➔Y ... 8
  • 15. OPM extensions, principled core OPM / PIL(*) -used -wasGeneratedBy -wasDerivedFrom 9 (*) PIL = Provenance Interchange Language, W3C Provenance Working Group
  • 16. OPM extensions, principled Processor Data types types core OPM / PIL(*) -used -wasGeneratedBy -wasDerivedFrom (Additional who context) when where ... 9 (*) PIL = Provenance Interchange Language, W3C Provenance Working Group
  • 17. OPM extensions, principled Processor Data types types core OPM / PIL(*) -used Nested -wasGeneratedBy Ordered Lists -wasDerivedFrom Operators on lists: - create, insert, delete, select... - map, fold (Additional who context) when where ... 9 (*) PIL = Provenance Interchange Language, W3C Provenance Working Group
  • 18. OPM extensions, principled Graph models? Graph Relations? matching (sets of tuples) queries? Relational Processor queries? Data types types core OPM / PIL(*) -used Nested -wasGeneratedBy Ordered Lists -wasDerivedFrom Operators on lists: - create, insert, delete, select... - map, fold (Additional who context) when where ... 9 (*) PIL = Provenance Interchange Language, W3C Provenance Working Group
  • 19. Role used in relation: Proposed extensions Context of use element Contained L Contained(element) x list Role Used in relation: used P Used(list) L Context of use position element Used i ned C onta P Used(position) p L C onta i ned (element) x term list Used Used ListUsed (list) L P comprehensions, see Sec. 4.2 generator position Used P Used (position) p filter term Used List comprehensions, see Sec. 4.2 function Used map, see Sec. 4.3 generator operand filter function Used map, see Sec. 4.3 operand Table 2: New roles for Used and Contained relations Table 2: New roles Causal relation for Used and Contained relations Example Contained(R) ⊆ [τ ] × A × [Int] L Contained([i1 . . . in ]) x x was inserted into L at position [i1 . . . in ] Causal relation wasSelectedFrom(R) ⊆ [τ ] × [τ ] × [Int] Example L wasSelectedFrom([i1 . . . in ]) L C onta i ned (R) [ ] × A × [Int] L at onta i ned ([i 1. . . . nn ]) x selected from L L C position [i1 . . i i ] was wasRemovedFrom(R) ⊆ [τ ] × [τ ] × [Int] L was inserted into L 1 . .position [i 1 . . . in ] x wasRemovedFrom([i at . in ]) L wasSelected F rom (R) [ ] × [ ] × [Int] L at position [iF romi([i 1was in ]) L from L L wasSelected 1 . . . n ] . . . deleted wasSameAs ⊆ A × A L wasSameAs x[i 1 . . . in ] was selected from L L at position wasRemoved F rom (R) [ ] × [ ] × [Int] inferred (various F rom ([i 1 . . . in ]) L L wasRemoved contexts) L at position [i 1 . . . in ] was deleted from L wasSame A s A × A L wasSame A s x Table 3: Specialised OPM dependencycontexts) inferred (various relations. 10
  • 20. OPM fragments for elementary operations empty list wasGenerated by insertion L ! wasDerivedFrom wasGenerated by used(list) L' P:ins L unit us ed (p used(element) contained os itio n) co p nt ai ne wasGenerated by used(element) d L P:unit x x selection p p ition) pos deletion on) d ( po siti use d( use wasGenerated by used(list) L' P:sel L wasGeneratedBy used(list) L' P:del L WasSelectedFrom WasRemovedFrom 11
  • 21. templates for the two operations. Additionally, however, the following Operator composition → graph composition y holds whenever insertion is followed by selection, if no other intervening on changes the state operators may translate into inferences on the graphs • Equalities on of the list: sel (ins x L p) p = x w translates into the following OPM inference rule: sameAs(L1,L) :- pType(P1, ins), used(P1, X, element), used(P1, Pos, position), wgby(L,P1), pType(P2, sel), used(P2, L, list), used(P2, Pos, position), wgby(L1,P2). 7 12
  • 22. templates for the two operations. Additionally, however, the following Operator composition → graph composition y holds whenever insertion is followed by selection, if no other intervening on changes the state operators may translate into inferences on the graphs • Equalities on of the list: sel (ins x L p) p = x w translates into the following OPM inference rule: sameAs(L1,L) :- pType(P1, ins), used(P1, X, element), used(P1, Pos, position), wgby(L,P1), pType(P2, sel), used(P2, L, list), used(P2, Pos, position), wgby(L1,P2). 7used(position) p wasDerivedFrom wasGeneratedBy used(list) wasGenerated by used(list) L' P:sel L' P:ins L us ed (p used(element) os WasSelectedFrom itio co n) nt ai ne d wasSam eAs x 12
  • 23. nd p a position in the list L = map f L. The following equality follows from he definition of map: Approach applies to map, fold (reduce)... sel (map f x) p = f (sel x p) (10) This equality is captured by the inferred dependency (y wasSameAs y ) in the WasSelectedFrom nhanced OPM template of Fig. 7 (inference rule omitted for simplicity). wasGenerated by used(list) wasGenerated by used(list) y Q2:sel L' P:map WasSelectedFrom us ed L (fu nc use tio d(p n) o siti on) f wasSameAs wasGenerated by used(list) wasGenerated by used(list) y Q2:sel L' P:map io n) unct ed(f p t) us (lis us ed ed L us (fu wasGenerated by used(operand) wasGenerated by nc y' R:apply x use Q1:sel used(position) tio d(p n) osi tion ) f wasSameAs n ) (fu nctio used p t) lis e d( us 13
  • 24. Take-home message • OPM (PIL) a candidate starting point for workflow-based provenance • extension mechanisms are provided, but they must be used sensibly – data types – processor types • Provenance of nested ordered lists used as a prototypical example – semantics of provenance graphs and graph composition grounded in the semantics of lists • Can this approach be useful for other interesting data types? – sets of tuples / relational algebra 14

Editor's Notes