SlideShare a Scribd company logo
Annotating the Behavior of
Scientific Modules Using Data
Examples: A Practical Approach
Khalid Belhajjame
Université Paris-Dauphine, LAMSADE
Khalid.Belhajjame@dauphine.fr
Scientific Workflows
We have recorded a dramatic
increase in the number of scientist
who utilize scientific modules as
building in the composition of their
experiments
In 2011, the EBI recorded 21
millions invocation to the
scientific modules they host
Typically, an experiment is designed
as a workflow, the steps of which
represent invocation to scientific
modules
Scientific Module Annotation
Semantic annotations can be used to describe scientific modules.
Existing semantic annotations are confined to the description of
modules parameters.
Annotations describing the
behavior of the modules as to the
task they play are rarely available
Designing an ontology that captures precisely the behavior of modules is
challenging.
Proposal: To describe the behavior of scientific modules using data examples
Data Example
Describes >
Outline
Annotation
Annotate Module
Parameters
Scien fic
Module Registry
Generate Data
Examples
Use
Explore and
Understand Modules
Compare Modules
Curator
Experiment
Designer
APIHUT
Radiant
Meteor-s
Galaxy
Taverna
Vistrails
1 2
3 4
Generating Data Examples
Data examples can be used as a means to
describe the behavior of scientific modules.
Enumerating all possible data examples that
can be used to describe a given module may be
expensive, and may contain redundant data
examples that describe the same behavior.
Issue: which data examples should be used to characterize the functionality
of a given module?
Solution: We show how software testing techniques can be adapted
to the problem of generating data examples without relying on the
availability of the module specification, which often is not accessible.
Identifying the Classes of
Behavior of a Scientific Module
To generate data examples, we start by identifying the classes of
behavior of the module.
Consider a module m with an input parameter i, the
domain of legal values of I is divided into partitions p1, …,
pn. The partitioning is performed in a way to cover all
classes of behavior of the module.
To do so, we need access to the module specification, which is
rarely available.
In this work, we use a different source of information, namely
the domain ontology used for annotating module parameters.
Identifying the Classes of
Behavior of a Scientific Module
An ontology can be viewed as a hierarchy of concepts.
We use this hierarchy to specify the classes of behavior
of scientific modules
Consider the module getAccession,
which given an input annotated as
biological sequence returns the
accession used for its identification.
a module can be partitioned into the following :
BiologicalSequence, NucleotideSequence, RNASequence,
DNASequence, and ProteinSequence.
Generating Data Examples Covering
Input Parameter Partitions
Given the partitions of input parameters identified
using the domain ontology, and given a pool of
annotated instances, the input values necessary for
constructing data examples can be automatically
identified:
Data examples covering the partitions in question can
then be constructed by invoking the model using the
input values identified.
hat cover thosepartitions. Such dataexamplescan bespecified by
soliciting from thehuman annotator examplesinput valuesthat be-
ong to the respective partitions, and then invoking the module m
o obtain thecorresponding output values, necessary for construct-
ng the data examples. The construction of such data examples
can, however, befully automated if apool of annotated instancesis
available. Specifically, given pl , apool of annotated instances, the
valuesof i necessary for constructing dataexamplesthat cover the
partitionsof theinput i of themodulemcanbeobtained asfollows:
{ hc, get I nst ance(c, pl )i s.t . c v sem(i )}
where get I nst ance(c, pl ) is a function that returns an instance
of theconcept c from theannotated pool of instancespl. Notethat
his function returns a realization of the concept in question [25],
n thesense that the instance of c chosen is not an instance of any
strict subconcept of c, i.e. not an instance of any concept c0
< c.
Generating Data Examples Covering
Output Parameter Partitions
The method for constructing data examples based on
the partitioning of the domains of output parameters is
can be difficult to implement.
Given a partition po of the output parameter o of a
module m, we need to find values that if used to feed
the inputs of m, the output o generates a value that
belongs to the partition po.
A source that we use for identifying (some of) data
examples that cover the output partitions, is the set of
data examples generated to cover the partitions of the
input parameters.
Evaluation
The method that we have just described is not an exact
method. Rather, it is a heuristic that provides a working
solution. Because of this:
The domain of a module may be over-partitioned, or
Inversely, it may be under-partitioned
We therefore assed the effectiveness of the method proposed
for generating data examples of 252 scientific modules
Notice that the availability of a pool of annotated instances
is crutial to our method.
We constructed such a pool by harvesting existing
provenance traces of scientific workflows.
Evaluation: Metrics
Coverage
Completeness
Conciseness
Coverage
We were able to construct data examples that cover all
the partitions of the input parameters.
Moreover, the data examples generated were found to
cover most of the partitions of the output parameters.
Indeed, with the exception of the partitions of the
outputs of 19 modules. e.g., get_genes_by_enzyme,
link and binfo, all the partitions of the outputs of the
remaining 233 modules were covered by the data
examples generated.
Completeness
Conciseness
Outline
Annotation
Annotate Module
Parameters
Scien fic
Module Registry
Generate Data
Examples
Use
Explore and
Understand Modules
Compare Modules
Curator
Experiment
Designer
APIHUT
Radiant
Meteor-s
Galaxy
Taverna
Vistrails
1 2
3 4
Understanding the Behavior of a
Module Using Data Examples
Question: Do data examples allow human users understand
the behavior of scientific modules?
Evaluation exercise: given a module m, we adopted the
following two-step process:
1. In the first step, the user was asked to describe the
behavior of a module based on its name, the name of its
input and output parameters, and the structural and
semantic types of those parameters.
2. the user was given additionally the data examples that
characterize the module and was asked to update the
module’s behavior if he deems necessary
Understanding the Behavior of a
Module Using Data Examples
Understanding the Behavior of a
Module Using Data Examples
An analysis of the results and the modules showed that the ability for the
human users to identify or not the behavior of the module is correlated
with the nature of the transformation carried out by the module.
The human users identified correctly the behavior of modules
implementing data retrieval, format transformation and identifier
mappings.
On the other hand, they were less successful with modules implementing
data filtering and complex data analysis, such as text mining.
Kind of data manipulation # of modules
Format transformation 53
Dataretrieval 51
Mapping identifiersl 62
Filtering 27
Dataanalysis 59
Table 3: Kinds of data manipulation carried out by the scientific
modules.
complex dataanalysis, dataexamplesmay not havethesamevalue
as for other module kinds, as far as the human user is considered.
Note, however, that alargeproportion of scientific modules imple-
ment format transformation, dataretrieval and mapping identifiers,
which arerefereed to in thescientific workflow literature using the
term Shims [35]. For example, Table 3 classifies the modules that
we analyzed in the experiment. It shows that format transforma-
tion, data retrieval and mapping identifiers modules represent be-
tween them 66% of the total number of modules that weanalyzed.
That said, it is worth stressing, as we will demonstrate in the next
identified protein.
plemented to auto
three modules. Th
obtained fromthe
tion error and out
match. Given a
performs a homo
teins. The accessi
feed the execution
responding geneo
This workflow wa
which ended in 20
froma bioinforma
flow. However, b
for performing th
the user was unab
search for an ava
and that we can u
consuming. We f
homology searche
Japan13
, the Euro
Outline
Annotation
Annotate Module
Parameters
Scien fic
Module Registry
Generate Data
Examples
Use
Explore and
Understand Modules
Compare Modules
Curator
Experiment
Designer
APIHUT
Radiant
Meteor-s
Galaxy
Taverna
Vistrails
1 2
3 4
Comparing Scientific Modules
Using Data Examples
As well as understanding
scientific modules, users may
be interested in comparing the
behavior of two or more
modules.
Module comparison, as a
functionality, is particularly
requested by workflows
curators to repair broken
workflows.
Comparing Scientific Modules
Using Data Examples
Consider two modules m and m’, and consider that
the inputs and outputs of those modules are
semantically and structurally compatible.
To be able to compare the behavior of m and m’, we
generate data examples that characterize their behavior
using the method presented earlier.
However, to make the comparison of their behavior
straightforward, we generate the data examples of m
and m’ in a way that their data examples have the same
input values.
Comparing Scientific Modules
Using Data Examples
By comparing the output values of the data examples
of m and m’ that have the same input values, we
determine if the two modules have behaviors that are:
Equivalent: the data examples of the two modules have
the same output values
Overlapping: Some (but not all) of the data examples of
the two modules have the same output values.
Disjoint: None of the data examples of the two modules
have the same output values.
Evaluation
To assess the effectiveness of the above method for
comparing modules’ behavior, we used it to assist in
the curation of broken workflows.
We were able to identify 72 modules that are in the
composition of scientific workflows (in the
myExperiment repository), that are no longer provided
by their suppliers, and for which we were able to
construct data examples.
We compared those modules with the 252 modules
that we characterized using data examples.
16
23
33
Outline
Annotation
Annotate Module
Parameters
Scien fic
Module Registry
Generate Data
Examples
Use
Explore and
Understand Modules
Compare Modules
Curator
Experiment
Designer
APIHUT
Radiant
Meteor-s
Galaxy
Taverna
Vistrails
1 2
3 4
Conclusions
We showed that it is possible to characterize scientific
modules using data examples without relying on module
specifications.
We also presented two functionalities that utilize the
generated data examples.
Understanding the module behavior by human users
Module comparison
Research Question for future work:
How can we make data examples more concise (less redundant)?
How can we compose modules based only on data examples?
Annotating the Behavior of
Scientific Modules Using Data
Examples: A Practical Approach
Khalid Belhajjame
Université Paris-Dauphine, LAMSADE
Khalid.Belhajjame@dauphine.fr

More Related Content

PDF
Performance Evaluation of Different Data Mining Classification Algorithm and ...
PDF
Phenoflow 2021
PDF
Feature Selection Algorithm for Supervised and Semisupervised Clustering
PDF
Novel Database-Centric Framework for Incremental Information Extraction
PDF
A Microservice Architecture for the Design of Computer-Interpretable Guidelin...
PDF
Filter Based Approach for Genomic Feature Set Selection (FBA-GFS)
PDF
Phenoflow: A Microservice Architecture for Portable Workflow-based Phenotype ...
PDF
MICRE: Microservices In MediCal Research Environments
Performance Evaluation of Different Data Mining Classification Algorithm and ...
Phenoflow 2021
Feature Selection Algorithm for Supervised and Semisupervised Clustering
Novel Database-Centric Framework for Incremental Information Extraction
A Microservice Architecture for the Design of Computer-Interpretable Guidelin...
Filter Based Approach for Genomic Feature Set Selection (FBA-GFS)
Phenoflow: A Microservice Architecture for Portable Workflow-based Phenotype ...
MICRE: Microservices In MediCal Research Environments

What's hot (20)

PDF
Deliverable_5.1.2
PDF
Myanmar Alphabet Recognition System Based on Artificial Neural Network
PDF
The International Journal of Engineering and Science (The IJES)
PDF
Geant4_Web_Application_Update_and_Pion_Cross_Section_Simulation
PDF
Analysis of Classification Algorithm in Data Mining
PDF
NCRAST Talk on Clustering
PDF
Protein structure prediction by means
PDF
Gene Selection for Sample Classification in Microarray: Clustering Based Method
PDF
A Survey Ondecision Tree Learning Algorithms for Knowledge Discovery
PDF
A NEW TECHNIQUE INVOLVING DATA MINING IN PROTEIN SEQUENCE CLASSIFICATION
PPTX
A Novel Approach for Developing Paraphrase Detection System using Machine Lea...
PDF
Iaetsd an enhanced feature selection for
PDF
Clustering and Classification of Cancer Data Using Soft Computing Technique
PDF
Iaetsd an efficient and large data base using subset selection algorithm
PDF
Delineation of techniques to implement on the enhanced proposed model using d...
PPTX
Adaptive web page content identification
PPT
Cheminformatics: An overview
PPTX
Session ii g2 overview metabolic network modeling mcc
PDF
IRJET- Classifying Twitter Data in Multiple Classes based on Sentiment Class ...
PPT
Paper presentations: UK e-science AHM meeting, 2005
Deliverable_5.1.2
Myanmar Alphabet Recognition System Based on Artificial Neural Network
The International Journal of Engineering and Science (The IJES)
Geant4_Web_Application_Update_and_Pion_Cross_Section_Simulation
Analysis of Classification Algorithm in Data Mining
NCRAST Talk on Clustering
Protein structure prediction by means
Gene Selection for Sample Classification in Microarray: Clustering Based Method
A Survey Ondecision Tree Learning Algorithms for Knowledge Discovery
A NEW TECHNIQUE INVOLVING DATA MINING IN PROTEIN SEQUENCE CLASSIFICATION
A Novel Approach for Developing Paraphrase Detection System using Machine Lea...
Iaetsd an enhanced feature selection for
Clustering and Classification of Cancer Data Using Soft Computing Technique
Iaetsd an efficient and large data base using subset selection algorithm
Delineation of techniques to implement on the enhanced proposed model using d...
Adaptive web page content identification
Cheminformatics: An overview
Session ii g2 overview metabolic network modeling mcc
IRJET- Classifying Twitter Data in Multiple Classes based on Sentiment Class ...
Paper presentations: UK e-science AHM meeting, 2005
Ad

Viewers also liked (9)

PPTX
Introduction to ProvBench @ Provenance Week 2014
PDF
Research Object Model in Sepublica
PPTX
Case studyworkshoponprovenance
PPT
Why Workflows Break
PPTX
D-prov use-case
PDF
Detecting Duplicate Records in Scientific Workflow Results
PPTX
PDF
Reproducibility 1
PDF
Предиктивная аналитика и Big Data: методы, инструменты, решения
Introduction to ProvBench @ Provenance Week 2014
Research Object Model in Sepublica
Case studyworkshoponprovenance
Why Workflows Break
D-prov use-case
Detecting Duplicate Records in Scientific Workflow Results
Reproducibility 1
Предиктивная аналитика и Big Data: методы, инструменты, решения
Ad

Similar to Edbt2014 talk (10)

PDF
Reproducible, Open Data Science in the Life Sciences
PDF
Testing Hyper-Complex Systems: What Can We Know? What Can We Claim?
PDF
Irpb workshop
PDF
FAIR BioData Management
PDF
Advanced Bioinformatics for Genomics and BioData Driven Research
PDF
Automated Hypothesis Testing with Large Scale Scientific Workflows
PDF
Multigenic (mechanistic) biomarkers
PPT
COMP60431 Machine Learning Advanced Computer Science MSc
PDF
Introduction to Machine Learning
Reproducible, Open Data Science in the Life Sciences
Testing Hyper-Complex Systems: What Can We Know? What Can We Claim?
Irpb workshop
FAIR BioData Management
Advanced Bioinformatics for Genomics and BioData Driven Research
Automated Hypothesis Testing with Large Scale Scientific Workflows
Multigenic (mechanistic) biomarkers
COMP60431 Machine Learning Advanced Computer Science MSc
Introduction to Machine Learning

More from Khalid Belhajjame (13)

PPTX
Provenance witha purpose
PDF
Lineage-Preserving Anonymization of the Provenance of Collection-Based Workflows
PPTX
Privacy-Preserving Data Analysis Workflows for eScience
PDF
Aussois bda-mdd-2018
PDF
Converting scripts into reproducible workflow research objects
PPTX
A Sightseeing Tour of Prov and Some of its Extensions
PDF
Anr cair meeting feb 2016
PPTX
Linking the prospective and retrospective provenance of scripts
PDF
Tapp 2014 (belhajjame)
PPTX
Credible workshop
PPT
Small Is Beautiful: Summarizing Scientific Workflows Using Semantic Annotat...
PDF
Intégration incrémentale de données (Valenciennes juin 2010)
PDF
Edbt 2010, Belhajjame
Provenance witha purpose
Lineage-Preserving Anonymization of the Provenance of Collection-Based Workflows
Privacy-Preserving Data Analysis Workflows for eScience
Aussois bda-mdd-2018
Converting scripts into reproducible workflow research objects
A Sightseeing Tour of Prov and Some of its Extensions
Anr cair meeting feb 2016
Linking the prospective and retrospective provenance of scripts
Tapp 2014 (belhajjame)
Credible workshop
Small Is Beautiful: Summarizing Scientific Workflows Using Semantic Annotat...
Intégration incrémentale de données (Valenciennes juin 2010)
Edbt 2010, Belhajjame

Recently uploaded (20)

PDF
VCE English Exam - Section C Student Revision Booklet
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PDF
Classroom Observation Tools for Teachers
PDF
Insiders guide to clinical Medicine.pdf
PDF
O7-L3 Supply Chain Operations - ICLT Program
PDF
TR - Agricultural Crops Production NC III.pdf
PDF
Basic Mud Logging Guide for educational purpose
PDF
Mark Klimek Lecture Notes_240423 revision books _173037.pdf
PDF
Business Ethics Teaching Materials for college
PDF
Origin of periodic table-Mendeleev’s Periodic-Modern Periodic table
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PDF
RMMM.pdf make it easy to upload and study
PDF
Microbial disease of the cardiovascular and lymphatic systems
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PPTX
master seminar digital applications in india
PPTX
Cell Types and Its function , kingdom of life
VCE English Exam - Section C Student Revision Booklet
Abdominal Access Techniques with Prof. Dr. R K Mishra
Classroom Observation Tools for Teachers
Insiders guide to clinical Medicine.pdf
O7-L3 Supply Chain Operations - ICLT Program
TR - Agricultural Crops Production NC III.pdf
Basic Mud Logging Guide for educational purpose
Mark Klimek Lecture Notes_240423 revision books _173037.pdf
Business Ethics Teaching Materials for college
Origin of periodic table-Mendeleev’s Periodic-Modern Periodic table
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
Pharmacology of Heart Failure /Pharmacotherapy of CHF
STATICS OF THE RIGID BODIES Hibbelers.pdf
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
RMMM.pdf make it easy to upload and study
Microbial disease of the cardiovascular and lymphatic systems
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
master seminar digital applications in india
Cell Types and Its function , kingdom of life

Edbt2014 talk

  • 1. Annotating the Behavior of Scientific Modules Using Data Examples: A Practical Approach Khalid Belhajjame Université Paris-Dauphine, LAMSADE Khalid.Belhajjame@dauphine.fr
  • 2. Scientific Workflows We have recorded a dramatic increase in the number of scientist who utilize scientific modules as building in the composition of their experiments In 2011, the EBI recorded 21 millions invocation to the scientific modules they host Typically, an experiment is designed as a workflow, the steps of which represent invocation to scientific modules
  • 3. Scientific Module Annotation Semantic annotations can be used to describe scientific modules. Existing semantic annotations are confined to the description of modules parameters. Annotations describing the behavior of the modules as to the task they play are rarely available Designing an ontology that captures precisely the behavior of modules is challenging. Proposal: To describe the behavior of scientific modules using data examples
  • 5. Outline Annotation Annotate Module Parameters Scien fic Module Registry Generate Data Examples Use Explore and Understand Modules Compare Modules Curator Experiment Designer APIHUT Radiant Meteor-s Galaxy Taverna Vistrails 1 2 3 4
  • 6. Generating Data Examples Data examples can be used as a means to describe the behavior of scientific modules. Enumerating all possible data examples that can be used to describe a given module may be expensive, and may contain redundant data examples that describe the same behavior. Issue: which data examples should be used to characterize the functionality of a given module? Solution: We show how software testing techniques can be adapted to the problem of generating data examples without relying on the availability of the module specification, which often is not accessible.
  • 7. Identifying the Classes of Behavior of a Scientific Module To generate data examples, we start by identifying the classes of behavior of the module. Consider a module m with an input parameter i, the domain of legal values of I is divided into partitions p1, …, pn. The partitioning is performed in a way to cover all classes of behavior of the module. To do so, we need access to the module specification, which is rarely available. In this work, we use a different source of information, namely the domain ontology used for annotating module parameters.
  • 8. Identifying the Classes of Behavior of a Scientific Module An ontology can be viewed as a hierarchy of concepts. We use this hierarchy to specify the classes of behavior of scientific modules Consider the module getAccession, which given an input annotated as biological sequence returns the accession used for its identification. a module can be partitioned into the following : BiologicalSequence, NucleotideSequence, RNASequence, DNASequence, and ProteinSequence.
  • 9. Generating Data Examples Covering Input Parameter Partitions Given the partitions of input parameters identified using the domain ontology, and given a pool of annotated instances, the input values necessary for constructing data examples can be automatically identified: Data examples covering the partitions in question can then be constructed by invoking the model using the input values identified. hat cover thosepartitions. Such dataexamplescan bespecified by soliciting from thehuman annotator examplesinput valuesthat be- ong to the respective partitions, and then invoking the module m o obtain thecorresponding output values, necessary for construct- ng the data examples. The construction of such data examples can, however, befully automated if apool of annotated instancesis available. Specifically, given pl , apool of annotated instances, the valuesof i necessary for constructing dataexamplesthat cover the partitionsof theinput i of themodulemcanbeobtained asfollows: { hc, get I nst ance(c, pl )i s.t . c v sem(i )} where get I nst ance(c, pl ) is a function that returns an instance of theconcept c from theannotated pool of instancespl. Notethat his function returns a realization of the concept in question [25], n thesense that the instance of c chosen is not an instance of any strict subconcept of c, i.e. not an instance of any concept c0 < c.
  • 10. Generating Data Examples Covering Output Parameter Partitions The method for constructing data examples based on the partitioning of the domains of output parameters is can be difficult to implement. Given a partition po of the output parameter o of a module m, we need to find values that if used to feed the inputs of m, the output o generates a value that belongs to the partition po. A source that we use for identifying (some of) data examples that cover the output partitions, is the set of data examples generated to cover the partitions of the input parameters.
  • 11. Evaluation The method that we have just described is not an exact method. Rather, it is a heuristic that provides a working solution. Because of this: The domain of a module may be over-partitioned, or Inversely, it may be under-partitioned We therefore assed the effectiveness of the method proposed for generating data examples of 252 scientific modules Notice that the availability of a pool of annotated instances is crutial to our method. We constructed such a pool by harvesting existing provenance traces of scientific workflows.
  • 13. Coverage We were able to construct data examples that cover all the partitions of the input parameters. Moreover, the data examples generated were found to cover most of the partitions of the output parameters. Indeed, with the exception of the partitions of the outputs of 19 modules. e.g., get_genes_by_enzyme, link and binfo, all the partitions of the outputs of the remaining 233 modules were covered by the data examples generated.
  • 15. Outline Annotation Annotate Module Parameters Scien fic Module Registry Generate Data Examples Use Explore and Understand Modules Compare Modules Curator Experiment Designer APIHUT Radiant Meteor-s Galaxy Taverna Vistrails 1 2 3 4
  • 16. Understanding the Behavior of a Module Using Data Examples Question: Do data examples allow human users understand the behavior of scientific modules? Evaluation exercise: given a module m, we adopted the following two-step process: 1. In the first step, the user was asked to describe the behavior of a module based on its name, the name of its input and output parameters, and the structural and semantic types of those parameters. 2. the user was given additionally the data examples that characterize the module and was asked to update the module’s behavior if he deems necessary
  • 17. Understanding the Behavior of a Module Using Data Examples
  • 18. Understanding the Behavior of a Module Using Data Examples An analysis of the results and the modules showed that the ability for the human users to identify or not the behavior of the module is correlated with the nature of the transformation carried out by the module. The human users identified correctly the behavior of modules implementing data retrieval, format transformation and identifier mappings. On the other hand, they were less successful with modules implementing data filtering and complex data analysis, such as text mining. Kind of data manipulation # of modules Format transformation 53 Dataretrieval 51 Mapping identifiersl 62 Filtering 27 Dataanalysis 59 Table 3: Kinds of data manipulation carried out by the scientific modules. complex dataanalysis, dataexamplesmay not havethesamevalue as for other module kinds, as far as the human user is considered. Note, however, that alargeproportion of scientific modules imple- ment format transformation, dataretrieval and mapping identifiers, which arerefereed to in thescientific workflow literature using the term Shims [35]. For example, Table 3 classifies the modules that we analyzed in the experiment. It shows that format transforma- tion, data retrieval and mapping identifiers modules represent be- tween them 66% of the total number of modules that weanalyzed. That said, it is worth stressing, as we will demonstrate in the next identified protein. plemented to auto three modules. Th obtained fromthe tion error and out match. Given a performs a homo teins. The accessi feed the execution responding geneo This workflow wa which ended in 20 froma bioinforma flow. However, b for performing th the user was unab search for an ava and that we can u consuming. We f homology searche Japan13 , the Euro
  • 19. Outline Annotation Annotate Module Parameters Scien fic Module Registry Generate Data Examples Use Explore and Understand Modules Compare Modules Curator Experiment Designer APIHUT Radiant Meteor-s Galaxy Taverna Vistrails 1 2 3 4
  • 20. Comparing Scientific Modules Using Data Examples As well as understanding scientific modules, users may be interested in comparing the behavior of two or more modules. Module comparison, as a functionality, is particularly requested by workflows curators to repair broken workflows.
  • 21. Comparing Scientific Modules Using Data Examples Consider two modules m and m’, and consider that the inputs and outputs of those modules are semantically and structurally compatible. To be able to compare the behavior of m and m’, we generate data examples that characterize their behavior using the method presented earlier. However, to make the comparison of their behavior straightforward, we generate the data examples of m and m’ in a way that their data examples have the same input values.
  • 22. Comparing Scientific Modules Using Data Examples By comparing the output values of the data examples of m and m’ that have the same input values, we determine if the two modules have behaviors that are: Equivalent: the data examples of the two modules have the same output values Overlapping: Some (but not all) of the data examples of the two modules have the same output values. Disjoint: None of the data examples of the two modules have the same output values.
  • 23. Evaluation To assess the effectiveness of the above method for comparing modules’ behavior, we used it to assist in the curation of broken workflows. We were able to identify 72 modules that are in the composition of scientific workflows (in the myExperiment repository), that are no longer provided by their suppliers, and for which we were able to construct data examples. We compared those modules with the 252 modules that we characterized using data examples.
  • 25. Outline Annotation Annotate Module Parameters Scien fic Module Registry Generate Data Examples Use Explore and Understand Modules Compare Modules Curator Experiment Designer APIHUT Radiant Meteor-s Galaxy Taverna Vistrails 1 2 3 4
  • 26. Conclusions We showed that it is possible to characterize scientific modules using data examples without relying on module specifications. We also presented two functionalities that utilize the generated data examples. Understanding the module behavior by human users Module comparison Research Question for future work: How can we make data examples more concise (less redundant)? How can we compose modules based only on data examples?
  • 27. Annotating the Behavior of Scientific Modules Using Data Examples: A Practical Approach Khalid Belhajjame Université Paris-Dauphine, LAMSADE Khalid.Belhajjame@dauphine.fr