SlideShare a Scribd company logo
2016-09-04 BioExcel SIG, ECCB, Amsterdam
Advances in Scientific
Workflow Environments
Carole Goble, Stian Soiland-Reyes
The University of Manchester
carole.goble@manchester.ac.uk
http://esciencelab.org.uk/
What is a Workflow?
• Orchestrating multiple
computational tasks
• Managing the control and
data flow between them
• In a world that is
homogeneous or
heterogeneous
• Tasks
– Local / remote
– Local / third party
– White, grey or black boxes
– Reliable / fragile
– Reserved / dynamic
– Various underpinning
infrastructure
– Various access controls
BioExcel: Biomolecular recognition
What is a Workflow?
Automation
– Automate computational aspects
– Repetitive pipelines, sweep campaigns
Scaling – compute cycles
– Make use of computational infrastructure
& handle large data
Abstraction – people cycles
– Shield complexity and incompatibilities
– Report, re-use, evolve, share, compare
– Repeat –Tweak - Repeat
– First class commodities
Provenance - reporting
– Capture, report and utilize log and data
lineage auto-documentation
– Traceable evolution, audit, transparency
– Compare
With thanks to Bertram Ludascher:WORKS 2015 Keynote
Findable
Accessible
Interoperable
Reusable
(Reproducible)
https://pegasus.isi.edu/2016/02/11/pegasus-powers-ligo-gravitational-waves-detection-analysis/
Laser Interferometer Gravitational-Wave
Observatory – first detection of gravitational
waves from colliding black holes
Morphological, hemodynamic and
structural analyses linked to aneurysm
genesis, growth and rupture.
[Susheel Varma] http://www.vph-share.eu/
http://taverna.org.uk
Galaxy
https://usegalaxy.org/
Marine metagenomics
+ Bespoke Scripts
[Rob Finn]
Open PHACTS
https://www.knime.org/
BioExcel
workflow
https://www.openphacts.org/
Targets
Pharmacological queries
target, compound and pathway data
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0115460
Scripts, Ensemble toolkit, execution patterns
http://www.extasy-project.org/
http://www.myexperiment.org
WF Zoo
Advances in Scientific Workflow Environments
Workflow Patterns, templates
Data
wrangling
& analytics
Simulations
Instrument
pipelines
+
+
http://tpeterka.github.io/maui-project/
The Future of ScientificWorkflows, Report of DOEWorkshop 2015,
http://science.energy.gov/~/media/ascr/pdf/programdocuments/docs/workflows_final_report.pd
Workflow Patterns, templates
Data
wrangling
& analytics
Simulations
Instrument
pipelines
+
+ Garijo et al Common Motifs in ScientificWorkflows: An EmpiricalAnalysis, FGCS, 36, July 2014, 338–351
Workflow Patterns, templates
• Long running and complex code
• Tunable parameters and input sets
• Simulation sweeps / iterations
• Ensembles, comparisons
• Tricky set-ups, human-in-the-loop
interaction
• Computational steering
• In situ workflows – multiple tasks, same
box, within fixed time
– data locality.
– human-in-the-loop.
– capture provenance.
Data
wrangling
& analytics
Simulations
Instrument
pipelines
+
+
Traction + Examples
Reuse behaviours
Exploratory vs Production
Different kinds of user / deployment
Developer – User Ratios
BiologistDeveloper Computational
Scientist
Existing computational research
workflow systems
https://github.com/common-workflow-
WFMS Zoo
Existing computational research
workflow systems
https://github.com/common-workflow-
Existing computational research
workflow systems
s://github.com/common-workflow-language/common-workflow-language/wiki/Existing-
Workflow-systems
“Multi-scale” WFMS
• Workflow
Management
System
– Its design and reporting
environment
– Its execution
environment
• The tasks
– tools, codes and services
and their execution
environments
• Stack layer
– App level, infrastructure
level
Component making
Tasks loosely coupled through files,
• execute on geographically distributed
clusters, clouds, grids across systems
• execute on multiple facilities
• call host services (web / grid services)
DAIC
Distributed Area/Instrument
Computing
“Multi-scale” WFMS
Tasks tightly coupled
• exchanging info over memory/storage
• network of supercomputers
• In situ workflows – multiple tasks, same
box, within fixed time
HPC
Interoperability
Portability
Granularity
Maintenance
Workflow Environment Ecosystem
Copernicus workflow engine for
parallel adaptive molecular dynamics
• Peer-to-peer distributed
computing platform
– high-level parallelization of
statistical sampling problems
• Consolidation of heterogeneous
compute resources
• Automatic resource matching of
jobs against compute resources
• Automatic fault tolerance of
distributed work
• Workflow execution engine to
define a problem (reporting) and
trace its results live (provenance)
• Flexible plugin facilities
– programs to be integrated to the
workflow execution engine
Free Energy
Workflow using
GROMACS
http://copernicus-computing.org/
COMPs/PyCOMPs:
Programmer Productivity
framework
• Sequential programming
– Parallelisation and
distribution heavy-lifting
– Dependency detection
• Infrastructure unaware
– Abstract application from
underlying infrastructure
– Portability
• Standard Programming
Languages
– Java, Python, C/C++
• No (or few!) APIs
– Standard Java
Shield the
user/programmer
Exposure to the
infrastructure
System Design
Manage/minimize data transfers
Stop Press!
GUIs not essential!
• Canvas, drag-drop blocks, arrows,
run button
• Command-line & embedding in
developer or user applications
Scripts can be workflows!
• WMS<->Scripts
• Script vs Workflows/ASAP:
– Automation: *****
– Scaling: **
– Abstraction: *
– Provenance: **
Stop Press!
GUIs not essential!
• Canvas, drag-drop blocks, arrows,
run button
• Command-line & embedding in
developer or user applications
Scripts can be workflows!
• WMS <-> Scripts
• Script vs Workflows/ASAP:
– Automation: *****
– Scaling: **
– Abstraction: *
– Provenance: **
Work close to a problem-
specific ad-hoc data model
Domain Specific Language
"programming-lite" scripts
• wire with declarative
"makefile"-like DAG
Plus
• procedural scripting and
expressions in languages
like Javascript and Python
Nextflow, SnakeMake,
CommonWorkflow Language
GUIs Are Essential 
take-up by the user base
Workflowising script software eco-systems
prime example: provenance
ASAP
• common,
interoperable
provenance recording
– W3C PROV
ASAP
• YesWorkflow.org
– Annotations in script
yield workflow view
ASAP
• Library profilers
– noWorkflow
• runtime provenance
recorders
– Sumatra, RDataTracker
Provenance the link between computation and results
W3C PROV model standard
record for reporting
compare diffs/discrepancies
provenance analytics
track changes, adapt
partial repeat/reproduce
carry attributions
compute credits
compute data quality/trust
select data to keep/release
optimisation and debugging
Metadata propagation –where was the
physical sample collected, and who
should be attributed?
Task-based abstractions: simplifying
provenance using motifs and tool
annotations
“Free energy calculation” rather than 5
steps including preparation of PDB files
and GROMACS execution
Provenance the link workflow variants
and workflow reuse and repurpose
W3C PROV model standard?
record for reporting
compare diffs/discrepancies
provenance analytics
track changes, adapt
carry attributions
compute design credits
versioning, forking, cloning
Nested workflows
functions by stealth
Copy and paste fragmentation
Designing for reuse
Find and Go
Software practices
Systematic reuse
Guidelines for persistently identifying
software using DataCite
https://epubs.stfc.ac.uk/work/24058274
https://www.force11.org/software-citation-
principles
ASAP Wfms for FAIR Science
Automate: workflows,
programs and services folks
already use or want to use
Scale: Enable computational
productivity
Abstract: Enable human
productivity
Provenance: Record and use Usability
Workflow Plugged in Code
Reporting Comparison
Thanks to Bertram Ludascher
Dependency Management
Codes Behaviours & Reliability
● Task-specific “mini-workflow”
fragments
– e.g. using Gromacs, CPMD,
HADDOCK
● Packaged
– EGIVM images and Docker
containers
● Backed by existing registries
– ELIXIR’s bio.tools and EGI App DB
● Instantiated as cloud instances
– private (Open Nebula, Open Stack)
– public (e.g.AmazonAWS )
Application Building Blocks
BioExcel Virtualised Software Library
“transversal workflow units”, higher level operations
BioExcel Use cases
● Genomics
● Ensembl Molecular
simulations
● Free Energy simulations
● Multiscale modelling of
molecular basis for odor
and taste
● Biomolecular recognition
● Pharmacological queries
● Virtual Screening
Finding valid pathways through free-energy
landscapes: implementation of the “string of
swarms” method using Copernicus as a
workflow manager, and GROMACS as a
compute engine.
Workflow Interoperability.
• Common format for bioinformatics tool &
workflow execution
• Community based standards effort
• Designed for clusters & clouds
• Supports the use of containers (e.g. Docker)
• Specify data dependencies between steps
• Scatter/gather on steps
• Nest workflows in steps
• Develop your pipeline on your local computer
(optionally with Docker)
• Execute on your research cluster or in the cloud
• Deliver to users via workbenches
• EDAM ontology (ELIXIR-DK) to specify file
formats and reason about them: “FASTQ
Sanger” encoding is a type of FASTQ file
Workflow Research Object Bundle
researchobject.org
Belhajjame et al (2015) Using a suite of ontologies for preserving workflow-centric research objects,
JWeb Semantics doi:10.1016/j.websem.2015.01.003
application/vnd.wf4ever.robundle+zip
Z. Zhao et al., “Workflow bus for e-Science”, in IEEE e-Science 2006, Amsterdam
2007
2015
http://bioexcel.eu/events/bioexcel-workflow-training-for-computational-biomolecular-
research/
Adam Hospital (IRB), Anna Montras (IRB), Stian Soiland-Reyes (UNIMAN), Alexandre Bonvin
(UU), Adrien Melquiond (UU), Josep Lluís Gelpí (BSC), Daniele Lezzi (BSC), Steven Newhouse
(EBI), Jose A. Dianes (EBI), Mark Abraham (KTH), Rossen Apostolov (KTH), Emiliano Ippoliti
(Jülich), Adam Carter (UEDIN), Darren J. White (UEDIN)
Slides: Bertram Ludascher, Ewa Deelman, Vasa Curcin, Paolo Missier, Pinar Alper, Susheel
Varma, Rob Finn, Michael Crusoe, Rizos Sakellariou
Sign up
ASAP!
Bonus Slides

More Related Content

PPTX
FAIRDOM - FAIR Asset management and sharing experiences in Systems and Synthe...
PPTX
Reproducibility, Research Objects and Reality, Leiden 2016
PDF
Research Shared: researchobject.org
PPTX
FAIR Data, Operations and Model management for Systems Biology and Systems Me...
PPTX
Being FAIR: FAIR data and model management SSBSS 2017 Summer School
PPTX
Being Reproducible: SSBSS Summer School 2017
PPTX
FAIR Software (and Data) Citation: Europe, Research Object Systems, Networks ...
PPT
Publishing data and code openly
FAIRDOM - FAIR Asset management and sharing experiences in Systems and Synthe...
Reproducibility, Research Objects and Reality, Leiden 2016
Research Shared: researchobject.org
FAIR Data, Operations and Model management for Systems Biology and Systems Me...
Being FAIR: FAIR data and model management SSBSS 2017 Summer School
Being Reproducible: SSBSS Summer School 2017
FAIR Software (and Data) Citation: Europe, Research Object Systems, Networks ...
Publishing data and code openly

What's hot (20)

PPTX
The Rhetoric of Research Objects
PPTX
Mtsr2015 goble-keynote
PDF
Reproducibility of model-based results: standards, infrastructure, and recogn...
PPTX
FAIRer Research
PPTX
Being FAIR: Enabling Reproducible Data Science
PPTX
FAIRy Stories
PPTX
Research Objects: more than the sum of the parts
PPTX
What is Reproducibility? The R* brouhaha (and how Research Objects can help)
PPTX
FAIR Data and Model Management for Systems Biology (and SOPs too!)
PPTX
ROHub
PPTX
Introduction to FAIRDOM
PPTX
Research Objects, SEEK and FAIRDOM
PPTX
The Research Object Initiative: Frameworks and Use Cases
PDF
Improving the Management of Computational Models -- Invited talk at the EBI
PPTX
Aspects of Reproducibility in Earth Science
PPTX
Better Software, Better Research
PPTX
Citing data in research articles: principles, implementation, challenges - an...
PPTX
Crediting informatics and data folks in life science teams
PPTX
The swings and roundabouts of a decade of fun and games with Research Objects
PPTX
The FAIRDOM Commons for Systems Biology
The Rhetoric of Research Objects
Mtsr2015 goble-keynote
Reproducibility of model-based results: standards, infrastructure, and recogn...
FAIRer Research
Being FAIR: Enabling Reproducible Data Science
FAIRy Stories
Research Objects: more than the sum of the parts
What is Reproducibility? The R* brouhaha (and how Research Objects can help)
FAIR Data and Model Management for Systems Biology (and SOPs too!)
ROHub
Introduction to FAIRDOM
Research Objects, SEEK and FAIRDOM
The Research Object Initiative: Frameworks and Use Cases
Improving the Management of Computational Models -- Invited talk at the EBI
Aspects of Reproducibility in Earth Science
Better Software, Better Research
Citing data in research articles: principles, implementation, challenges - an...
Crediting informatics and data folks in life science teams
The swings and roundabouts of a decade of fun and games with Research Objects
The FAIRDOM Commons for Systems Biology
Ad

Viewers also liked (10)

PDF
Capturing the context: one small(ish step for modellers, one giant leap for m...
PDF
Improving the management of computational models.
PPTX
FAIR data and model management for systems biology (and SOPs too!)
PPTX
FAIR data and model management for systems biology.
PPTX
Making your data good enough for sharing.
PPTX
Report of the second FAIRDOM foundry
PPTX
Licensing, Citation and Sustainability.
PPTX
Reproducible and citable data and models: an introduction.
PPTX
ERA CoBioTech Data Management Webinar
PPTX
Precision Medicine in Oncology Informatics
Capturing the context: one small(ish step for modellers, one giant leap for m...
Improving the management of computational models.
FAIR data and model management for systems biology (and SOPs too!)
FAIR data and model management for systems biology.
Making your data good enough for sharing.
Report of the second FAIRDOM foundry
Licensing, Citation and Sustainability.
Reproducible and citable data and models: an introduction.
ERA CoBioTech Data Management Webinar
Precision Medicine in Oncology Informatics
Ad

Similar to Advances in Scientific Workflow Environments (20)

PDF
2016-10-20 BioExcel: Advances in Scientific Workflow Environments
PPTX
FAIR Computational Workflows
PDF
Towards an Infrastructure for Enabling Systematic Development and Research of...
PPTX
2017-11-03 Scientific Workflow systems
PPTX
2017-11-03 Scientific Workflow systems
PDF
Overview of Scientific Workflows - Why Use Them?
PPTX
An Overview of VIEW
PPTX
FAIR Computational Workflows
PPTX
WorkflowHub: Community Framework for Enabling Scientific Workflow Research a...
PPTX
Scientific workflow-overview-2012-01-rev-2
PDF
Scientific Workflows: what do we have, what do we miss?
PPTX
EOSC-Life Workflow Collaboratory
PDF
Automating Environmental Computing Applications with Scientific Workflows
PDF
2016-11-21 BioExcel Workflows and Pipelines Interest Group
PPT
Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome
PPTX
Pegasus-Poster-2016-final-v2
PDF
Converting scripts into reproducible workflow research objects
PDF
Converting Scripts into Reproducible Workflow Research Objects
PDF
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
PPT
eResearch workflows for studying free and open source software development
2016-10-20 BioExcel: Advances in Scientific Workflow Environments
FAIR Computational Workflows
Towards an Infrastructure for Enabling Systematic Development and Research of...
2017-11-03 Scientific Workflow systems
2017-11-03 Scientific Workflow systems
Overview of Scientific Workflows - Why Use Them?
An Overview of VIEW
FAIR Computational Workflows
WorkflowHub: Community Framework for Enabling Scientific Workflow Research a...
Scientific workflow-overview-2012-01-rev-2
Scientific Workflows: what do we have, what do we miss?
EOSC-Life Workflow Collaboratory
Automating Environmental Computing Applications with Scientific Workflows
2016-11-21 BioExcel Workflows and Pipelines Interest Group
Workflows, provenance and reporting: a lifecycle perspective at BIH 2013, Rome
Pegasus-Poster-2016-final-v2
Converting scripts into reproducible workflow research objects
Converting Scripts into Reproducible Workflow Research Objects
A Maturing Role of Workflows in the Presence of Heterogenous Computing Archit...
eResearch workflows for studying free and open source software development

More from Carole Goble (20)

PPTX
The ELIXIR FAIR Knowledge Ecosystem for practical know-how: RDMkit and FAIRCo...
PPTX
Can’t Pay, Won’t Pay, Don’t Pay: Delivering open science, a Digital Research...
PPTX
RO-Crate: packaging metadata love notes into FAIR Digital Objects
PPTX
Research Software Sustainability takes a Village
PPTX
Title: Love, Money, Fame, Nudge: Enabling Data-intensive BioScience through D...
PPTX
FAIR Computational Workflows
PPTX
Open Research: Manchester leading and learning
PPTX
RDMkit, a Research Data Management Toolkit. Built by the Community for the ...
PPTX
FAIR Computational Workflows
PPTX
FAIR Computational Workflows
PPTX
FAIR Data Bridging from researcher data management to ELIXIR archives in the...
PPTX
FAIR Workflows and Research Objects get a Workout
PPTX
FAIRy stories: the FAIR Data principles in theory and in practice
PPTX
RO-Crate: A framework for packaging research products into FAIR Research Objects
PPTX
How are we Faring with FAIR? (and what FAIR is not)
PPTX
What is Reproducibility? The R* brouhaha and how Research Objects can help
PPTX
FAIR History and the Future
PPTX
ELIXIR UK Node presentation to the ELIXIR Board
PPTX
FAIRy stories: tales from building the FAIR Research Commons
PPTX
Let’s go on a FAIR safari!
The ELIXIR FAIR Knowledge Ecosystem for practical know-how: RDMkit and FAIRCo...
Can’t Pay, Won’t Pay, Don’t Pay: Delivering open science, a Digital Research...
RO-Crate: packaging metadata love notes into FAIR Digital Objects
Research Software Sustainability takes a Village
Title: Love, Money, Fame, Nudge: Enabling Data-intensive BioScience through D...
FAIR Computational Workflows
Open Research: Manchester leading and learning
RDMkit, a Research Data Management Toolkit. Built by the Community for the ...
FAIR Computational Workflows
FAIR Computational Workflows
FAIR Data Bridging from researcher data management to ELIXIR archives in the...
FAIR Workflows and Research Objects get a Workout
FAIRy stories: the FAIR Data principles in theory and in practice
RO-Crate: A framework for packaging research products into FAIR Research Objects
How are we Faring with FAIR? (and what FAIR is not)
What is Reproducibility? The R* brouhaha and how Research Objects can help
FAIR History and the Future
ELIXIR UK Node presentation to the ELIXIR Board
FAIRy stories: tales from building the FAIR Research Commons
Let’s go on a FAIR safari!

Recently uploaded (20)

PDF
AlphaEarth Foundations and the Satellite Embedding dataset
PDF
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
PDF
MIRIDeepImagingSurvey(MIDIS)oftheHubbleUltraDeepField
PPTX
Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...
PPTX
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
PDF
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
PPT
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
PPTX
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
PPTX
Introduction to Fisheries Biotechnology_Lesson 1.pptx
PPTX
The KM-GBF monitoring framework – status & key messages.pptx
PPTX
Classification Systems_TAXONOMY_SCIENCE8.pptx
PDF
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
PPTX
famous lake in india and its disturibution and importance
PDF
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
PPTX
microscope-Lecturecjchchchchcuvuvhc.pptx
PDF
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
PPTX
Production technology of seed spices,,,,
PPTX
Comparative Structure of Integument in Vertebrates.pptx
PPTX
Cell Membrane: Structure, Composition & Functions
PPTX
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
AlphaEarth Foundations and the Satellite Embedding dataset
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
MIRIDeepImagingSurvey(MIDIS)oftheHubbleUltraDeepField
Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
Introduction to Fisheries Biotechnology_Lesson 1.pptx
The KM-GBF monitoring framework – status & key messages.pptx
Classification Systems_TAXONOMY_SCIENCE8.pptx
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
famous lake in india and its disturibution and importance
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
microscope-Lecturecjchchchchcuvuvhc.pptx
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
Production technology of seed spices,,,,
Comparative Structure of Integument in Vertebrates.pptx
Cell Membrane: Structure, Composition & Functions
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.

Advances in Scientific Workflow Environments

  • 1. 2016-09-04 BioExcel SIG, ECCB, Amsterdam Advances in Scientific Workflow Environments Carole Goble, Stian Soiland-Reyes The University of Manchester carole.goble@manchester.ac.uk http://esciencelab.org.uk/
  • 2. What is a Workflow? • Orchestrating multiple computational tasks • Managing the control and data flow between them • In a world that is homogeneous or heterogeneous • Tasks – Local / remote – Local / third party – White, grey or black boxes – Reliable / fragile – Reserved / dynamic – Various underpinning infrastructure – Various access controls BioExcel: Biomolecular recognition
  • 3. What is a Workflow? Automation – Automate computational aspects – Repetitive pipelines, sweep campaigns Scaling – compute cycles – Make use of computational infrastructure & handle large data Abstraction – people cycles – Shield complexity and incompatibilities – Report, re-use, evolve, share, compare – Repeat –Tweak - Repeat – First class commodities Provenance - reporting – Capture, report and utilize log and data lineage auto-documentation – Traceable evolution, audit, transparency – Compare With thanks to Bertram Ludascher:WORKS 2015 Keynote Findable Accessible Interoperable Reusable (Reproducible)
  • 5. Morphological, hemodynamic and structural analyses linked to aneurysm genesis, growth and rupture. [Susheel Varma] http://www.vph-share.eu/ http://taverna.org.uk
  • 7. Marine metagenomics + Bespoke Scripts [Rob Finn]
  • 8. Open PHACTS https://www.knime.org/ BioExcel workflow https://www.openphacts.org/ Targets Pharmacological queries target, compound and pathway data http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0115460
  • 9. Scripts, Ensemble toolkit, execution patterns http://www.extasy-project.org/
  • 12. Workflow Patterns, templates Data wrangling & analytics Simulations Instrument pipelines + + http://tpeterka.github.io/maui-project/ The Future of ScientificWorkflows, Report of DOEWorkshop 2015, http://science.energy.gov/~/media/ascr/pdf/programdocuments/docs/workflows_final_report.pd
  • 13. Workflow Patterns, templates Data wrangling & analytics Simulations Instrument pipelines + + Garijo et al Common Motifs in ScientificWorkflows: An EmpiricalAnalysis, FGCS, 36, July 2014, 338–351
  • 14. Workflow Patterns, templates • Long running and complex code • Tunable parameters and input sets • Simulation sweeps / iterations • Ensembles, comparisons • Tricky set-ups, human-in-the-loop interaction • Computational steering • In situ workflows – multiple tasks, same box, within fixed time – data locality. – human-in-the-loop. – capture provenance. Data wrangling & analytics Simulations Instrument pipelines + +
  • 15. Traction + Examples Reuse behaviours Exploratory vs Production Different kinds of user / deployment Developer – User Ratios BiologistDeveloper Computational Scientist
  • 16. Existing computational research workflow systems https://github.com/common-workflow- WFMS Zoo
  • 17. Existing computational research workflow systems https://github.com/common-workflow-
  • 18. Existing computational research workflow systems s://github.com/common-workflow-language/common-workflow-language/wiki/Existing- Workflow-systems
  • 19. “Multi-scale” WFMS • Workflow Management System – Its design and reporting environment – Its execution environment • The tasks – tools, codes and services and their execution environments • Stack layer – App level, infrastructure level
  • 20. Component making Tasks loosely coupled through files, • execute on geographically distributed clusters, clouds, grids across systems • execute on multiple facilities • call host services (web / grid services) DAIC Distributed Area/Instrument Computing “Multi-scale” WFMS Tasks tightly coupled • exchanging info over memory/storage • network of supercomputers • In situ workflows – multiple tasks, same box, within fixed time HPC Interoperability Portability Granularity Maintenance
  • 22. Copernicus workflow engine for parallel adaptive molecular dynamics • Peer-to-peer distributed computing platform – high-level parallelization of statistical sampling problems • Consolidation of heterogeneous compute resources • Automatic resource matching of jobs against compute resources • Automatic fault tolerance of distributed work • Workflow execution engine to define a problem (reporting) and trace its results live (provenance) • Flexible plugin facilities – programs to be integrated to the workflow execution engine Free Energy Workflow using GROMACS http://copernicus-computing.org/
  • 23. COMPs/PyCOMPs: Programmer Productivity framework • Sequential programming – Parallelisation and distribution heavy-lifting – Dependency detection • Infrastructure unaware – Abstract application from underlying infrastructure – Portability • Standard Programming Languages – Java, Python, C/C++ • No (or few!) APIs – Standard Java
  • 24. Shield the user/programmer Exposure to the infrastructure System Design Manage/minimize data transfers
  • 25. Stop Press! GUIs not essential! • Canvas, drag-drop blocks, arrows, run button • Command-line & embedding in developer or user applications Scripts can be workflows! • WMS<->Scripts • Script vs Workflows/ASAP: – Automation: ***** – Scaling: ** – Abstraction: * – Provenance: **
  • 26. Stop Press! GUIs not essential! • Canvas, drag-drop blocks, arrows, run button • Command-line & embedding in developer or user applications Scripts can be workflows! • WMS <-> Scripts • Script vs Workflows/ASAP: – Automation: ***** – Scaling: ** – Abstraction: * – Provenance: ** Work close to a problem- specific ad-hoc data model Domain Specific Language "programming-lite" scripts • wire with declarative "makefile"-like DAG Plus • procedural scripting and expressions in languages like Javascript and Python Nextflow, SnakeMake, CommonWorkflow Language
  • 27. GUIs Are Essential  take-up by the user base
  • 28. Workflowising script software eco-systems prime example: provenance ASAP • common, interoperable provenance recording – W3C PROV ASAP • YesWorkflow.org – Annotations in script yield workflow view ASAP • Library profilers – noWorkflow • runtime provenance recorders – Sumatra, RDataTracker
  • 29. Provenance the link between computation and results W3C PROV model standard record for reporting compare diffs/discrepancies provenance analytics track changes, adapt partial repeat/reproduce carry attributions compute credits compute data quality/trust select data to keep/release optimisation and debugging Metadata propagation –where was the physical sample collected, and who should be attributed? Task-based abstractions: simplifying provenance using motifs and tool annotations “Free energy calculation” rather than 5 steps including preparation of PDB files and GROMACS execution
  • 30. Provenance the link workflow variants and workflow reuse and repurpose W3C PROV model standard? record for reporting compare diffs/discrepancies provenance analytics track changes, adapt carry attributions compute design credits versioning, forking, cloning Nested workflows functions by stealth Copy and paste fragmentation Designing for reuse Find and Go Software practices Systematic reuse Guidelines for persistently identifying software using DataCite https://epubs.stfc.ac.uk/work/24058274 https://www.force11.org/software-citation- principles
  • 31. ASAP Wfms for FAIR Science Automate: workflows, programs and services folks already use or want to use Scale: Enable computational productivity Abstract: Enable human productivity Provenance: Record and use Usability Workflow Plugged in Code Reporting Comparison Thanks to Bertram Ludascher
  • 33. ● Task-specific “mini-workflow” fragments – e.g. using Gromacs, CPMD, HADDOCK ● Packaged – EGIVM images and Docker containers ● Backed by existing registries – ELIXIR’s bio.tools and EGI App DB ● Instantiated as cloud instances – private (Open Nebula, Open Stack) – public (e.g.AmazonAWS ) Application Building Blocks BioExcel Virtualised Software Library “transversal workflow units”, higher level operations
  • 34. BioExcel Use cases ● Genomics ● Ensembl Molecular simulations ● Free Energy simulations ● Multiscale modelling of molecular basis for odor and taste ● Biomolecular recognition ● Pharmacological queries ● Virtual Screening
  • 35. Finding valid pathways through free-energy landscapes: implementation of the “string of swarms” method using Copernicus as a workflow manager, and GROMACS as a compute engine.
  • 36. Workflow Interoperability. • Common format for bioinformatics tool & workflow execution • Community based standards effort • Designed for clusters & clouds • Supports the use of containers (e.g. Docker) • Specify data dependencies between steps • Scatter/gather on steps • Nest workflows in steps • Develop your pipeline on your local computer (optionally with Docker) • Execute on your research cluster or in the cloud • Deliver to users via workbenches • EDAM ontology (ELIXIR-DK) to specify file formats and reason about them: “FASTQ Sanger” encoding is a type of FASTQ file
  • 37. Workflow Research Object Bundle researchobject.org Belhajjame et al (2015) Using a suite of ontologies for preserving workflow-centric research objects, JWeb Semantics doi:10.1016/j.websem.2015.01.003 application/vnd.wf4ever.robundle+zip
  • 38. Z. Zhao et al., “Workflow bus for e-Science”, in IEEE e-Science 2006, Amsterdam
  • 40. http://bioexcel.eu/events/bioexcel-workflow-training-for-computational-biomolecular- research/ Adam Hospital (IRB), Anna Montras (IRB), Stian Soiland-Reyes (UNIMAN), Alexandre Bonvin (UU), Adrien Melquiond (UU), Josep Lluís Gelpí (BSC), Daniele Lezzi (BSC), Steven Newhouse (EBI), Jose A. Dianes (EBI), Mark Abraham (KTH), Rossen Apostolov (KTH), Emiliano Ippoliti (Jülich), Adam Carter (UEDIN), Darren J. White (UEDIN) Slides: Bertram Ludascher, Ewa Deelman, Vasa Curcin, Paolo Missier, Pinar Alper, Susheel Varma, Rob Finn, Michael Crusoe, Rizos Sakellariou Sign up ASAP!