SlideShare a Scribd company logo
FAIR
Computational
Workflows
Professor Carole Goble
The University of Manchester UK
EU Research Infrastructures ELIXIR, IBISBA, EOSC-Life
BioExcel Centre of Excellence
Software Sustainability Institute UK
FAIRDOM Consortium
carole.goble@manchester.ac.uk
16th Workshop on Workflows in Support of Large-Scale Science
November 15, 2021
20 years+
Computational workflows
decades in the making…
finally coming of age….?
doi: 10.1093/gigascience/giaa140
Nature 573, 149-150 (2019)
https://doi.org/10.1038/d41586-019-02619-z
https://doi.org/10.1038/s41592-021-01254-9
An open
collaborative
space for digital
biology in
Europe
https://lifescience-ri.eu/
https://www.eosc-life.eu/
Computational Workflows for Data intensive Bioscience
CryoEM Image Analysis
Metagenomic Pipelines
[Rob Finn]
[Carlos Oscar Sorzano Sanchez]
Nature 573, 149-150 (2019)
https://doi.org/10.1038/d41586-019-02619-z
Data pipelines, simulation
sweeps, workflow ensembles.
Mixture of workflow systems,
notebooks and scripts.
Chaining different codes.
Genome Annotation
[Romain Dallet]
High Throughput Sequencing
[Fabrice Allain]
Interactive &
exploratory analysis
Production, automated,
repetitive & workflow-
integrated software
Workflow System Landscape
Inter-twingled, mix and matching
Scripting
environments
Interactive Electronic
Research Notebooks
Repositories
Registries
Workflow
Management
Systems & execution
platforms
*https://s.apache.org/existing-workflow-systems
300+ Systems*
General and Specialised
General Repositories
https://snakemake.github.io/
Workflows are rules:
Graph of jobs for automatic parallelisation,
DIY package & containerisation
installation, auto-documentation
From frameworks to web based analysis platforms
Communities cluster round a few systems.
Take up of a WfMS typically depends on the “plugged-in” support of data types &
specific codes, skills level of the workflow developers, its popularity & sustainability.
Online portals users build and reuse
workflows around publicly available
or user-uploaded data and pre-
wrapped, pre-installed tools.
A FAIR data and workflow commons
sharing and running workflows
Workflows are:
an entry point to tools and datasets,
democratising resources
functions for FAIR data processing
and secure data processing
FAIR digital objects
Honour legacy & diversity of WfMS -> Buy-in & on-boarding of WfMS
A FAIR data and workflow commons
Workflows are:
an entry point to tools and datasets,
democratising resources
functions for FAIR data processing
and secure data processing
FAIR digital objects
Honour legacy & diversity of WfMS -> Buy-in & on-boarding of WfMS
FAIR Guiding Principles for Research Data
Findable, Accessible, Interoperable, Reusable
A set of guiding principles to enhance the value
of all digital resources and their reuse
by people and by machines
A community journey to common guidelines
The glue to federate data and services,
to apply to all objects
Benefit both consumers and producers.
The FAIR Research Data Principles
RDA FAIR Data Maturity Model. Specification and Guidelines https://zenodo.org/record/3909563#.YORYkUzTX19
https://www.go-fair.org/fair-principles/
Wilkinson, M., Dumontier, M., Aalbersberg, I. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data 3, 160018 (2016).
https://doi.org/10.1038/sdata.2016.18
tl;dr FAIR Research Data Principles
https://www.go-fair.org/fair-principles/
Persistent human readable & machine-
actionable metadata
• Linked
• Community standards
Persistent identifiers
Clear licensing and access rules
Protocols for machine accessibility & AAI
Registration
Searching & Indexing
Enabling automation
FAIR Research Data Principles update in a nutshell
Policy
Rallying point
I’m FAIR!
What is it?
Definition
Spectrum
Contextual
Methodology
FAIRification
FAIR by Design
Assessment
Compliance
Certification
FREE
Infrastructure
Services
Adoption
Incentives
Stewardship
Services
FAIR Research Software Principles
Software is a digital object but research software is not (just) data
https://www.rd-alliance.org/groups/fair-4-researchsoftware-fair4rs-wg
FAIR for Research Software (FAIR4RS) Working Group
FAIR4RS First Draft of FAIR4RS principles
Katz, et al PATTERNS 2, 2021
Lamprecht et al., 2020
FAIR Research Software Principles
Software is a digital object but research software is not (just) data
Findable Accessable
I1. Software should read, write or exchange data in a way
that meets domain-relevant community standards
I2. Software includes qualified references to other
objects.
Reusable
Interoperable
R1. Software is richly described with a plurality of accurate & relevant
attributes
R1.1. Software is made available with a clear & accessible software usage
license
R1.2. Software is associated with detailed provenance
R1.3. Software meets domain-relevant community standards
R2. Software includes qualified references to other software
(Katz et al, 2021 PATTERNS,
https://doi.org/10.1016/j.patter.2021.100222)
R. The software is usable (it can be executed) & reusable (it can be
understood, modified, built upon, or incorporated into other software).
Enabling FAIR?
FAIRification. Assessment. Services.
Governance. Incentives.
FAIR takes a Village*
*Borgman, C. L., & Bourne, P. E. (2021). Why it takes a village to manage and share data.
Harvard Data Science Review (under Review). https://arxiv.org/abs/2109.01694
FAIR Computational
Workflow Principles?
FAIR Principles for Workflows
Abstraction 1: Hybrid Processual Digital Objects
Image credit: BioExcel Centre of Excellence
different
components,
codes,
languages,
third parties
FAIR Principles for Workflows
Abstraction 2: Compositional Objects
Interoperability and Reusability
FAIR Unit Test
FAIR Principles for Workflows
Method “Data” Objects
Workflows as
FAIR Software
FAIR+R and FAIR++
Quality, maturity, maintainability
The principles revised
Workflows as
FAIR Digital Objects
Data-like method objects
Associated objects
The principles adapted
Workflows as
FAIR Data Instruments
FAIRification of the dataflow
The data principles supported
C. Goble, S. Cohen-Boulakia, S.
Soiland-Reyes, D. Garijo, Y. Gil, M.R.
Crusoe, K. Peters & D. Schober. FAIR
computational workflows. Data
Intelligence 2(2020), 108–121.
doi: 10.1162/dint_a_000
Workflow Objects
Software Objects
Data FAIRification
FAIR enabling services
Services
Findable & Accessible
WORKS 2007
WORKS 2021
https://workflowhub.eu https://workflowhub.org
https://myexperiment.org
Findable & Accessable
register workflows with assigned PID + metadata in a searchable resource.
https://workflowhub.eu
Publishing Services
Journals
Digital Objects of Scholarship
published, cited, exchanged, reviewed, validated & reused
• Versioning, DOI/PID assignment
• Collections, workflow libraries
scripts
Repos
Containers Deploys
Tools
WfMS Agnostic degrees of onboarding, support & access
• Native repositories
• Metadata standards framework,
handle associated objects and links
between objects.
• Execution API
https://dockstore.org/
Link up providers and users
Building visibility & reputation
Close the
“Find – Get– Use – Credit”
loop
Credit, Attribution, Citation
Knowledge Graphs linking
out to OpenAIRE, DataCite
Associate workflows
Associate sister objects
myExperiment influence
Social aspects
Teams, People
licensing
authors
& credit
analytics
access
search
versions & status
other
workflows Smoothing onboarding
• GitHub integration
• WfMS metadata support
• Accessible by API
scripts
Tool Registry Service API
Accessible
“metadata & workflow retrievable by PID using a standardized communication protocol”
GitHub page: https://github.com/ga4gh/tool-registry-service-schemas
Accessible
an implementation of
GA4GH WES
https://github.com/sapporo-wes/sapporo
top layer over the tools, the
workflow languages, and the
workflow runners
GA4GH TRS
FAIR Workflow are FAIR Software
lifecycle support for living objects
Git Coupling
Publishing
Status
Testing
Benchmarking
Extensible Metadata Framework
catering for those processual FAIR criteria
Common metadata
about the workflow,
tools & parameters
Canonical workflow
description of the
steps of the workflow
Type the input and
outputs of the steps
Run Provenance / Histories / Tests
WfMS native history logs
Format for packaging a
workflow, its metadata
and companion objects
(links to containers, data
etc) for exchange,
archiving, reporting,
citing.
WorkflowHub and Services
create and consume Crates
FAIR Digital Object
Adopting Open
Community efforts
FAIR Metadata for Machines & Humans
https://www.commonwl.org
WfMS neutral canonical description
Linked to containerised tools
• Portable, reusable workflows
• Standardise expression of workflow
• Standardise compatible I/O for steps
• Reduce vendor / project lock-in
• Workflow comparisons
• Collaboration & knowledge transfer
https://openwdl.org/
Computational workflow profile
Formal
parameter
profile
https://bioschemas.org
Opinionated use of schema.org,
the web resource mark-up used
by search engines, knowledge
graphs and scientific resources.
Computational tool
profile
FAIR Metadata for Machines & Humans
data and software objects
RO-Crate Digital Objects
Packaging everything together regardless where or what it is
https://www.researchobject.org/ro-crate/
Self describing format for
packaging up scattered resources
integrated view + context
metadata and PIDs reference
digital and real things - datasets,
workflows, services, software &
people, places etc.
Web-native, COTS
machine and human readable
search engine & developer
friendly.
Infrastructure
independent & self-
describing
Avoid repository silos
Extensible and open-
ended profiles duck-
typing, cope with diversity
and legacy
RO-Crate Profile
https://www.researchobject.org/ro-crate/profiles
WfMS produce and consume
Workflow-RO-Crates
Provenance & Preservation
Transparency & Reuse matter
more than Reproducibility?
Traceability more important?
When is it FAIR enough?
WfMS heavy lifting needed …
R1.2: (Meta)data, software and workflows are
associated with detailed provenance – data lineage,
workflow lineage & workflow logs
ProvenanceWeek 2021, T7 Workshop on Provenance for Transparent Research, July 2021
https://iitdbgroup.github.io/ProvenanceWeek2021/t7.html
A2. metadata are accessible, even when
the workflow is no longer available
Read-reproducible as a method description if no
longer runs, Metadata preserved beyond any one
service republished in a long-term archive
R. The workflow is usable (it can
be executed) and reusable (it can
be understood, modified, built
upon, or incorporated into other
workflows).
FAIR Services
Law of decline
All workflows decay over time.
Complexity of Dependencies
Description persists -> Review, Repair, Remake
Reusable and Usable
i.e. can be executed once accessed
Quality, maturity, maintainability -> FAIR++
Multiple wf/test backends:
Galaxy Pandemo, CWL,
Jenkins …
Check workflow
performance,
provenance on
containers, memory
usage …
Testing and monitoring -> metadata
into WorkflowHub
Portability
High-level workflow
execution service backend,
sensitive data analysis &
running on private clouds
“Interoperable” Execution
Is a workflow
reusable if it’s
resource greedy
or too slow or
needs special
resources or
unavailable data
or cannot be
ported or run by
anyone other
than the
developers? Like
Google ML…
Interoperable and Reusable Workflows…
a portability viewpoint
All good WORKS
stuff which I am not
going to talk about….
exascale computing
Composability -> Interoperability and Reusability
Community driven Reusability first
I1: Software interoperates through APIs and metadata standards.
FAIR Unit tested & validated canonical workflows & blocks.
Well documented, well maintained
CWL Canonical descriptions
• Recycle descriptions and sub-workflows
• Platform independent exchange and comparison
• Standardised I/O formats
Thanks: Rob Finn
Composability -> Interoperability and Reusability
Community driven Reusability first
I1: Software interoperates through APIs and metadata standards.
FAIR Unit tested & validated canonical workflows & blocks.
Canonical Workflow Frameworks
for Research (CWFR)
https://www.rd-
alliance.org/canonical-
workflow-frameworks-
research-cwfr
https://fairdo.org/wg/fdo-cwfr/
Thanks: Stian Soiland-Reyes
Workflow Data FAIRification & FAIR Data by Design
Assisted by WfMS
Challenge of diverse API & AAI landscape, formats and packaging
Reviewing
Curation
Certification
Governance
Best Practice
Golden
Examples
Canonical
workflows
Design for
FAIR Data
and Reuse
Metadata generated for data products
FAIR Reusable Workflow Design is Hard and Hard Work
Nearly always post-hoc
Third party dependencies 
Technology Debt and Refactoring
Software Engineering
In the Sweatshop
of Science who
has the Time?
Inclination?
Skills? Resources?
FAIR Reusable Workflow Design is Hard and Hard Work
Nearly always post-hoc
Workflow developers
Tool and data set
providers
Workflow readiness
FAIR Unit Testing
Brack, et al (2021). 10 Simple Rules for
making a software tool workflow-ready.
https://doi.org/10.5281/zenodo.5636487
What’s the reward?
What’s a FAIR Unit?
How will we assess?
How to refactor?
WfMS platforms
Programmatic access to workflow metadata
Common metadata, PID & API standards
FAIR Software.
Service that is FAIR enabling*
Ramezani et al . (2021). D2.7 Framework for
assessing FAIR Services (V1.0_DRAFT).
https://doi.org/10.5281/zenodo.5336234
Can we FAIR assist? automate?
Abstraction framework for
granularity assessment & (semi)-
automated refactoring
2021 IEEE International Conference on Cluster Computing
DOI: 10.1109/Cluster48925.2021.00053
Professionalisation
Community activism
Service activism
Can we FAIR assist? Best practice, stewardship.
Training
https://society-rse.org/
WORKFLOW
APPLICATION USER
FAIR takes a Village
Shared responsibility, shared benefits, shared curation
TOOL
DEVELOPER
WORKFLOW
USER
WFMS
DEVOP
WORKFLOW
DEVELOPER
& CUSTODIAN
COMPUTATIONAL
USER
Platform Service
Workflow
Labour
Use
Reach
Software
What can a lab do to be FAIR?
As developer and user of workflows, datasets, tools?
Get Help
Skill the Team with
Best Practice
Register/Publish
Cite & credit makers
Document
for Strangers
https://fair-software.nl/
Professionalisation
Pre and post hoc
Corpas M et al (2018) A FAIR guide for data providers to maximise sharing of human genomic data, PLOS Comp Bio
Boeckhout M et al (2018) The FAIR guiding principles for data stewardship: fair enough?, E J of Human Genetics
Use WfMSs and
tools that are FAIR
enabling
Checklists
A Management Plan
Use Standards
Use IDs
What can a lab do to be FAIR?
As developer and user of workflows, datasets, tools?
Get Help
Document
for Strangers
https://fair-software.nl/
Professionalisation
Pre and post hoc
Corpas M et al (2018) A FAIR guide for data providers to maximise sharing of human genomic data, PLOS Comp Bio
Boeckhout M et al (2018) The FAIR guiding principles for data stewardship: fair enough?, E J of Human Genetics
Use WfMSs and
tools that are FAIR
enabling
Checklists
A Management Plan
Use Standards
Use IDs
Register/Publish
Cite & credit makers
Skill the Team with
Best Practice
What can the WfMS Community do?
Collective action by a few WfMS and
services nails 80% workflow use.
Ferreira da Silva et al, A Community Roadmap for Scientific Workflows Research and Development, arXiv:2110
Best Practice
Support a FAIR
metadata framework
TL;DL FAIR Computational Workflows
FAIR Principles laid the foundation for sharing
digital assets
Computational workflows are Hybrid Digital
Objects of scholarship
Should support the creation of FAIR data and
themselves adhere to FAIR Principles
Metadata matters
FAIR takes a Village.
Life Sciences has begun work.
Acknowledgements
The WorkflowHub Club, Bioschemas Community, RO-Crate
Community, CWL Community, Galaxy Europe, EOSC-Life and
ELIXIR Tools Platform.
https://about.workflowhub.eu/community/
Special Thanks
Rafael Ferreira da Silva (Oakridge)
Stian Soiland-Reyes (U Manchester / U Amsterdam)
Paul Brack, Stuart Owen, Finn Bacall, Alan Williams, Doug Lowe (U Manchester)
Björn Grüning (U Freiburg)
Frederik Coppens (VIB)
Sarah Jones (GEANT)
Herve Menager (Pasteur Institute)
Sarah Cohen-Boulakia (U Paris Sacly)
Dan Katz (U Illinois Urbana-Champaign)
Simone Leo (CRS4)
Laura Rodriguez-Navas (BSC)
José Mª Fernández (BSC)
Denis Yuen (Ontario Institute for Cancer Research)
Tristan Glatard (Concordia University)
Chris Erdmann (AGU)
WorkflowHub https://workflowhub.eu/ and https://workflowhub.org
EOSC-Life https://www.eosc-life.eu/
ELIXIR http://elixir-europe.org
RO-Crate https://www.researchobject.org/ro-crate/
Galaxy Europe https://galaxyproject.eu/
Bioschemas https://bioschemas.org
Common Workflow Language https://www.commonwl.org/
WorkflowsRI https://workflowsri.org/
Dockstore https://dockstore.org/
RDMkit https://rdmkit.elixir-europe.org
Wither Workflow Interoperability? FAR not FAIR?
(Question by Rafael Ferreira da Silva)
What is Workflow Interoperability?
• CWL /WDL - WfMS independence rather than interoperability?
• Execution of sub-workflows – (re)usability rather than interoperability?
• Multiple WfMS execution – are WfMS really executed in mixed workflows
or is this front/backends that can run multiple WfMS (e.g. TES/WES)?
• Composability of workflow units - Data I/O compatibility
I1. Software should read, write or exchange data in a way that meets domain-relevant
community standards

More Related Content

PPTX
FAIR Computational Workflows
PPTX
FAIR Computational Workflows
PDF
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...
PPTX
Tableau Visual analytics complete deck 2
PDF
Presto Summit 2018 - 09 - Netflix Iceberg
PDF
Airflow presentation
PDF
[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdf
PDF
OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)
FAIR Computational Workflows
FAIR Computational Workflows
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...
Tableau Visual analytics complete deck 2
Presto Summit 2018 - 09 - Netflix Iceberg
Airflow presentation
[EN] Building modern data pipeline with Snowflake + DBT + Airflow.pdf
OLAP for Big Data (Druid vs Apache Kylin vs Apache Lens)

What's hot (20)

PDF
InfluxDB 101 - Concepts and Architecture | Michael DeSa | InfluxData
PDF
Plotly dash and data visualisation in Python
PDF
Apache Iceberg - A Table Format for Hige Analytic Datasets
PDF
Tableau Tutorial for Data Science | Edureka
PPTX
Relational databases vs Non-relational databases
PPTX
DAX and Power BI Training - 001 Overview
PDF
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
PDF
Big data-cheat-sheet
PDF
Apache Arrow: High Performance Columnar Data Framework
PPTX
Simply your Jenkins Projects with Docker Multi-Stage Builds
PDF
Parquet performance tuning: the missing guide
PDF
Modeling Microservices
PPTX
Machine learning meets embedded development
 
PDF
DataEd Slides: Data Management Maturity - Achieving Best Practices Using DMM
PDF
Introduction to Data Stream Processing
PDF
What we've learned from running a PostgreSQL managed service on Kubernetes
PDF
ContainerConf 2022: Kubernetes is awesome - but...
PDF
Apache Kylin
PDF
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
PPTX
IBM Data Analyst Professional Certificate - C01 - W01.pptx
InfluxDB 101 - Concepts and Architecture | Michael DeSa | InfluxData
Plotly dash and data visualisation in Python
Apache Iceberg - A Table Format for Hige Analytic Datasets
Tableau Tutorial for Data Science | Edureka
Relational databases vs Non-relational databases
DAX and Power BI Training - 001 Overview
Spark + Parquet In Depth: Spark Summit East Talk by Emily Curtin and Robbie S...
Big data-cheat-sheet
Apache Arrow: High Performance Columnar Data Framework
Simply your Jenkins Projects with Docker Multi-Stage Builds
Parquet performance tuning: the missing guide
Modeling Microservices
Machine learning meets embedded development
 
DataEd Slides: Data Management Maturity - Achieving Best Practices Using DMM
Introduction to Data Stream Processing
What we've learned from running a PostgreSQL managed service on Kubernetes
ContainerConf 2022: Kubernetes is awesome - but...
Apache Kylin
Big Data [sorry] & Data Science: What Does a Data Scientist Do?
IBM Data Analyst Professional Certificate - C01 - W01.pptx
Ad

Similar to FAIR Computational Workflows (20)

PPTX
FAIR Computational Workflows
PPTX
FAIR Computational Workflows
PPTX
Let’s go on a FAIR safari!
PPTX
FAIRy stories: the FAIR Data principles in theory and in practice
PPTX
The ELIXIR FAIR Knowledge Ecosystem for practical know-how: RDMkit and FAIRCo...
PPTX
EOSC-Life Workflow Collaboratory
PPTX
Research Object Community Update
PDF
FAIRification is a Team Sport: FAIRsharing and the FAIR Cookbook
PDF
Introduction to FAIR Data and Research Objects
PPTX
RO-Crate: A framework for packaging research products into FAIR Research Objects
PDF
NFDI Physical Sciences Colloquium - FAIR
PDF
The FAIR Principles and FAIRsharing
PDF
Tag.bio: Self Service Data Mesh Platform
PDF
FAIR, FAIRplus and the FAIR Cookbook
PPTX
FAIRy stories: tales from building the FAIR Research Commons
PDF
The FAIR movement - Oxford Open Data Week
PPTX
FAIR data: what it means, how we achieve it, and the role of RDA
PDF
Building Federated FAIR Data Spaces, Yann Le Franc, EOSC-Pillar
PPTX
RDMkit, a Research Data Management Toolkit. Built by the Community for the ...
PDF
PATHS state of the art monitoring report
FAIR Computational Workflows
FAIR Computational Workflows
Let’s go on a FAIR safari!
FAIRy stories: the FAIR Data principles in theory and in practice
The ELIXIR FAIR Knowledge Ecosystem for practical know-how: RDMkit and FAIRCo...
EOSC-Life Workflow Collaboratory
Research Object Community Update
FAIRification is a Team Sport: FAIRsharing and the FAIR Cookbook
Introduction to FAIR Data and Research Objects
RO-Crate: A framework for packaging research products into FAIR Research Objects
NFDI Physical Sciences Colloquium - FAIR
The FAIR Principles and FAIRsharing
Tag.bio: Self Service Data Mesh Platform
FAIR, FAIRplus and the FAIR Cookbook
FAIRy stories: tales from building the FAIR Research Commons
The FAIR movement - Oxford Open Data Week
FAIR data: what it means, how we achieve it, and the role of RDA
Building Federated FAIR Data Spaces, Yann Le Franc, EOSC-Pillar
RDMkit, a Research Data Management Toolkit. Built by the Community for the ...
PATHS state of the art monitoring report
Ad

More from Carole Goble (18)

PPTX
Can’t Pay, Won’t Pay, Don’t Pay: Delivering open science, a Digital Research...
PPTX
RO-Crate: packaging metadata love notes into FAIR Digital Objects
PPTX
Research Software Sustainability takes a Village
PPTX
Title: Love, Money, Fame, Nudge: Enabling Data-intensive BioScience through D...
PPTX
Open Research: Manchester leading and learning
PPTX
FAIR Data Bridging from researcher data management to ELIXIR archives in the...
PPTX
FAIR Workflows and Research Objects get a Workout
PPTX
The swings and roundabouts of a decade of fun and games with Research Objects
PPTX
How are we Faring with FAIR? (and what FAIR is not)
PPTX
What is Reproducibility? The R* brouhaha and how Research Objects can help
PPTX
FAIR History and the Future
PPTX
ELIXIR UK Node presentation to the ELIXIR Board
PPTX
Reproducible Research: how could Research Objects help
PPTX
Reflections on a (slightly unusual) multi-disciplinary academic career
PPTX
Better Software, Better Research
PPTX
Reproducibility (and the R*) of Science: motivations, challenges and trends
PPTX
Introduction to FAIRDOM
PPTX
Being FAIR: Enabling Reproducible Data Science
Can’t Pay, Won’t Pay, Don’t Pay: Delivering open science, a Digital Research...
RO-Crate: packaging metadata love notes into FAIR Digital Objects
Research Software Sustainability takes a Village
Title: Love, Money, Fame, Nudge: Enabling Data-intensive BioScience through D...
Open Research: Manchester leading and learning
FAIR Data Bridging from researcher data management to ELIXIR archives in the...
FAIR Workflows and Research Objects get a Workout
The swings and roundabouts of a decade of fun and games with Research Objects
How are we Faring with FAIR? (and what FAIR is not)
What is Reproducibility? The R* brouhaha and how Research Objects can help
FAIR History and the Future
ELIXIR UK Node presentation to the ELIXIR Board
Reproducible Research: how could Research Objects help
Reflections on a (slightly unusual) multi-disciplinary academic career
Better Software, Better Research
Reproducibility (and the R*) of Science: motivations, challenges and trends
Introduction to FAIRDOM
Being FAIR: Enabling Reproducible Data Science

Recently uploaded (20)

PDF
MIRIDeepImagingSurvey(MIDIS)oftheHubbleUltraDeepField
PPTX
neck nodes and dissection types and lymph nodes levels
PPTX
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
PDF
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
PPTX
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
PPTX
Derivatives of integument scales, beaks, horns,.pptx
PPTX
famous lake in india and its disturibution and importance
PPTX
Cell Membrane: Structure, Composition & Functions
PDF
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
PPTX
2. Earth - The Living Planet earth and life
PPTX
Microbiology with diagram medical studies .pptx
PPTX
SCIENCE10 Q1 5 WK8 Evidence Supporting Plate Movement.pptx
PDF
Mastering Bioreactors and Media Sterilization: A Complete Guide to Sterile Fe...
PDF
bbec55_b34400a7914c42429908233dbd381773.pdf
PPTX
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
PPTX
Taita Taveta Laboratory Technician Workshop Presentation.pptx
PPTX
2. Earth - The Living Planet Module 2ELS
PDF
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
PDF
Crime Scene Investigation: A Guide for Law Enforcement (2013 Update)
PPTX
Introduction to Fisheries Biotechnology_Lesson 1.pptx
MIRIDeepImagingSurvey(MIDIS)oftheHubbleUltraDeepField
neck nodes and dissection types and lymph nodes levels
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
Derivatives of integument scales, beaks, horns,.pptx
famous lake in india and its disturibution and importance
Cell Membrane: Structure, Composition & Functions
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
2. Earth - The Living Planet earth and life
Microbiology with diagram medical studies .pptx
SCIENCE10 Q1 5 WK8 Evidence Supporting Plate Movement.pptx
Mastering Bioreactors and Media Sterilization: A Complete Guide to Sterile Fe...
bbec55_b34400a7914c42429908233dbd381773.pdf
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
Taita Taveta Laboratory Technician Workshop Presentation.pptx
2. Earth - The Living Planet Module 2ELS
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
Crime Scene Investigation: A Guide for Law Enforcement (2013 Update)
Introduction to Fisheries Biotechnology_Lesson 1.pptx

FAIR Computational Workflows

  • 1. FAIR Computational Workflows Professor Carole Goble The University of Manchester UK EU Research Infrastructures ELIXIR, IBISBA, EOSC-Life BioExcel Centre of Excellence Software Sustainability Institute UK FAIRDOM Consortium carole.goble@manchester.ac.uk 16th Workshop on Workflows in Support of Large-Scale Science November 15, 2021
  • 2. 20 years+ Computational workflows decades in the making… finally coming of age….? doi: 10.1093/gigascience/giaa140 Nature 573, 149-150 (2019) https://doi.org/10.1038/d41586-019-02619-z https://doi.org/10.1038/s41592-021-01254-9
  • 3. An open collaborative space for digital biology in Europe https://lifescience-ri.eu/ https://www.eosc-life.eu/
  • 4. Computational Workflows for Data intensive Bioscience CryoEM Image Analysis Metagenomic Pipelines [Rob Finn] [Carlos Oscar Sorzano Sanchez] Nature 573, 149-150 (2019) https://doi.org/10.1038/d41586-019-02619-z Data pipelines, simulation sweeps, workflow ensembles. Mixture of workflow systems, notebooks and scripts. Chaining different codes. Genome Annotation [Romain Dallet] High Throughput Sequencing [Fabrice Allain] Interactive & exploratory analysis Production, automated, repetitive & workflow- integrated software
  • 5. Workflow System Landscape Inter-twingled, mix and matching Scripting environments Interactive Electronic Research Notebooks Repositories Registries Workflow Management Systems & execution platforms *https://s.apache.org/existing-workflow-systems 300+ Systems* General and Specialised General Repositories
  • 6. https://snakemake.github.io/ Workflows are rules: Graph of jobs for automatic parallelisation, DIY package & containerisation installation, auto-documentation From frameworks to web based analysis platforms Communities cluster round a few systems. Take up of a WfMS typically depends on the “plugged-in” support of data types & specific codes, skills level of the workflow developers, its popularity & sustainability. Online portals users build and reuse workflows around publicly available or user-uploaded data and pre- wrapped, pre-installed tools.
  • 7. A FAIR data and workflow commons sharing and running workflows Workflows are: an entry point to tools and datasets, democratising resources functions for FAIR data processing and secure data processing FAIR digital objects Honour legacy & diversity of WfMS -> Buy-in & on-boarding of WfMS
  • 8. A FAIR data and workflow commons Workflows are: an entry point to tools and datasets, democratising resources functions for FAIR data processing and secure data processing FAIR digital objects Honour legacy & diversity of WfMS -> Buy-in & on-boarding of WfMS
  • 9. FAIR Guiding Principles for Research Data Findable, Accessible, Interoperable, Reusable A set of guiding principles to enhance the value of all digital resources and their reuse by people and by machines A community journey to common guidelines The glue to federate data and services, to apply to all objects Benefit both consumers and producers.
  • 10. The FAIR Research Data Principles RDA FAIR Data Maturity Model. Specification and Guidelines https://zenodo.org/record/3909563#.YORYkUzTX19 https://www.go-fair.org/fair-principles/ Wilkinson, M., Dumontier, M., Aalbersberg, I. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci Data 3, 160018 (2016). https://doi.org/10.1038/sdata.2016.18
  • 11. tl;dr FAIR Research Data Principles https://www.go-fair.org/fair-principles/ Persistent human readable & machine- actionable metadata • Linked • Community standards Persistent identifiers Clear licensing and access rules Protocols for machine accessibility & AAI Registration Searching & Indexing Enabling automation
  • 12. FAIR Research Data Principles update in a nutshell Policy Rallying point I’m FAIR! What is it? Definition Spectrum Contextual Methodology FAIRification FAIR by Design Assessment Compliance Certification FREE Infrastructure Services Adoption Incentives Stewardship Services
  • 13. FAIR Research Software Principles Software is a digital object but research software is not (just) data https://www.rd-alliance.org/groups/fair-4-researchsoftware-fair4rs-wg FAIR for Research Software (FAIR4RS) Working Group FAIR4RS First Draft of FAIR4RS principles Katz, et al PATTERNS 2, 2021 Lamprecht et al., 2020
  • 14. FAIR Research Software Principles Software is a digital object but research software is not (just) data Findable Accessable I1. Software should read, write or exchange data in a way that meets domain-relevant community standards I2. Software includes qualified references to other objects. Reusable Interoperable R1. Software is richly described with a plurality of accurate & relevant attributes R1.1. Software is made available with a clear & accessible software usage license R1.2. Software is associated with detailed provenance R1.3. Software meets domain-relevant community standards R2. Software includes qualified references to other software (Katz et al, 2021 PATTERNS, https://doi.org/10.1016/j.patter.2021.100222) R. The software is usable (it can be executed) & reusable (it can be understood, modified, built upon, or incorporated into other software).
  • 15. Enabling FAIR? FAIRification. Assessment. Services. Governance. Incentives. FAIR takes a Village* *Borgman, C. L., & Bourne, P. E. (2021). Why it takes a village to manage and share data. Harvard Data Science Review (under Review). https://arxiv.org/abs/2109.01694 FAIR Computational Workflow Principles?
  • 16. FAIR Principles for Workflows Abstraction 1: Hybrid Processual Digital Objects
  • 17. Image credit: BioExcel Centre of Excellence different components, codes, languages, third parties FAIR Principles for Workflows Abstraction 2: Compositional Objects Interoperability and Reusability FAIR Unit Test
  • 18. FAIR Principles for Workflows Method “Data” Objects Workflows as FAIR Software FAIR+R and FAIR++ Quality, maturity, maintainability The principles revised Workflows as FAIR Digital Objects Data-like method objects Associated objects The principles adapted Workflows as FAIR Data Instruments FAIRification of the dataflow The data principles supported C. Goble, S. Cohen-Boulakia, S. Soiland-Reyes, D. Garijo, Y. Gil, M.R. Crusoe, K. Peters & D. Schober. FAIR computational workflows. Data Intelligence 2(2020), 108–121. doi: 10.1162/dint_a_000 Workflow Objects Software Objects Data FAIRification FAIR enabling services Services
  • 19. Findable & Accessible WORKS 2007 WORKS 2021 https://workflowhub.eu https://workflowhub.org https://myexperiment.org
  • 20. Findable & Accessable register workflows with assigned PID + metadata in a searchable resource. https://workflowhub.eu Publishing Services Journals Digital Objects of Scholarship published, cited, exchanged, reviewed, validated & reused • Versioning, DOI/PID assignment • Collections, workflow libraries scripts Repos Containers Deploys Tools WfMS Agnostic degrees of onboarding, support & access • Native repositories • Metadata standards framework, handle associated objects and links between objects. • Execution API https://dockstore.org/
  • 21. Link up providers and users Building visibility & reputation Close the “Find – Get– Use – Credit” loop Credit, Attribution, Citation Knowledge Graphs linking out to OpenAIRE, DataCite Associate workflows Associate sister objects myExperiment influence Social aspects Teams, People
  • 22. licensing authors & credit analytics access search versions & status other workflows Smoothing onboarding • GitHub integration • WfMS metadata support • Accessible by API scripts
  • 23. Tool Registry Service API Accessible “metadata & workflow retrievable by PID using a standardized communication protocol” GitHub page: https://github.com/ga4gh/tool-registry-service-schemas
  • 24. Accessible an implementation of GA4GH WES https://github.com/sapporo-wes/sapporo top layer over the tools, the workflow languages, and the workflow runners GA4GH TRS
  • 25. FAIR Workflow are FAIR Software lifecycle support for living objects Git Coupling Publishing Status Testing Benchmarking
  • 26. Extensible Metadata Framework catering for those processual FAIR criteria Common metadata about the workflow, tools & parameters Canonical workflow description of the steps of the workflow Type the input and outputs of the steps Run Provenance / Histories / Tests WfMS native history logs Format for packaging a workflow, its metadata and companion objects (links to containers, data etc) for exchange, archiving, reporting, citing. WorkflowHub and Services create and consume Crates FAIR Digital Object Adopting Open Community efforts
  • 27. FAIR Metadata for Machines & Humans https://www.commonwl.org WfMS neutral canonical description Linked to containerised tools • Portable, reusable workflows • Standardise expression of workflow • Standardise compatible I/O for steps • Reduce vendor / project lock-in • Workflow comparisons • Collaboration & knowledge transfer https://openwdl.org/
  • 28. Computational workflow profile Formal parameter profile https://bioschemas.org Opinionated use of schema.org, the web resource mark-up used by search engines, knowledge graphs and scientific resources. Computational tool profile FAIR Metadata for Machines & Humans data and software objects
  • 29. RO-Crate Digital Objects Packaging everything together regardless where or what it is https://www.researchobject.org/ro-crate/ Self describing format for packaging up scattered resources integrated view + context metadata and PIDs reference digital and real things - datasets, workflows, services, software & people, places etc. Web-native, COTS machine and human readable search engine & developer friendly. Infrastructure independent & self- describing Avoid repository silos Extensible and open- ended profiles duck- typing, cope with diversity and legacy
  • 31. Provenance & Preservation Transparency & Reuse matter more than Reproducibility? Traceability more important? When is it FAIR enough? WfMS heavy lifting needed … R1.2: (Meta)data, software and workflows are associated with detailed provenance – data lineage, workflow lineage & workflow logs
  • 32. ProvenanceWeek 2021, T7 Workshop on Provenance for Transparent Research, July 2021 https://iitdbgroup.github.io/ProvenanceWeek2021/t7.html
  • 33. A2. metadata are accessible, even when the workflow is no longer available Read-reproducible as a method description if no longer runs, Metadata preserved beyond any one service republished in a long-term archive R. The workflow is usable (it can be executed) and reusable (it can be understood, modified, built upon, or incorporated into other workflows). FAIR Services Law of decline All workflows decay over time. Complexity of Dependencies Description persists -> Review, Repair, Remake
  • 34. Reusable and Usable i.e. can be executed once accessed Quality, maturity, maintainability -> FAIR++ Multiple wf/test backends: Galaxy Pandemo, CWL, Jenkins … Check workflow performance, provenance on containers, memory usage … Testing and monitoring -> metadata into WorkflowHub Portability High-level workflow execution service backend, sensitive data analysis & running on private clouds “Interoperable” Execution Is a workflow reusable if it’s resource greedy or too slow or needs special resources or unavailable data or cannot be ported or run by anyone other than the developers? Like Google ML…
  • 35. Interoperable and Reusable Workflows… a portability viewpoint All good WORKS stuff which I am not going to talk about…. exascale computing
  • 36. Composability -> Interoperability and Reusability Community driven Reusability first I1: Software interoperates through APIs and metadata standards. FAIR Unit tested & validated canonical workflows & blocks. Well documented, well maintained CWL Canonical descriptions • Recycle descriptions and sub-workflows • Platform independent exchange and comparison • Standardised I/O formats Thanks: Rob Finn
  • 37. Composability -> Interoperability and Reusability Community driven Reusability first I1: Software interoperates through APIs and metadata standards. FAIR Unit tested & validated canonical workflows & blocks. Canonical Workflow Frameworks for Research (CWFR) https://www.rd- alliance.org/canonical- workflow-frameworks- research-cwfr https://fairdo.org/wg/fdo-cwfr/ Thanks: Stian Soiland-Reyes
  • 38. Workflow Data FAIRification & FAIR Data by Design Assisted by WfMS Challenge of diverse API & AAI landscape, formats and packaging Reviewing Curation Certification Governance Best Practice Golden Examples Canonical workflows Design for FAIR Data and Reuse Metadata generated for data products
  • 39. FAIR Reusable Workflow Design is Hard and Hard Work Nearly always post-hoc Third party dependencies  Technology Debt and Refactoring Software Engineering In the Sweatshop of Science who has the Time? Inclination? Skills? Resources?
  • 40. FAIR Reusable Workflow Design is Hard and Hard Work Nearly always post-hoc Workflow developers Tool and data set providers Workflow readiness FAIR Unit Testing Brack, et al (2021). 10 Simple Rules for making a software tool workflow-ready. https://doi.org/10.5281/zenodo.5636487 What’s the reward? What’s a FAIR Unit? How will we assess? How to refactor? WfMS platforms Programmatic access to workflow metadata Common metadata, PID & API standards FAIR Software. Service that is FAIR enabling* Ramezani et al . (2021). D2.7 Framework for assessing FAIR Services (V1.0_DRAFT). https://doi.org/10.5281/zenodo.5336234
  • 41. Can we FAIR assist? automate? Abstraction framework for granularity assessment & (semi)- automated refactoring 2021 IEEE International Conference on Cluster Computing DOI: 10.1109/Cluster48925.2021.00053
  • 42. Professionalisation Community activism Service activism Can we FAIR assist? Best practice, stewardship. Training https://society-rse.org/
  • 43. WORKFLOW APPLICATION USER FAIR takes a Village Shared responsibility, shared benefits, shared curation TOOL DEVELOPER WORKFLOW USER WFMS DEVOP WORKFLOW DEVELOPER & CUSTODIAN COMPUTATIONAL USER Platform Service Workflow Labour Use Reach Software
  • 44. What can a lab do to be FAIR? As developer and user of workflows, datasets, tools? Get Help Skill the Team with Best Practice Register/Publish Cite & credit makers Document for Strangers https://fair-software.nl/ Professionalisation Pre and post hoc Corpas M et al (2018) A FAIR guide for data providers to maximise sharing of human genomic data, PLOS Comp Bio Boeckhout M et al (2018) The FAIR guiding principles for data stewardship: fair enough?, E J of Human Genetics Use WfMSs and tools that are FAIR enabling Checklists A Management Plan Use Standards Use IDs
  • 45. What can a lab do to be FAIR? As developer and user of workflows, datasets, tools? Get Help Document for Strangers https://fair-software.nl/ Professionalisation Pre and post hoc Corpas M et al (2018) A FAIR guide for data providers to maximise sharing of human genomic data, PLOS Comp Bio Boeckhout M et al (2018) The FAIR guiding principles for data stewardship: fair enough?, E J of Human Genetics Use WfMSs and tools that are FAIR enabling Checklists A Management Plan Use Standards Use IDs Register/Publish Cite & credit makers Skill the Team with Best Practice
  • 46. What can the WfMS Community do? Collective action by a few WfMS and services nails 80% workflow use. Ferreira da Silva et al, A Community Roadmap for Scientific Workflows Research and Development, arXiv:2110 Best Practice Support a FAIR metadata framework
  • 47. TL;DL FAIR Computational Workflows FAIR Principles laid the foundation for sharing digital assets Computational workflows are Hybrid Digital Objects of scholarship Should support the creation of FAIR data and themselves adhere to FAIR Principles Metadata matters FAIR takes a Village. Life Sciences has begun work.
  • 48. Acknowledgements The WorkflowHub Club, Bioschemas Community, RO-Crate Community, CWL Community, Galaxy Europe, EOSC-Life and ELIXIR Tools Platform. https://about.workflowhub.eu/community/ Special Thanks Rafael Ferreira da Silva (Oakridge) Stian Soiland-Reyes (U Manchester / U Amsterdam) Paul Brack, Stuart Owen, Finn Bacall, Alan Williams, Doug Lowe (U Manchester) Björn Grüning (U Freiburg) Frederik Coppens (VIB) Sarah Jones (GEANT) Herve Menager (Pasteur Institute) Sarah Cohen-Boulakia (U Paris Sacly) Dan Katz (U Illinois Urbana-Champaign) Simone Leo (CRS4) Laura Rodriguez-Navas (BSC) José Mª Fernández (BSC) Denis Yuen (Ontario Institute for Cancer Research) Tristan Glatard (Concordia University) Chris Erdmann (AGU) WorkflowHub https://workflowhub.eu/ and https://workflowhub.org EOSC-Life https://www.eosc-life.eu/ ELIXIR http://elixir-europe.org RO-Crate https://www.researchobject.org/ro-crate/ Galaxy Europe https://galaxyproject.eu/ Bioschemas https://bioschemas.org Common Workflow Language https://www.commonwl.org/ WorkflowsRI https://workflowsri.org/ Dockstore https://dockstore.org/ RDMkit https://rdmkit.elixir-europe.org
  • 49. Wither Workflow Interoperability? FAR not FAIR? (Question by Rafael Ferreira da Silva) What is Workflow Interoperability? • CWL /WDL - WfMS independence rather than interoperability? • Execution of sub-workflows – (re)usability rather than interoperability? • Multiple WfMS execution – are WfMS really executed in mixed workflows or is this front/backends that can run multiple WfMS (e.g. TES/WES)? • Composability of workflow units - Data I/O compatibility I1. Software should read, write or exchange data in a way that meets domain-relevant community standards