SlideShare a Scribd company logo
Data-intensive bioinformatics on HPC and
Cloud
Ola Spjuth <ola.spjuth@farmbio.uu.se>
Department of Pharmaceutical Biosciences
and Science for Life Laboratory
Uppsala University
Today: We have access to high-throughput
technologies to study biological phenomena
2017: Human whole genome sequenced
in 3 days for ~$1100
…requires supercomputers
for analysis and storage
Massively parallel sequencing….
2017: Illumina HiSeq X systems. 15K whole human
genomes per year
2016: NGI data velocity 950 Mbp/hour = 16 Mbp/s
Analysis
Scientists
Sample
transfer
Mode of operation
Platforms
Pre-processing (NGI)
Research (SNIC)
Data
delivery
Software +
reference data
Support
Education
Compute resources
Storage resources
Efficiency +
automation
A national e-infrastructure
What we sequenced at NGI /
National distribution
7
2016
Some statistics Storage usage
Projects at SNIC-UPPMAX
Data-intensive bioinformatics
Other disciplines
Support tickets
Biggest challenge: Data growth
• Storage is filling up. Projects do not end. Users do not clean up data. WGS
projects are very large.
• At the heart of the problem: Services are currently free of charge.
Our strategy:
• Cost of data storage and analysis should be assessed and included in
budget before data production
• Move data away from expensive storage near clusters
– Constrain project lifetimes (shorten allowed time for data on systems)
– Move towards tiered storage solutions
– Improve efficiency in analyses (education, monitoring, support)
• Investigate scaling with other centra and public cloud providers
• Long-term storage: Unresolved question…
9
NGS users
• Key observations
– Storage biggest challenge
– Many and inefficient users, lots of software (admin burden, support,
education)
– Free resources (no cost) does not promote efficient usage
• Resource strategies
– When investing in computational hardware, it takes a long time from
funding decision until the resources are operational (10-12 months on
average).
– Expansion of resources are done at specific points in time, low flexibility
between these.
– Decision on resources are made by the SNIC board with limited
influence from life science scientists (SciLifeLab)
• Can we improve on this?
10
Cloud computing
• Purchase a service instead of hardware
• Pay-per-use pricing
• Scale up and down as you need
• Virtual infrastructure/machines/storage
11
Cloud in Bioinformatics
How can we take advantage of cloud resources?
Simplest example:
• Start pre-made VMI
• Upload data
• Perform scientific task
• Download results
• Terminate VMI
Easy to scale this up to using many instances!
Or….. is it?
• What if I want to run 100 instances in parallel?
• What about if I want a new tool? Later versions?
• Do I need to upload data every time?
12
So we want to set up and use a virtual
cluster
• Multiple compute nodes
• Network
• Distributed storage
• Firewall, DNS, reverse proxy, etc.
So, we now have a virtual cluster. And now?
Batch-like system
– Install a queueing system, e.g. SLURM
– Install bioinformatics software on the VMI
Big Data system
– Install HDFS + Hadoop/Spark on the system
(There are tools that can help automating these procedures)
13
Why cloud in the life sciences?
• Access to resources
– Flexible configurations
– On-demand
– Cost-efficient?
• Collaborate on international level
– Publish/federate data
– E.g. Large sequencing initiatives, “move compute to the
data”
• New types of analysis environments
– Hadoop/Spark/Flink etc.
– Microservices, Docker, Kubernetes, Mesos
14
Challenges with cloud
• Tradition: Strong HPC tradition in academia
– Sweden: Existing HPC resources funded by Research
Council and personnel at 6 centra in Sweden (SNIC)
• Economy: Cost model is new
– Difficult to assess the costs
• Legal: Working with sensitive data
• Educational: New technology for many
15
Some SciLifeLab cloud options
16
● Geographically distributed federated IaaS cloud
based on 2nd generation HPC-hardware
● Built using OpenStack
SNIC Cloud in Sweden
Needs in bioinformatics
• Primarily resources with a lot of RAM and storage (high I/O)
• Preferably transparent system, users don’t want to deal with e-
infrastructure at all
• How to work with storage (tiered?)
18
Virtual Machines and Containers
Virtual machines
• Package entire systems (heavy)
• Completely isolated
• Suitable in cloud environments
Containers:
• Share OS
• Smaller, faster, portable
• Docker
19
MicroServices
• Decompose functionality into smaller, loosely coupled, on-demand
services communicating via an API
– “Do one thing and do it well”
• Services are easy to replace, language-agnostic
– Minimize risk, maximize agility
– Suitable for loosely coupled teams
– Portable - easy to scale
– Multiple services can be chained into larger tasks
Software containers (e.g. Docker) are
ideal for microservices!
Orchestrating containers
• Origin: Google
• A declarative language for
launching containers
• Start, stop, update, and
manage a cluster of
machines running
containers in a consistent
and maintainable way
• Suitable for microservices
Containers
Scheduled and packed containers on nodes
Connecting the microservices
• A suitable way of using
containers are connecting
them into a (scientific)
workflow.
• Tools like Pachyderm
(http://pachyderm.io/), Luigi
(https://github.com/spotify/lui
gi) and Galaxy
(https://galaxyproject.org/)
can assist with this.
• Goal: Reproducible, fault-
tolerant, scalable execution.
22
How can regular users take advantage of
these technologies?
• Virtual Research Environment (VRE) is an appealing option
– Easy and user-friendly access to computational resources, tools and
data, commonly for a scientific domain
– Usually access from a browser
• Multi-tenant VRE – log into shared system
• Private VRE - deploy on your favorite cloud provider
23
Tools
Tools
Data
Data
VREs aim to
bridge this gap!
Researcher Other
researchers
Virtual Research Environments
Researcher
Tools
Data
Compute
and
storage
resources
Virtual Research Environment!
Other
researchers
Virtual Research Environments
PhenoMeNal
• Horizon 2020 project, 2015-2018
• Virtual Research Environments (VRE), Microservices, Workflows
• Towards interoperable and scalable Metabolomics data analysis
• Private environments for sensitive data
http://phenomenal-h2020.eu/
DockerHub
Virtual Infrastructure
GitHub
Cloudflare
kubeadm Terraform
kubectl
Packer
• Enable users to deploy their own virtual
infrastructure on an IaaS provider
• Containerize tools, orchestrate microservices
with workflow systems on top of Kubernetes
PhenoMeNal approach and
stack
KubeNow
Users should not see this…
Users should see this!
29
Start-to-end MS-analysis
30
Research focus in my group
e-Science methods development
Smart data management,
predictive modeling
Applied e-Science research
Drug discovery and
individualized diagnostics
e-infrastructure development
Automation, Big Data
Privacy
preservation
Workflows
Big Data
frameworks
Data management and
predictive modeling
Data
federation
Compute
federation
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
NGS projects
2014 2015 2016 2017
Efficiency feedback
to users began
0
20
40
60
80
100
Efficiency(%)
Date
●●
●
●●
●
●
●●●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●●●●●●●●
●
●
●●
●●●
●
●
●●●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●●
●
●
●●
●●●
●●
●
●
●
●
●
●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●
●
●
●
●
●
●
●
●
●
●
●●●
●
●●
●
●
●
●
●
●●●
●
●
●
●●
●
●
●●
●
●●●
●●●●●●
Other projects
2014 2015 2016 2017
0
20
40
60
80
100
Selected research questions
How can we improve efficiency on shared HPC for data-
intensive bioinformatics?
1. M. Dahlö, D. Schofield, Wesley Schaal and O. Spjuth, Tracking the NGS revolution: Usage and system support of bioinformatics
projects on shared high-performance computing clusters. In Preparation.
2. O. Spjuth, E. Bongcam-Rudloff, J. Dahlberg, M. Dahlö, A. Kallio, L. Pireddu, F. Vezzi, and E. Korpelainen, Recommendations on e-
infrastructures for next-generation sequencing. Gigascience, 2016, 5:26
3. S. Lampa, M. Dahlö, P. I. Olason, J. Hagberg, and O. Spjuth, Lessons learned from implementing a national infrastructure in
sweden for storage and analysis of next-generation sequencing data. Gigascience, 2013, 2:9
Data locality?
Outsourcing?
Martin Dahlö
Selected research questions
Can Big Data frameworks aid data-intensive bioinformatics?
1. A. Siretskiy, L. Pireddu, T. Sundqvist, and O. Spjuth. A quantitative assessment of the Hadoop framework for analyzing
massively parallel DNA sequencing data. Gigascience. 2015; 4:26.
2. L. Ahmed, A. Edlund, E. Laure, and O. Spjuth. Using Iterative MapReduce for Parallel Virtual Screening. Cloud Computing
Technology and Science (Cloud-Com), 2013 IEEE 5th International Conference on , vol.2, no., pp.27,32, 2-5, 2013
3. M. Capuccini, L. Carlsson, U. Norinder and O. Spjuth. Conformal Prediction in Spark: Large-Scale Machine Learning with
Confidence. EEE/ACM 2nd International Symposium on Big Data Computing (BDC), Limassol, 2015, pp. 61-67.
4. M. Capuccini, L.Ahmed, W. Schaal, E. Laure and O. Spjuth Large-scale virtual screening on public cloud resources with
Apache Spark Journal of Cheminformatics 2017 9:15
Laeeq
Valentin
Marco
Efficient Virtual Screening
with Apache Spark and
Machine Learning
Hadoop pipeline scales better than HPC
and is economical for current data sizes
“EasyMapReduce: Leverage the power of Spark And Docker
To scale scientific tools in MapReduce fashion“
35
https://spark-summit.org/east-2017/events/easymapreduce-leverage-the-
power-of-spark-and-docker-to-scale-scientific-tools-in-mapreduce-fashion/
Selected research questions
How useful are Scientific Workflows in data-intensive research?
O. Spjuth et al. Experiences with workflows for automating data-intensive bioinformatics. Biology Direct.
2015 Aug 19;10(1):43.
S. Lampa, J. Alvarsson and O. Spjuth. Towards agile large-scale predictive modelling in drug discovery with
flow-based programming design principles. Journal of Cheminformatics, 2016, 8:67
SamuelJon
• Streamline analysis on high-
performance e-infrastructures
• Support reproducible data analysis
• Enable large-scale data analysis
Selected research questions
How can we deploy smart, high-availability services with APIs?
http://www.openrisknet.org
• Horizon 2020 project, 2017-2020
• E-Infrastructure for chemical safety assessment
• Multi-tenant Virtual Environments, microservices
• APIS, “Semantic interoperability”
• Academia – industry
• Much focus on standardizing chemical data and predictive modeling
Staffan
Jonathan
Arvid
Research questions around the
corner
• Public and private data sources are not static. How can we
continuously improve predictive models as data changes?
• We can generate too much data. Can predictive modeling aid data
acquisition, storage and analysis?
38
Reactive/continuous modeling
Data sources
Coordinate
Integrate
Version
Monitor
Publish
models
Archive
models
User
Train and
assess model
HASTE
Hierarchical Analysis of Spatial and TEmporal and image data
From intelligent data acquisition via smart data management to confident predictions
PI, Aim1: Carolina Wählby Aim 3: Andreas HellanderAim 2: Ola Spjuth
29 MSEK
2017-2022
.
.
.
Training data
Can we use
privileged
information to
improve machine
learning models?
Training
Can we make a valid
ranking and guide data
acquisition?
.
.
.
Is something interesting
happening? Can we
assign valid probabilities
for that?
Collect more data
Online setting
Aim 2: Guiding data acquisition with
machine learning
Aim3: Explore a hierarchical model
based on Information Layers
Data warehouse,
distributed storage
Edge
Cloudlet,
private
cloud
European Open Science Cloud (EOSC)
• The vast majority of all data in the world (in fact up to 90%) has been
generated in the last two years.
• Scientific data is in direct need of openness, better handling, careful
management, machine actionability and sheer re-use.
• European Open Science Cloud: A vision of a future infrastructure to
support Open Research Data and Open Science in Europe
– It should enable trusted access to services, systems and the re-use
of shared scientific data across disciplinary, social and geographical
borders
– research data should be findable, accessible, interoperable and re-
usable (FAIR)
– provide the means to analyze datasets of huge sizes
43http://ec.europa.eu/research/openscience/index.cfm?pg=open-science-cloud
Acknowledgements
Wes Schaal
Jonathan Alvarsson
Staffan Arvidsson
Arvid Berg
Samuel Lampa
Marco Capuccini
Martin Dahlö
Valentin Georgiev
Anders Larsson
Polina Georgiev
Maris Lapins
Jon-Ander Novella
44
Lars Carlsson
Ernst Ahlberg
Ola Engqvist
SNIC Science Cloud
Andreas Hellander
Salman Toor
Caramba.clinic
Kim Kultima
Stephanie Herman
Payam Emami
Research group website: http://pharmb.io
Thank you

More Related Content

PDF
Genomics and proteomics I
PDF
Parallel Genetic Algorithms in the Cloud
PPTX
DNA STORAGE
PPTX
Grid computing Seminar PPT
PPTX
Bio computing
PPTX
High performance computing
PPTX
Analysing of big data using map reduce
Genomics and proteomics I
Parallel Genetic Algorithms in the Cloud
DNA STORAGE
Grid computing Seminar PPT
Bio computing
High performance computing
Analysing of big data using map reduce

What's hot (20)

PPT
Max flow min cut
PPTX
Data Storage in DNA
PDF
Cloud Computing Model with Service Oriented Architecture
PDF
Cloud Computing - An Introduction
PPTX
IBM Cloud Computing
PPTX
Cloud sim
PDF
Career path and preparation for computer science field
PDF
Deep Learning for Time Series Data
PPTX
3 migration
PDF
Cloud computing reference architecture from nist and ibm
PPT
DNA computing
PDF
Genetic engineering in animal
PDF
The Data Distribution Service
PPTX
De bruijn graphs
PPT
Cloud computing
PDF
Literature Review: Security on cloud computing
PPT
PPT
Cloud computing ppt
PDF
Introduction to Biological Network Analysis and Visualization with Cytoscape ...
PDF
Cloud computing security
Max flow min cut
Data Storage in DNA
Cloud Computing Model with Service Oriented Architecture
Cloud Computing - An Introduction
IBM Cloud Computing
Cloud sim
Career path and preparation for computer science field
Deep Learning for Time Series Data
3 migration
Cloud computing reference architecture from nist and ibm
DNA computing
Genetic engineering in animal
The Data Distribution Service
De bruijn graphs
Cloud computing
Literature Review: Security on cloud computing
Cloud computing ppt
Introduction to Biological Network Analysis and Visualization with Cytoscape ...
Cloud computing security
Ad

Similar to Data-intensive bioinformatics on HPC and Cloud (20)

PPTX
Data-intensive applications on cloud computing resources: Applications in lif...
PDF
Azure Brain: 4th paradigm, scientific discovery & (really) big data
ODP
Hadoop @ Sara & BiG Grid
PDF
PDF
IDB-Cloud Providing Bioinformatics Services on Cloud
PDF
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
PPTX
Big Data HPC Convergence and a bunch of other things
PDF
NSF Software @ ApacheConNA
PPTX
Cloud Programming Models: eScience, Big Data, etc.
PDF
Adoption of Cloud Computing in Scientific Research
PPTX
e-Infrastructure available for research, using the right tool for the right job
PDF
Software and Education at NSF/ACI
PPTX
High Performance Computing and Big Data
PDF
Big Data Europe SC6 WS 3: Ron Dekker, Director CESSDA European Open Science A...
PPTX
Analyzing Big Data in Medicine with Virtual Research Environments and Microse...
PPT
grid computing
PPTX
Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)
PDF
Using the Open Science Data Cloud for Data Science Research
PPTX
Continuous modeling - automating model building on high-performance e-Infrast...
PPTX
Utilising Cloud Computing for Research through Infrastructure, Software and D...
Data-intensive applications on cloud computing resources: Applications in lif...
Azure Brain: 4th paradigm, scientific discovery & (really) big data
Hadoop @ Sara & BiG Grid
IDB-Cloud Providing Bioinformatics Services on Cloud
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
Big Data HPC Convergence and a bunch of other things
NSF Software @ ApacheConNA
Cloud Programming Models: eScience, Big Data, etc.
Adoption of Cloud Computing in Scientific Research
e-Infrastructure available for research, using the right tool for the right job
Software and Education at NSF/ACI
High Performance Computing and Big Data
Big Data Europe SC6 WS 3: Ron Dekker, Director CESSDA European Open Science A...
Analyzing Big Data in Medicine with Virtual Research Environments and Microse...
grid computing
Using e-infrastructures for biodiversity conservation - Gianpaolo Coro (CNR)
Using the Open Science Data Cloud for Data Science Research
Continuous modeling - automating model building on high-performance e-Infrast...
Utilising Cloud Computing for Research through Infrastructure, Software and D...
Ad

More from Ola Spjuth (14)

PPTX
Automating cell-based screening with open source, robotics and AI
PPTX
Towards automated phenotypic cell profiling with high-content imaging
PPTX
Towards Automated AI-guided Drug Discovery Labs
PDF
Combining Prediction Intervals on Multi-Source Non-Disclosed Regression Datasets
PPTX
Building an informatics solution to sustain AI-guided cell profiling with hig...
PPTX
Automating the process of continuously prioritising data, updating and deploy...
PPTX
The case for cloud computing in Life Sciences
PPTX
Storage and Analysis of Sensitive Large-Scale Biomedical Data in Sweden
PPTX
Agile large-scale machine-learning pipelines in drug discovery
PPTX
Enabling Translational Medicine with e-Science
PPTX
Interoperability and scalability with microservices in science
PPTX
Chemical decision support in toxicology and pharmacology (OpenToxEU 2013)
PPT
Building a flexible infrastructure with Bioclipse, open source, and federated...
PPT
Accessing and scripting CDK from Bioclipse
Automating cell-based screening with open source, robotics and AI
Towards automated phenotypic cell profiling with high-content imaging
Towards Automated AI-guided Drug Discovery Labs
Combining Prediction Intervals on Multi-Source Non-Disclosed Regression Datasets
Building an informatics solution to sustain AI-guided cell profiling with hig...
Automating the process of continuously prioritising data, updating and deploy...
The case for cloud computing in Life Sciences
Storage and Analysis of Sensitive Large-Scale Biomedical Data in Sweden
Agile large-scale machine-learning pipelines in drug discovery
Enabling Translational Medicine with e-Science
Interoperability and scalability with microservices in science
Chemical decision support in toxicology and pharmacology (OpenToxEU 2013)
Building a flexible infrastructure with Bioclipse, open source, and federated...
Accessing and scripting CDK from Bioclipse

Recently uploaded (20)

PPTX
Cell Membrane: Structure, Composition & Functions
PDF
HPLC-PPT.docx high performance liquid chromatography
PPTX
Microbiology with diagram medical studies .pptx
PDF
MIRIDeepImagingSurvey(MIDIS)oftheHubbleUltraDeepField
PPTX
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
PPT
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
PPTX
The KM-GBF monitoring framework – status & key messages.pptx
PDF
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
PPTX
famous lake in india and its disturibution and importance
PPTX
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
PPTX
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
PDF
bbec55_b34400a7914c42429908233dbd381773.pdf
PDF
The scientific heritage No 166 (166) (2025)
PDF
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
DOCX
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx
PPTX
2. Earth - The Living Planet Module 2ELS
PPTX
Introduction to Fisheries Biotechnology_Lesson 1.pptx
PPTX
ECG_Course_Presentation د.محمد صقران ppt
PPTX
Derivatives of integument scales, beaks, horns,.pptx
PPTX
Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...
Cell Membrane: Structure, Composition & Functions
HPLC-PPT.docx high performance liquid chromatography
Microbiology with diagram medical studies .pptx
MIRIDeepImagingSurvey(MIDIS)oftheHubbleUltraDeepField
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
The KM-GBF monitoring framework – status & key messages.pptx
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
famous lake in india and its disturibution and importance
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
bbec55_b34400a7914c42429908233dbd381773.pdf
The scientific heritage No 166 (166) (2025)
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx
2. Earth - The Living Planet Module 2ELS
Introduction to Fisheries Biotechnology_Lesson 1.pptx
ECG_Course_Presentation د.محمد صقران ppt
Derivatives of integument scales, beaks, horns,.pptx
Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...

Data-intensive bioinformatics on HPC and Cloud

  • 1. Data-intensive bioinformatics on HPC and Cloud Ola Spjuth <ola.spjuth@farmbio.uu.se> Department of Pharmaceutical Biosciences and Science for Life Laboratory Uppsala University
  • 2. Today: We have access to high-throughput technologies to study biological phenomena
  • 3. 2017: Human whole genome sequenced in 3 days for ~$1100 …requires supercomputers for analysis and storage Massively parallel sequencing…. 2017: Illumina HiSeq X systems. 15K whole human genomes per year 2016: NGI data velocity 950 Mbp/hour = 16 Mbp/s
  • 5. Software + reference data Support Education Compute resources Storage resources Efficiency + automation A national e-infrastructure
  • 6. What we sequenced at NGI /
  • 8. Some statistics Storage usage Projects at SNIC-UPPMAX Data-intensive bioinformatics Other disciplines Support tickets
  • 9. Biggest challenge: Data growth • Storage is filling up. Projects do not end. Users do not clean up data. WGS projects are very large. • At the heart of the problem: Services are currently free of charge. Our strategy: • Cost of data storage and analysis should be assessed and included in budget before data production • Move data away from expensive storage near clusters – Constrain project lifetimes (shorten allowed time for data on systems) – Move towards tiered storage solutions – Improve efficiency in analyses (education, monitoring, support) • Investigate scaling with other centra and public cloud providers • Long-term storage: Unresolved question… 9
  • 10. NGS users • Key observations – Storage biggest challenge – Many and inefficient users, lots of software (admin burden, support, education) – Free resources (no cost) does not promote efficient usage • Resource strategies – When investing in computational hardware, it takes a long time from funding decision until the resources are operational (10-12 months on average). – Expansion of resources are done at specific points in time, low flexibility between these. – Decision on resources are made by the SNIC board with limited influence from life science scientists (SciLifeLab) • Can we improve on this? 10
  • 11. Cloud computing • Purchase a service instead of hardware • Pay-per-use pricing • Scale up and down as you need • Virtual infrastructure/machines/storage 11
  • 12. Cloud in Bioinformatics How can we take advantage of cloud resources? Simplest example: • Start pre-made VMI • Upload data • Perform scientific task • Download results • Terminate VMI Easy to scale this up to using many instances! Or….. is it? • What if I want to run 100 instances in parallel? • What about if I want a new tool? Later versions? • Do I need to upload data every time? 12
  • 13. So we want to set up and use a virtual cluster • Multiple compute nodes • Network • Distributed storage • Firewall, DNS, reverse proxy, etc. So, we now have a virtual cluster. And now? Batch-like system – Install a queueing system, e.g. SLURM – Install bioinformatics software on the VMI Big Data system – Install HDFS + Hadoop/Spark on the system (There are tools that can help automating these procedures) 13
  • 14. Why cloud in the life sciences? • Access to resources – Flexible configurations – On-demand – Cost-efficient? • Collaborate on international level – Publish/federate data – E.g. Large sequencing initiatives, “move compute to the data” • New types of analysis environments – Hadoop/Spark/Flink etc. – Microservices, Docker, Kubernetes, Mesos 14
  • 15. Challenges with cloud • Tradition: Strong HPC tradition in academia – Sweden: Existing HPC resources funded by Research Council and personnel at 6 centra in Sweden (SNIC) • Economy: Cost model is new – Difficult to assess the costs • Legal: Working with sensitive data • Educational: New technology for many 15
  • 16. Some SciLifeLab cloud options 16
  • 17. ● Geographically distributed federated IaaS cloud based on 2nd generation HPC-hardware ● Built using OpenStack SNIC Cloud in Sweden
  • 18. Needs in bioinformatics • Primarily resources with a lot of RAM and storage (high I/O) • Preferably transparent system, users don’t want to deal with e- infrastructure at all • How to work with storage (tiered?) 18
  • 19. Virtual Machines and Containers Virtual machines • Package entire systems (heavy) • Completely isolated • Suitable in cloud environments Containers: • Share OS • Smaller, faster, portable • Docker 19
  • 20. MicroServices • Decompose functionality into smaller, loosely coupled, on-demand services communicating via an API – “Do one thing and do it well” • Services are easy to replace, language-agnostic – Minimize risk, maximize agility – Suitable for loosely coupled teams – Portable - easy to scale – Multiple services can be chained into larger tasks Software containers (e.g. Docker) are ideal for microservices!
  • 21. Orchestrating containers • Origin: Google • A declarative language for launching containers • Start, stop, update, and manage a cluster of machines running containers in a consistent and maintainable way • Suitable for microservices Containers Scheduled and packed containers on nodes
  • 22. Connecting the microservices • A suitable way of using containers are connecting them into a (scientific) workflow. • Tools like Pachyderm (http://pachyderm.io/), Luigi (https://github.com/spotify/lui gi) and Galaxy (https://galaxyproject.org/) can assist with this. • Goal: Reproducible, fault- tolerant, scalable execution. 22
  • 23. How can regular users take advantage of these technologies? • Virtual Research Environment (VRE) is an appealing option – Easy and user-friendly access to computational resources, tools and data, commonly for a scientific domain – Usually access from a browser • Multi-tenant VRE – log into shared system • Private VRE - deploy on your favorite cloud provider 23
  • 24. Tools Tools Data Data VREs aim to bridge this gap! Researcher Other researchers Virtual Research Environments
  • 26. PhenoMeNal • Horizon 2020 project, 2015-2018 • Virtual Research Environments (VRE), Microservices, Workflows • Towards interoperable and scalable Metabolomics data analysis • Private environments for sensitive data http://phenomenal-h2020.eu/ DockerHub Virtual Infrastructure GitHub
  • 27. Cloudflare kubeadm Terraform kubectl Packer • Enable users to deploy their own virtual infrastructure on an IaaS provider • Containerize tools, orchestrate microservices with workflow systems on top of Kubernetes PhenoMeNal approach and stack KubeNow
  • 28. Users should not see this…
  • 29. Users should see this! 29
  • 31. Research focus in my group e-Science methods development Smart data management, predictive modeling Applied e-Science research Drug discovery and individualized diagnostics e-infrastructure development Automation, Big Data
  • 32. Privacy preservation Workflows Big Data frameworks Data management and predictive modeling Data federation Compute federation
  • 33. ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●●●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ●● NGS projects 2014 2015 2016 2017 Efficiency feedback to users began 0 20 40 60 80 100 Efficiency(%) Date ●● ● ●● ● ● ●●● ● ● ●● ● ● ● ● ● ● ●● ● ● ● ● ●●●●●●●● ● ● ●● ●●● ● ● ●●● ● ● ● ●● ● ● ● ● ● ● ● ● ●● ●● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ●● ● ● ●● ●●● ●● ● ● ● ● ● ●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●● ● ● ● ● ● ● ● ● ● ● ●●● ● ●● ● ● ● ● ● ●●● ● ● ● ●● ● ● ●● ● ●●● ●●●●●● Other projects 2014 2015 2016 2017 0 20 40 60 80 100 Selected research questions How can we improve efficiency on shared HPC for data- intensive bioinformatics? 1. M. Dahlö, D. Schofield, Wesley Schaal and O. Spjuth, Tracking the NGS revolution: Usage and system support of bioinformatics projects on shared high-performance computing clusters. In Preparation. 2. O. Spjuth, E. Bongcam-Rudloff, J. Dahlberg, M. Dahlö, A. Kallio, L. Pireddu, F. Vezzi, and E. Korpelainen, Recommendations on e- infrastructures for next-generation sequencing. Gigascience, 2016, 5:26 3. S. Lampa, M. Dahlö, P. I. Olason, J. Hagberg, and O. Spjuth, Lessons learned from implementing a national infrastructure in sweden for storage and analysis of next-generation sequencing data. Gigascience, 2013, 2:9 Data locality? Outsourcing? Martin Dahlö
  • 34. Selected research questions Can Big Data frameworks aid data-intensive bioinformatics? 1. A. Siretskiy, L. Pireddu, T. Sundqvist, and O. Spjuth. A quantitative assessment of the Hadoop framework for analyzing massively parallel DNA sequencing data. Gigascience. 2015; 4:26. 2. L. Ahmed, A. Edlund, E. Laure, and O. Spjuth. Using Iterative MapReduce for Parallel Virtual Screening. Cloud Computing Technology and Science (Cloud-Com), 2013 IEEE 5th International Conference on , vol.2, no., pp.27,32, 2-5, 2013 3. M. Capuccini, L. Carlsson, U. Norinder and O. Spjuth. Conformal Prediction in Spark: Large-Scale Machine Learning with Confidence. EEE/ACM 2nd International Symposium on Big Data Computing (BDC), Limassol, 2015, pp. 61-67. 4. M. Capuccini, L.Ahmed, W. Schaal, E. Laure and O. Spjuth Large-scale virtual screening on public cloud resources with Apache Spark Journal of Cheminformatics 2017 9:15 Laeeq Valentin Marco Efficient Virtual Screening with Apache Spark and Machine Learning Hadoop pipeline scales better than HPC and is economical for current data sizes
  • 35. “EasyMapReduce: Leverage the power of Spark And Docker To scale scientific tools in MapReduce fashion“ 35 https://spark-summit.org/east-2017/events/easymapreduce-leverage-the- power-of-spark-and-docker-to-scale-scientific-tools-in-mapreduce-fashion/
  • 36. Selected research questions How useful are Scientific Workflows in data-intensive research? O. Spjuth et al. Experiences with workflows for automating data-intensive bioinformatics. Biology Direct. 2015 Aug 19;10(1):43. S. Lampa, J. Alvarsson and O. Spjuth. Towards agile large-scale predictive modelling in drug discovery with flow-based programming design principles. Journal of Cheminformatics, 2016, 8:67 SamuelJon • Streamline analysis on high- performance e-infrastructures • Support reproducible data analysis • Enable large-scale data analysis
  • 37. Selected research questions How can we deploy smart, high-availability services with APIs? http://www.openrisknet.org • Horizon 2020 project, 2017-2020 • E-Infrastructure for chemical safety assessment • Multi-tenant Virtual Environments, microservices • APIS, “Semantic interoperability” • Academia – industry • Much focus on standardizing chemical data and predictive modeling Staffan Jonathan Arvid
  • 38. Research questions around the corner • Public and private data sources are not static. How can we continuously improve predictive models as data changes? • We can generate too much data. Can predictive modeling aid data acquisition, storage and analysis? 38
  • 40. HASTE Hierarchical Analysis of Spatial and TEmporal and image data From intelligent data acquisition via smart data management to confident predictions PI, Aim1: Carolina Wählby Aim 3: Andreas HellanderAim 2: Ola Spjuth 29 MSEK 2017-2022
  • 41. . . . Training data Can we use privileged information to improve machine learning models? Training Can we make a valid ranking and guide data acquisition? . . . Is something interesting happening? Can we assign valid probabilities for that? Collect more data Online setting Aim 2: Guiding data acquisition with machine learning
  • 42. Aim3: Explore a hierarchical model based on Information Layers Data warehouse, distributed storage Edge Cloudlet, private cloud
  • 43. European Open Science Cloud (EOSC) • The vast majority of all data in the world (in fact up to 90%) has been generated in the last two years. • Scientific data is in direct need of openness, better handling, careful management, machine actionability and sheer re-use. • European Open Science Cloud: A vision of a future infrastructure to support Open Research Data and Open Science in Europe – It should enable trusted access to services, systems and the re-use of shared scientific data across disciplinary, social and geographical borders – research data should be findable, accessible, interoperable and re- usable (FAIR) – provide the means to analyze datasets of huge sizes 43http://ec.europa.eu/research/openscience/index.cfm?pg=open-science-cloud
  • 44. Acknowledgements Wes Schaal Jonathan Alvarsson Staffan Arvidsson Arvid Berg Samuel Lampa Marco Capuccini Martin Dahlö Valentin Georgiev Anders Larsson Polina Georgiev Maris Lapins Jon-Ander Novella 44 Lars Carlsson Ernst Ahlberg Ola Engqvist SNIC Science Cloud Andreas Hellander Salman Toor Caramba.clinic Kim Kultima Stephanie Herman Payam Emami
  • 45. Research group website: http://pharmb.io Thank you

Editor's Notes

  • #6: Access to computers (many if you need) Access to storage (a lot if you need) Pre-installed software and reference genomes Free
  • #20: Drop applications into VMs running Docker in different clouds.
  • #33: How improve efficiency on shared HPC for data-intensive bioinformatics? Can Cloud Computing and Big Data frameworks aid data-intensive research? How useful are Scientific Workflows in data-intensive research? Can predictive modeling aid data acquisition, storage and analysis? How can we continuously improve predictive models as data changes?
  • #42: Making predictions with valid estimates of uncertainty Using privileged information in model training Deploy models efficiently on different e-infrastructures