CHPC Workshop Morning Session

Cloud Computing Solutions for Genomics
Across Geographic, Institutional and Economic Barriers

Ntinos Krampis
Asst. Professor
J. Craig Venter Institute
kkrampis@jcvi.org

http://www.jcvi.org/cms/about/bios/kkrampis/

Workshop Schedule

Morning Session: Background Presentations and Prep

11:00 – 11:45 Introduction to Cloud Computing for Bioinformatics
11:45 – 12:00 Questions and Answers
12:00 – 12:30 Using Cloud BioLinux on the Amazon EC2 Cloud
12:30 – 13:00 Preparation: install Cloud Virtual Machines on laptops

Afternoon Session: Hands on Session

14:00 – 14:30 Preparation: install Cloud Virtual Machines on laptops
14:30 – 16:00 Bioinformatic Analysis using Cloud BioLinux
16:30 – 17:30 Customized Bioinformatics Solutions for Participants

A little bit of background information...

● Konstantinos (Ntinos) Krampis, started working at J. Craig Venter Inst.
(JCVI) in 2009

● Background training in Molecular Biology, PhD in Bioinformatics

● Research: cloud and high-performance computing, genome assembly

● Projects: Cloud BioLinux (cloudbiolinux.org)

● Taught Cloud BioLinux workshop at Univ. of Limpopo last May

● Slides available at http://www.slideshare.com/agbiotec

●
Email me for slides, meeting, questions: kkrampis@jcvi.org

J. Craig Venter Institute (JCVI)
Large-scale genome sequencing and bioinformatics computing

●Human Microbiome Project (HMP): genome
sequencing of microbes living in and on the human
body

● Global Ocean Sampling (GOS) survey: genome
sequencing of microbes sampled from oceans around
the world

JCVI: sequencing and computing infrastructure

● core sequencing laboratory: 454, Solexa, HiSeq, IonTorrent on the way

● dedicated bioinformatics department (57 bioinformaticians)

● large-scale computations, ~1000 node Sun Grid Engine (SGE) cluster

Low-cost sequencing instruments

●
small-factor sequencers available: GS Junior by 454, MiSeq by Illumina

● bacterial, viral, small fungal genomes, sequencing for variant discovery

● sequencing as a standard technique in molecular biology and genetics

● RNAseq (instead of microarrays) and ChiPseq (instead of yeast 2-hybrid)

http://www.gsjunior.com/ http://www.illumina.com/systems/miseq.ilmn

More small laboratories doing genome sequencing

amount of
sequencing

number of labs

acquiring the sequence data is only the first step...

Sequencing instruments shipped with minimal
computational capacity

●Problem 1: sequencing data analysis requires high-performance and expensive
computing hardware, for example: genome assembly, BLAST, genome annotation

●Problem 2: much of bioinformatics software are difficult to install by biologists,
need technical expertise with operating systems, compiling source code etc.

Each lab building their own informatics infrastructure ?

●small labs need additional funds to build
computing clusters

●funds for bioinformaticians and software
developers to maintain the clusters and
software

● duplication of effort across labs

●sub-optimal utilization of the hardware
due to small amounts of sequencing

Large sequencing centers offering bioinformatics analysis services ?

● Bioinformatic Resource Centers (BRC)

●bioinformatic analysis coupled with sequencing
of an organism

● mostly provide data browsing and few analysis
tools to the public

●cannot serve the bioinformatic needs of every
small lab acquiring a sequencing instrument

●need end-to-end solutions, users submit
sequence data and get final annotation

Solving Problem 1: using high-performance computing
hardware available on the cloud

●cloud computing : high performance
computers and data storage, remotely
accessible through the Internet

●we are all using the cloud: Gmail,
Google Docs, FaceBook; you store and
access data on a remote computer

●cloud computers rented pay-as-you-go
by service providers such as Amazon
Elastic Compute Cloud (EC2)

The Amazon EC2 cloud computing service
●
a subsidiary company of Amazon.com, rents computing pay-as-you go

● cloud computers cost $0.085 - $2 per hr (max 64GB memory and 8 processors)

● used by companies that need additional computers without investing on hardware

● physical locations US East / West regions, EU, Singapore, Japan researchers

● democratizes access to computing resources outside of institutional, economic or
national boundaries

750 hours free for new users, sign up here:
http://aws.amazon.com/free/
http://aws.amazon.com

How does cloud computing work ?

● cloud computing evolved from virtualization technology

● operating system, bioinformatics software and data, are
installed on a Virtual Machine (VM)

● VM is emulation of a computer system, in the form of a
single, executable binary file

● runs inside a physical computer such as a laptop

● why Virtualization: simplify IT maintenance

How does cloud computing work ?

● a VM is uploaded on the cloud remote Amazon EC2 cloud computing service
service; runs by renting computing
capacity from Amazon EC2 (up to VM VM
VM
64GB RAM / 8 core computers)

● bioinformatics software can be
executed from anywhere in the world
through a desktop computer with
Internet access Internet

● removes need for local computer
clusters at each laboratory

● alternatively if you have a cluster
locally it can run on a private cloud
local computers

Solving problem 2: pre-installed and
configured bioinformatics software on cloud
Virtual Machines

●Cloud BioLinux: a publicly accessible Virtual
Machine (VM) on the Amazon EC2 cloud

●100+ pre-configured and installed bioinformatics
software tools Amazon EC2 cloud
●sequence analysis, genome assembly, annotation,
phylogeny, molecular modeling, gene expression

●a researcher can initiate a practically unlimited
number of VMs for large-scale data analysis and
access them using a local computer

Cloud BioLinux for Bioinformatics

● how the Cloud BioLinux project came to be, what it can offers to small
labs for genome sequence analysis

●where and how do I run Cloud BioLinux , especially if I am not a
computer expert

● besides end-users, bioinformatics developers are provided a framework
for modifying and sharing VM configurations and data

Creating Cloud Biolinux

● JCVI bioinformatics cloud computing research

tinyurl.com/BioLinux-NEBC
● NEBC BioLinux software repository
● community effort at BOSC 2009 – 11
+
●initially: a VM on Amazon EC2 with the tools copied
and installed from the NEBC repository

● now: developer's framework for creating customized
cloud VMs
=
● major contributors:

http://www.cloudbiolinux.org

Research at JCVI with Cloud BioLinux

●Eucalyptus private cloud currently installed at JCVI,
OpenStack on the way

●open-source cloud platforms, fully compatible with
Amazon EC2 (identical API)

●easy to set up on a local computer cluster, comes with
Ubuntu Linux server

●develop VMs in-house with complex bioinformatics
pipelines pre-installed and upload to Amazon EC2 for
public access

● free to use on your laptop with Virtualbox


● bioinformatics data analysis pipelines have complex
dependencies: operating system, software libraries,
reference databases etc.

● approach: pre-install pipelines and all dependencies
in a single binary VM file using a private cloud

●upload VM on Amazon EC2: pipelines ready to
execute, no need to purchase hardware
JCVI - GSC


● Funded by NIAID until 2013

●port complex bioinformatic pipelines on the
Amazon EC2 cloud

● focus on viral reads-to-annotation data pipelines

● benefits to small laboratories that lack resources

if you own a cluster: download and run VM on your JCVI - GSC
●

private Eucalyptus or Openstack cloud

Running Cloud BioLinux on the Amazon EC2 cloud

Account on the Amazon EC2 cloud http://aws.amazon.com/ec2

Launch Cloud BioLinux through the EC2 cloud console

http://tinyurl.com/cloud-biolinux-tutorial

Cloud BioLinux launch wizard: steps 1 & 2

1. go to the
“Community AMIs”
tab, specify the Cloud
BioLinux VM
identifier
(most recent update:
cloudbiolinux.org)

2. select
computational
capacity

Cloud BioLinux launch wizard: step 3

3. specify
a password
for login to Cloud
BioLinux in the
“User Data” box

10

Distributing Data Analysis Results with Cloud BioLinux

Distributing Data Analysis Results with Cloud BioLinux

Whole System Snapshot Exchange

● how difficult is to share bioinformatics work on your computer with a collaborator ?
● capture the state of the computing system (OS + software), data, analysis results
● make VM snapshots: executable binary file, replica of a running VM

● distribute a VM snapshot with pre-installed software and data analysis results

● collaborators can replicate, re-run, add to your analysis results

● a snapshot can be shared directly on the Amazon cloud, downloaded on a private
cloud or run on desktop using virtualization software

Cloud BioLinux: whole system snapshot exchange

storage cost: 0.10$ / GB / month

Cloud BioLinux: whole system snapshot exchange
authorize access to the VM: public or for certain users

other researchers can access the VM with all the
software, data, analysis results directly on the cloud

Cloud BioLinux for Software Developers

● Issue 1: for researchers with sensitive data a public cloud might not be an option
● Problem 1: moving VMs across clouds is not trivial, need low level operations

● Issue 2: bioinformatic specializations (ex. sequencing, phylogeny, protein structure)
● Problem 2: one VM with many tools to fit all becomes over-sized

● Cloud BioLinux VM deployment framework

Cloud BioLinux for Software Developers

● framework to customize software installed in cloud VM / image
● based on python Fabric automated deployment tool
● software components listed in simple text configuration files
● edit the files to mix and match software according to your needs
● use source code repository to share configuration files for customized VMs
● start with a bare-bones VM on Amazon EC2 or Eucalyptus private cloud
● Fabric scripts automatically install specified software based on configuration files

Free, available from: https://github.com/chapmanb/cloudbiolinux

software domains in Cloud BioLinux:

Genome sequencing, de novo assembly, annotation,
phylogeny, molecular structures, gene expression
analysis

high-level configuration describing software groups
for each group individual bioinformatics tools

Acknowledgments & Credits

Brad Chapman - development of the Fabric scripts, website
Tim Booth, Mesude Bicak, Dawn Field – BioLinux 6.0 development
Enis Afgan – Cloudman and Cloud BioLinux integration

Members of the Cloud Biolinux community:
http://groups.google.com/group/cloudbiolinux

And again our contacts:
kkrampis@jcvi.org Thank you !
http://www.cloudbiolinux.org
http://www.slideshare.com/agbiotec

CHPC Workshop Morning Session

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to CHPC Workshop Morning Session (20)

Recently uploaded (20)

CHPC Workshop Morning Session