SlideShare a Scribd company logo
Data analysis with
Galaxy on the Cloud
Enis Afgan
100sGB
100+
How to use the cloud?
1. Get an account on the supported cloud
2. Start a master instance via a launcher app
3. Use CloudMan’s web interface on the master
instance to manage the platform
Agenda details
• Launch an instance
• Demonstrate the following CloudMan
features and prepare for the data
analysis part:
• Manual & Auto-scaling
• Using an S3 bucket as a data
source
• Accessing an instance over ssh
• Customizing an instance
• Controlling Galaxy
• Sharing-an-instance
• Perform data analysis in Galaxy
Interactionflow
YOUR TURN
http://bit.ly/goc-ws
Launch an instance
1. Slides @ bit.ly/goc-ws
2. Load biocloudcentral.org
3. Enter the access key and secret key
provided at http://bit.ly/ws-creds
4. Provide your email address
5. Use your initials as the cluster name
6. Set any password (and remember it)
7. Keep Large instance type
8. Start your instance
Wait for the instance to start (~2-3 minutes)
For more details, see
http://cloudman.irb.hr
Scaling
computation
YES YES
NO
Agenda details
• Launch an instance ✓
• Demonstrate the following
CloudMan features and prepare for
the data analysis part:
• Manual & Auto-scaling
• Using an S3 bucket as a data
source
• Accessing an instance over ssh
• Customizing an instance
• Controlling Galaxy
• Sharing-an-instance
• Perform data analysis in Galaxy
Interactionflow
Manual scaling
• Explicitly add 1 worker node to your cluster
• Node type
corresponds to
node
processing
capacity
• Research use of
Spot instances
Auto-scaling
Agenda details
• Launch an instance ✓
• Demonstrate the following CloudMan
features and prepare for the data
analysis part:
• Manual & Auto-scaling ✓
• Using an S3 bucket as a data
source
• Accessing an instance over ssh
• Customizing an instance
• Controlling Galaxy
• Sharing-an-instance
• Perform data analysis in Galaxy
Interactionflow
On human chromosome 22,
which coding exons have
the most SNPs in them?
A Rough Plan
•Get some data
•Coding exons on chromosome 22
•SNPs on chromosome 22
•Mess with it
•Identify which exons have SNPs
•Count SNPs per exon
•Visualize our results
Exons, from UCSC SNPs, from UCSC
Exons, from UCSC SNPs, from UCSC
Exons, from UCSC
SNPs, from UCSC
Overlap pairings
Exons, from UCSC SNPs, from UCSC
1
1
2
Exons, from UCSC
SNPs, from UCSC
Overlap pairings
Exon overlap counts
Exons, from UCSC
1
1
2
Exon overlap counts
Exons, from UCSC
1
1
2
Exon overlap counts
1
1
2
Join on exon name
0
0
0
Exons, from UCSC
1
1
2
Exon overlap counts
1
1
2
Join on exon name
0
0
0
1
1
2
Rearrange columns w/ cut
Your turn
http://usegalaxy.org/galaxy101
Slides @ http://bit.ly/gxy-ws
Agenda details
• Launch an instance ✓
• Demonstrate the following CloudMan
features and prepare for the data
analysis part:
• Manual & Auto-scaling ✓
• Using an S3 bucket as a data
source
• Accessing an instance over ssh
• Customizing an instance
• Controlling Galaxy
• Sharing-an-instance
• Perform data analysis in Galaxy
Interactionflow
ource
e over
Use the terminal (or install Secure Shell for Chrome)
SSH using user ubuntu and the password you chose
when launching an instance:
[local machine]$ ssh ubuntu@<instance IP address>
Once logged in
• You have full system access to your instance,
including sudo; use it as any other system
• galaxy user exists on the system and should
be used when manipulating Galaxy (sudo su
galaxy)
• Can submit any jobs via the standard qsub
command
• Edit Galaxy’s configuration
$ sudo su galaxy
$ cd /mnt/galaxy/galaxy-app
$ vi universe_wsgi.ini
allow_library_path_paste = True
Controlling Galaxy
• Start/stop Galaxy application
• Add an admin user
• Use the email you registered with
S3 bucket as a data
library
• Within Galaxy, create a Data Library, using S3 bucket
path as the data source (/mnt/workshop-data)
• This will import all the datasets into the Data Library
• Import that dataset into a history
Sharing-an-Instance
• Share the entire CloudMan platform
• Includes all of user data and even the customizations
• Publish a self-contained
analysis
• Make a note of the
share-string and send
it to your neighbor
Management
Console
Application(s)
(eg, Galaxy)
1°
2°
3°
6°, 8°
9°
Persistent
data
repository
Start CloudMan
Setup services
5°
4°
7°
10°
CM-w
CM-w
CM-w
...
FS 2 FS ...
Instance block storage
Contextualize
image
CloudMan MI
CloudMan MI
...
CloudMan MI
CloudMan machine image
11°
S3/Swift
CloudMan instance
Snap2 Snap..
/mnt/galaxy[Indices]
/mnt/cm/paster.log
cm-<hash>
/usr/bin/ec2autorun.log
/tmp/cm/cm_boot.log
Troubleshooting
1
°
2
°
3
°
CLUSTER ON THE COMMAND
LINE
Distributed Job Manager
(DRM)
• Controls job execution on a (set of)
resource(s)
• A job: an invocation of a program
• Manages job load
• Provides ability to monitor system and job
status
• Popular DRMs: Sun Grid Engine (SGE),
Portable Batch Scheduler (PBS), TORQUE,
Condor, Load Sharing Facility (LSF)
Customize your instance -
install a new tool
$ cd /mnt/galaxy/export
$ wget
http://heanet.dl.sourceforge.net/project/dnaclust/parallel_relea
se_3/dnaclust_linux_release3.zip
$ unzip dnaclust_linux_release3.zip
$ cd dnaclust_linux_release3
$ chmod +x *
$ cp /mnt/workshop-data/mtDNA.fasta .
Get a copy of a sample dataset Don’t forget
Use the new tool in the cluster
mode
1. Create a new sample shell file to run the tool; call it job_script.sh
with the following content:
#$ -cwd
./dnaclust -l -s 0.9 /mnt/workshop-data/mtDNA.fasta
2. Submit single job to SGE queue
qsub job_script.sh
3. Check the queue: qstat -f
4. Job output will be in the local directory in file job_script.sh.o#
5. Start a number of instances of the script:
qsub job_script.sh (*10)
watch qstat –f
1. See all jobs lined up
6. See auto-scaling in (using /cloud) [1.5-2 mins]
7. Go back to command prmopt, see jobs being distributed

More Related Content

PDF
GCC 2014 scriptable workshop
PPTX
Stabilising the jenga tower
PDF
Syslog Centralization Logging with Windows ~ A techXpress Guide
PDF
Zookeeper In Action
PPT
Zookeeper Introduce
PDF
Performance testing meets the cloud - Artem Shendrikov
PPT
Spark Streaming Info
PDF
How to Run Solr on Docker and Why
GCC 2014 scriptable workshop
Stabilising the jenga tower
Syslog Centralization Logging with Windows ~ A techXpress Guide
Zookeeper In Action
Zookeeper Introduce
Performance testing meets the cloud - Artem Shendrikov
Spark Streaming Info
How to Run Solr on Docker and Why

What's hot (20)

PDF
An Introduction to Priam
PDF
Apache ZooKeeper TechTuesday
PPTX
HTCondor flocking between two clouds
PDF
Matthew Treinish, HP - subunit2sql: Tracking 1 Test Result in Millions, OpenS...
PDF
zookeeperProgrammers
PDF
Liferay OpenShift base concepts
PDF
Distributed system coordination by zookeeper and introduction to kazoo python...
PDF
Mongo db program_installation_guide
PPTX
Kubernetes
PDF
How choosing the Raft consensus algorithm saved us 3 months of development time
PDF
Aws S3 uploading tricks 2016
PDF
Apache Zookeeper
PPTX
Winter is coming? Not if ZooKeeper is there!
PDF
Containerd: Building a Container Supervisor by Michael Crosby
PPT
Seattle Cassandra Meetup - Cassandra 1.2 - Eddie Satterly
PDF
Comparing processing frameworks v7
PDF
Strict-Data-Consistency-in-Distrbuted-Systems-With-Failures
PDF
Максим Барышиков-«WoT: Geographically distributed cluster of clusters»
PPTX
Writing Serverless Application in Java with comparison of 3 approaches: AWS S...
PDF
Mininet: Moving Forward
An Introduction to Priam
Apache ZooKeeper TechTuesday
HTCondor flocking between two clouds
Matthew Treinish, HP - subunit2sql: Tracking 1 Test Result in Millions, OpenS...
zookeeperProgrammers
Liferay OpenShift base concepts
Distributed system coordination by zookeeper and introduction to kazoo python...
Mongo db program_installation_guide
Kubernetes
How choosing the Raft consensus algorithm saved us 3 months of development time
Aws S3 uploading tricks 2016
Apache Zookeeper
Winter is coming? Not if ZooKeeper is there!
Containerd: Building a Container Supervisor by Michael Crosby
Seattle Cassandra Meetup - Cassandra 1.2 - Eddie Satterly
Comparing processing frameworks v7
Strict-Data-Consistency-in-Distrbuted-Systems-With-Failures
Максим Барышиков-«WoT: Geographically distributed cluster of clusters»
Writing Serverless Application in Java with comparison of 3 approaches: AWS S...
Mininet: Moving Forward
Ad

Similar to Data analysis with Galaxy on the Cloud (20)

PDF
CloudMan workshop
PDF
IRB Galaxy CloudMan radionica
PDF
SMACK Stack 1.1
PDF
Real-time Inverted Search in the Cloud Using Lucene and Storm
PDF
[AWS Dev Day] 실습워크샵 | Amazon EKS 핸즈온 워크샵
PDF
FIWARE Tech Summit - Docker Swarm Secrets for Creating Great FIWARE Platforms
PPTX
Real-Time Inverted Search NYC ASLUG Oct 2014
PPTX
Docker Swarm secrets for creating great FIWARE platforms
PDF
Netflix Global Applications - NoSQL Search Roadshow
PPTX
Federated Storage Resources GCC2018 https://vimeo.com/291738189
PPTX
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
PPTX
Benchmarking Solr Performance at Scale
PDF
Galaxy RNA-Seq Analysis: Tuxedo Protocol
PDF
Container Performance Analysis Brendan Gregg, Netflix
PDF
FIWARE Tech Summit - FIWARE Cygnus and STH-Comet
PPTX
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
PPTX
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
PDF
Container Performance Analysis
PDF
Data persistency (draco, cygnus, sth comet, quantum leap)
PDF
Microservices with Micronaut
CloudMan workshop
IRB Galaxy CloudMan radionica
SMACK Stack 1.1
Real-time Inverted Search in the Cloud Using Lucene and Storm
[AWS Dev Day] 실습워크샵 | Amazon EKS 핸즈온 워크샵
FIWARE Tech Summit - Docker Swarm Secrets for Creating Great FIWARE Platforms
Real-Time Inverted Search NYC ASLUG Oct 2014
Docker Swarm secrets for creating great FIWARE platforms
Netflix Global Applications - NoSQL Search Roadshow
Federated Storage Resources GCC2018 https://vimeo.com/291738189
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
Benchmarking Solr Performance at Scale
Galaxy RNA-Seq Analysis: Tuxedo Protocol
Container Performance Analysis Brendan Gregg, Netflix
FIWARE Tech Summit - FIWARE Cygnus and STH-Comet
Case Study: Elasticsearch Ingest Using StreamSets @ Cisco Intercloud
Case Study: Elasticsearch Ingest Using StreamSets at Cisco Intercloud
Container Performance Analysis
Data persistency (draco, cygnus, sth comet, quantum leap)
Microservices with Micronaut
Ad

More from Enis Afgan (14)

PDF
Federated Galaxy: Biomedical Computing at the Frontier
PDF
From laptop to super-computer: standardizing installation and management of G...
PDF
Horizontal scaling with Galaxy
PDF
Endofday: A Container Workflow Engine for Scalable, Reproducible Computation
PDF
2016 07 - CloudBridge Python library (XSEDE16)
PDF
2017.07.19 Galaxy & Jetstream cloud
PDF
Resource planning on the (Amazon) cloud
PDF
The pulse of cloud computing with bioinformatics as an example
PDF
Cloud computing and bioinformatics
PDF
Galaxy CloudMan performance on AWS
PDF
Adding Transparency and Automation into the Galaxy Tool Installation Process
PDF
Enabling Cloud Bursting for Life Sciences within Galaxy
PDF
Introduction to Galaxy and RNA-Seq
PDF
Galaxy workshop
Federated Galaxy: Biomedical Computing at the Frontier
From laptop to super-computer: standardizing installation and management of G...
Horizontal scaling with Galaxy
Endofday: A Container Workflow Engine for Scalable, Reproducible Computation
2016 07 - CloudBridge Python library (XSEDE16)
2017.07.19 Galaxy & Jetstream cloud
Resource planning on the (Amazon) cloud
The pulse of cloud computing with bioinformatics as an example
Cloud computing and bioinformatics
Galaxy CloudMan performance on AWS
Adding Transparency and Automation into the Galaxy Tool Installation Process
Enabling Cloud Bursting for Life Sciences within Galaxy
Introduction to Galaxy and RNA-Seq
Galaxy workshop

Recently uploaded (20)

PPTX
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
PPTX
BIOMOLECULES PPT........................
PPTX
POULTRY PRODUCTION AND MANAGEMENTNNN.pptx
DOCX
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx
PDF
Looking into the jet cone of the neutrino-associated very high-energy blazar ...
PPT
6.1 High Risk New Born. Padetric health ppt
PDF
Biophysics 2.pdffffffffffffffffffffffffff
PDF
Phytochemical Investigation of Miliusa longipes.pdf
PPTX
Application of enzymes in medicine (2).pptx
PDF
Mastering Bioreactors and Media Sterilization: A Complete Guide to Sterile Fe...
PDF
Warm, water-depleted rocky exoplanets with surfaceionic liquids: A proposed c...
PPTX
2. Earth - The Living Planet Module 2ELS
PDF
Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...
PDF
Lymphatic System MCQs & Practice Quiz – Functions, Organs, Nodes, Ducts
PDF
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
PDF
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
PDF
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
PDF
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
PPTX
Science Quipper for lesson in grade 8 Matatag Curriculum
PPT
POSITIONING IN OPERATION THEATRE ROOM.ppt
ognitive-behavioral therapy, mindfulness-based approaches, coping skills trai...
BIOMOLECULES PPT........................
POULTRY PRODUCTION AND MANAGEMENTNNN.pptx
Q1_LE_Mathematics 8_Lesson 5_Week 5.docx
Looking into the jet cone of the neutrino-associated very high-energy blazar ...
6.1 High Risk New Born. Padetric health ppt
Biophysics 2.pdffffffffffffffffffffffffff
Phytochemical Investigation of Miliusa longipes.pdf
Application of enzymes in medicine (2).pptx
Mastering Bioreactors and Media Sterilization: A Complete Guide to Sterile Fe...
Warm, water-depleted rocky exoplanets with surfaceionic liquids: A proposed c...
2. Earth - The Living Planet Module 2ELS
Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...
Lymphatic System MCQs & Practice Quiz – Functions, Organs, Nodes, Ducts
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
Science Quipper for lesson in grade 8 Matatag Curriculum
POSITIONING IN OPERATION THEATRE ROOM.ppt

Data analysis with Galaxy on the Cloud

  • 1. Data analysis with Galaxy on the Cloud Enis Afgan
  • 3. How to use the cloud? 1. Get an account on the supported cloud 2. Start a master instance via a launcher app 3. Use CloudMan’s web interface on the master instance to manage the platform
  • 4. Agenda details • Launch an instance • Demonstrate the following CloudMan features and prepare for the data analysis part: • Manual & Auto-scaling • Using an S3 bucket as a data source • Accessing an instance over ssh • Customizing an instance • Controlling Galaxy • Sharing-an-instance • Perform data analysis in Galaxy Interactionflow
  • 6. Launch an instance 1. Slides @ bit.ly/goc-ws 2. Load biocloudcentral.org 3. Enter the access key and secret key provided at http://bit.ly/ws-creds 4. Provide your email address 5. Use your initials as the cluster name 6. Set any password (and remember it) 7. Keep Large instance type 8. Start your instance Wait for the instance to start (~2-3 minutes) For more details, see http://cloudman.irb.hr
  • 8. Agenda details • Launch an instance ✓ • Demonstrate the following CloudMan features and prepare for the data analysis part: • Manual & Auto-scaling • Using an S3 bucket as a data source • Accessing an instance over ssh • Customizing an instance • Controlling Galaxy • Sharing-an-instance • Perform data analysis in Galaxy Interactionflow
  • 9. Manual scaling • Explicitly add 1 worker node to your cluster • Node type corresponds to node processing capacity • Research use of Spot instances
  • 11. Agenda details • Launch an instance ✓ • Demonstrate the following CloudMan features and prepare for the data analysis part: • Manual & Auto-scaling ✓ • Using an S3 bucket as a data source • Accessing an instance over ssh • Customizing an instance • Controlling Galaxy • Sharing-an-instance • Perform data analysis in Galaxy Interactionflow
  • 12. On human chromosome 22, which coding exons have the most SNPs in them?
  • 13. A Rough Plan •Get some data •Coding exons on chromosome 22 •SNPs on chromosome 22 •Mess with it •Identify which exons have SNPs •Count SNPs per exon •Visualize our results
  • 14. Exons, from UCSC SNPs, from UCSC
  • 15. Exons, from UCSC SNPs, from UCSC Exons, from UCSC SNPs, from UCSC Overlap pairings
  • 16. Exons, from UCSC SNPs, from UCSC 1 1 2 Exons, from UCSC SNPs, from UCSC Overlap pairings Exon overlap counts
  • 17. Exons, from UCSC 1 1 2 Exon overlap counts
  • 18. Exons, from UCSC 1 1 2 Exon overlap counts 1 1 2 Join on exon name 0 0 0
  • 19. Exons, from UCSC 1 1 2 Exon overlap counts 1 1 2 Join on exon name 0 0 0 1 1 2 Rearrange columns w/ cut
  • 21. Agenda details • Launch an instance ✓ • Demonstrate the following CloudMan features and prepare for the data analysis part: • Manual & Auto-scaling ✓ • Using an S3 bucket as a data source • Accessing an instance over ssh • Customizing an instance • Controlling Galaxy • Sharing-an-instance • Perform data analysis in Galaxy Interactionflow
  • 22. ource
  • 23. e over Use the terminal (or install Secure Shell for Chrome) SSH using user ubuntu and the password you chose when launching an instance: [local machine]$ ssh ubuntu@<instance IP address>
  • 24. Once logged in • You have full system access to your instance, including sudo; use it as any other system • galaxy user exists on the system and should be used when manipulating Galaxy (sudo su galaxy) • Can submit any jobs via the standard qsub command
  • 25. • Edit Galaxy’s configuration $ sudo su galaxy $ cd /mnt/galaxy/galaxy-app $ vi universe_wsgi.ini allow_library_path_paste = True
  • 26. Controlling Galaxy • Start/stop Galaxy application • Add an admin user • Use the email you registered with
  • 27. S3 bucket as a data library • Within Galaxy, create a Data Library, using S3 bucket path as the data source (/mnt/workshop-data) • This will import all the datasets into the Data Library • Import that dataset into a history
  • 28. Sharing-an-Instance • Share the entire CloudMan platform • Includes all of user data and even the customizations • Publish a self-contained analysis • Make a note of the share-string and send it to your neighbor
  • 29. Management Console Application(s) (eg, Galaxy) 1° 2° 3° 6°, 8° 9° Persistent data repository Start CloudMan Setup services 5° 4° 7° 10° CM-w CM-w CM-w ... FS 2 FS ... Instance block storage Contextualize image CloudMan MI CloudMan MI ... CloudMan MI CloudMan machine image 11° S3/Swift CloudMan instance Snap2 Snap.. /mnt/galaxy[Indices] /mnt/cm/paster.log cm-<hash> /usr/bin/ec2autorun.log /tmp/cm/cm_boot.log Troubleshooting 1 ° 2 ° 3 °
  • 30. CLUSTER ON THE COMMAND LINE
  • 31. Distributed Job Manager (DRM) • Controls job execution on a (set of) resource(s) • A job: an invocation of a program • Manages job load • Provides ability to monitor system and job status • Popular DRMs: Sun Grid Engine (SGE), Portable Batch Scheduler (PBS), TORQUE, Condor, Load Sharing Facility (LSF)
  • 32. Customize your instance - install a new tool $ cd /mnt/galaxy/export $ wget http://heanet.dl.sourceforge.net/project/dnaclust/parallel_relea se_3/dnaclust_linux_release3.zip $ unzip dnaclust_linux_release3.zip $ cd dnaclust_linux_release3 $ chmod +x * $ cp /mnt/workshop-data/mtDNA.fasta . Get a copy of a sample dataset Don’t forget
  • 33. Use the new tool in the cluster mode 1. Create a new sample shell file to run the tool; call it job_script.sh with the following content: #$ -cwd ./dnaclust -l -s 0.9 /mnt/workshop-data/mtDNA.fasta 2. Submit single job to SGE queue qsub job_script.sh 3. Check the queue: qstat -f 4. Job output will be in the local directory in file job_script.sh.o# 5. Start a number of instances of the script: qsub job_script.sh (*10) watch qstat –f 1. See all jobs lined up 6. See auto-scaling in (using /cloud) [1.5-2 mins] 7. Go back to command prmopt, see jobs being distributed

Editor's Notes

  • #14: This is based on the Galaxy101 example for human.
  • #23: Note to self: workshop-data bucket is owned by the Galaxy outreach acct on AWS.