SlideShare a Scribd company logo
Integrating the NCBI BLAST+
suite into Galaxy
https://github.com/peterjc/galaxy_blast
http://slideshare.net/pjacock/
1 July 2014, GCC2014, Baltimore
Peter Cock1
, John Chilton2
, Bj¨orn Gr¨uning3
, Jim Johnson4
, Nicola Soranzo5
1 James Hutton Institute, Scotland, UK; 2 Penn State University, USA; 3 Albert-
Ludwigs-University, Germany; 4 University of Minnesota, USA; 5 CRS4, Italy
Everyone loves BLAST!
• Altschul et al. (1990)
Basic local alignment search tool.
http://doi.org/10.1016/S0022-2836(05)80360-2
• . . .
• Camacho et al. (2009)
BLAST+: architecture and applications.
http://doi.org/10.1186/1471-2105-10-421
Example use cases and workflows
• Homology searches / functional annotation
• Contamination detection
• Identification of unmappable reads
• Reciprocal best BLAST hits
• BLAST2GO
• Identification of nearby located genes (Gene cluster searches)
Example use cases and workflows
• Identification of nearby located genes (Gene cluster searches)
Open Source Licensing → Ecosystem
• NCBI BLAST and BLAST+
• Public Domain (United States Government Work)
• Galaxy
• Academic Free License version 3.0
• BLAST+ wrappers for Galaxy
• MIT Licence
Like most Bioinformatics tools, these are all open source :)
• This is important for redistribution via the Galaxy Tool Shed
Galaxy BLAST wrapper early history
• Jan 2008 – Galaxy team wrapped NCBI ‘legacy’ MEGABLAST
• Sep 2010 – I contributed wrappers for NCBI BLAST+ core
• Development within main Galaxy bitbucket.org repository
• August 2012 – BLAST+ wrapper moved to ToolShed
• Development on my Galaxy bitbucket.org branch
• Summer 2013 – Published Galaxy tools paper
• Cock, Gr¨uning, Paszkiewicz, and Pritchard (2013)
Galaxy tools and workflows for sequence analysis with
applications in molecular plant pathology
http://doi.org/10.7717/peerj.167
• Summer 2013 – Reorganised to facilitate contributions
Galaxy BLAST wrapper recent history
• July 2013
• Held Birds-of-a-Feather meeting at GCC2013
• BLAST+ wrapper development moved to GitHub
• Adopted MIT Licence
• Merged first pull request on GitHub
• Sept 2013 – updated to BLAST+ 2.2.26
• Added RPSBLAST, RPSTBLASTN, and protein domain
databases
• Dec 2013 – updated to BLAST+ 2.2.28
• Added description in tabular output, $GALAXY SLOTS, macros,
masking, etc
• March 2014 – updated to BLAST+ 2.2.29
• Added pick-your-own columns, more masking, etc
• We’re also working on a paper. . .
Source code & history on GitHub
https://github.com/peterjc/galaxy_blast/tree/master/tools/ncbi_blast_plus
Issue Tracker on GitHub
(For Bug Reports etc)
https://github.com/peterjc/galaxy_blast/issues
GitHub contributors
Git & GitHub facilitate an open development model:
• Anyone can “fork” the code to try out modifications
• Can then offer their improvements to be merged back
https://github.com/peterjc/galaxy_blast/network
Changes on GitHub tested via TravisCI
http://blastedbio.blogspot.co.uk/2013/09/using-travis-ci-for-testing-galaxy-tools.html
https://travis-ci.org/peterjc/galaxy_blast/builds
Development & Testing Process
General process,
• Development & local testing on galaxy-central
(on internal Galaxy server for development, using our cluster)
• Push to GitHub, runs TravisCI tests (also galaxy-central)
Then, for a release-candidate,
• Update Test Tool Shed (also on galaxy-central)
• Local testing on our live server (using galaxy-dist)
• Encourage others to test on their systems
Finally,
• Update main ToolShed (tested on galaxy-dist)
All dependencies via Tool Shed
• Main BLAST+ tool wrappers
• http://toolshed.g2.bx.psu.edu/view/devteam/ncbi_blast_plus/
• Packages for NCBI BLAST+ binaries
• http://toolshed.g2.bx.psu.edu/view/iuc/package_blast_plus_2_2_26
• http://toolshed.g2.bx.psu.edu/view/iuc/package_blast_plus_2_2_27
• http://toolshed.g2.bx.psu.edu/view/iuc/package_blast_plus_2_2_28
• http://toolshed.g2.bx.psu.edu/view/iuc/package_blast_plus_2_2_29
• BLAST datatypes
• http://toolshed.g2.bx.psu.edu/view/devteam/blast_datatypes/
(and same set again for the Test Tool Shed)
See: Blankenberg et al. (2014)
http://doi.org/10.1186/gb4161
All dependencies via Tool Shed
BLAST	
  
datatypes	
  
BLAST+	
  
binaries	
  
BLAST+	
  
wrappers	
  
All dependencies via Tool Shed
BLAST	
  
datatypes	
  
BLAST+	
  
binaries	
  
BLAST+	
  
wrappers	
  
Reciprocal	
  
Best	
  Hits	
  
Blast2GO	
  
(b2g4pipe)	
  
Galaxy datatypes used in BLAST
• tabular – BLAST tabular output
• txt – BLAST plain text output
• html – BLAST webpage output
• fasta – FASTA sequence files
New Galaxy datatypes for BLAST
• maskinfo-asn1 and maskinfo-asn1-binary – Masking Info
• pssm-asn1 – Position Specific Scoring Matrices (PSSMs)
• Simple subclasses of Galaxy’s ASN.1 datatypes
• Each defined by one line of XML in datatypes conf.xml:
<datatype extension="pssm -asn1"
type="galaxy.datatypes. data:GenericAsn1 "
mimetype="text/plain" subclass="True"
display_in_upload ="true" />
https://github.com/peterjc/galaxy_blast/blob/master/datatypes/blast_datatypes/datatypes_conf.xml
New Galaxy datatypes for BLAST
• blastxml – BLAST XML output
• Python subclass of Galaxy’s XML datatype
• Defines .sniff(...) method for detecting on upload
• Defines .merge(...) method for job splitting
class BlastXml(GenericXml):
""" NCBI Blast XML Output data."""
file_ext = "blastxml"
def sniff(self , filename):
....
def merge(split_files , output_file):
...
https://github.com/peterjc/galaxy_blast/blob/master/datatypes/blast_datatypes/blast.py
New Galaxy datatypes for BLAST
• blastdbn, blastdbp, blastdbd – BLAST databases
• Composite datatypes made up of multiple files
• Python subclasses of Galaxy’s base datatype
class BlastNucDb (...):
""" Nucleotide BLAST database files."""
file_ext = "blastdbn"
allow_datatype_change = False
composite_type = "basic"
...
https://github.com/peterjc/galaxy_blast/blob/master/datatypes/blast_datatypes/blast.py
Galaxy Macros
• BLAST+ tool suite had a lot of repeated XML
• Consistency/maintenance was hard
• Galaxy does not support:
• XML <!ENTITY ...> tags
• XInclude
• Galaxy has its own macros and token system
• Downside is more complexity
• Ideal for tool suites like BLAST+
Galaxy Macros - XML macros
Turn re-used chunks of XML into macros:
<xml name=" input_conditional_nucleotide_db ">
<conditional name="db_opts">
<param name=" db_opts_selector " type="select" label="Database/sequences">
<option value="db" selected="True">Locally installed BLAST DB</option >
<option value="histdb">BLAST database from your history </option >
<option value="file">FASTA file from history (see warning)</option >
</param >
<when value="db">
<param name="database" type="select" label="Nucleotide BLAST database">
<options from_file="blastdb.loc">
...
</options >
</param >
<param name="histdb" type="hidden" value="" />
<param name="subject" type="hidden" value="" />
</when >
<when value="histdb">
...
</when >
<when value="file">
...
</when >
</conditional >
</xml>
https://github.com/peterjc/galaxy_blast/blob/master/tools/ncbi_blast_plus/ncbi_macros.xml
Galaxy Macros - Cheetah tokens
Turn re-used chunks of Cheetah markup (or help text) into tokens:
<token name =" @BLAST_DB_SUBJECT@ ">
#if $db_opts. db_opts_selector == "db":
-db "${db_opts.database.fields.path }"
#elif $db_opts. db_opts_selector == "histdb ":
-db "${os.path.join($db_opts.histdb.
extra_files_path ,’blastdb ’)}"
#else:
-subject "$db_opts.subject"
#end if
</token >
https://github.com/peterjc/galaxy_blast/blob/master/tools/ncbi_blast_plus/ncbi_macros.xml
Galaxy Macros – Example Tool
Now can embed the @TOKENS@ and XML fragments:
<tool id=" ncbi_blastn_wrapper " ... >
...
<macros >
<import >ncbi_macros.xml</import >
</macros >
...
<command >
blastn ... @BLAST_DB_SUBJECT@ ...
</command >
...
<inputs >
...
<expand macro=" input_conditional_nucleotide_db " />
...
</inputs >
...
</tool >
Galaxy BLAST+ setup at
James Hutton Institute
• Task splitting/job parallelization enabled
• Batches of 1000 query sequences
• Use $GALAXY SLOTS=4 to exploit our entire cluster
• Ensures even our older 4 core nodes are used
• Update NCBI BLAST databases using cron job
• We don’t keep date-stamped versions
• Wrapper script caches BLAST databases on cluster nodes,
https://github.com/peterjc/picobio/tree/master/blast
• Databases on network storage were too slow
• (BeeGFS currently under evaluation ... talk to Bj¨orn)
Work in progress
• Data manager(s) for BLAST databases
• PSI-BLAST
• DELTA-BLAST
• Tests for every tool
• ...
• See https://github.com/peterjc/galaxy_blast/issues
DB configuration via *.loc files
$ cat /mnt/galaxy/galaxy -central/tool -data/blastdb.loc
nt NCBI nt nucleotide sequence database /mnt/scratch/local/blast/ncbi/nt
est_others NCBI EST others (non -human , non -mouse) nucleotide sequence database /
mnt/scratch/local/blast/ncbi/est_others
est_mouse NCBI EST mouse nucleotide sequence database /mnt/scratch/local/blast/
ncbi/est_mouse
est_human NCBI EST human nucleotide sequence database /mnt/scratch/local/blast/
ncbi/est_human
est NCBI EST nucleotide sequence database /mnt/scratch/local/blast/ncbi/est
oomycete_genes Oomycete predicted genes /mnt/shared/cluster/blast/galaxy/
oomycete_genes
oomycete_ests Oomycete ESTs /mnt/shared/cluster/blast/galaxy/ oomycete_ests
oomycete_scaffolds Oomycete genome scaffolds /mnt/shared/cluster/blast/galaxy/
oomycete_scaffolds
...
• Data Managers could make this easier...
• Blankenberg et al. (2014) Wrangling Galaxy’s reference data.
http://doi.org/10.1093/bioinformatics/btu119
Acknowledgements – Groups
• NCBI BLAST developers (and their help desk team)
• Galaxy Community
• Galaxy Team, especially Core Developers & Tool Shed team
• Intergalactic Utilities Commission (IUC, “Tool Shed Police”)
• GitHub and Bitbucket for repository hosting
• TravisCI for testing open source projects free of charge
• Our various testers @ JHI and ALU-Freiburg
• Our funders
Acknowledgements – Individuals
• Bj¨orn Gr¨uning (packaging, macros, datatypes, masking, etc)
• Dan Blankenberg (Data Managers)
• Dannon Baker (Tool Shed migration)
• Edward Kirton (datatype for BLAST databases)
• Jim Johnson (tabular output enhancements)
• John Chilton (macros, $GALAXY SLOTS, test framework, etc)
• Kanwei Li (merges while using main Galaxy repository)
• Luobin Yang (initial PSI-BLAST work)
• Nicola Soranzo (masking support, datatype work, etc)
Key References
• Camacho et al. (2009)
BLAST+: architecture and applications.
http://doi.org/10.1186/1471-2105-10-421
• Cock et al. (2013)
Galaxy tools and workflows for sequence analysis with
applications in molecular plant pathology.
http://doi.org/10.7717/peerj.167
• Galaxy BLAST+ wrappers
https://github.com/peterjc/galaxy_blast
https://github.com/peterjc/galaxy_blast
http://slideshare.net/pjacock/
@pjacock #usegalaxy

More Related Content

PDF
Writing Galaxy Tools
PDF
Calypso browser
PPTX
Stress test data pipeline
PDF
JUnit5 and TestContainers
PPTX
Apache Flink Training: DataStream API Part 2 Advanced
ODP
Lambda Chops - Recipes for Simpler, More Expressive Code
PDF
Kotlin @ Coupang Backend 2017
PDF
Clojure for Java developers
Writing Galaxy Tools
Calypso browser
Stress test data pipeline
JUnit5 and TestContainers
Apache Flink Training: DataStream API Part 2 Advanced
Lambda Chops - Recipes for Simpler, More Expressive Code
Kotlin @ Coupang Backend 2017
Clojure for Java developers

What's hot (19)

PDF
Property Based Testing in PHP
PPTX
Enhanced Web Service Testing: A Better Mock Structure
PDF
Calypso a new modular code browser for Pharo
PDF
Core Java Programming Language (JSE) : Chapter X - I/O Fundamentals
PDF
Calypso underhood
PPTX
Beyond parallelize and collect - Spark Summit East 2016
PDF
Large Scale Log Analytics with Solr: Presented by Rafał Kuć & Radu Gheorghe, ...
PDF
2014 holden - databricks umd scala crash course
PPT
XML SAX PARSING
PPTX
Page Fragments как развитие идеи Page Object паттерна
ODP
Getting started with Clojure
PDF
Solr Black Belt Pre-conference
PPTX
Automation patterns on practice
PDF
Kotlin @ Coupang Backed - JetBrains Day seoul 2018
PDF
Java 7 New Features
PDF
(2) c sharp introduction_basics_part_i
PDF
Inside the JVM - Follow the white rabbit!
PDF
Advanced R cheat sheet
PPTX
Hadoop cluster performance profiler
Property Based Testing in PHP
Enhanced Web Service Testing: A Better Mock Structure
Calypso a new modular code browser for Pharo
Core Java Programming Language (JSE) : Chapter X - I/O Fundamentals
Calypso underhood
Beyond parallelize and collect - Spark Summit East 2016
Large Scale Log Analytics with Solr: Presented by Rafał Kuć & Radu Gheorghe, ...
2014 holden - databricks umd scala crash course
XML SAX PARSING
Page Fragments как развитие идеи Page Object паттерна
Getting started with Clojure
Solr Black Belt Pre-conference
Automation patterns on practice
Kotlin @ Coupang Backed - JetBrains Day seoul 2018
Java 7 New Features
(2) c sharp introduction_basics_part_i
Inside the JVM - Follow the white rabbit!
Advanced R cheat sheet
Hadoop cluster performance profiler
Ad

Similar to Integrating the NCBI BLAST+ suite into Galaxy (20)

PDF
The Galaxy bioinformatics workflow environment
PPT
introduction to galaxy
PPTX
BB_NCBI_PAG_2019_Workshop
PDF
Galaxy
DOCX
Bioinformatics Final Report
PPTX
PPTX
sobia.blast.pptx
PPTX
Bb health ai_jan26_v2
PPTX
Sage 2 19_v5_busby
PPT
BLAST_CSS2.ppt
PPTX
Repeatable plant pathology bioinformatic analysis: Not everything is NGS data
PPTX
Basic Local Alignment Search Tool Presentation
PDF
Introduction to Galaxy and RNA-Seq
PPTX
BLAST
PPTX
BLAST (Basic local alignment search Tool)
PPTX
BLAST
PDF
Blast bioinformatics
PDF
BioTeam Bhanu Rekepalli Presentation at BICoB 2015
The Galaxy bioinformatics workflow environment
introduction to galaxy
BB_NCBI_PAG_2019_Workshop
Galaxy
Bioinformatics Final Report
sobia.blast.pptx
Bb health ai_jan26_v2
Sage 2 19_v5_busby
BLAST_CSS2.ppt
Repeatable plant pathology bioinformatic analysis: Not everything is NGS data
Basic Local Alignment Search Tool Presentation
Introduction to Galaxy and RNA-Seq
BLAST
BLAST (Basic local alignment search Tool)
BLAST
Blast bioinformatics
BioTeam Bhanu Rekepalli Presentation at BICoB 2015
Ad

Recently uploaded (20)

PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
A Presentation on Artificial Intelligence
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
Spectroscopy.pptx food analysis technology
PDF
Empathic Computing: Creating Shared Understanding
PPTX
Group 1 Presentation -Planning and Decision Making .pptx
PDF
Encapsulation theory and applications.pdf
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PPT
Teaching material agriculture food technology
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
1. Introduction to Computer Programming.pptx
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
cuic standard and advanced reporting.pdf
PPTX
Big Data Technologies - Introduction.pptx
PDF
A comparative analysis of optical character recognition models for extracting...
Reach Out and Touch Someone: Haptics and Empathic Computing
A Presentation on Artificial Intelligence
Network Security Unit 5.pdf for BCA BBA.
The Rise and Fall of 3GPP – Time for a Sabbatical?
Per capita expenditure prediction using model stacking based on satellite ima...
MIND Revenue Release Quarter 2 2025 Press Release
Spectroscopy.pptx food analysis technology
Empathic Computing: Creating Shared Understanding
Group 1 Presentation -Planning and Decision Making .pptx
Encapsulation theory and applications.pdf
Spectral efficient network and resource selection model in 5G networks
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Teaching material agriculture food technology
Dropbox Q2 2025 Financial Results & Investor Presentation
Advanced methodologies resolving dimensionality complications for autism neur...
1. Introduction to Computer Programming.pptx
“AI and Expert System Decision Support & Business Intelligence Systems”
cuic standard and advanced reporting.pdf
Big Data Technologies - Introduction.pptx
A comparative analysis of optical character recognition models for extracting...

Integrating the NCBI BLAST+ suite into Galaxy

  • 1. Integrating the NCBI BLAST+ suite into Galaxy https://github.com/peterjc/galaxy_blast http://slideshare.net/pjacock/ 1 July 2014, GCC2014, Baltimore Peter Cock1 , John Chilton2 , Bj¨orn Gr¨uning3 , Jim Johnson4 , Nicola Soranzo5 1 James Hutton Institute, Scotland, UK; 2 Penn State University, USA; 3 Albert- Ludwigs-University, Germany; 4 University of Minnesota, USA; 5 CRS4, Italy
  • 2. Everyone loves BLAST! • Altschul et al. (1990) Basic local alignment search tool. http://doi.org/10.1016/S0022-2836(05)80360-2 • . . . • Camacho et al. (2009) BLAST+: architecture and applications. http://doi.org/10.1186/1471-2105-10-421
  • 3. Example use cases and workflows • Homology searches / functional annotation • Contamination detection • Identification of unmappable reads • Reciprocal best BLAST hits • BLAST2GO • Identification of nearby located genes (Gene cluster searches)
  • 4. Example use cases and workflows • Identification of nearby located genes (Gene cluster searches)
  • 5. Open Source Licensing → Ecosystem • NCBI BLAST and BLAST+ • Public Domain (United States Government Work) • Galaxy • Academic Free License version 3.0 • BLAST+ wrappers for Galaxy • MIT Licence Like most Bioinformatics tools, these are all open source :) • This is important for redistribution via the Galaxy Tool Shed
  • 6. Galaxy BLAST wrapper early history • Jan 2008 – Galaxy team wrapped NCBI ‘legacy’ MEGABLAST • Sep 2010 – I contributed wrappers for NCBI BLAST+ core • Development within main Galaxy bitbucket.org repository • August 2012 – BLAST+ wrapper moved to ToolShed • Development on my Galaxy bitbucket.org branch • Summer 2013 – Published Galaxy tools paper • Cock, Gr¨uning, Paszkiewicz, and Pritchard (2013) Galaxy tools and workflows for sequence analysis with applications in molecular plant pathology http://doi.org/10.7717/peerj.167 • Summer 2013 – Reorganised to facilitate contributions
  • 7. Galaxy BLAST wrapper recent history • July 2013 • Held Birds-of-a-Feather meeting at GCC2013 • BLAST+ wrapper development moved to GitHub • Adopted MIT Licence • Merged first pull request on GitHub • Sept 2013 – updated to BLAST+ 2.2.26 • Added RPSBLAST, RPSTBLASTN, and protein domain databases • Dec 2013 – updated to BLAST+ 2.2.28 • Added description in tabular output, $GALAXY SLOTS, macros, masking, etc • March 2014 – updated to BLAST+ 2.2.29 • Added pick-your-own columns, more masking, etc • We’re also working on a paper. . .
  • 8. Source code & history on GitHub https://github.com/peterjc/galaxy_blast/tree/master/tools/ncbi_blast_plus
  • 9. Issue Tracker on GitHub (For Bug Reports etc) https://github.com/peterjc/galaxy_blast/issues
  • 10. GitHub contributors Git & GitHub facilitate an open development model: • Anyone can “fork” the code to try out modifications • Can then offer their improvements to be merged back https://github.com/peterjc/galaxy_blast/network
  • 11. Changes on GitHub tested via TravisCI http://blastedbio.blogspot.co.uk/2013/09/using-travis-ci-for-testing-galaxy-tools.html https://travis-ci.org/peterjc/galaxy_blast/builds
  • 12. Development & Testing Process General process, • Development & local testing on galaxy-central (on internal Galaxy server for development, using our cluster) • Push to GitHub, runs TravisCI tests (also galaxy-central) Then, for a release-candidate, • Update Test Tool Shed (also on galaxy-central) • Local testing on our live server (using galaxy-dist) • Encourage others to test on their systems Finally, • Update main ToolShed (tested on galaxy-dist)
  • 13. All dependencies via Tool Shed • Main BLAST+ tool wrappers • http://toolshed.g2.bx.psu.edu/view/devteam/ncbi_blast_plus/ • Packages for NCBI BLAST+ binaries • http://toolshed.g2.bx.psu.edu/view/iuc/package_blast_plus_2_2_26 • http://toolshed.g2.bx.psu.edu/view/iuc/package_blast_plus_2_2_27 • http://toolshed.g2.bx.psu.edu/view/iuc/package_blast_plus_2_2_28 • http://toolshed.g2.bx.psu.edu/view/iuc/package_blast_plus_2_2_29 • BLAST datatypes • http://toolshed.g2.bx.psu.edu/view/devteam/blast_datatypes/ (and same set again for the Test Tool Shed) See: Blankenberg et al. (2014) http://doi.org/10.1186/gb4161
  • 14. All dependencies via Tool Shed BLAST   datatypes   BLAST+   binaries   BLAST+   wrappers  
  • 15. All dependencies via Tool Shed BLAST   datatypes   BLAST+   binaries   BLAST+   wrappers   Reciprocal   Best  Hits   Blast2GO   (b2g4pipe)  
  • 16. Galaxy datatypes used in BLAST • tabular – BLAST tabular output • txt – BLAST plain text output • html – BLAST webpage output • fasta – FASTA sequence files
  • 17. New Galaxy datatypes for BLAST • maskinfo-asn1 and maskinfo-asn1-binary – Masking Info • pssm-asn1 – Position Specific Scoring Matrices (PSSMs) • Simple subclasses of Galaxy’s ASN.1 datatypes • Each defined by one line of XML in datatypes conf.xml: <datatype extension="pssm -asn1" type="galaxy.datatypes. data:GenericAsn1 " mimetype="text/plain" subclass="True" display_in_upload ="true" /> https://github.com/peterjc/galaxy_blast/blob/master/datatypes/blast_datatypes/datatypes_conf.xml
  • 18. New Galaxy datatypes for BLAST • blastxml – BLAST XML output • Python subclass of Galaxy’s XML datatype • Defines .sniff(...) method for detecting on upload • Defines .merge(...) method for job splitting class BlastXml(GenericXml): """ NCBI Blast XML Output data.""" file_ext = "blastxml" def sniff(self , filename): .... def merge(split_files , output_file): ... https://github.com/peterjc/galaxy_blast/blob/master/datatypes/blast_datatypes/blast.py
  • 19. New Galaxy datatypes for BLAST • blastdbn, blastdbp, blastdbd – BLAST databases • Composite datatypes made up of multiple files • Python subclasses of Galaxy’s base datatype class BlastNucDb (...): """ Nucleotide BLAST database files.""" file_ext = "blastdbn" allow_datatype_change = False composite_type = "basic" ... https://github.com/peterjc/galaxy_blast/blob/master/datatypes/blast_datatypes/blast.py
  • 20. Galaxy Macros • BLAST+ tool suite had a lot of repeated XML • Consistency/maintenance was hard • Galaxy does not support: • XML <!ENTITY ...> tags • XInclude • Galaxy has its own macros and token system • Downside is more complexity • Ideal for tool suites like BLAST+
  • 21. Galaxy Macros - XML macros Turn re-used chunks of XML into macros: <xml name=" input_conditional_nucleotide_db "> <conditional name="db_opts"> <param name=" db_opts_selector " type="select" label="Database/sequences"> <option value="db" selected="True">Locally installed BLAST DB</option > <option value="histdb">BLAST database from your history </option > <option value="file">FASTA file from history (see warning)</option > </param > <when value="db"> <param name="database" type="select" label="Nucleotide BLAST database"> <options from_file="blastdb.loc"> ... </options > </param > <param name="histdb" type="hidden" value="" /> <param name="subject" type="hidden" value="" /> </when > <when value="histdb"> ... </when > <when value="file"> ... </when > </conditional > </xml> https://github.com/peterjc/galaxy_blast/blob/master/tools/ncbi_blast_plus/ncbi_macros.xml
  • 22. Galaxy Macros - Cheetah tokens Turn re-used chunks of Cheetah markup (or help text) into tokens: <token name =" @BLAST_DB_SUBJECT@ "> #if $db_opts. db_opts_selector == "db": -db "${db_opts.database.fields.path }" #elif $db_opts. db_opts_selector == "histdb ": -db "${os.path.join($db_opts.histdb. extra_files_path ,’blastdb ’)}" #else: -subject "$db_opts.subject" #end if </token > https://github.com/peterjc/galaxy_blast/blob/master/tools/ncbi_blast_plus/ncbi_macros.xml
  • 23. Galaxy Macros – Example Tool Now can embed the @TOKENS@ and XML fragments: <tool id=" ncbi_blastn_wrapper " ... > ... <macros > <import >ncbi_macros.xml</import > </macros > ... <command > blastn ... @BLAST_DB_SUBJECT@ ... </command > ... <inputs > ... <expand macro=" input_conditional_nucleotide_db " /> ... </inputs > ... </tool >
  • 24. Galaxy BLAST+ setup at James Hutton Institute • Task splitting/job parallelization enabled • Batches of 1000 query sequences • Use $GALAXY SLOTS=4 to exploit our entire cluster • Ensures even our older 4 core nodes are used • Update NCBI BLAST databases using cron job • We don’t keep date-stamped versions • Wrapper script caches BLAST databases on cluster nodes, https://github.com/peterjc/picobio/tree/master/blast • Databases on network storage were too slow • (BeeGFS currently under evaluation ... talk to Bj¨orn)
  • 25. Work in progress • Data manager(s) for BLAST databases • PSI-BLAST • DELTA-BLAST • Tests for every tool • ... • See https://github.com/peterjc/galaxy_blast/issues
  • 26. DB configuration via *.loc files $ cat /mnt/galaxy/galaxy -central/tool -data/blastdb.loc nt NCBI nt nucleotide sequence database /mnt/scratch/local/blast/ncbi/nt est_others NCBI EST others (non -human , non -mouse) nucleotide sequence database / mnt/scratch/local/blast/ncbi/est_others est_mouse NCBI EST mouse nucleotide sequence database /mnt/scratch/local/blast/ ncbi/est_mouse est_human NCBI EST human nucleotide sequence database /mnt/scratch/local/blast/ ncbi/est_human est NCBI EST nucleotide sequence database /mnt/scratch/local/blast/ncbi/est oomycete_genes Oomycete predicted genes /mnt/shared/cluster/blast/galaxy/ oomycete_genes oomycete_ests Oomycete ESTs /mnt/shared/cluster/blast/galaxy/ oomycete_ests oomycete_scaffolds Oomycete genome scaffolds /mnt/shared/cluster/blast/galaxy/ oomycete_scaffolds ... • Data Managers could make this easier... • Blankenberg et al. (2014) Wrangling Galaxy’s reference data. http://doi.org/10.1093/bioinformatics/btu119
  • 27. Acknowledgements – Groups • NCBI BLAST developers (and their help desk team) • Galaxy Community • Galaxy Team, especially Core Developers & Tool Shed team • Intergalactic Utilities Commission (IUC, “Tool Shed Police”) • GitHub and Bitbucket for repository hosting • TravisCI for testing open source projects free of charge • Our various testers @ JHI and ALU-Freiburg • Our funders
  • 28. Acknowledgements – Individuals • Bj¨orn Gr¨uning (packaging, macros, datatypes, masking, etc) • Dan Blankenberg (Data Managers) • Dannon Baker (Tool Shed migration) • Edward Kirton (datatype for BLAST databases) • Jim Johnson (tabular output enhancements) • John Chilton (macros, $GALAXY SLOTS, test framework, etc) • Kanwei Li (merges while using main Galaxy repository) • Luobin Yang (initial PSI-BLAST work) • Nicola Soranzo (masking support, datatype work, etc)
  • 29. Key References • Camacho et al. (2009) BLAST+: architecture and applications. http://doi.org/10.1186/1471-2105-10-421 • Cock et al. (2013) Galaxy tools and workflows for sequence analysis with applications in molecular plant pathology. http://doi.org/10.7717/peerj.167 • Galaxy BLAST+ wrappers https://github.com/peterjc/galaxy_blast