SlideShare a Scribd company logo
Biogrid – Bioinformatics for the grid

    Joel Hedlund <yohell@ifm.liu.se>
       Biogrid User and Developer
      Linköping University, Sweden

      Birds-of-a-feather session tonight: see me after this talk!
Outline
•   What is it?
•   What is it good for?
•   Does it really work?
•   Gory details.
•   Why did we do this?
•   Profit!
What is it?



NDGF BIO Community Grid
   Bioinformatics for the Grid
What is it?
• Unified interface
  ...to popular bioinformatic applications
  ...on shared, distributed computational resources
  ...using versioned and cached databases
What is it good for?
• Burst computing
  – High demand for short periods of time
     • high during development / production
     • low during analysis / writing papers
  – Share resources to enable more efficient use
• Database accessibility
• Availibility
• Unified interface
What is NDGF?
What is NDGF?
• Nordic Data Grid Facility
• A WLCG Tier1 facility
  – Worldwide LHC Computational Grid
  – Stores and processes data from LHC at CERN
     • peak rate ≈ 1.6Gb/s, when the accelerator is running
       (and that’s after most of the data have been filtered away)
Hedlund_biogrid_BOSC2009
Hedlund_biogrid_BOSC2009
”Does it really work, this
  distributed thingie?”
”Does it really work, this
  distributed thingie?”
 Why yes, very well thank you!
NDGF
• 96% availablity
  (highest of all Tier1 facilities)

• Third largest Tier1 facility in the world
• Lowest ratio of failed ATLAS jobs
• Production goals met, and beyond
   – Goal: 8% of all ATLAS resources (10.5% provided)
   – Goal: 9% of all ALICE resources (12% provided)




                    * Data graciously stolen from Leif Nixons NorduNet 2008 talk. Thank you Leif :-)
DISTRIBUTION
    IS A
 STRENGTH
It enforces unification

It ensures availability
Does it really work?


 It’s good enough for LHC.
It’s good enough for Bioinformatics.
Gory details
Biogrid provides
Optimised applications:
  – BLAST
  – ClustalW
  – HMMER
  – Muscle
  – Mafft




                          Planned: molecular dynamics, phylogeny...
Biogrid provides
Versioned, indexed and cached databases
  – UniProtKB (subreleases)
  – Uniref (subreleases)




                       Planned: genomes (EnsEMBL), nucleotides (EMBL)...
Cached database access




Database files are transfered to the cluster at most once per project.
Unified Interface
Unified Interface
Unified Interface


             DATA




             RESULTS
Unified Interface
• XRSL Job Description
  Standard in ARC Grid Middleware

• Well defined runtime environments
   $HMMERDIR: node local (fast) scratch dir containing db files
   prepare_db: download and unpack db files on the fly from front node to $HMMERDIR
XRSL Job Description
(jobName=refinehmm-family023)
(runTimeEnvironment=APPS/BIO/HMMER2.3.2)
(cpuTime=3000)
(executable=refinehmm.jobscript.sh)
(inputFiles=
  (sp.gz srm://srm.ndgf.org/biogrid/db/uniprot/UniProt14.8/uniprot_sprot.fasta.gz)
  (tr.gz srm://srm.ndgf.org/biogrid/db/uniprot/UniProt14.8/uniprot_trembl.fasta.gz)
  (family023.hmm ””)
)
(outputfiles=
  (family023.refined.hmm ””)
)
XRSL Job Description
(jobName=refinehmm-$HMM_NAME)
(runTimeEnvironment=APPS/BIO/HMMER2.3.2)
(cpuTime=3000)
(executable=refinehmm.jobscript.sh)
(inputFiles=
  (sp.gz srm://srm.ndgf.org/biogrid/db/uniprot/UniProt14.8/uniprot_sprot.fasta.gz)
  (tr.gz srm://srm.ndgf.org/biogrid/db/uniprot/UniProt14.8/uniprot_trembl.fasta.gz)
  ($HMM_NAME.hmm ””)
)
(outputfiles=
  ($HMM_NAME.refined.hmm ””)
)
Unified Interface
• Run on any resource I can access:
  $ ngsub myjob.xrsl

• ...or run on my buddy’s cluster:
  $ ngsub -c kiniini.csc.fi myjob.xrsl

• Check jobs:
  $ ngstat refinehmm-family023
  (or use Grid Monitor web interface at www.nordugrid.org)

• Fetch results:
  $ ngget refinehmm-family*



                     DATA                GRID
                                                RESULTS
What do I need?
    1. A resource with ARC and Biogrid REs
    2. An ARC client
    3. A Grid Certificate
       (available from a number of global certificate authorities)

    4. Time allowance on the resource



(   5. Biogrid VO Membership
       Not really necessary, but it will get you 1 & 4   )
What do I need?



...or you can just grab the RE scripts off the biogrid website,
        and your db of choice from the biogrid dCache.
Why did we do this?
Bioinformatic applications...
  – CPU intensive
  – Small input and output files
  – ”Large” databases can be cached

...are very well suited for distributed computing.
Profit!
Subclassification of the MDR superfamily

• 15000 members
    from all kingdoms of life

• 500 families
    25% sequence identity

•   40 human members
•   Different substrate specificities
•   Different subunit & cofactor count
•   2 HMMs available for superfamily detection
•   None for any of the individual families
Subclassification of the MDR superfamily

• We made HMMs for all MDR (sub)families
  with 20+ members.
• 86 families
• 34 detected subfamilies to 14 of these
• 11579 / 15000 sequences classified
• ≈5000*hmmsearch vs UniProtKB



                                Manuscript in preparation
refinehmm
• Algorithm for automated HMM refinement
• Produces stable and reliable HMMs
• Developed using Biogrid REs and resources




                Will also be open source software once the paper is out.
Acknowledgements
  • Olli Tourunen                       Supercomputing centers
    Biogrid developer
                                        • NSC
  • Bengt Persson                         Jens Larsson, Leif Nixon
    Biogrid PI
                                        • HPC2N
  • NDGF                                  Åke Sandgren
    Michael Grønager
    Josva Kleist                        • Others
                                          C3SE, CSC, Uppmax, Lunarc, PDC,
  • Biogrid co-applicants                 Aalborg University, Oslo University
    Ann-Charlotte Berglund Sonnhammer
    Erik Sonnhammer
    Inge Jonassen                                                 Joel Hedlund
                                                              yohell@ifm.liu.se
                                                    Biogrid User and Developer
                                                  Linköping University, Sweden

Birds-of-a-feather session tonight: see me after the talk!
Acknowledgements
  • Olli Tourunen                       Supercomputing centers
    Biogrid developer
                                        • NSC
  • Bengt Persson                         Jens Larsson, Leif Nixon
    Biogrid PI
                                        • HPC2N
  • NDGF                                  Åke Sandgren
    Michael Grønager
    Josva Kleist                        • Others
                                          C3SE, CSC, Uppmax, Lunarc, PDC,
  • Biogrid co-applicants                 Aalborg University, Oslo University
    Ann-Charlotte Berglund Sonnhammer
    Erik Sonnhammer
    Inge Jonassen                                                 Joel Hedlund
                                                              yohell@ifm.liu.se
                                                    Biogrid User and Developer
                                                  Linköping University, Sweden

Birds-of-a-feather session tonight: see me after the talk!

More Related Content

PPTX
Reinforcement Learning: Chapter 15 Neuroscience
PPTX
Introduction to Spiking Neural Networks: From a Computational Neuroscience pe...
PPTX
Introduction to Spiking Neural Networks: From a Computational Neuroscience pe...
PDF
Noninvasive, Automated Measurement of Sleep, Wake and Breathing in Rodents
PDF
How To Study Structural and Functional Properties of Tendon
PDF
Employing Electrophysiology and Optogenetics to Measure and Manipulate Neuron...
PDF
[2010 10-02] intro to microprocessors[1]
PPTX
myGrid: Personalised Bioinformatics on the Information Grid
Reinforcement Learning: Chapter 15 Neuroscience
Introduction to Spiking Neural Networks: From a Computational Neuroscience pe...
Introduction to Spiking Neural Networks: From a Computational Neuroscience pe...
Noninvasive, Automated Measurement of Sleep, Wake and Breathing in Rodents
How To Study Structural and Functional Properties of Tendon
Employing Electrophysiology and Optogenetics to Measure and Manipulate Neuron...
[2010 10-02] intro to microprocessors[1]
myGrid: Personalised Bioinformatics on the Information Grid

Similar to Hedlund_biogrid_BOSC2009 (20)

PDF
Bio-UnaGrid: Easing bioinformatics workflow execution
PDF
e-BioGrid_NBIC Conference 2011 april 20
PDF
HPC lab projects
PPTX
Indiana University's Advanced Science Gateway Support
PPT
Agents In An Exponential World Foster
PPT
BioMake BOSC 2004
PPT
Computing Outside The Box June 2009
PDF
Mastering Bio Grid
PPT
UC Capabilities Supporting High-Performance Collaboration and Data-Intensive ...
PDF
Ntino Cloud BioLinux Barcelona Spain 2012
PDF
Ruby on bioinformatics
PPTX
Data-intensive bioinformatics on HPC and Cloud
PPT
Many Task Applications for Grids and Supercomputers
PPTX
Data-intensive applications on cloud computing resources: Applications in lif...
PDF
PDF
Kubernetes - Hosted OSG Services
PDF
Hungarian ClusterGrid and its applications
PDF
Grid is Dead ? Nimrod on the Cloud
PDF
Whitepaper : CHI: Hadoop's Rise in Life Sciences
 
PDF
A Prlic - BioJava update
Bio-UnaGrid: Easing bioinformatics workflow execution
e-BioGrid_NBIC Conference 2011 april 20
HPC lab projects
Indiana University's Advanced Science Gateway Support
Agents In An Exponential World Foster
BioMake BOSC 2004
Computing Outside The Box June 2009
Mastering Bio Grid
UC Capabilities Supporting High-Performance Collaboration and Data-Intensive ...
Ntino Cloud BioLinux Barcelona Spain 2012
Ruby on bioinformatics
Data-intensive bioinformatics on HPC and Cloud
Many Task Applications for Grids and Supercomputers
Data-intensive applications on cloud computing resources: Applications in lif...
Kubernetes - Hosted OSG Services
Hungarian ClusterGrid and its applications
Grid is Dead ? Nimrod on the Cloud
Whitepaper : CHI: Hadoop's Rise in Life Sciences
 
A Prlic - BioJava update

More from bosc (20)

PPT
Swertz Molgenis Bosc2009
PPT
Bosc Intro 20090627
PPT
Software Patterns Panel Bosc2009
PDF
Schbath Rmes Bosc2009
PPT
Kallio Chipster Bosc2009
PPTX
Welch Wordifier Bosc2009
PPT
Rice Emboss Bosc2009
PDF
Prlic Bio Java Bosc2009
PPT
Senger Soaplab Bosc2009
PDF
Cock Biopython Bosc2009
PDF
Hanmer Software Patterns Bosc2009
PDF
Snell Psoda Bosc2009
PDF
Procter Vamsas Bosc2009
PDF
Drablos Composite Motifs Bosc2009
PDF
Fauteux Seeder Bosc2009
PDF
Moeller Debian Bosc2009
PDF
Prins Bio Lib Bosc 2009
PDF
Wilczynski_BNFinder_BOSC2009
PDF
Welsh_BioHDF_BOSC2009
PDF
Varre_Biomanycores_BOSC2009
Swertz Molgenis Bosc2009
Bosc Intro 20090627
Software Patterns Panel Bosc2009
Schbath Rmes Bosc2009
Kallio Chipster Bosc2009
Welch Wordifier Bosc2009
Rice Emboss Bosc2009
Prlic Bio Java Bosc2009
Senger Soaplab Bosc2009
Cock Biopython Bosc2009
Hanmer Software Patterns Bosc2009
Snell Psoda Bosc2009
Procter Vamsas Bosc2009
Drablos Composite Motifs Bosc2009
Fauteux Seeder Bosc2009
Moeller Debian Bosc2009
Prins Bio Lib Bosc 2009
Wilczynski_BNFinder_BOSC2009
Welsh_BioHDF_BOSC2009
Varre_Biomanycores_BOSC2009

Recently uploaded (20)

PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
A Presentation on Artificial Intelligence
PPT
Teaching material agriculture food technology
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Electronic commerce courselecture one. Pdf
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
Cloud computing and distributed systems.
DOCX
The AUB Centre for AI in Media Proposal.docx
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Understanding_Digital_Forensics_Presentation.pptx
“AI and Expert System Decision Support & Business Intelligence Systems”
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Spectral efficient network and resource selection model in 5G networks
A Presentation on Artificial Intelligence
Teaching material agriculture food technology
MYSQL Presentation for SQL database connectivity
Review of recent advances in non-invasive hemoglobin estimation
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Electronic commerce courselecture one. Pdf
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Mobile App Security Testing_ A Comprehensive Guide.pdf
NewMind AI Weekly Chronicles - August'25 Week I
The Rise and Fall of 3GPP – Time for a Sabbatical?
Encapsulation_ Review paper, used for researhc scholars
Reach Out and Touch Someone: Haptics and Empathic Computing
Cloud computing and distributed systems.
The AUB Centre for AI in Media Proposal.docx

Hedlund_biogrid_BOSC2009

  • 1. Biogrid – Bioinformatics for the grid Joel Hedlund <yohell@ifm.liu.se> Biogrid User and Developer Linköping University, Sweden Birds-of-a-feather session tonight: see me after this talk!
  • 2. Outline • What is it? • What is it good for? • Does it really work? • Gory details. • Why did we do this? • Profit!
  • 3. What is it? NDGF BIO Community Grid Bioinformatics for the Grid
  • 4. What is it? • Unified interface ...to popular bioinformatic applications ...on shared, distributed computational resources ...using versioned and cached databases
  • 5. What is it good for? • Burst computing – High demand for short periods of time • high during development / production • low during analysis / writing papers – Share resources to enable more efficient use • Database accessibility • Availibility • Unified interface
  • 7. What is NDGF? • Nordic Data Grid Facility • A WLCG Tier1 facility – Worldwide LHC Computational Grid – Stores and processes data from LHC at CERN • peak rate ≈ 1.6Gb/s, when the accelerator is running (and that’s after most of the data have been filtered away)
  • 10. ”Does it really work, this distributed thingie?”
  • 11. ”Does it really work, this distributed thingie?” Why yes, very well thank you!
  • 12. NDGF • 96% availablity (highest of all Tier1 facilities) • Third largest Tier1 facility in the world • Lowest ratio of failed ATLAS jobs • Production goals met, and beyond – Goal: 8% of all ATLAS resources (10.5% provided) – Goal: 9% of all ALICE resources (12% provided) * Data graciously stolen from Leif Nixons NorduNet 2008 talk. Thank you Leif :-)
  • 13. DISTRIBUTION IS A STRENGTH
  • 14. It enforces unification It ensures availability
  • 15. Does it really work? It’s good enough for LHC. It’s good enough for Bioinformatics.
  • 17. Biogrid provides Optimised applications: – BLAST – ClustalW – HMMER – Muscle – Mafft Planned: molecular dynamics, phylogeny...
  • 18. Biogrid provides Versioned, indexed and cached databases – UniProtKB (subreleases) – Uniref (subreleases) Planned: genomes (EnsEMBL), nucleotides (EMBL)...
  • 19. Cached database access Database files are transfered to the cluster at most once per project.
  • 22. Unified Interface DATA RESULTS
  • 23. Unified Interface • XRSL Job Description Standard in ARC Grid Middleware • Well defined runtime environments $HMMERDIR: node local (fast) scratch dir containing db files prepare_db: download and unpack db files on the fly from front node to $HMMERDIR
  • 24. XRSL Job Description (jobName=refinehmm-family023) (runTimeEnvironment=APPS/BIO/HMMER2.3.2) (cpuTime=3000) (executable=refinehmm.jobscript.sh) (inputFiles= (sp.gz srm://srm.ndgf.org/biogrid/db/uniprot/UniProt14.8/uniprot_sprot.fasta.gz) (tr.gz srm://srm.ndgf.org/biogrid/db/uniprot/UniProt14.8/uniprot_trembl.fasta.gz) (family023.hmm ””) ) (outputfiles= (family023.refined.hmm ””) )
  • 25. XRSL Job Description (jobName=refinehmm-$HMM_NAME) (runTimeEnvironment=APPS/BIO/HMMER2.3.2) (cpuTime=3000) (executable=refinehmm.jobscript.sh) (inputFiles= (sp.gz srm://srm.ndgf.org/biogrid/db/uniprot/UniProt14.8/uniprot_sprot.fasta.gz) (tr.gz srm://srm.ndgf.org/biogrid/db/uniprot/UniProt14.8/uniprot_trembl.fasta.gz) ($HMM_NAME.hmm ””) ) (outputfiles= ($HMM_NAME.refined.hmm ””) )
  • 26. Unified Interface • Run on any resource I can access: $ ngsub myjob.xrsl • ...or run on my buddy’s cluster: $ ngsub -c kiniini.csc.fi myjob.xrsl • Check jobs: $ ngstat refinehmm-family023 (or use Grid Monitor web interface at www.nordugrid.org) • Fetch results: $ ngget refinehmm-family* DATA GRID RESULTS
  • 27. What do I need? 1. A resource with ARC and Biogrid REs 2. An ARC client 3. A Grid Certificate (available from a number of global certificate authorities) 4. Time allowance on the resource ( 5. Biogrid VO Membership Not really necessary, but it will get you 1 & 4 )
  • 28. What do I need? ...or you can just grab the RE scripts off the biogrid website, and your db of choice from the biogrid dCache.
  • 29. Why did we do this? Bioinformatic applications... – CPU intensive – Small input and output files – ”Large” databases can be cached ...are very well suited for distributed computing.
  • 31. Subclassification of the MDR superfamily • 15000 members from all kingdoms of life • 500 families 25% sequence identity • 40 human members • Different substrate specificities • Different subunit & cofactor count • 2 HMMs available for superfamily detection • None for any of the individual families
  • 32. Subclassification of the MDR superfamily • We made HMMs for all MDR (sub)families with 20+ members. • 86 families • 34 detected subfamilies to 14 of these • 11579 / 15000 sequences classified • ≈5000*hmmsearch vs UniProtKB Manuscript in preparation
  • 33. refinehmm • Algorithm for automated HMM refinement • Produces stable and reliable HMMs • Developed using Biogrid REs and resources Will also be open source software once the paper is out.
  • 34. Acknowledgements • Olli Tourunen Supercomputing centers Biogrid developer • NSC • Bengt Persson Jens Larsson, Leif Nixon Biogrid PI • HPC2N • NDGF Åke Sandgren Michael Grønager Josva Kleist • Others C3SE, CSC, Uppmax, Lunarc, PDC, • Biogrid co-applicants Aalborg University, Oslo University Ann-Charlotte Berglund Sonnhammer Erik Sonnhammer Inge Jonassen Joel Hedlund yohell@ifm.liu.se Biogrid User and Developer Linköping University, Sweden Birds-of-a-feather session tonight: see me after the talk!
  • 35. Acknowledgements • Olli Tourunen Supercomputing centers Biogrid developer • NSC • Bengt Persson Jens Larsson, Leif Nixon Biogrid PI • HPC2N • NDGF Åke Sandgren Michael Grønager Josva Kleist • Others C3SE, CSC, Uppmax, Lunarc, PDC, • Biogrid co-applicants Aalborg University, Oslo University Ann-Charlotte Berglund Sonnhammer Erik Sonnhammer Inge Jonassen Joel Hedlund yohell@ifm.liu.se Biogrid User and Developer Linköping University, Sweden Birds-of-a-feather session tonight: see me after the talk!