SlideShare a Scribd company logo
Hadoop for Bioinformatics
                       Deepak Singh
                    Amazon Web Services




Hadoop World, NYC
Hadoop for Bioinformatics
Via Reavel under a CC-BY-NC-ND license
By ~Prescott under a CC-BY-NC license
data sets
many data sets
PFAM                                PDB




       GENBANK                 ENSEMBL




                 Many Others
manageable
Image: Matt Wood
Human
                   genom
                   e




Image: Matt Wood
Hadoop for Bioinformatics
Hadoop for Bioinformatics
Image: Matt Wood
~100 TB/Week
Image: Matt Wood
~100 TB/Week
                       >2 PB/Year
Image: Matt Wood
Hadoop for Bioinformatics
years
days
hours
gigabytes
terabytes
petabytes
really fast
Hadoop for Bioinformatics
typical informatics workflow
Hadoop for Bioinformatics
Hadoop for Bioinformatics
Hadoop for Bioinformatics
Hadoop for Bioinformatics
Via Christolakis under a CC-BY-NC-ND license
Via Argonne National Labs under a CC-BY-SA license
killer app




Via Argonne National Labs under a CC-BY-SA license
Via asklar under a CC-BY license
Hadoop for Bioinformatics
Hadoop for Bioinformatics
Image: Chris Dagdigian
Hadoop for Bioinformatics
rethink algorithms
rethink computing
rethink data management
rethink data sharing
operational mindset
scalability
we are data geeks not data center geeks
two key trends
Hadoop for Bioinformatics
Hadoop for Bioinformatics
Hadoop for Bioinformatics
develop applications
distribute applications
use applications
some work
filters
some work
   ^
High Throughput Sequence Analysis
Mike Schatz, University of Maryland
• Read Mapping
• Mapping & SNP Discovery
• De novo Genome Assembly
Short Read Mapping
Asian Individual Genome: 3.3 Billion 35bp, 104
GB (Wang et al., 2008)

African Individual Genome: 4.0 Billion 35bp, 144
GB (Bentley et al., 2008)
Alignment > 10000 CPU hrs
Seed & Extend
Good alignments must have significant
exact alignment

Minimal exact alignment length = l/(k+1)
Seed & Extend
Good alignments must have significant
exact alignment

Minimal exact alignment length = l/(k+1)



          Expensive to scale
Seed & Extend
Good alignments must have significant
exact alignment

Minimal exact alignment length = l/(k+1)



          Expensive to scale
Seed & Extend
Good alignments must have significant
exact alignment

Minimal exact alignment length = l/(k+1)



          Expensive to scale

  Need parallelization framework
CloudBurst




Catalog k-mers     Collect seeds   End-to-end alignment
http://cloudburst-bio.sourceforge.net; Bioinformatics 2009 25: 1363-1369
Hadoop for Bioinformatics
CloudBurst efficiently reports every k-difference
           alignment of every read
many applications only need the best alignment
Bowtie: Ultrafast short read aligner




Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human
genome. Genome Biol 10 (3): R25.
SOAPSnp: Consensus alignment and SNP calling




Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human
genome. Genome Biol 10 (3): R25.
Crossbow: Rapid whole genome SNP analysis



                                                                                                       Ben Langmead




Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human
genome. Genome Biol 10 (3): R25.
Hadoop for Bioinformatics
Preprocessed reads
Preprocessed reads



   Map: Bowtie
Preprocessed reads



     Map: Bowtie



Sort: Bin and partition
Preprocessed reads



     Map: Bowtie



Sort: Bin and partition


  Reduce: SoapSNP
Crossbow condenses over 1,000 hours
of resequencing computation into a few
hours without requiring the user to own
or operate a computer cluster
Comparing Genomes
Estimating relative evolutionary rates
           from sequence comparisons:
                Identification of probable orthologs
                              Admissible comparisons:       A or B vs. D
                                                            C vs. E
                              Inadmissible comparisons:    A or B vs. E
                                                           C vs. D




 A B C                      D    E                         species tree
                                                          gene tree
S. cerevisiae               C. elegans
Estimating relative evolutionary rates
           from sequence comparisons:
                          1. Orthologs found using the Reciprocal
                          smallest distance algorithm
                          2. Build alignment between two orthologs
                          >Sequence C
                          MSGRTILASTIAKPFQEEVTKAVKQLNFT-----PKLVGLLSNEDPAAKMYANWTGKTCESLGFKYEL-…
                          >Sequence E
                          MSGRTILASKVAETFNTEIINNVEEYKKTHNGQGPLLVGFLANNDPAAKMYATWTQKTSESMGFRYDL…




                          3. Estimate distance given a substitution
                          matrix
                                                        Phe Ala Pro Leu Thr
                                                      Phe
                                                      Ala µπ
                                                      Pro µπ µπ µπ
                                                      Leu µπ µπ µπ µπ




 A B C              D    E                                                     species tree
                                                                              gene tree
S. cerevisiae       C. elegans
RSD algorithm summary
 Genome I                                            Genome J


                          Ib                                     Jc

  Align sequences &
  Calculate distances          L     Orthologs:
                                                     Align sequences &
                                                     Calculate distances          H
                                   ib - jc D = 0.1
    c
        vs.       D=1.2                                    vs.            D=0.2
              a                                        b              a
        vs.       D=0.1                                    vs.            D=0.3
    c         b                                        b              b
        vs.       D=0.9                                    vs.            D=0.1
    c         c                                        b              c
Prof. Dennis Wall
Harvard Medical School
Roundup is a database of orthologs
and their evolutionary distances.
To get started, click browse. Alternatively, you can
read our documentation here.
Good luck, researchers!
massive computational demand
1000 genomes = 5,994,000 processes =
         23,976,000 hours
2737 years
periodic task
must scale up
not scalability gurus
hadoop streaming
Hadoop for Bioinformatics
compared 50+ genomes
what’s next?
de novo assembly
machine learning and statistics
protein structure prediction
docking
trajectory analysis
key driving factors?
the ecosystem
Pig
Cascading
Hive
RHIPE
domain specific libraries and tools
http://aws.amazon.com/publicdatasets/
Hadoop for Bioinformatics
http://aws.amazon.com/education/
Hadoop for Bioinformatics
Thank you!




     deesingh@amazon.com; Twitter:@mndoci
Presentation ideas from @mza, @simon and @lessig

More Related Content

PDF
Hw09 Hadoop For Bioinfomatics
PPTX
2012 XLDB talk
PDF
Genome assembly: the art of trying to make one big thing from millions of ver...
PDF
The art of good science writing
PPTX
Smart Print & Hybrid Database
PPTX
Bioinformatics t2-databases v2014
PDF
Nuestro Colegio CEIP Manuel Pacheco
ODP
Desmotivaciones.es
Hw09 Hadoop For Bioinfomatics
2012 XLDB talk
Genome assembly: the art of trying to make one big thing from millions of ver...
The art of good science writing
Smart Print & Hybrid Database
Bioinformatics t2-databases v2014
Nuestro Colegio CEIP Manuel Pacheco
Desmotivaciones.es

Viewers also liked (19)

DOCX
PDF
ABAEvents - Presentación comercial
PDF
Design fax 951 | Les gros mots du design | ShowDesign
PDF
VIDP Ground Handling Agent in New Delhi, India - Indira Gandhi International ...
PDF
Eurazeo 2012 Annual Results Presentation
PDF
Job Search-RD2B February 2016
DOC
Investigación Operativa 1
PDF
Ind. Eng. Chem. Res. 2009, 48, 4866–4871_Synthesis of Ultrahigh Molecular Wei...
PDF
Envases galotto chile
PDF
Brochure Torre UNIKA Virrey
PPTX
Vision sistematica de las relaciones industrials
PDF
YAMAHA XS 400 1982 - service manual_chapter3_engine_overhaul_part2
PDF
Transfer factor3g bienestar es verde
PDF
An Integral Model of Human Resilience in Technological Systems (6.15)
PPT
Presentació Patufet
PDF
Nichia product catalogue
DOCX
Bioinformatics Final Report
PPT
Anxyolitics& hypnotics
ABAEvents - Presentación comercial
Design fax 951 | Les gros mots du design | ShowDesign
VIDP Ground Handling Agent in New Delhi, India - Indira Gandhi International ...
Eurazeo 2012 Annual Results Presentation
Job Search-RD2B February 2016
Investigación Operativa 1
Ind. Eng. Chem. Res. 2009, 48, 4866–4871_Synthesis of Ultrahigh Molecular Wei...
Envases galotto chile
Brochure Torre UNIKA Virrey
Vision sistematica de las relaciones industrials
YAMAHA XS 400 1982 - service manual_chapter3_engine_overhaul_part2
Transfer factor3g bienestar es verde
An Integral Model of Human Resilience in Technological Systems (6.15)
Presentació Patufet
Nichia product catalogue
Bioinformatics Final Report
Anxyolitics& hypnotics
Ad

Similar to Hadoop for Bioinformatics (20)

PPT
Blast fasta 4
PPT
Similarity
PDF
20110524zurichngs 1st pub
PDF
RNA-seq Analysis
PDF
01 Slide_Oscar
PPTX
Bioinformatica t3-scoring matrices
PPTX
BoInformatics Lecture 5
PDF
Basics of bioinformatics
PDF
Genome_annotation@BioDec: Python all over the place
PPTX
Bio info 5
PPTX
bioinformatics lecture 2.pptx and computational Boilogygy
PPT
20100515 bioinformatics kapushesky_lecture07
PPT
Hadoop for Genomics__HadoopSummit2010
PPTX
2013 bms-retreat-talk
PDF
BITS - Comparative genomics: gene family analysis
PPTX
2013 duke-talk
PDF
20110524zurichngs 2nd pub
PPT
2013 pag-equine-workshop
PDF
BITS: Basics of sequence analysis
PDF
BIOINFORMATICS_AND_PHYLOGENY.pdf.pdf
Blast fasta 4
Similarity
20110524zurichngs 1st pub
RNA-seq Analysis
01 Slide_Oscar
Bioinformatica t3-scoring matrices
BoInformatics Lecture 5
Basics of bioinformatics
Genome_annotation@BioDec: Python all over the place
Bio info 5
bioinformatics lecture 2.pptx and computational Boilogygy
20100515 bioinformatics kapushesky_lecture07
Hadoop for Genomics__HadoopSummit2010
2013 bms-retreat-talk
BITS - Comparative genomics: gene family analysis
2013 duke-talk
20110524zurichngs 2nd pub
2013 pag-equine-workshop
BITS: Basics of sequence analysis
BIOINFORMATICS_AND_PHYLOGENY.pdf.pdf
Ad

More from Deepak Singh (20)

PDF
Intel Theater Presentation - SC11
PDF
Talk at West Coast Association of Shared Resource Directors
PDF
Platforms for Data Science - Computing on the Brink
PDF
High Performance Cloud Computing
PPTX
#arseniclife
PDF
High Performance Cloud Computing
PDF
Systems Bioinformatics Workshop Keynote
PDF
Talk at NCRR P41 Director's Meeting
PDF
Platforms for data science
PDF
Discovery 2015 Workshop
PDF
Bio-IT World 2010 - Keynote talk
PDF
Talk at Microsoft Cloud Futures 2010
PDF
NHGRI Cloud Computing talk
PDF
Plenary Talk at ACAT 2010
PDF
Masterworks talk on Big Data and the implications of petascale science
PDF
Talk given at "Cloud Computing for Systems Biology" workshop
KEY
Big Data & the networked future of Science (at Ignite Seattle 7)
PPT
Science Big, Science Connected
PPT
Bioscreencast: Capturing the life sciences frame by frame
PPT
Searching Science
Intel Theater Presentation - SC11
Talk at West Coast Association of Shared Resource Directors
Platforms for Data Science - Computing on the Brink
High Performance Cloud Computing
#arseniclife
High Performance Cloud Computing
Systems Bioinformatics Workshop Keynote
Talk at NCRR P41 Director's Meeting
Platforms for data science
Discovery 2015 Workshop
Bio-IT World 2010 - Keynote talk
Talk at Microsoft Cloud Futures 2010
NHGRI Cloud Computing talk
Plenary Talk at ACAT 2010
Masterworks talk on Big Data and the implications of petascale science
Talk given at "Cloud Computing for Systems Biology" workshop
Big Data & the networked future of Science (at Ignite Seattle 7)
Science Big, Science Connected
Bioscreencast: Capturing the life sciences frame by frame
Searching Science

Recently uploaded (20)

PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PPTX
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
Cloud computing and distributed systems.
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
KodekX | Application Modernization Development
PPT
Teaching material agriculture food technology
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Advanced Soft Computing BINUS July 2025.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Advanced IT Governance
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Network Security Unit 5.pdf for BCA BBA.
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
“AI and Expert System Decision Support & Business Intelligence Systems”
breach-and-attack-simulation-cybersecurity-india-chennai-defenderrabbit-2025....
Reach Out and Touch Someone: Haptics and Empathic Computing
Cloud computing and distributed systems.
CIFDAQ's Market Insight: SEC Turns Pro Crypto
KodekX | Application Modernization Development
Teaching material agriculture food technology
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Review of recent advances in non-invasive hemoglobin estimation
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Diabetes mellitus diagnosis method based random forest with bat algorithm
Advanced Soft Computing BINUS July 2025.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Advanced IT Governance

Hadoop for Bioinformatics