SlideShare a Scribd company logo
Large Scale Resequencing: Approaches and
   Challenges


    Thomas Keane
    Vertebrate Resequencing Informatics group
    Wellcome Trust Sanger Institute
    Hinxton, Cambridge, UK

    thomas.keane@sanger.ac.uk



AGBT Tutorial Workshop   15th February, 2012
Sanger total sequence (2007-2009)
Gbp




  AGBT Tutorial Workshop   15th February, 2012
Sanger total sequence to-date
Gbp




  AGBT Tutorial Workshop   15th February, 2012
Vertebrate Resequencing Informatics Group

     Established in 2008 with Jim Stalker
         PIs: Richard Durbin and David Adams
     Initial projects
         1000 Genomes project (http://www.1000genomes.org)
               Data processing, releases, aligner evaluation, sequencing
               Pilot 2008-2009: ~5Tbp (Nature 2011;467)
               Phase 1 2009-2011: ~30Tbp
               Phase 2 2011-: ~36.9Tbp (LowCov ilmn only)
         Mouse Genomes Project (http://www.sanger.ac.uk/
           mousegenomes)
               Sequencing 17 laboratory mouse strains
               SNPs, indels, SVs, de novo assembly
               Approx. ~1.2Tbp (Nature 2011;477)


AGBT Tutorial Workshop   15th February, 2012
UK10K

 Investigating the role of rare genetic variants in health and disease
 Whole genome cohorts: 4,000 individuals across two well-established and deeply
 phenotyped UK cohorts with ongoing longitudinal phenotype collection:
     TWINSUK – 2,000
     ALSPAC – 2,000
     6x (18Gbp) per sample

 Exomes: 6,000 exomes from 3 sets of extreme phenotype individuals
    Neurodevelopmental diseases – 3,000
        e.g. schizophrenia, autism spectrum disorders
    Obesity – 2,000
        e.g. severe childhood onset obesity
    Rare diseases – 1,000
        e.g. severe insulin resistance, congenital heart disease, ciliopathies
    5Gbp per sample

 Expect to generate ~100Tbp by end 2012
    ~40Tbp from BGI


AGBT Tutorial Workshop   15th February, 2012
Current Status




                  Recently passed 1000 genomes in terms of total Gbp
AGBT Tutorial Workshop   15th February, 2012
What are the challenges?



 Storage                                             Software/Workflows



                                               NGS


 Compute                                                  Power


AGBT Tutorial Workshop   15th February, 2012
Data Production Workflow


         Sample                              NA34842                 NA87465                 Sample/Platform
         merge

                                                                                                  Merge Up
                                    BAM                   BAM                    BAM
      Library
      merge                                                                                  Library
Freeze


       BAM
                            BAM           BAM          BAM      ……       BAM           BAM

   Improvement
                            BAM                                 ……
   Alignment
                                          BAM          BAM               BAM           BAM
                                                                                                   Import
   (bwa, smalt etc)
                            Fastq         Fastq        Fastq    ……       Fastq     Fastq
                                                                                                       +
                                                                                             Improvement



   AGBT Tutorial Workshop   15th February, 2012
Data Production Workflow

                                        Chr1                   Chr2            Chr3
                     NA19294                                                              …
                     NA18943
                                                                                          …      Merge
                     NA19305              .                        .            .
                         .
                         .
                                          .
                                          .
                                                                   .
                                                                   .
                                                                                .
                                                                                .                across
                     NA19309                                                              …

                 RG:NA19294
                 RG:NA18943
                 RG:NA19305
                                                                                          Cross-sample BAMs

                        SNPs/indels                                                 SVMerge
                samtools                GATK                    Genome STRiP



                              VQSR
                                                                                                  Variant
                              BEAGLE/
                              Impute2
                                                                                                  Calling

                                                       VEP Annotation

                                                       Final VCF 

AGBT Tutorial Workshop           15th February, 2012
Storage Challenges

 Expect ~200Tbp of sequence in 2011-2012
   Working estimate including processing, release, and variant calling
   10bytes per bp

 Storage considerations
   Scalability – can we easily add more storage units?
   Backup and disaster recovery – what do we really need to keep?
   Performance – sufficient I/O throughput to serve compute nodes
   Cost

 Data Formats
   Standardised formats – BAM & VCF 

 Minimise the number of copies
   Aim for two copies at most – original lanes + release (stripped) BAM

AGBT Tutorial Workshop   15th February, 2012
A Tiered Storage Solution


Cost          Size

 2               1                                                              3Gb/sec




                                                                                                  CPU Farm
 1               3                                                                    800Mb/sec




                                                          Off-       Off-
 2               2                                        site       site
       Level 1
           Data: Current release vertical BAMs
           Processes: BAM merging + splitting, Variant calling (SNPs, indels, SVs)
       Level 2
           Data: Lane level BAMs
           Processes: Alignment, recalibration, local realignment
       Level 3
           Data: Previous release BAMs + variant calls backup

     AGBT Tutorial Workshop   15th February, 2012
Data release + archiving: iRODs

 Rule-Oriented Data management systems                                                iRODs
     Open source – origins in particle physics world
     Most important feature of iRODS is the Rule Engine                      nfs02       nfs20
     Akin to source control system
 Customise own application level metadata                          nfs03
                                                                                 nfs01        Off-
     e.g. run, lane, plex, sample, library….                                                 site
 Stores/searches key-value metadata on files:
            List all files from UK10K studies:
                     imeta -z seq qu -d study like 'UK10K_%’!
                          /seq/5363/5363_1.bam!
                          /seq/5363/5363_2.bam (.....and a whole lot more)!
                Get metadata about a file:
                     imeta ls -d /seq/6534/6534_3#7.bam sample!
                          attribute: sample!
                          value: QTL191953!

 Sanger production: BAM files from runs per lane per plex deposited
      BMC Bioinformatics 2011, 12:361

 Recently adopted for UK10K internal data release and archiving
      Users use meta-data queries to find their data
      Files can be part of multiple releases
                                                                              http://www.irods.org

AGBT Tutorial Workshop    15th February, 2012
Compute Pipeline Management: VRPipe

 VRPipe
   Managed and automated execution of sequences of arbitrary
     software against massive datasets across large compute clusters
   Error handling, optimal memory requests, batching of jobs, retrying
     failures, failure reporting, highly extendable, detailed job statistics
 1000 Genomes Phase 2 processed through VRPipe
   Tracked ~1 million jobs
   Total serial wall time: 9886 days, 3 hrs, 43 mins, 25 secs
   bwa_aln_fastq: ~2443 days total serial wall time
   Mean memory: 941MB/job (max 5637)
 2012                                                                sb10@sanger.ac.uk

   Fully migrate all NGS processes to VRPipe (data processing, SNP/
     indel/SV variant calling, and RNA-seq/ChIP-Seq pipelines)
   Management front-ends
   Create distributable VM for cloud rollout
 http://www.github.com/VertebrateResequencing/vr-pipe/wiki

AGBT Tutorial Workshop   15th February, 2012
Even more scale up in 2012 – HiSeq 2500

 Currently takes 1-2 weeks to sequence a human genome
   High depth human genomes in a single day – Illumina HiSeq
     2500
   Caucasian family with a severe T-cell deficiency in affected
     sibling
   Single run on HiSeq 2500 by Illumina per individual

                             PF
                                                      % ≥Q30 Mismatch Mismatch Run time
              Sample        Yield         % Align
                            (Gbp)                      value  R1 (%)   R2 (%)    (hrs)

              Father       117.7               89      92.6     0.4      0.5     25.5
              Mother       125.7               90.2    92.8     0.4      0.5     25.5

              Affected     124.4               90.3    92.4     0.4      0.5     25.5




AGBT Tutorial Workshop   15th February, 2012
What does the data look like?




AGBT Tutorial Workshop   15th February, 2012
Upcoming Changes in 2012

 We cannot keep all of the data
   2007-2008: Keep everything including images from runs
   2009: BAM/Fastq – all of the base quality information
   2010-2011: Stripping original qualities and other unused tags
   2012-: Current formats contain lots of repetition
       Reference based compression
       Reducing quality information e.g. quality binning or quality
       budgets
       Potential formats: CRAM and/or Reduced BAM




AGBT Tutorial Workshop   15th February, 2012
CRAM Format
                                        TGAGCTCTAAGTACC!
                                        329183050298757!


CRAM models for
compression                                                           TGAGCTCTAAGTACC!               TGAGCTCTAAGTACC!
                                                                      002020010022212!               -2---30---9---7!

                                                                            Horizontal                Vertical
                            Do nothing                     Lossless
                                                                                             Quality lossy


        100                                       10                                     1                                            0.1



CRAM current
                                  Untreated             CRAM                       CRAM               CRAM substitutions/insertions
performance                                            lossless                  combination                   model
                                                                                   model


    CRAM v0.6 released 13.2.12:                                        •    Option to preserve all unmapped reads
    •  Pairing information preservation regardless of distance         •    Performance and bug fixes
    •  Revised and improved lossless mode                              •    Arbitrary tags

                                  http://www.ebi.ac.uk/ena/about/cram_toolkit
                                                                                         Source: Ewan Birney/Guy Cochrane, EBI

   AGBT Tutorial Workshop   15th February, 2012
Any questions?




                                                                 Richard Durbin




 URLs
  •  VRPipe: https://github.com/VertebrateResequencing/vr-pipe   David Adams
  •  iRODS@Sanger: BMC Bioinformatics 2011, 12:361
  •  http://www.slideshare.net/thomaskeane

AGBT Tutorial Workshop   15th February, 2012

More Related Content

PDF
Git Branching Model
PDF
Tokyo r11caret
PDF
Git Branching Model
PDF
Apple compatibility guide harman kardon
PDF
Mousegenomes tk-wtsi (1)
PDF
AMR surveillance in Europe: historical background and future outlook. Hajo G...
PDF
Assessing the impact of transposable element variation on mouse phenotypes an...
PDF
Infographic-SAP-Personalized-Medicine
Git Branching Model
Tokyo r11caret
Git Branching Model
Apple compatibility guide harman kardon
Mousegenomes tk-wtsi (1)
AMR surveillance in Europe: historical background and future outlook. Hajo G...
Assessing the impact of transposable element variation on mouse phenotypes an...
Infographic-SAP-Personalized-Medicine

Viewers also liked (19)

PDF
Long read sequencing - LSCC lab talk - fri 5 june 2015
PPTX
Parallel computing in bioinformatics t.seemann - balti bioinformatics - wed...
PPTX
The Best Way to Optimize Physician Workflow
PDF
Cleaning illumina reads - LSCC Lab Meeting - Fri 23 Nov 2012
PPT
Assembling NGS Data - IMB Winter School - 3 July 2012
PDF
Multiple mouse reference genomes and strain specific gene annotations
PDF
Mouse Genomes Project + RNA-Editing
PPT
Antimicrobial resistance (AMR) in N. gonorrhoeae (GC) - global problem but v...
PDF
Comparing bacterial isolates - T.Seemann - IMB winter school 2016 - fri 8 jul...
PPTX
Maternal Fetal Medicine 2017
PPTX
The Real Opportunity of Precision Medicine and How to Not Miss Out
PDF
Wellcome Trust Advances Course: NGS Course - Lecture1
PPT
De novo genome assembly - IMB Winter School - 7 July 2015
PPTX
Key Issues on the Economics of Precision Medicine
PPTX
The Scottish Ecosystem for Precision Medicine
PPTX
Rheumatoid Arthritis: Too expensive to treat, too expensive to fail
PPT
Stem cell personalized medicine 2017 plus
PPTX
Six secrets-to-closing-sale
PPTX
ECCMID 2015 Meet-The-Expert: Bioinformatics Tools
Long read sequencing - LSCC lab talk - fri 5 june 2015
Parallel computing in bioinformatics t.seemann - balti bioinformatics - wed...
The Best Way to Optimize Physician Workflow
Cleaning illumina reads - LSCC Lab Meeting - Fri 23 Nov 2012
Assembling NGS Data - IMB Winter School - 3 July 2012
Multiple mouse reference genomes and strain specific gene annotations
Mouse Genomes Project + RNA-Editing
Antimicrobial resistance (AMR) in N. gonorrhoeae (GC) - global problem but v...
Comparing bacterial isolates - T.Seemann - IMB winter school 2016 - fri 8 jul...
Maternal Fetal Medicine 2017
The Real Opportunity of Precision Medicine and How to Not Miss Out
Wellcome Trust Advances Course: NGS Course - Lecture1
De novo genome assembly - IMB Winter School - 7 July 2015
Key Issues on the Economics of Precision Medicine
The Scottish Ecosystem for Precision Medicine
Rheumatoid Arthritis: Too expensive to treat, too expensive to fail
Stem cell personalized medicine 2017 plus
Six secrets-to-closing-sale
ECCMID 2015 Meet-The-Expert: Bioinformatics Tools
Ad

Similar to Large Scale Resequencing: Approaches and Challenges (16)

PDF
Next generation sequencing in cloud computing era
PDF
Bobcat hotchips final 8 2 10
PDF
Efficient Parallel Set-Similarity Joins Using MapReduce - Poster
PDF
Jaguar x86 Core Functional Verification
PDF
Netgear ReadyNAS Comparison
PDF
BGP Error Handling - Developing an Operator-Led Approach in the IETF (UKNOF 18)
PDF
Asml Euv Use Forecast
PDF
Public Presentation, ASML EUV forecast Jul 2010
PDF
16.07.12 Analyzing Logs/Configs of 200'000 Systems with Hadoop (Christoph Sch...
PDF
ESS-Bilbao Initiative Workshop. Beam Dynamics Codes: Availability, Sophistica...
PDF
NGS Data Preprocessing
PDF
AMD technologies for HPC
PDF
Overview of methods for variant calling from next-generation sequence data
PDF
Benchmarker - A Good Friend for Performance
PDF
産総研におけるプライベートクラウドへの取り組み
PDF
BGP Error Handling (NANOG 51)
Next generation sequencing in cloud computing era
Bobcat hotchips final 8 2 10
Efficient Parallel Set-Similarity Joins Using MapReduce - Poster
Jaguar x86 Core Functional Verification
Netgear ReadyNAS Comparison
BGP Error Handling - Developing an Operator-Led Approach in the IETF (UKNOF 18)
Asml Euv Use Forecast
Public Presentation, ASML EUV forecast Jul 2010
16.07.12 Analyzing Logs/Configs of 200'000 Systems with Hadoop (Christoph Sch...
ESS-Bilbao Initiative Workshop. Beam Dynamics Codes: Availability, Sophistica...
NGS Data Preprocessing
AMD technologies for HPC
Overview of methods for variant calling from next-generation sequence data
Benchmarker - A Good Friend for Performance
産総研におけるプライベートクラウドへの取り組み
BGP Error Handling (NANOG 51)
Ad

More from Thomas Keane (7)

PDF
2014 Wellcome Trust Advances Course: NGS Course - Lecture2
PDF
Enhanced structural variant and breakpoint detection using SVMerge by integra...
PDF
Overview of methods for variant calling from next-generation sequence data
PDF
1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale r...
PDF
Mouse Genomes Poster - Genetics 2010
PDF
Mouse Genomes Project Summary June 2010
PDF
ECCB 2010 Next-gen sequencing Tutorial
2014 Wellcome Trust Advances Course: NGS Course - Lecture2
Enhanced structural variant and breakpoint detection using SVMerge by integra...
Overview of methods for variant calling from next-generation sequence data
1000G/UK10K: Bioinformatics, storage, and compute challenges of large scale r...
Mouse Genomes Poster - Genetics 2010
Mouse Genomes Project Summary June 2010
ECCB 2010 Next-gen sequencing Tutorial

Recently uploaded (20)

PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
cuic standard and advanced reporting.pdf
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
Review of recent advances in non-invasive hemoglobin estimation
PPTX
Big Data Technologies - Introduction.pptx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
DOCX
The AUB Centre for AI in Media Proposal.docx
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
KodekX | Application Modernization Development
PPT
Teaching material agriculture food technology
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Network Security Unit 5.pdf for BCA BBA.
Mobile App Security Testing_ A Comprehensive Guide.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
Encapsulation_ Review paper, used for researhc scholars
cuic standard and advanced reporting.pdf
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Review of recent advances in non-invasive hemoglobin estimation
Big Data Technologies - Introduction.pptx
Chapter 3 Spatial Domain Image Processing.pdf
The AUB Centre for AI in Media Proposal.docx
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
sap open course for s4hana steps from ECC to s4
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Agricultural_Statistics_at_a_Glance_2022_0.pdf
KodekX | Application Modernization Development
Teaching material agriculture food technology
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Advanced methodologies resolving dimensionality complications for autism neur...

Large Scale Resequencing: Approaches and Challenges

  • 1. Large Scale Resequencing: Approaches and Challenges Thomas Keane Vertebrate Resequencing Informatics group Wellcome Trust Sanger Institute Hinxton, Cambridge, UK thomas.keane@sanger.ac.uk AGBT Tutorial Workshop 15th February, 2012
  • 2. Sanger total sequence (2007-2009) Gbp AGBT Tutorial Workshop 15th February, 2012
  • 3. Sanger total sequence to-date Gbp AGBT Tutorial Workshop 15th February, 2012
  • 4. Vertebrate Resequencing Informatics Group  Established in 2008 with Jim Stalker  PIs: Richard Durbin and David Adams  Initial projects  1000 Genomes project (http://www.1000genomes.org)  Data processing, releases, aligner evaluation, sequencing  Pilot 2008-2009: ~5Tbp (Nature 2011;467)  Phase 1 2009-2011: ~30Tbp  Phase 2 2011-: ~36.9Tbp (LowCov ilmn only)  Mouse Genomes Project (http://www.sanger.ac.uk/ mousegenomes)  Sequencing 17 laboratory mouse strains  SNPs, indels, SVs, de novo assembly  Approx. ~1.2Tbp (Nature 2011;477) AGBT Tutorial Workshop 15th February, 2012
  • 5. UK10K Investigating the role of rare genetic variants in health and disease Whole genome cohorts: 4,000 individuals across two well-established and deeply phenotyped UK cohorts with ongoing longitudinal phenotype collection:   TWINSUK – 2,000   ALSPAC – 2,000   6x (18Gbp) per sample Exomes: 6,000 exomes from 3 sets of extreme phenotype individuals   Neurodevelopmental diseases – 3,000  e.g. schizophrenia, autism spectrum disorders   Obesity – 2,000  e.g. severe childhood onset obesity   Rare diseases – 1,000  e.g. severe insulin resistance, congenital heart disease, ciliopathies   5Gbp per sample Expect to generate ~100Tbp by end 2012   ~40Tbp from BGI AGBT Tutorial Workshop 15th February, 2012
  • 6. Current Status Recently passed 1000 genomes in terms of total Gbp AGBT Tutorial Workshop 15th February, 2012
  • 7. What are the challenges? Storage Software/Workflows NGS Compute Power AGBT Tutorial Workshop 15th February, 2012
  • 8. Data Production Workflow Sample NA34842 NA87465 Sample/Platform merge Merge Up BAM BAM BAM Library merge Library Freeze BAM BAM BAM BAM …… BAM BAM Improvement BAM …… Alignment BAM BAM BAM BAM Import (bwa, smalt etc) Fastq Fastq Fastq …… Fastq Fastq + Improvement AGBT Tutorial Workshop 15th February, 2012
  • 9. Data Production Workflow Chr1 Chr2 Chr3 NA19294 … NA18943 … Merge NA19305 . . . . . . . . . . . across NA19309 … RG:NA19294 RG:NA18943 RG:NA19305 Cross-sample BAMs SNPs/indels SVMerge samtools GATK Genome STRiP VQSR Variant BEAGLE/ Impute2 Calling VEP Annotation Final VCF  AGBT Tutorial Workshop 15th February, 2012
  • 10. Storage Challenges Expect ~200Tbp of sequence in 2011-2012  Working estimate including processing, release, and variant calling  10bytes per bp Storage considerations  Scalability – can we easily add more storage units?  Backup and disaster recovery – what do we really need to keep?  Performance – sufficient I/O throughput to serve compute nodes  Cost Data Formats  Standardised formats – BAM & VCF  Minimise the number of copies  Aim for two copies at most – original lanes + release (stripped) BAM AGBT Tutorial Workshop 15th February, 2012
  • 11. A Tiered Storage Solution Cost Size 2 1 3Gb/sec CPU Farm 1 3 800Mb/sec Off- Off- 2 2 site site Level 1   Data: Current release vertical BAMs   Processes: BAM merging + splitting, Variant calling (SNPs, indels, SVs) Level 2   Data: Lane level BAMs   Processes: Alignment, recalibration, local realignment Level 3   Data: Previous release BAMs + variant calls backup AGBT Tutorial Workshop 15th February, 2012
  • 12. Data release + archiving: iRODs Rule-Oriented Data management systems iRODs   Open source – origins in particle physics world   Most important feature of iRODS is the Rule Engine nfs02 nfs20   Akin to source control system Customise own application level metadata nfs03 nfs01 Off-   e.g. run, lane, plex, sample, library…. site Stores/searches key-value metadata on files:   List all files from UK10K studies: imeta -z seq qu -d study like 'UK10K_%’! /seq/5363/5363_1.bam! /seq/5363/5363_2.bam (.....and a whole lot more)!   Get metadata about a file: imeta ls -d /seq/6534/6534_3#7.bam sample! attribute: sample! value: QTL191953! Sanger production: BAM files from runs per lane per plex deposited   BMC Bioinformatics 2011, 12:361 Recently adopted for UK10K internal data release and archiving   Users use meta-data queries to find their data   Files can be part of multiple releases http://www.irods.org AGBT Tutorial Workshop 15th February, 2012
  • 13. Compute Pipeline Management: VRPipe VRPipe  Managed and automated execution of sequences of arbitrary software against massive datasets across large compute clusters  Error handling, optimal memory requests, batching of jobs, retrying failures, failure reporting, highly extendable, detailed job statistics 1000 Genomes Phase 2 processed through VRPipe  Tracked ~1 million jobs  Total serial wall time: 9886 days, 3 hrs, 43 mins, 25 secs  bwa_aln_fastq: ~2443 days total serial wall time  Mean memory: 941MB/job (max 5637) 2012 sb10@sanger.ac.uk  Fully migrate all NGS processes to VRPipe (data processing, SNP/ indel/SV variant calling, and RNA-seq/ChIP-Seq pipelines)  Management front-ends  Create distributable VM for cloud rollout http://www.github.com/VertebrateResequencing/vr-pipe/wiki AGBT Tutorial Workshop 15th February, 2012
  • 14. Even more scale up in 2012 – HiSeq 2500 Currently takes 1-2 weeks to sequence a human genome  High depth human genomes in a single day – Illumina HiSeq 2500  Caucasian family with a severe T-cell deficiency in affected sibling  Single run on HiSeq 2500 by Illumina per individual PF % ≥Q30 Mismatch Mismatch Run time Sample Yield % Align (Gbp) value R1 (%) R2 (%) (hrs) Father 117.7 89 92.6 0.4 0.5 25.5 Mother 125.7 90.2 92.8 0.4 0.5 25.5 Affected 124.4 90.3 92.4 0.4 0.5 25.5 AGBT Tutorial Workshop 15th February, 2012
  • 15. What does the data look like? AGBT Tutorial Workshop 15th February, 2012
  • 16. Upcoming Changes in 2012 We cannot keep all of the data  2007-2008: Keep everything including images from runs  2009: BAM/Fastq – all of the base quality information  2010-2011: Stripping original qualities and other unused tags  2012-: Current formats contain lots of repetition  Reference based compression  Reducing quality information e.g. quality binning or quality budgets  Potential formats: CRAM and/or Reduced BAM AGBT Tutorial Workshop 15th February, 2012
  • 17. CRAM Format TGAGCTCTAAGTACC! 329183050298757! CRAM models for compression TGAGCTCTAAGTACC! TGAGCTCTAAGTACC! 002020010022212! -2---30---9---7! Horizontal Vertical Do nothing Lossless Quality lossy 100 10 1 0.1 CRAM current Untreated CRAM CRAM CRAM substitutions/insertions performance lossless combination model model CRAM v0.6 released 13.2.12: •  Option to preserve all unmapped reads •  Pairing information preservation regardless of distance •  Performance and bug fixes •  Revised and improved lossless mode •  Arbitrary tags http://www.ebi.ac.uk/ena/about/cram_toolkit Source: Ewan Birney/Guy Cochrane, EBI AGBT Tutorial Workshop 15th February, 2012
  • 18. Any questions? Richard Durbin URLs •  VRPipe: https://github.com/VertebrateResequencing/vr-pipe David Adams •  iRODS@Sanger: BMC Bioinformatics 2011, 12:361 •  http://www.slideshare.net/thomaskeane AGBT Tutorial Workshop 15th February, 2012