SlideShare a Scribd company logo
Evaluating and improving the chick
     genome & transcriptome


                 C. Titus Brown
        Asst Prof, CSE and Microbiology;
               BEACON NSF STC
            Michigan State University
                  ctb@msu.edu
Acknowledgements
This is joint work with Hans Cheng (USDA ADOL), Jerry
  Dodgson (MSU).

Likit Preeyanon (MSU) and Alexis Black Pyrkosz (ADOL)
  did the work.

All of the software discussed in this talk is available.


    This work was primarily supported by the USDA NIFA
                    through a grant to me.
Simulations show that incomplete gene reference
=> inaccurate differential expression from mRNAseq
                                                                                Single End Reads                                                                                            Paired End Reads
  % Transcripts Expressed Inaccurately (2-fold Difference)




                                                                                                              % Transcripts Expressed Inaccurately (2-fold Difference)
                                                             100%                                                                                                        100%
                                                                      10                                                                                                           10
                                                                        0%                                                                                                           0%
                                                             90%                                                                                                         90%
                                                                             ex                                                                                                           ex
                                                                                pr                                                                                                          pr
                                                             80%                   e   ss                                                                                80%                     es
                                                                                          io                                                                                                        sio
                                                                      75                    n                                                                                     75                    n
                                                             70%        %                                                                                                70%        %
                                                                             ex                                                                                                         ex
                                                                                pre                                                                                                        pre
                                                                                    ss                                                                                                        s
                                                             60%                       ion                                                                               60%                       sio
                                                                                                                                                                                                      n
                                                                      50%                                                                                                         50%
                                                             50%             expr                                                                                        50%            ex p
                                                                                 essio                                                                                                         ress
                                                                                                n                                                                                                     ion
                                                             40%                                                                                                         40%

                                                             30%      25% expressi                                                                                       30%      25% expre
                                                                                                on                                                                                                 ssion
                                                             20%                                                                                                         20%

                                                             10%                                                                                                         10%

                                                             0% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%                                                              0% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
                                                                      % Reference Completeness                                                                                      % Reference Completeness




                                                                                                                                                                                                            Alexis Black Pyrkosz
Existing chick gene models lack exons,
isoforms



                                                      Our data




                                                         Models



 *This gene contains at least 4 isoforms.
                                            Likit Preeyanon
(Exon detection is pretty good.)




                            Likit Preeyanon
Different approaches to gene set prediction
yield distinct splice junction predictions




     > 95% of thee assembly-based splice junctions are
        supported by 4 or more independent reads.
                                                     Likit Preeyanon
mRNAseq analysis with a combined de novo
and genome-based approach.




                                 Likit Preeyanon
We can produce combined gene models.




       Cufflinks (ref based) + de novo assembly + known mRNA
Gene Model Summary
     (note: spleen mRNAseq)
         Method                           Gene                         Transcript
Global Assembly                           14,832                            32,311
Local Assembly                            15,297                            23,028
Global + Local Assembly                   15,934                            46,797




 *Number of genes and transcripts might be overdue to incomplete assembly
 and spurious splice junctions.
Cross-validation with technical replicates
     Dataset                   Single-end                        Paired-end
                        Mapped          Unmapped          Mapped         Unmapped
Line 6 uninfected         18,375,966        5,203,586       21,598,218     12,065,659
                            (77.93%)         (22.07%)         (64.16%)       (35.84%)
Line 6 infected           17,160,695        6,288,286       15,274,638         8633855
                            (73.18%)         (26.82%)         (63.89%)        (36.11%)
Line 7 uninfected         18,130,072        5,795,737       20,961,033     11,960,299
                            (75.77%)         (24.22%)         (63.67%)       (36.33%)
Line 7 infected           19,912,046        5,450,521       22,485,833     11,992,002
                            (78.51%)         (21.49%)         (65.22%)       (34.78%)


            Single-ended reads were used to generate gene models; paired-end data
                       was used as technical replicate cross-validation.
Gene Modeler Pipeline (“gimme”)
 Merge transcripts together based on transcript mapping to genome; can
  include existing gene predictions, & iteratively combine
  predictions.
 Construct gene models
 Remove redundant sequences
 Predict strands and ORFs




                                                           Likit Preeyanon
Next problem: chick reference!
 We like using the reference genome to scaffold RNAseq contigs;
  purely de novo RNAseq assembly is messy.
 Genomes are also useful for other things, we hear.
Problems:
 Poor sensitivity: the chick genome is missing a substantial number
  of genes from microchromosomes:
  723 genes from HSA19q missing from chicken galGal4.
  ESTs and RNAseq transcripts for many or most.
 Gaps
  9900 gaps on ordered chromosomes
  21k gaps on chr-aligned but low-confidence/unaligned
 Over-collapsed tandem dups and under-collapsed het
Sensitivity – where is the problem?
Are microchromosomes hard to sequence or is
  microchromosomal sequence hard to assemble?

Sequences that simply don’t show up in the data are hard to
  include in the assembly…
  Unclonable (Sanger)
  Strong GC or AT bias


Sequences with biased (generally low) coverage are often
  discarded by assemblers.
Can we “even out” coverage?
(Digital normalization)


                         If you have two loci, or two
                         mRNA species, with uneven
                        coverage, can you remove the
                                extra coverage?
2013 pag-poultry-workshop
2013 pag-poultry-workshop
2013 pag-poultry-workshop
2013 pag-poultry-workshop
2013 pag-poultry-workshop
2013 pag-poultry-workshop
Coverage before digital normalization:


                                  (MD amplified)
Coverage after digital normalization:

                            Normalizes coverage

                            Discards redundancy

                            Eliminates majority of
                            errors

                            Scales assembly dramatically.

                            Assembly is 98% identical.
Prelim results from digital
normalization
Reassembled chick genome contigs from 70x Illumina ->
 normalized reads in ~24 hours.
Obtained 40 Mbp of assembled contigs that were not present
 in galGal4.
Contig assembly contained partial or complete matches to
 70% of previously unmappable transcripts assembled from
 chick spleen mRNAseq.

⇒Bioinformatics remedies may help but are probably not
  sufficient.

                                                  Likit Preeyanon
Can we improve the assembly?
  Read cleaning and improvement

        1. Digital normalization evens out relative
        coverage, permitting recovery of difficult-
        to-sequence regions in assemblies.
        2. Error correction and read-to-graph                                Selection of
        concordance editing collapses                                       strategies and
        heterozygous regions.                                                parameters
        3. Paired-end de Bruijn graphs can be
        used to include long-distance constraints
        in primary contig assembly.
        4. RNAseq data indicates contigs that can
        be combined into scaffolds.


                                                      Assembly assessment

                                                           1. High-abundance k-mers present in the
                                                           sequence data but missing from the
                                                           assembly indicate poor sensitivity.
                                                           2. Discordant long-insert mate pairs
                Contig assembly
                                                           indicate potentially erroneous contigs and
                     and/or
                                                           scaffolds.
                  scaffolding
                                                           3. De novo RNAseq assembly can identify
                                                           likely misassemblies and positively
                                                           identify missing genomic sequence.
slides from http://slideshare.net/flxlex/ ; Lex Nederbragt


Longer reads!
  Repeat copy 1                                   Repeat copy 2




       Long reads can span repeats      and heterozygous regions




                          Polymorphic contig 22
                           Polymorphic contig

   Contig 1                                                  Contig 4
                          Polymorphic contig 33
                           Polymorphic contig
slides from http://slideshare.net/flxlex/ ; Lex Nederbragt


PacBio: first results (cod/salmon)
                       Raw reads
Cod: PacBio results
         Mapping to the published genome
                  11.4 kbp subread




                    10.6 kbp subread




                   10.9 kbp subread




          slides from http://slideshare.net/flxlex/ ; Lex Nederbragt
Need to combine Illumina + PacBio still.
                                           P_errorCorrection pipeline from

                                                                93% of reads recovered
                        2.7x
                                                  Alignments of at least 1kb to cod published assembly


           +




                                                                                             Error-corrected reads
                        23x


                                                                                         s
           +                                                                 w
                                                                                 rea
                                                                                     d
                                                                        Ra
                        24 cpus
                        4.5 days
                        100 Gb RAM


slides from http://slideshare.net/flxlex/ ; Lex
Concluding thoughts/comments
Gene models and reference genome both need work.


This is going to be a continuing process…


Together with Wes Warren (WUSTL), Hans Cheng (USDA
  ADOL), Jerry Dodgson (MSU) proposing to apply PacBio
  sequencing and digital normalization to improve chick
  genome and regularly integrate community improvements;
  should be generalizable approach.

   Questions? Contact me at: ctb@msu.edu

More Related Content

PDF
Turnaround Data Sample
PDF
Ict killers e.b.pdf
PDF
Behavior Analysis Graphing In Excel
PPTX
WMS Cloud Visie Piet van Vugt
PDF
Know Your Enemy
PDF
2009 Business Breakfast Slideshow
PPTX
유기화학 2nd
PPT
Langkah Membuat Blogspot
Turnaround Data Sample
Ict killers e.b.pdf
Behavior Analysis Graphing In Excel
WMS Cloud Visie Piet van Vugt
Know Your Enemy
2009 Business Breakfast Slideshow
유기화학 2nd
Langkah Membuat Blogspot

Viewers also liked (20)

PDF
Motoholics Sponsorship Proposal 2010
PPS
Fantastic Photography
PPTX
Alcohol # 1 concern march 16 2016
PPT
Museo Virtual De La Escuelaeste
PDF
Trainings Evaluation Report WPS Phase-I Lodharn
PDF
Cope Manifesto
PPT
Nursing Skills
PDF
PDF
Arbitrator Subpoenas: Are They Worth The Paper They Are Printed On?
DOCX
GAME TECHNOLOGIES USAGE FOR USERS ATTRACTION TO EDUCATIONAL RESOURCES
PDF
Shepley ross introduction_ode_4th
PPTX
2012 stamps-mbl-2
PPTX
2014 anu-canberra-streaming
PPTX
U Florida / Gainesville talk, apr 13 2011
PPT
Recount of trip to Howick Historical Village
PPT
Bildspel, irish glen of imaal terrier 2005
PPTX
Br10 sommerhus
PPTX
2014 nicta-reproducibility
PPS
Ten Common Wage & Hour Blunders
PPTX
Circles of San Antonio Community Coalition and Bexar County DWI Task Force Ho...
Motoholics Sponsorship Proposal 2010
Fantastic Photography
Alcohol # 1 concern march 16 2016
Museo Virtual De La Escuelaeste
Trainings Evaluation Report WPS Phase-I Lodharn
Cope Manifesto
Nursing Skills
Arbitrator Subpoenas: Are They Worth The Paper They Are Printed On?
GAME TECHNOLOGIES USAGE FOR USERS ATTRACTION TO EDUCATIONAL RESOURCES
Shepley ross introduction_ode_4th
2012 stamps-mbl-2
2014 anu-canberra-streaming
U Florida / Gainesville talk, apr 13 2011
Recount of trip to Howick Historical Village
Bildspel, irish glen of imaal terrier 2005
Br10 sommerhus
2014 nicta-reproducibility
Ten Common Wage & Hour Blunders
Circles of San Antonio Community Coalition and Bexar County DWI Task Force Ho...
Ad

Similar to 2013 pag-poultry-workshop (11)

XLS
Reporting dashboard template
PDF
Alex. papers gm svava bjarnasson
PDF
What is e market services 2010
PDF
Sfm Washington 20081120
PPTX
Session #2: Test Driven Development
PDF
Scalable Drupal infrastructure
PDF
Scalable Drupal Infrastructure
PPT
Commissioning support for London
PPT
The BioMed Central customer experience
PDF
Designing for Disruption
Reporting dashboard template
Alex. papers gm svava bjarnasson
What is e market services 2010
Sfm Washington 20081120
Session #2: Test Driven Development
Scalable Drupal infrastructure
Scalable Drupal Infrastructure
Commissioning support for London
The BioMed Central customer experience
Designing for Disruption
Ad

More from c.titus.brown (20)

PPTX
2016 bergen-sars
PPTX
2016 davis-plantbio
PPTX
2016 davis-biotech
PPTX
2015 genome-center
PPTX
2015 beacon-metagenome-tutorial
PPTX
2015 aem-grs-keynote
PPTX
2015 msu-code-review
PPTX
2015 illinois-talk
PPTX
2015 mcgill-talk
PPTX
2015 pycon-talk
PPTX
2015 opencon-webcast
PPTX
2015 vancouver-vanbug
PPTX
2015 osu-metagenome
PPTX
2015 ohsu-metagenome
PPTX
2015 balti-and-bioinformatics
PPTX
2015 pag-chicken
PPTX
2015 pag-metagenome
PPTX
2014 nyu-bio-talk
PPTX
2014 bangkok-talk
PPTX
2014 aus-agta
2016 bergen-sars
2016 davis-plantbio
2016 davis-biotech
2015 genome-center
2015 beacon-metagenome-tutorial
2015 aem-grs-keynote
2015 msu-code-review
2015 illinois-talk
2015 mcgill-talk
2015 pycon-talk
2015 opencon-webcast
2015 vancouver-vanbug
2015 osu-metagenome
2015 ohsu-metagenome
2015 balti-and-bioinformatics
2015 pag-chicken
2015 pag-metagenome
2014 nyu-bio-talk
2014 bangkok-talk
2014 aus-agta

2013 pag-poultry-workshop

  • 1. Evaluating and improving the chick genome & transcriptome C. Titus Brown Asst Prof, CSE and Microbiology; BEACON NSF STC Michigan State University ctb@msu.edu
  • 2. Acknowledgements This is joint work with Hans Cheng (USDA ADOL), Jerry Dodgson (MSU). Likit Preeyanon (MSU) and Alexis Black Pyrkosz (ADOL) did the work. All of the software discussed in this talk is available. This work was primarily supported by the USDA NIFA through a grant to me.
  • 3. Simulations show that incomplete gene reference => inaccurate differential expression from mRNAseq Single End Reads Paired End Reads % Transcripts Expressed Inaccurately (2-fold Difference) % Transcripts Expressed Inaccurately (2-fold Difference) 100% 100% 10 10 0% 0% 90% 90% ex ex pr pr 80% e ss 80% es io sio 75 n 75 n 70% % 70% % ex ex pre pre ss s 60% ion 60% sio n 50% 50% 50% expr 50% ex p essio ress n ion 40% 40% 30% 25% expressi 30% 25% expre on ssion 20% 20% 10% 10% 0% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 0% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% % Reference Completeness % Reference Completeness Alexis Black Pyrkosz
  • 4. Existing chick gene models lack exons, isoforms Our data Models *This gene contains at least 4 isoforms. Likit Preeyanon
  • 5. (Exon detection is pretty good.) Likit Preeyanon
  • 6. Different approaches to gene set prediction yield distinct splice junction predictions > 95% of thee assembly-based splice junctions are supported by 4 or more independent reads. Likit Preeyanon
  • 7. mRNAseq analysis with a combined de novo and genome-based approach. Likit Preeyanon
  • 8. We can produce combined gene models. Cufflinks (ref based) + de novo assembly + known mRNA
  • 9. Gene Model Summary (note: spleen mRNAseq) Method Gene Transcript Global Assembly 14,832 32,311 Local Assembly 15,297 23,028 Global + Local Assembly 15,934 46,797 *Number of genes and transcripts might be overdue to incomplete assembly and spurious splice junctions.
  • 10. Cross-validation with technical replicates Dataset Single-end Paired-end Mapped Unmapped Mapped Unmapped Line 6 uninfected 18,375,966 5,203,586 21,598,218 12,065,659 (77.93%) (22.07%) (64.16%) (35.84%) Line 6 infected 17,160,695 6,288,286 15,274,638 8633855 (73.18%) (26.82%) (63.89%) (36.11%) Line 7 uninfected 18,130,072 5,795,737 20,961,033 11,960,299 (75.77%) (24.22%) (63.67%) (36.33%) Line 7 infected 19,912,046 5,450,521 22,485,833 11,992,002 (78.51%) (21.49%) (65.22%) (34.78%) Single-ended reads were used to generate gene models; paired-end data was used as technical replicate cross-validation.
  • 11. Gene Modeler Pipeline (“gimme”)  Merge transcripts together based on transcript mapping to genome; can include existing gene predictions, & iteratively combine predictions.  Construct gene models  Remove redundant sequences  Predict strands and ORFs Likit Preeyanon
  • 12. Next problem: chick reference!  We like using the reference genome to scaffold RNAseq contigs; purely de novo RNAseq assembly is messy.  Genomes are also useful for other things, we hear. Problems:  Poor sensitivity: the chick genome is missing a substantial number of genes from microchromosomes: 723 genes from HSA19q missing from chicken galGal4. ESTs and RNAseq transcripts for many or most.  Gaps 9900 gaps on ordered chromosomes 21k gaps on chr-aligned but low-confidence/unaligned  Over-collapsed tandem dups and under-collapsed het
  • 13. Sensitivity – where is the problem? Are microchromosomes hard to sequence or is microchromosomal sequence hard to assemble? Sequences that simply don’t show up in the data are hard to include in the assembly… Unclonable (Sanger) Strong GC or AT bias Sequences with biased (generally low) coverage are often discarded by assemblers.
  • 14. Can we “even out” coverage? (Digital normalization) If you have two loci, or two mRNA species, with uneven coverage, can you remove the extra coverage?
  • 21. Coverage before digital normalization: (MD amplified)
  • 22. Coverage after digital normalization: Normalizes coverage Discards redundancy Eliminates majority of errors Scales assembly dramatically. Assembly is 98% identical.
  • 23. Prelim results from digital normalization Reassembled chick genome contigs from 70x Illumina -> normalized reads in ~24 hours. Obtained 40 Mbp of assembled contigs that were not present in galGal4. Contig assembly contained partial or complete matches to 70% of previously unmappable transcripts assembled from chick spleen mRNAseq. ⇒Bioinformatics remedies may help but are probably not sufficient. Likit Preeyanon
  • 24. Can we improve the assembly? Read cleaning and improvement 1. Digital normalization evens out relative coverage, permitting recovery of difficult- to-sequence regions in assemblies. 2. Error correction and read-to-graph Selection of concordance editing collapses strategies and heterozygous regions. parameters 3. Paired-end de Bruijn graphs can be used to include long-distance constraints in primary contig assembly. 4. RNAseq data indicates contigs that can be combined into scaffolds. Assembly assessment 1. High-abundance k-mers present in the sequence data but missing from the assembly indicate poor sensitivity. 2. Discordant long-insert mate pairs Contig assembly indicate potentially erroneous contigs and and/or scaffolds. scaffolding 3. De novo RNAseq assembly can identify likely misassemblies and positively identify missing genomic sequence.
  • 25. slides from http://slideshare.net/flxlex/ ; Lex Nederbragt Longer reads! Repeat copy 1 Repeat copy 2 Long reads can span repeats and heterozygous regions Polymorphic contig 22 Polymorphic contig Contig 1 Contig 4 Polymorphic contig 33 Polymorphic contig
  • 26. slides from http://slideshare.net/flxlex/ ; Lex Nederbragt PacBio: first results (cod/salmon) Raw reads
  • 27. Cod: PacBio results Mapping to the published genome 11.4 kbp subread 10.6 kbp subread 10.9 kbp subread slides from http://slideshare.net/flxlex/ ; Lex Nederbragt
  • 28. Need to combine Illumina + PacBio still. P_errorCorrection pipeline from  93% of reads recovered 2.7x Alignments of at least 1kb to cod published assembly + Error-corrected reads 23x s + w rea d Ra 24 cpus 4.5 days 100 Gb RAM slides from http://slideshare.net/flxlex/ ; Lex
  • 29. Concluding thoughts/comments Gene models and reference genome both need work. This is going to be a continuing process… Together with Wes Warren (WUSTL), Hans Cheng (USDA ADOL), Jerry Dodgson (MSU) proposing to apply PacBio sequencing and digital normalization to improve chick genome and regularly integrate community improvements; should be generalizable approach. Questions? Contact me at: ctb@msu.edu