2013 pag-poultry-workshop

Evaluating and improving the chick
genome & transcriptome

C. Titus Brown
Asst Prof, CSE and Microbiology;
BEACON NSF STC
Michigan State University
ctb@msu.edu

Acknowledgements
This is joint work with Hans Cheng (USDA ADOL), Jerry
Dodgson (MSU).

Likit Preeyanon (MSU) and Alexis Black Pyrkosz (ADOL)
did the work.

All of the software discussed in this talk is available.

This work was primarily supported by the USDA NIFA
through a grant to me.

Simulations show that incomplete gene reference
=> inaccurate differential expression from mRNAseq
Single End Reads Paired End Reads
% Transcripts Expressed Inaccurately (2-fold Difference)

% Transcripts Expressed Inaccurately (2-fold Difference)
100% 100%
10 10
0% 0%
90% 90%
ex ex
pr pr
80% e ss 80% es
io sio
75 n 75 n
70% % 70% %
ex ex
pre pre
ss s
60% ion 60% sio
n
50% 50%
50% expr 50% ex p
essio ress
n ion
40% 40%

30% 25% expressi 30% 25% expre
on ssion
20% 20%

10% 10%

0% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100% 0% 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
% Reference Completeness % Reference Completeness

Alexis Black Pyrkosz

Existing chick gene models lack exons,
isoforms

Our data

Models

*This gene contains at least 4 isoforms.
Likit Preeyanon

(Exon detection is pretty good.)

Likit Preeyanon

Different approaches to gene set prediction
yield distinct splice junction predictions

> 95% of thee assembly-based splice junctions are
supported by 4 or more independent reads.
Likit Preeyanon

mRNAseq analysis with a combined de novo
and genome-based approach.

Likit Preeyanon

We can produce combined gene models.

Cufflinks (ref based) + de novo assembly + known mRNA

Gene Model Summary
(note: spleen mRNAseq)
Method Gene Transcript
Global Assembly 14,832 32,311
Local Assembly 15,297 23,028
Global + Local Assembly 15,934 46,797

*Number of genes and transcripts might be overdue to incomplete assembly
and spurious splice junctions.

Cross-validation with technical replicates
Dataset Single-end Paired-end
Mapped Unmapped Mapped Unmapped
Line 6 uninfected 18,375,966 5,203,586 21,598,218 12,065,659
(77.93%) (22.07%) (64.16%) (35.84%)
Line 6 infected 17,160,695 6,288,286 15,274,638 8633855
(73.18%) (26.82%) (63.89%) (36.11%)
Line 7 uninfected 18,130,072 5,795,737 20,961,033 11,960,299
(75.77%) (24.22%) (63.67%) (36.33%)
Line 7 infected 19,912,046 5,450,521 22,485,833 11,992,002
(78.51%) (21.49%) (65.22%) (34.78%)

Single-ended reads were used to generate gene models; paired-end data
was used as technical replicate cross-validation.

Gene Modeler Pipeline (“gimme”)
 Merge transcripts together based on transcript mapping to genome; can
include existing gene predictions, & iteratively combine
predictions.
 Construct gene models
 Remove redundant sequences
 Predict strands and ORFs

Likit Preeyanon

Next problem: chick reference!
 We like using the reference genome to scaffold RNAseq contigs;
purely de novo RNAseq assembly is messy.
 Genomes are also useful for other things, we hear.
Problems:
 Poor sensitivity: the chick genome is missing a substantial number
of genes from microchromosomes:
723 genes from HSA19q missing from chicken galGal4.
ESTs and RNAseq transcripts for many or most.
 Gaps
9900 gaps on ordered chromosomes
21k gaps on chr-aligned but low-confidence/unaligned
 Over-collapsed tandem dups and under-collapsed het

Sensitivity – where is the problem?
Are microchromosomes hard to sequence or is
microchromosomal sequence hard to assemble?

Sequences that simply don’t show up in the data are hard to
include in the assembly…
Unclonable (Sanger)
Strong GC or AT bias

Sequences with biased (generally low) coverage are often
discarded by assemblers.

Can we “even out” coverage?
(Digital normalization)

If you have two loci, or two
mRNA species, with uneven
coverage, can you remove the
extra coverage?

Coverage before digital normalization:

(MD amplified)

Coverage after digital normalization:

Normalizes coverage

Discards redundancy

Eliminates majority of
errors

Scales assembly dramatically.

Assembly is 98% identical.

Prelim results from digital
normalization
Reassembled chick genome contigs from 70x Illumina ->
normalized reads in ~24 hours.
Obtained 40 Mbp of assembled contigs that were not present
in galGal4.
Contig assembly contained partial or complete matches to
70% of previously unmappable transcripts assembled from
chick spleen mRNAseq.

⇒Bioinformatics remedies may help but are probably not
sufficient.

Likit Preeyanon

Can we improve the assembly?
Read cleaning and improvement

1. Digital normalization evens out relative
coverage, permitting recovery of difﬁcult-
to-sequence regions in assemblies.
2. Error correction and read-to-graph Selection of
concordance editing collapses strategies and
heterozygous regions. parameters
3. Paired-end de Bruijn graphs can be
used to include long-distance constraints
in primary contig assembly.
4. RNAseq data indicates contigs that can
be combined into scaffolds.

Assembly assessment

1. High-abundance k-mers present in the
sequence data but missing from the
assembly indicate poor sensitivity.
2. Discordant long-insert mate pairs
Contig assembly
indicate potentially erroneous contigs and
and/or
scaffolds.
scaffolding
3. De novo RNAseq assembly can identify
likely misassemblies and positively
identify missing genomic sequence.

slides from http://slideshare.net/flxlex/ ; Lex Nederbragt

Longer reads!
Repeat copy 1 Repeat copy 2

Long reads can span repeats and heterozygous regions

Polymorphic contig 22
Polymorphic contig

Contig 1 Contig 4
Polymorphic contig 33
Polymorphic contig


PacBio: first results (cod/salmon)
Raw reads

Cod: PacBio results
Mapping to the published genome
11.4 kbp subread

10.6 kbp subread

10.9 kbp subread


Need to combine Illumina + PacBio still.
P_errorCorrection pipeline from

 93% of reads recovered
2.7x
Alignments of at least 1kb to cod published assembly

+

Error-corrected reads
23x

s
+ w
rea
d
Ra
24 cpus
4.5 days
100 Gb RAM

slides from http://slideshare.net/flxlex/ ; Lex

Concluding thoughts/comments
Gene models and reference genome both need work.

This is going to be a continuing process…

Together with Wes Warren (WUSTL), Hans Cheng (USDA
ADOL), Jerry Dodgson (MSU) proposing to apply PacBio
sequencing and digital normalization to improve chick
genome and regularly integrate community improvements;
should be generalizable approach.

Questions? Contact me at: ctb@msu.edu

2013 pag-poultry-workshop

More Related Content

Viewers also liked (20)

Similar to 2013 pag-poultry-workshop (11)

More from c.titus.brown (20)

2013 pag-poultry-workshop