SlideShare a Scribd company logo
Next-Generation Sequencing Analysis Series
January 28, 2015
Andrew Oler, PhD
High-throughput Sequencing Bioinformatics Specialist
BCBB/OCICB/NIAID/NIH
Bioinformatics and Computational
Biosciences Branch
§  Bioinformatics Software
Developers
§  Computational Biologists
§  Project Managers &
Analysts
http://www.niaid.nih.gov/about/organization/odoffices/omo/ocicb/Pages/bcbb.aspx
2
Objectives
§  Give you an introduction to common methods used
to process and analyze Next Generation Sequence
data
§  Learn methods for 1) Mapping NGS reads and 2)
De novo assembly of NGS reads
§  Give exposure to various applications for NGS
experiments
3
Illumina
Sample DNA library
Illumina sequencing
What other platforms?
Cycle 1
Cycle 2
Cycle 3
Cycle 4
Cycle 5
Illumina Paired-End Library Preparation
5Illumina
Illumina Mate Pair Libraries
6
A growing list of NGS applications
RNA-Seq / miRNA-seq
(noncoding, differential
expression,
Novel splice forms,
antisense)
Epigenetics (Chip-
Seq, Mnase-seq,
Bisulfite-Seq)
CNV,
Structural
variations
Targeted
resequencing
“Exome analysis”
Whole genome
sequencing
Metagenomics
(16S microbiome,
environmental
WGS)
Somatic mutations
Variants in
mendelian diseases
High-throughput
sequencing
De novo
genome
assembly
What applications?
Alignment versus De Novo Assembly
8
Short Sequence “Reads”
Is a Reference Genome available?
Yes No
Alignment to Reference de novo Assembly
?
http://www.ncbi.nlm.nih.gov/sites/genome
“Browse by organism groups”
Short	
  Read	
  Alignment
CTCTGCACGCGTGGGTTCGAATCCCACCTTCGTCGA!
Coordinate:
chr6 27,373,801
chr6
9
Steps in Alignment/Mapping
1.  Get your sequence data
2.  Check quality of sequence data
3.  Choose an alignment/mapping program
4.  Run the alignment
5.  View the alignments
6.  Downstream Processing
10
Steps in Alignment/Mapping
1.  Get your sequence data
2.  Check quality of sequence data
3.  Choose an alignment/mapping program
4.  Run the alignment
5.  View the alignments
6.  Downstream Processing
11
Public Short Read Repositories
§  NIH/NCBI
•  Short Read Archive (http://www.ncbi.nlm.nih.gov/sra)
•  Gene Expression Omnibus (http://www.ncbi.nlm.nih.gov/geo/)
•  1000 Genomes (ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/)
•  European Nucleotide Archive (http://www.ebi.ac.uk/ena/)
12
fastq-dump SRR036642
Understanding file
formats
@F29EPBU01CZU4O
GCTCCGTCGTAAAAGGGG
+
24469:666811//..,,
@F29EPBU01D60ZF
CTCGTTCTTGATTAATGAAACATTCTTGGCAAA
TGCTTTCGCTCTGGTCCGTCTTGCGCCGGTCCA
AGAATTTCACCTCTAGCGGCGCAATACGAATG
CCCAAACACACCCAACACACCA
+
G???HHIIIIIIIIIBG555?
=IIIIIIIIHHGHHIHHHIIIIIIHHHIIHHHIIIIIIIIIH99;;CB
BCCEI???DEIIIIII??;;;IIGDBCEA?
9944215BB@>>@A=BEIEEE
@F29EPBU01EIPCX
TTAATGATTGGAGTCTTGGAAGCTTGACTACCC
TACGTTCTCCTACAAATGGACCTTGAGAGCTTG
TTTGGAGGTTCTAGCAGGGGAGCGCATCTCCC
CAAACACACCCAACACACCA
+
IIIIIIIIIIIIIIIIIIIIIIHHHHIIIIHHHIIIIIIIIIIIIIHHHIIIIIIIIIIIIIIIIIH
HHIIIIIIIIEIIB94422=4GEEEEEIBBBBHHHFIH??
?CII=?AEEEE
@F29EPBU01DER7Q
TGACGTGCAAATCGGTCGTCCGACCTCGGTAT
AGGGGCGAAGACTAATCGAACCATCTAGTAGC
Sequence file formats
§  Next gen sequence file formats are based on the
commonly used
FASTA format
>sequence_ID and optional comments
ATTCCGGTGCGGTGCGGTGCTGCCGTGCCGGTGC
TTCGAAATTGGCGTCAGT
§  The Phred quality scores per base were added to form
the FASTQ format
14
Sequence file formats
§  Illumina Fastq format (fasta format with Quality values for each base)
15
@EAS139:136:FC706VJ:2:5:1000:12850 1:Y:18:ATCACG
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA - base calls
+
BBBBCCCC?<A?BC?7@@???????DBBA@@@@A@@ - Base quality+33
Full read header description"
@ <instrument-name>:<run ID>:<flowcell ID>:<lane-number>:<tile-number>: <x-pos>: <y-pos>
<read number>:<is filtered>:<control number>:<barcode sequence>
Space to separate Read ID
Read ID "
Fastq Quality values
16
Quality scores are normally expected up to 40 in a Phred scale.
ASCII characters <http://en.wikipedia.org/wiki/ASCII>
BBBBCCCC?<A?BC?7@@???????DBBA@@@@A@@ "
The highest base quality score in this sequence: ‘D’=(68-33)=35
From http://en.wikipedia.org/wiki/FASTQ_format
= 0.00032 (or 1/3200 incorrect)P=10
-35/10
If base quality = 35
Steps in Alignment/Mapping
1.  Get your sequence data
2.  Check quality of sequence data
3.  Choose an alignment/mapping program
4.  Run the alignment
5.  View the alignments
6.  Downstream Processing
17
Running FastQC
Open FastQC program
Open in browser:
fastqc_report.html
18
Per base sequence quality
p-value = 0.0001
p-value = 0.001
p-value = 0.01
p-value = 0.05
Babraham Bioinformatics http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
Steps in Alignment/Mapping
1.  Get your sequence data
2.  Check quality of sequence data
3.  Choose an alignment/mapping program
4.  Run the alignment
5.  View the alignments
6.  Downstream Processing
19
Short Read Alignment Software
§ BFAST
§ BLASTN
§ BLAT
§ Bowtie
§ BWA
§ ELAND
§ GNUMAP
§ GMAP and
GSNAP
§ MAQ
§ mrFAST and
mrsFAST
§ MOSAIK
§ Novoalign
§ RUM
§ SHRiMP
§ SOAP
§ SpliceMap
§ SSAHA and
SSAHA2
§ STAR
§ TopHat
§ ~20 more…
20
http://en.wikipedia.org/wiki/List_of_sequence_alignment_software
http://tinyurl.com/seqanswers-mapping
ls /usr/local/bio_apps/
Issues of Consideration for Alignment
Software
§  Library types:
•  Genomic DNA (for resequencing)
•  ChIP DNA (PCR bias)
•  RNA-seq cDNA
–  mRNA-seq (junction mapping)
–  smRNA-seq (adapter trimming)
21
3
n
This protocol explains how to prepare libraries of chromatin-immuno-
precipitated DNA for analysis on the Illumina Cluster Station and Genome
Analyzer. You will add adapter sequences onto the ends of DNA fragments
to generate the following template format:
Figure 1 Fragments after Sample Preparation
The adapter sequences correspond to the two surface-bound oligos on
the flow cells used in the Cluster Station.
DNA
Fragment
Adapters
3
Introduction
This protocol explains how to prepare libraries of small RNA for subsequent
cDNA sequencing on the Illumina Cluster Station and Genome Analyzer.
You will physically isolate small RNA, ligate the adapters necessary for use
during cluster creation, and reverse-transcribe and PCR to generate the
following template format:
Figure 1 Fragments after Sample Preparation
The 5’ small RNA adapter is necessary for reverse transcription and
amplification of the small RNA fragment. This adapter also contains the DNA
sequencing primer binding site. The 3’ small RNA adapter corresponds to
Small RNA
Adapters
cDNA
Fragment
Adapter
Ligation
RT-PCR
Illumina
Issues of Consideration for Alignment
Software
§  Types of reads
•  Single-end
•  Paired-end
22
1
2
Mean, Standard Deviation of Inner Distance
e.g. SRR036642.fastq
e.g. SRR027894_1.fastq, SRR027894_2.fastq
Issues of Consideration for Alignment
Software
§  Library types, continued:
•  Multiplexed library (demultiplex)
•  Mate pair library
•  Bisulfite-converted
(C->T reference genome)
23Illumina
Heng Li, 2010
Issues of Consideration for Alignment
Software
§  Platform differences
•  Bases (ACTG)
•  Colorspace (2-base encoding, SOLiD)
•  Read Length
•  454 (homopolymers)
24
Issues of Consideration for Alignment
Software
§  Software Properties
•  Open-source or proprietary ($)
•  Accuracy
•  Speed of algorithm
•  Multi-threaded or single processor
•  RAM requirements (2GB vs 50GB for loading index)
•  Use of base quality score
•  Gapped alignment (indels)
25
Steps in Alignment/Mapping
1.  Get your sequence data
2.  Check quality of sequence data
3.  Choose an alignment/mapping program
4.  Run the alignment
5.  View the alignments
6.  Downstream Processing
26
The Command Line Terminal
A New World to Some
File Manager/Browser by Operating System
28
OS: Windows Mac OSX Unix
FM: Explorer Finder Shell
Input
Method:
Anatomy of the Terminal, “Command Line”,
or “Shell”
Prompt (computer_name:current_directory username)
Cursor
Command Argument
Window
Output
Mac: Applications -> Utilities -> Terminal
Windows: Download open source software
PuTTY http://www.chiark.greenend.org.uk/~sgtatham/putty/
Other SSH Clients (http://en.wikipedia.org/wiki/Comparison_of_SSH_clients)
Cygwin (http://www.cygwin.com/)
29
How to execute a command
command argument
output
output
30
ls (“list”)
ls list the files, links, subdirectories, etc. in a directory
ls -a same as “ls”, but also show the “hidden” files
ls -l list files with details (size, timestamp, ownership, permissions)
ls -lh use “human-readable” file sizes
*See handout for more options!*
31
cd (“change directory”),
mkdir (“make directory”) and viewing files
cd ~ change to home directory
cd test_data change to “test_data” directory
cd .. change to higher directory (“go up”)
cd ~/unix_hpc change to home directory > “unix_hpc” directory
mkdir dir_name make directory “dir_name”
pwd “print working directory”
head junctions.bed view the first 10 lines of “junctions.bed”
head -5 file1 view the first 5 lines of “file1”
tail lymph3K.fastq view the last 10 lines of “lymph3K.fastq”
tail -5 file1 view the last 5 lines of “file1”
less lymph3K.fastq view a file; Space to page down, Ctrl-b to page up,
arrow keys also work; “/” to search, “q” to quit
(faster for huge files)
32
Mapping ChIP-seq Reads
with Bowtie
33
Using ChIP-seq to analyze protein-DNA
contacts
§  Proteins called transcription factors (TFs) are involved
in regulation of gene activation
§  The first step in gene activation is binding of the TF to
its target gene.
Gene X
RNA
Polymerase
TF
Chromatin Immunoprecipitation (ChIP)
See also: ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia.
Genome Res. 2012 Sep;22(9):1813-31
Burrows-Wheeler Transformation
§  Uses Burrows-Wheeler Transformation
•  small genome index
•  small memory footprint (RAM) during alignment
•  faster alignment
§  Good at getting very accurate alignments quickly
§  Used in BWA, Bowtie, Bowtie2
36
Reference
Sequence
Indexed
Sequence
Burrows-Wheeler Transformation
Langmead B, Trapnell C, Pop M, Salzberg SL. Genome Biol 10:R25.
Create all permutations, then sort
Mapping RNA-seq Reads
with TopHat
37
Mapping RNA-seq Reads
38
Steps in TopHat Alignment
39Genome Biology (2013) 14:R36.
Alignment for Variant Analysis
§  Variants
•  Small-scale
–  single nucleotide variants (SNV) or single nucleotide polymorphism (SNP)
–  short insertions or deletions
–  deletion followed by insertion (indel)
•  Large-scale, structural
–  copy number variants (CNV)
–  inversions and translocations
§  Alignment software that will support gapped alignment for small-scale variation
•  BWA (also uses Burroughs-Wheeler algorithm)
•  Novoalign
•  Bowtie2
•  GSNAP
•  GEM
•  mrFAST
•  MOSAIK
•  RMAP
•  rNA
•  RTG Investigator
•  Segemehl
•  SHRiMP
•  Stampy
•  SToRM
40http://en.wikipedia.org/wiki/List_of_sequence_alignment_software
http://www.hgvs.org/mutnomen/recs-DNA.html
iGenomes
§  Common standard datasets for genomic analysis, organized in standardized directory
structure
•  Found in /gpfs/bio_data/iGenomes on NIAID HPC
•  Found in /fdb/igenomes on Biowulf
§  Files have additional formatting required by TopHat, Cufflinks
§  Maintained by Illumina, hosted on TopHat / Cufflinks website
•  http://tophat.cbcb.umd.edu/igenomes.html
§  Approximately 500Gb for all species together
§  Genomes available:
41
Arabidopsis_thaliana
Bacillus_cereus_ATCC_10987
Bacillus_subtilis_168
Bos_taurus
Caenorhabditis_elegans
Canis_familiaris
Drosophila_melanogaster
Enterobacteriophage_lamdba
Equus_caballus
Escherichia_coli_K_12_DH10B
Escherichia_coli_K_12_MG1655
Gallus_gallus
Glycine_max
Homo_sapiens
Macaca_mulatta
Mus_musculus
Mycobacterium_tuberculosis_H37RV
Oryza_sativa_japonica
Pan_troglodytes
PhiX
Pseudomonas_aeruginosa_PAO1
Rattus_norvegicus
Rhodobacter_sphaeroides_2.4.1
Saccharomyces_cerevisiae
Schizosaccharomyces_pombe
Sorangium_cellulosum_So_ce_56
Sorghum_bicolor
Staphylococcus_aureus_NCTC_8325
Sus_scrofa
Zea_mays
iGenomes Directory Structure
[Species]
Ensembl NCBI UCSC
hg18
…
hg19
Annotation
Genes SmallRNA Variation
Sequence
BWAIndex BowtieIndex Chromosomes WholeGenomeFasta AbundantSequences
GenomeStudio
42
GTF, other formats FASTA filesPre-built Indexes
Examples:
iGenomes/Homo_sapiens/UCSC/hg19/Sequence/BowtieIndex/genome
iGenomes/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa
iGenomes/Homo_sapiens/UCSC/hg19/Annotation/Genes/genes.gtf
Are you still awake?
43
Mapping Demo with Bowtie and BWA
§  SRR036642 from SRA
•  ChIP-seq
•  Map using Bowtie
§  SRR062634 from SRA
•  Human Resequencing data
•  Map using BWA
44
No HPC available to you? Free, Alternative
Ways to Map NGS Reads
§  Galaxy
•  Web-based analysis workflow interface
•  https://main.g2.bx.psu.edu/
•  Emphasis on NGS tools
•  Includes Bowtie, BWA, TopHat
§  Kbase
•  Web-based command-line interface
•  http://kbase.science.energy.gov/
•  Includes Bowtie, BWA
§  Disadvantages of online tools:
•  Takes long time to upload data to servers
•  Disk space limitations
•  Limited customization of analysis workflow
45
Steps in Alignment/Mapping
1.  Get your sequence data
2.  Check quality of sequence data
3.  Choose an alignment/mapping program
4.  Run the alignment
5.  View the alignments
6.  Downstream Processing
46
Most commonly used alignment file formats
§  SAM (sequence alignment map)
Unified format for storing alignments to a reference genome
§  BAM (binary version of SAM)
Compressed SAM file, is normally indexed
47
SAM/BAM format (sequence alignment map):
Most commonly used alignment file formats
48
QNAME FLAG RNAME POSITION MAPQ CIGAR MRNM MPOS TLEN
SEQ QUAL OPT
Unified format for storing alignments to a reference genome
BAM is a compressed SAM file, normally indexed
http://samtools.sourceforge.net/samtools.shtml http://samtools.sourceforge.net/SAM1.pdf
http://picard.sourceforge.net/explain-flags.html
Picard Tools
AddOrReplaceReadGroups.jar
BamIndexStats.jar
BamToBfq.jar
BuildBamIndex.jar
CalculateHsMetrics.jar
CleanSam.jar
CollectAlignmentSummaryMetrics.jar
CollectCDnaMetrics.jar
CollectGcBiasMetrics.jar
CollectInsertSizeMetrics.jar
CollectMultipleMetrics.jar
CompareSAMs.jar
CreateSequenceDictionary.jar
EstimateLibraryComplexity.jar
ExtractIlluminaBarcodes.jar
ExtractSequences.jar
FastqToSam.jar
FixMateInformation.jar
IlluminaBasecallsToSam.jar
MarkDuplicates.jar
MeanQualityByCycle.jar
MergeBamAlignment.jar
MergeSamFiles.jar
NormalizeFasta.jar
picard-1.45.jar
QualityScoreDistribution.jar
ReorderSam.jar
ReplaceSamHeader.jar
RevertSam.jar
sam-1.45.jar
SamFormatConverter.jar
SamToFastq.jar
SortSam.jar
ValidateSamFile.jar
ViewSam.jar
49
http://broadinstitute.github.io/picard/
java -jar QualityScoreDistribution.jar I=file.bam CHART=file.pdf
/usr/local/bio_apps/java/bin/java -jar /usr/local/bio_apps/picard-tools/
CollectMultipleMetrics.jar …
Visualization of output in Integrated
Genome Browser (IGV)
§  IGV download
•  http://www.broadinstitute.org/igv/projects/current/igv_mm.jnlp (Windows
1.2GB)
•  http://www.broadinstitute.org/igv/projects/current/igv_lm.jnlp (Mac 2GB)
§  Open IGV by double-clicking. Upload data by selecting File → Load
from URL, and entering the following links.
§  Links to BAM files
•  ChIP-seq:
–  https://dl.dropbox.com/u/12821862/SRR036642.bam
•  DNA-seq:
–  https://dl.dropbox.com/u/30379708/SRR062634.sorted.bam
•  RNA-seq:
–  https://dl.dropbox.com/u/30379708/Upenn/lymph_accepted_hits.bam
–  https://dl.dropbox.com/u/30379708/Upenn/wbc_accepted_hits.bam
§  Examples:
•  chr6:26,224,647-26,402,373
•  rs1205023, chr6:3,582,094-3,582,266
•  AIF1
•  LST1
50
Steps in Alignment/Mapping
1.  Get your sequence data
2.  Check quality of sequence data
3.  Choose an alignment/mapping program
4.  Run the alignment
5.  View the alignments
6.  Downstream Processing
51
Downstream Processing
§  Finding peaks (ChIP-seq)
§  Annotating peaks to genes (ChIP-seq)
§  Assembling transcripts (RNA-seq)
§  Annotating transcripts to genes (RNA-seq)
§  Etc.
52Park, Nat Rev Genet, 2009 http://grimmond.imb.uq.edu.au/mammalian_transcriptome.html
53
Examples of using different mapping
strategies for NGS
ChIP-seq and Differential Expression RNA-
seq
§  The transcription factor T-bet is induced by
multiple pathways and prevents an endogenous
Th2 cell program during Th1 cell responses.
Immunity. 2012 Oct 19;37(4):660-73. doi: 10.1016/j.immuni.2012.09.007. Epub
2012 Oct 4. Zhu J, Jankovic D, Oler AJ, Wei G, Sharma S, Hu G, Guo L, Yagi R,
Yamane H, Punkosdy G, Feigenbaum L, Zhao K, Paul WE.
§  ChIP-seq Methods
•  Mapping: Bowtie
•  Peaks: MACS
§  RNA-seq Methods
•  Mapping: TopHat
•  Expression: USeq
54
Resequencing/Variant Analysis
§  Whole genome sequencing of peach (Prunus
persica L.) for SNP identification and selection.
BMC Genomics. 2011 Nov 22;12:569. doi: 10.1186/1471-2164-12-569. Ahmad R,
Parfitt DE, Fass J, Ogundiwin E, Dhingra A, Gradziel TM, Lin D, Joshi NA,
Martinez-Garcia PJ, Crisosto CH.
§  Methods
•  Mapping: BWA
•  SNP calling: SAMtools
55http://www.themoneytimes.com/files/peach.jpg?1270231192
RNA-seq alternative splicing
§  RNA-Seq analysis of the parietal cortex in
Alzheimer's disease reveals alternatively spliced
isoforms related to lipid metabolism. Neurosci Lett. 2013
Mar 1;536:90-5. doi: 10.1016/j.neulet.2012.12.042. Epub 2013 Jan 7. Mills JD,
Nalpathamkalam T, Jacobs HI, Janitz C, Merico D, Hu P, Janitz M.
§  Methods
•  Mapping: TopHat
•  Splicing: Cufflinks, Cuffdiff
56
Stranded RNA-seq
§  Directional gene expression and antisense
transcripts in sexual and asexual stages of
Plasmodium falciparum. BMC Genomics. 2011 Nov 30;12:587.
doi: 10.1186/1471-2164-12-587. López-Barragán MJ, Lemieux J, Quiñones M,
Williamson KC, Molina-Cruz A, Cui K, Barillas-Mury C, Zhao K, Su XZ.
§  Methods
•  Mapping: TopHat
•  Expression:
Cufflinks
57
Genome-wide Bisulfite Sequencing
§  Whole-genome bisulfite DNA sequencing of a
DNMT3B mutant patient. Epigenetics. 2012 Jun 1;7(6):542-50.
doi: 10.4161/epi.20523. Epub 2012 Jun 1. Heyn H, Vidal E, Sayols S,
Sanchez-Mut JV, Moran S, Medina I, Sandoval J, Simó-Riudalbas L,
Szczesna K, Huertas D, Gatto S, Matarazzo MR, Dopazo J, Esteller M.
§  Methods:
•  Mapping and Bisulfite analysis: BSMAP
58
Cross-linking immunoprecipitation
sequencing (CLIP-seq)
§  LIN28 binds messenger RNAs at GGAGA motifs
and regulates splicing factor abundance. Mol Cell. 2012
Oct 26;48(2):195-206. doi: 10.1016/j.molcel.2012.08.004. Epub 2012 Sep 6.
Wilbert ML, Huelga SC, Kapeli K, Stark TJ, Liang TY, Chen SX, Yan BY,
Nathanson JL, Hutt KR, Lovci MT, Kazan H, Vu AQ, Massirer KB, Morris Q,
Hoon S, Yeo GW.
§  Methods
•  Mapping: Bowtie
•  Peaks: Custom scripts
59
Ribosomal Profiling sequencing (Ribo-seq)
§  Genome-wide ribosome profiling reveals complex
translational regulation in response to oxidative
stress. Proc Natl Acad Sci U S A. 2012 Oct 23;109(43):17394-9. doi:
10.1073/pnas.1120799109. Epub 2012 Oct 8. Gerashchenko MV, Lobanov AV,
Gladyshev VN.
§  Methods
•  Mapping: Bowtie
•  Translation Efficiency: Custom Perl scripts
60
Chromosome Conformation Capture
Sequencing (4C)
§  Multiplexed chromosome conformation capture
sequencing for rapid genome-scale high-resolution
detection of long-range chromatin interactions. Nat Protoc.
2013 Feb 14;8(3):509-24. doi: 10.1038/nprot.2013.018. Epub 2013 Feb 14. Stadhouders R,
Kolovos P, Brouwer R, Zuin J, van den Heuvel A, Kockx C, Palstra RJ, Wendt KS,
Grosveld F, van Ijcken W, Soler E.
§  Methods
•  Mapping: Bowtie via NARWHAL
•  Post-alignment: BED Tools
61
DNase I Hypersensitivity (DNase-seq)
§  Modeling gene expression using chromatin
features in various cellular contexts. Genome Biol. 2012
Jun 13;13(9):R53. doi: 10.1186/gb-2012-13-9-r53. Dong X, Greven MC,
Kundaje A, Djebali S, Brown JB, Cheng C, Gingeras TR, Gerstein M, Guigó R,
Birney E, Weng Z.
§  Methods
•  Mapping: Maq
•  Peak calling: F-Seq
62
http://genome.ucsc.edu/cgi-bin/hgTrackUi?
hgsid=328941501&g=wgEncodeChromatin
Map&hgTracksConfigPage=configure
16S rRNA Microbiome Sequencing
§  Reducing the effects of PCR amplification and
sequencing artifacts on 16S rRNA-based studies.
PLoS One. 2011;6(12):e27310. doi: 10.1371/journal.pone.0027310. Epub 2011
Dec 14. Schloss PD, Gevers D, Westcott SL.
§  Methods
•  Taxonomic Assignment: Mothur classify.seqs
•  Alignment: Mothur align.seqs
63
Polyploid Genome Re-sequencing
§  PolyCat: A Resource for Genome Categorization
of Sequencing Reads From Allopolyploid
Organisms. G3 (Bethesda). 2013 Mar;3(3):517-25.
doi: 10.1534/g3.112.005298. Epub 2013 Mar 1.
Page JT, Gingle AR, Udall JA.
§  Methods
•  Mapping: GSNAP
•  Homoeo-SNP calling: PolyCat
64
Additional Resources
§  Commercial Software for NGS Analysis (No
Command Line!)
•  Partek Genomics Suite
–  http://www.partek.com/?q=partekgs
•  CLCBio Genomics Workbench
–  http://www.clcbio.com/products/clc-genomics-workbench/
65
Next-Generation Sequencing Analysis Series
Andrew Oler, PhD
High-throughput Sequencing Bioinformatics Specialist
BCBB/OCICB/NIAID/NIH
Alignment versus De Novo Assembly
67
Short Sequence “Reads”
Is a Reference Genome available?
Yes No
Alignment to Reference de novo Assembly
?
http://www.ncbi.nlm.nih.gov/sites/genome
“Browse by organism groups”
General strategy of assembling a genome
de novo
68
Pre-process short reads
(trim, quality filter…)
Assemble sequences into
contigs
Order contigs into
scaffolds
Annotate genome
Basic Preprocessing
Tools for evaluating quality
•  PrinSeq (web and command line) -
http://edwards.sdsu.edu/cgi-bin/prinseq/prinseq.cgi
•  FastQC (stand-alone and command line) -
http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
Tools for trimming reads and removing adaptors
•  Btrim - http://www.ncbi.nlm.nih.gov/pubmed/21651976
–  Trims off adapters, barcodes and/or low quality regions from single or paired-
end reads
•  Cutadapt - http://code.google.com/p/cutadapt/
–  Provides many options of trimming
–  Accepts fasta, fastq and csfasta/qual
–  Needs ordering of pairs; could be done with cmpfastq script
§  http://compbio.brc.iop.kcl.ac.uk/software/download/cmpfastq
69
Assembly of Sequences
§  Algorithms
1.  Greedy
2.  Overlap-layout-
consensus (OLC)
3.  De Bruijn Graph
70
Schatz M C et al. Genome Res. 2010;20:1165-1173
Greedy
Was used in the very early next gen assemblers (e.g. SSAKE,
VCAKE)
1- The highest scoring alignment takes on another read with
the highest score
2- The paired end reads are used to generate super contigs
3- Mate pairs could also be used to determine contig order
71
* Repeats can cause big problems in this approach
Imperfect Overlap Between Reads Can Lead to
Incorrect Assembly in the Greedy Approach
72
Brief Bioinform. 2009 July; 10(4): 354–366.
Correct!
Incorrect
Imperfect overlap
Greedy Extension Leads to Arrested
Assembly if Multiple Matches are Found
73
Existing Contig
Two Unassembled Reads that Match Contig
Can’t Resolve, so Assembly Stops
•  Perform better overall
•  All against all using k-mers as seeds; Seed
& Extend algorithm is used.
•  Good for Long reads (e.g. Sanger or other
>100bp, such as 454, Ion Torrent, PacBio)
due to minimum overlap threshold
•  Examples: CABOG (Celera), ARACHNE
•  Newbler developed for 454 is based on
OLC and is now being used for IonTorrent
Overlap Graph or Overlap-layout-consensus (OLC)
•  It breaks reads into successive k-mers and the graph maps the k-mers
•  Each k-mer is a node and edges are drawn between each k-mer in a read.
•  Repeat sequences create a fork in the graph; alternative sequences create a
bubble.
•  The k-mer size can only be determined by “trial and error”.
•  A small value of K will create a complex graph but a large value of K may miss
small overlaps. A good starting point would be a k-mer size that is 2/3 the size of
the read
•  Good for short reads or small genomes. With long reads and/or large genomes,
may require lots of RAM (e.g., ~0.5 TB for human)
De Bruijn Graph
Examples are:
Velvet, SOAPdenovo, ALLPATHS-LG, ABySS
Evaluating the assembly
§  Genome assembly results:
•  contig size and number of contigs produced
•  scaffold size and number
•  N50 and N90
§  Coverage
§  GC Content
§  Genome annotation
•  repeats analysis and annotation
•  protein-coding gene annotation (including gene structure prediction and gene function
annotation)
•  non-coding RNA gene annotation (including annotation of microRNA, tRNA, rRNA, and other
ncRNA)
•  transposon and tandem repeats annotation
§  Comparative genomics and evolution (chromosome structure, conserved gene
families)
76
Evaluating the assembly
Basic statistics
N50 the length of the shortest contig such that the sum of contigs of equal
length or longer is at least 50% of the total length of all contigs.
Contig size (bp)
3000
2000 N50
1200
800
600 N90
400
Total: 8000
N90 = the length of the shortest contig such that the sum of
contigs of equal length or longer is at least 90% of the total
length of all contigs.
77
To Determine Optimal kmer Size, Try Many
78
0
100
200
300
400
500
600
700
800
900
1000
20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64
Contigs(bp)
kmer (bp)
Effect of kmer Length on Contig Length in ABySS
ABySS N25
ABySS N50
ABySS N75
*This will vary based on dataset (genome, read length, etc.)
*Good starting point is 2/3 of the read length.
Example of de novo genome assembly
from start to finish: Giant Panda
§ “The sequence and de novo assembly of
the giant panda genome.” Nature. 2010 Jan
21;463(7279):311-7. doi: 10.1038/
nature08696. Epub 2009 Dec 13.
79
Panda Genome Karyotype
80
81
Complex genome (if any condition met)
•  GC content: < 35% or > 65%
•  Repeat content: >50%
•  Heterozygous diploid or polyploid
•  Heterozygosity rate > 0.5%
Sequencing Strategies for De novo Assembly
Flowchart of the panda genome assembly
82
Supplementary Methods for Details about
Panda Genome Assembly
§  Illumina GA Platform, 35-71 bp paired-end reads
§  “In total, we generated 176-Gb of usable sequence (equal to 73-fold
coverage of the whole genome), with an average read length of 52  bp”
83
Summary of Sequencing Reads
84
Sequencing Error Correction and Filtering
§  “The quality requirements for de novo sequencing is far higher than
for re-sequencing, because sequencing errors can create difficulties
for the short-read assembly algorithm. We therefore carried out a
stringent filtering process.”
§  Remove reads that contain only/mostly adapter.
•  How would you do that?
§  Exclude datasets/lanes with too much low-quality
sequence.
§  Trimming at 3’ end to remove low-quality bases
§  Remove duplicate base call reads
§  Remove reads with significant excess of “N” and low-
quality bases.
•  How would you do that?
85
Sequencing Error Correction and Filtering
§  Error correction by K-mer frequency: “Prior to assembly, the
sequence errors were corrected based on K-mer frequency
information. For the panda genome assembly, we chose K=17
bp, and corrected sequencing errors for the 17-mers with a
frequency lower than 4. In summary, we corrected 8.4% of the
reads and 0.2% of the bases. The total, the number of distinct
27-mers (we used 27-mer in graph construction and assembly)
was reduced from 8.62 billion to 2.69 billion (3.2 times smaller)
through this error correction step.”
§  Internal to SOAPdenovo
§  Quake or ALLPATHS-LG error corrector can be used
as standalone methods to do this
86
http://soap.genomics.org.cn/down/soapdenovo.pdf
A. Create Graph
Kmer = 27
Strategy of SOAPdenovo
http://1.usa.gov/oTUrWC
B. Simplify the graph by removing errors
72 million 2.6 million
(Keep Contigs >100bp)
N50: 1483
N90: 224
Strategy of SOAPdenovo
C. Realign reads into contigs and use paired end
information to create scaffolds
•  Require at least 3 consistent pairs to make a
connection
•  Start with small inserts, progressively add larger insert
libraries
Strategy of SOAPdenovo
Scaffolding Statistics
90
“In principle, the scaffold size could have been further improved by using
even more distant insert-sized paired-end data, such as fosmid ends (~35
Kb) and BAC ends (100~150 Kb).”
§  D. Close Gaps using Paired-end reads
•  Mainly repeats (masked during scaffold construction)
•  Local assembly of the reads that align to the gap
•  If unknown copy number, fill with Ns
•  97% of gaps filled
•  Increased coverage from 84.2% to 93.6%
91
Strategy of SOAPdenovo
Are you still awake?
92
SOAPdenovo Demo
§  Xpr1 variants in mouse
§  “Endogenous gammaretrovirus acquisition in Mus
musculus subspecies carrying functional variants of
the XPR1 virus receptor.”
•  J Virol. 2013 Sep;87(17):9845-55
§  In IGV, go to Mouse, mm9
§  “Load from URL” :
•  https://dl.dropbox.com/u/30379708/H12.bam
•  https://dl.dropbox.com/u/30379708/H15.bam
§  Go to chr1:157,136,824-157,137,961 in browser
93
SOAPdenovo2
§  Updates
•  Reduced memory consumption in graph
construction
•  Resolves more repeat regions in contig assembly
•  Increased coverage and length in scaffold
construction
•  Improved gap closing
•  Optimization for large genome
§  Luo et al.: SOAPdenovo2: an empirically improved
memory-efficient short-read de novo assembler.
GigaScience 2012 1:18.
94
Annotation of Assembly: Repeats
§  Repeatmasker
•  http://www.binfo.ncku.edu.tw/RM/
webrepeatmaskerhelp.html
•  Known repeats
–  Uses RepBase database of known repeats
•  Low complexity repeats, satellites, etc.
–  “100 bp stretch of DNA is masked when it is >87% AT or
>89% GC, a 30 bp stretch has to contain 29 A/T (or GC)
nucleotides”
95
Annotation of Assembly: Gene Structure
and Function
§  Known Genes
•  Mapped human and dog genes to panda assembly
•  ~20,000 genes
§  Novel Gene Prediction
•  Genscan
•  Augustus
•  Required at least 3 exons, and at least 30% of the
translated sequence should align to SwissProt
§  Gene Function
•  Predict Domains (InterPro)
•  Functional Gene Ontology
•  ncRNAs (e.g., tRNAs, etc.) using INFERNAL
•  Pathways using KEGG
96
Assessment of Assembly: Coverage and
Annotation
97
Choosing a de novo Assembler
§  Assemblathon 1
•  Genome Res. 2011 21: 2224-2241
§  Genome Assembly Gold-standard Evalutions (GAGE)
•  Genome Res. 2012 22: 557-567
•  http://gage.cbcb.umd.edu/results/index.html
98
Assemblathon 1
99
•  BROAD (ALLPATHS-LG) and BGI (SOAPdenovo) performed best overall.
GAGE
§  Multiple genomes
•  Human chr14 (88 Mb)
•  S. aureus (2.9 Mb)
•  R. sphaeroides (4.6 Mb)
•  B. impatiens (~250 Mb)
§  Used Quake and ALLPATHS-LG for error correction
for all datasets prior to assembly (chose the best for
final report)
§  Compared assembly to known reference to determine
how many errors, etc.
100
GAGE Results
101
•  Corrected N50 is most instructive
GAGE Results
102
Unneccesary
Duplication/
Compression
Goal: 100%
Small
contigs
Goal: 0%
Reference
Bases
Missing
Goal: 0%
Sequence
not in
Reference
Goal: 0%
Human
more
difficult
GAGE Results
103
GAGE Results
104
Misjoins in the assembly are visible in dot-plot graphs
GAGE Summary
105
•  N50 is average of the three genomes with a known reference
•  Vertical axis is distance between errors
•  “Best” is top right area of graph
GAGE Conclusions
•  “ALLPATHS-LG demonstrated consistently strong performance based on
contig and scaffold size, with the best trade-off between size and error rate”
•  “Considering all metrics, and with the caveat that it requires a precise recipe
of input libraries, ALLPATHS-LG appears to be the most consistently
performing assembler, both in terms of contiguity and correctness.”
•  “SOAPdenovo produced results that initially seemed superior to most
assemblers, but on closer inspection it generated many misassemblies that
would be impossible to detect without access to a reference genome.”
•  “Despite its poor performance on human, SOAPdenovo performed very
well on the bacteria, creating contigs that were eight times larger than it built
on the human data.”
•  “Velvet had a particularly high error rate for its scaffolds, creating many
more inversions and translocations than any other algorithm.”
•  “Finally, we should note that all of the assemblers considered here are
under constant development, and many will be improved by the time this
analysis appears.”
106
Some Strategies for Refining an Assembly
§  Deeper coverage: the shorter the reads, the deeper
the coverage needed to produce long contigs
§  Mix of short and long read sizes
§  Combinatorial approach
•  e.g., assemble short reads with de Bruijn (e.g.,
Velvet), then treat the contigs as long reads in an
OCL assembler (e.g., CABOG)
§  Comparative assembly (using a reference sequence
to assist)
§  Libraries with a variety of insert sizes and mate pair
libraries to scaffold contigs into supercontigs
107
Schatz M C et al. Genome Res. 2010;20:1165-1173
Thank You
Questions or Comments please contact:
andrew.oler@nih.gov
ScienceApps@niaid.nih.gov
108
Bowtie Command-line
bowtie [options]* <ebwt> {-1 <m1> -2 <m2> | --12 <r> | <s>} [<hit>]
e.g., Paired-end
bowtie hg19 -1 SRR027894_1.fastq -2 SRR027894_2.fastq
e.g., Single-end
bowtie hg19 SRR036642.fastq
bowtie hg19 SRR036642.fastq,SRR036643.fastq
109
Paired-end Single-
end
Tab-
delimited
(uncommon)
“OR” “OR”
Index name
(genome)
Output file
(optional)
http://bowtie-bio.sourceforge.net/manual.shtml
Bowtie Command-line Options
To get options, type:
/usr/local/bio_apps/bowtie/bowtie
--solexa1.3-quals Use for Illumina pipeline 1.3-1.7 quality scores (phred+64) (omit for Illumina 1.8)
-p <int> Number of threads/processors (default: 1)
Alignment:
-v <int> Number of mismatches allowed in sequence
OR
-n <int> Number of mismatches allowed in “seed” portion (first part of read) (default: 2)
-l <int> Length of seed (default: 28bp)
-e <int> Maximum sum of scores of all mismatched bases (default: 70)
Reporting Reads:
-k <int> Number of alignments to report (default: 1)
-a Report all alignments (disables -k; default: off)
-m <int> Skip read if more than this many alignments (default: no limit)
-M <int> Like -m but reports one random alignment instead of skipping (default: no limit)
--best Order in best-to-worst quality alignment (i.e., fewest mismatches first)
--strata Only consider those alignments with the fewest mismatches
Output:
-t Print out time at each step (to terminal)
-S Output in SAM format
--un <file> Save unaligned reads to a file (give it a name)
--max <file> Save reads with more alignments than -m to a file (i.e. repeats; give it a name)
110
http://bowtie-bio.sourceforge.net/manual.shtml
Bowtie n mode versus v mode
111
CTCTGCACGTGTGGGTTCGAGTCCCACCTTCGTTTG
ATTGTGCTCTGCACGCGTGGGTTCGAATCCCACCTTCGTCGACCGTTT
Reference sequence
Read sequence
FHHHHIGHHFHIFFFGHGCD/DBA>=@?A980/*-)
Quality: 37 14 9 8 = ?68
In v mode (e.g., -v 2 commonly used):
In n mode (default -n 2 -e 70 -l 28): KEEP (because <=70)
REJECT (because >2 mismatches)
Example Bowtie Commands
§  These are some things you could add to a script
#Default alignment settings (plus threaded and SAM output):
bowtie -p 2 -t -S bowtie_hg19/genome SRR036642.fastq out.sam
#Unique alignments:
bowtie -p 2 -t -S -m 1 -a --best --strata bowtie_hg19/genome SRR036642.fastq out.sam
bowtie -p 2 -t -S -m 1 -a bowtie_hg19/genome SRR036642.fastq out.sam
#Allowing up to 10 repeats (for gene families):
bowtie -p 2 -t -S -m 10 -a --best --strata bowtie_hg19/genome SRR036642.fastq out.sam
bowtie -p 2 -t -S -m 10 -a bowtie_hg19/genome SRR036642.fastq out.sam
bowtie -p 2 -t -S -k 10 --best --strata bowtie_hg19/genome SRR036642.fastq out.sam
#Input a gzipped file to bowtie (- means stdin)
gunzip -c SRR036642.fastq.gz | bowtie -p 2 -t bowtie_hg19/genome -
112
Effects of Various Options on Bowtie
Output
alignment settings time (s) reads aligned reads not
aligned
reads suppressed
by -m (repeat)
#
alignment
s reported
reads/
alignment
s ratio
default 3 164801
(86.60%)
25492
(13.40%)
0 164801 1
unique: m1, a, best,
strata
4 132168
(69.45%)
25379
(13.34%)
32746
(17.21%)
132168 1
unique: m1, a 11 120069
(63.10%)
25492
(13.40%)
44732
(23.51%)
120069 1
max10: m10, a,
best, strata
5 147459
(77.49%)
25379
(13.34%)
17455
(9.17%)
191860 0.768
max10: m10, a 14 135517
(71.21%)
25492
(13.40%)
29284
(15.39%)
180796 0.750
max10: k10, best,
strata
5 164914
(86.66%)
25379
(13.34%)
0 366410 0.450
113
Total reads in test dataset: 190293
*
*
Playing with -l, -n, -e settings
could decrease this number,
but you will still have some not
aligned
Let’s Run Bowtie
Exercise 1: Get today’s alignment dataset
cp -r /scratch/aln ~
ls ~/aln
Exercise 2:
cd ~/aln
*Hint: You can use nano (or other text editor) to change email in test_bowtie.sh script to your
email address.
qsub test_bowtie.sh
**qsub output will tell you your jobID (needed for the next step)
qstat -u $LOGNAME (to check status of job occasionally)
cat test_bowtie.sh (to look at the script that we submitted)
*Hint: Use “genome” as the genome name in commands instead of hg19 because of the index
basename in the folder e.g.,
bowtie bowtie_hg19/genome SRR036642.fastq
*Hint: to learn PBS syntax for submitting jobs on Biowulf, see their website:
http://biowulf/user_guide.html
114
Bowtie Output Stats
Exercise 2, continued:
cat bowtie_test.oXXXXXX (substitute XXXXXX for jobID)
Unique alignments
Seeded quality full-index search: 00:00:09
# reads processed: 190293
# reads with at least one reported alignment: 132168 (69.45%)
# reads that failed to align: 25379 (13.34%)
# reads with alignments suppressed due to -m: 32746 (17.21%)
Reported 132168 alignments to 1 output stream(s)
Overall time: 00:00:51
Unique + repeat alignments max 10
Seeded quality full-index search: 00:00:10
# reads processed: 190293
# reads with at least one reported alignment: 147459 (77.49%)
# reads that failed to align: 25379 (13.34%)
# reads with alignments suppressed due to -m: 17455 (9.17%)
Reported 191860 alignments to 1 output stream(s)
115
TopHat Command Line
116
tophat [options]* <index_base> <reads>
•  See “TopHat” Section of Exercise Handout
•  Copy/Paste these commands, waiting for each to finish before
going to the next (should take ~1-2 minutes altogether):
§  cd ~/rnaseq_upenn
§  tophat -o wbc -p 2 -G hg19_chr6_refFlat_noRandomHapUn.gtf
index/chr6 wbc_aln.fastq.gz
§  tophat -o lymph -p 2 -G hg19_chr6_refFlat_noRandomHapUn.gtf
index/chr6 lymph_aln.fastq.gz
Fastq fileIndex name
(genome)
http://tophat.cbcb.umd.edu/manual.html
Using SAM Tools to Get Sorted BAM File
Convert SAM to BAM
samtools view [options] <in.bam or in.sam>
Options:
-b Output is BAM
-S Input is SAM
-h Include header if output is SAM
-o Output file (default: stdout)
e.g.,
samtools view -bS -o SRR062634.bam SRR062634.sam
samtools view -h SRR062634.bam | head -n 100
samtools sort [options] <in.bam> <out.prefix>
(.bam extension will be added to the “prefix”)
e.g.,
samtools sort SRR062634.bam SRR062634.sorted
samtools index <in.sorted.bam>
e.g.,
samtools index SRR062634.sorted.bam
117
http://samtools.sourceforge.net/samtools.shtml
Convert sam to
bam (to compress)
or bam to sam
Sort bam file
Index bam file
Get BAM Stats
BAM Stats with SAMtools and Picard
#Put samtools and other binaries on your PATH
export PATH=/usr/local/bio_apps/R/bin/:/usr/local/bio_apps/
java/bin/:/usr/local/bio_apps/samtools/:$PATH
samtools idxstats <in.bam>
#Outputs chr, length of chr, # mapped reads, # unmapped reads
e.g.,
samtools idxstats SRR062634.sorted.bam
samtools flagstat SRR062634.sorted.bam
java -jar /usr/local/bio_apps/picard-tools/
CollectMultipleMetrics.jar I=SRR062634.sorted.bam
O=SRR062634.sorted
118
Alignment QC Plots
119
0 100 200 300 400 500
0100200300400500
Insert Size Histogram for All_Reads
in file accepted_hits.bam
Insert Size
Count
FR
0 20 40 60 80 100
01020304050
accepted_hits.bam Quality By Cycle
Cycle
MeanQuality
Mean Quality
Mean Original Quality
10 20 30 40
05000001000000150000020000002500000
accepted_hits.bam Quality Score Distribution
Quality Score
Observations
Quality Scores
Original Quality Scores

More Related Content

PPTX
NGS data formats and analyses
PDF
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
PPTX
A Comparison of NGS Platforms.
PDF
Overview of Next Gen Sequencing Data Analysis
PDF
RNA-seq Analysis
PPTX
NGS.pptx
PDF
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...
NGS data formats and analyses
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
A Comparison of NGS Platforms.
Overview of Next Gen Sequencing Data Analysis
RNA-seq Analysis
NGS.pptx
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...

What's hot (20)

PDF
Genome Assembly
PPTX
Next generation sequencing technologies for crop improvement
PDF
Introduction to next generation sequencing
PPTX
Genotyping by Sequencing
PDF
RNAseq Analysis
PPTX
2 whole genome sequencing and analysis
PDF
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
PPTX
Introduction to Next Generation Sequencing
PPTX
Basic Steps of the NGS Method
PPTX
Next generation sequencing
PPTX
Role of transcriptomics in gene expression studies and
PPTX
Whole genome sequencing of bacteria & analysis
PDF
Genome Assembly 2018
PPTX
Next Generation Sequencing of DNA
PPTX
Ion Torrent Sequencing
PPTX
Transcriptome analysis
PPTX
Next generation sequencing
PDF
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
PPTX
Ngs ppt
PPTX
Lecture 7 gwas full
Genome Assembly
Next generation sequencing technologies for crop improvement
Introduction to next generation sequencing
Genotyping by Sequencing
RNAseq Analysis
2 whole genome sequencing and analysis
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Introduction to Next Generation Sequencing
Basic Steps of the NGS Method
Next generation sequencing
Role of transcriptomics in gene expression studies and
Whole genome sequencing of bacteria & analysis
Genome Assembly 2018
Next Generation Sequencing of DNA
Ion Torrent Sequencing
Transcriptome analysis
Next generation sequencing
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
Ngs ppt
Lecture 7 gwas full
Ad

Similar to NGS: Mapping and de novo assembly (20)

PPTX
Imgc2011 bioinformatics tutorial
PPTX
Next-generation sequencing format and visualization with ngs.plot
PPTX
Bioinfo ngs data format visualization v2
PPTX
Next-generation sequencing data format and visualization with ngs.plot 2015
PPT
20100516 bioinformatics kapushesky_lecture08
PPTX
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
PPTX
Enabling Large Scale Sequencing Studies through Science as a Service
PPTX
Workshop NGS data analysis - 2
PDF
Pasteur deep seq_analysis_theory_2016
PDF
20110524zurichngs 2nd pub
PPTX
Toolbox for bacterial population analysis using NGS
PPTX
Bioinformatics workshop Sept 2014
PDF
Introducing data analysis: reads to results
PPT
NGS - QC & Dataformat
PDF
Wellcome Trust Advances Course: NGS Course - Lecture1
PPTX
NGS File formats
PDF
Gwas.emes.comp
PDF
20110524zurichngs 1st pub
PDF
Introduction to Galaxy and RNA-Seq
PPTX
Knowing Your NGS Upstream: Alignment and Variants
Imgc2011 bioinformatics tutorial
Next-generation sequencing format and visualization with ngs.plot
Bioinfo ngs data format visualization v2
Next-generation sequencing data format and visualization with ngs.plot 2015
20100516 bioinformatics kapushesky_lecture08
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Enabling Large Scale Sequencing Studies through Science as a Service
Workshop NGS data analysis - 2
Pasteur deep seq_analysis_theory_2016
20110524zurichngs 2nd pub
Toolbox for bacterial population analysis using NGS
Bioinformatics workshop Sept 2014
Introducing data analysis: reads to results
NGS - QC & Dataformat
Wellcome Trust Advances Course: NGS Course - Lecture1
NGS File formats
Gwas.emes.comp
20110524zurichngs 1st pub
Introduction to Galaxy and RNA-Seq
Knowing Your NGS Upstream: Alignment and Variants
Ad

More from Bioinformatics and Computational Biosciences Branch (20)

PPTX
PPTX
Virus Sequence Alignment and Phylogenetic Analysis 2019
PDF
Nephele 2.0: How to get the most out of your Nephele results
PPTX
Protein fold recognition and ab_initio modeling
PDF
Protein structure prediction with a focus on Rosetta
PDF
UNIX Basics and Cluster Computing
PDF
Statistical applications in GraphPad Prism
PDF
Automating biostatistics workflows using R-based webtools
PDF
Overview of statistical tests: Data handling and data quality (Part II)
PDF
Overview of statistics: Statistical testing (Part I)
PDF
Virus Sequence Alignment and Phylogenetic Analysis 2019
Nephele 2.0: How to get the most out of your Nephele results
Protein fold recognition and ab_initio modeling
Protein structure prediction with a focus on Rosetta
UNIX Basics and Cluster Computing
Statistical applications in GraphPad Prism
Automating biostatistics workflows using R-based webtools
Overview of statistical tests: Data handling and data quality (Part II)
Overview of statistics: Statistical testing (Part I)

Recently uploaded (20)

PPTX
7. General Toxicologyfor clinical phrmacy.pptx
PPTX
2. Earth - The Living Planet earth and life
PPTX
2. Earth - The Living Planet Module 2ELS
PPTX
Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...
PDF
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
PDF
AlphaEarth Foundations and the Satellite Embedding dataset
PPTX
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
PPTX
ECG_Course_Presentation د.محمد صقران ppt
PPT
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
PPTX
Introduction to Fisheries Biotechnology_Lesson 1.pptx
PPTX
Comparative Structure of Integument in Vertebrates.pptx
PDF
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
PDF
Phytochemical Investigation of Miliusa longipes.pdf
PPTX
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
PDF
bbec55_b34400a7914c42429908233dbd381773.pdf
PDF
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
PPTX
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
PDF
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
PPTX
Cell Membrane: Structure, Composition & Functions
PPTX
The KM-GBF monitoring framework – status & key messages.pptx
7. General Toxicologyfor clinical phrmacy.pptx
2. Earth - The Living Planet earth and life
2. Earth - The Living Planet Module 2ELS
Vitamins & Minerals: Complete Guide to Functions, Food Sources, Deficiency Si...
Formation of Supersonic Turbulence in the Primordial Star-forming Cloud
AlphaEarth Foundations and the Satellite Embedding dataset
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
ECG_Course_Presentation د.محمد صقران ppt
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
Introduction to Fisheries Biotechnology_Lesson 1.pptx
Comparative Structure of Integument in Vertebrates.pptx
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
Phytochemical Investigation of Miliusa longipes.pdf
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
bbec55_b34400a7914c42429908233dbd381773.pdf
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
Cell Membrane: Structure, Composition & Functions
The KM-GBF monitoring framework – status & key messages.pptx

NGS: Mapping and de novo assembly

  • 1. Next-Generation Sequencing Analysis Series January 28, 2015 Andrew Oler, PhD High-throughput Sequencing Bioinformatics Specialist BCBB/OCICB/NIAID/NIH
  • 2. Bioinformatics and Computational Biosciences Branch §  Bioinformatics Software Developers §  Computational Biologists §  Project Managers & Analysts http://www.niaid.nih.gov/about/organization/odoffices/omo/ocicb/Pages/bcbb.aspx 2
  • 3. Objectives §  Give you an introduction to common methods used to process and analyze Next Generation Sequence data §  Learn methods for 1) Mapping NGS reads and 2) De novo assembly of NGS reads §  Give exposure to various applications for NGS experiments 3
  • 4. Illumina Sample DNA library Illumina sequencing What other platforms? Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5
  • 5. Illumina Paired-End Library Preparation 5Illumina
  • 6. Illumina Mate Pair Libraries 6
  • 7. A growing list of NGS applications RNA-Seq / miRNA-seq (noncoding, differential expression, Novel splice forms, antisense) Epigenetics (Chip- Seq, Mnase-seq, Bisulfite-Seq) CNV, Structural variations Targeted resequencing “Exome analysis” Whole genome sequencing Metagenomics (16S microbiome, environmental WGS) Somatic mutations Variants in mendelian diseases High-throughput sequencing De novo genome assembly What applications?
  • 8. Alignment versus De Novo Assembly 8 Short Sequence “Reads” Is a Reference Genome available? Yes No Alignment to Reference de novo Assembly ? http://www.ncbi.nlm.nih.gov/sites/genome “Browse by organism groups”
  • 10. Steps in Alignment/Mapping 1.  Get your sequence data 2.  Check quality of sequence data 3.  Choose an alignment/mapping program 4.  Run the alignment 5.  View the alignments 6.  Downstream Processing 10
  • 11. Steps in Alignment/Mapping 1.  Get your sequence data 2.  Check quality of sequence data 3.  Choose an alignment/mapping program 4.  Run the alignment 5.  View the alignments 6.  Downstream Processing 11
  • 12. Public Short Read Repositories §  NIH/NCBI •  Short Read Archive (http://www.ncbi.nlm.nih.gov/sra) •  Gene Expression Omnibus (http://www.ncbi.nlm.nih.gov/geo/) •  1000 Genomes (ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/) •  European Nucleotide Archive (http://www.ebi.ac.uk/ena/) 12 fastq-dump SRR036642
  • 14. Sequence file formats §  Next gen sequence file formats are based on the commonly used FASTA format >sequence_ID and optional comments ATTCCGGTGCGGTGCGGTGCTGCCGTGCCGGTGC TTCGAAATTGGCGTCAGT §  The Phred quality scores per base were added to form the FASTQ format 14
  • 15. Sequence file formats §  Illumina Fastq format (fasta format with Quality values for each base) 15 @EAS139:136:FC706VJ:2:5:1000:12850 1:Y:18:ATCACG AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA - base calls + BBBBCCCC?<A?BC?7@@???????DBBA@@@@A@@ - Base quality+33 Full read header description" @ <instrument-name>:<run ID>:<flowcell ID>:<lane-number>:<tile-number>: <x-pos>: <y-pos> <read number>:<is filtered>:<control number>:<barcode sequence> Space to separate Read ID Read ID "
  • 16. Fastq Quality values 16 Quality scores are normally expected up to 40 in a Phred scale. ASCII characters <http://en.wikipedia.org/wiki/ASCII> BBBBCCCC?<A?BC?7@@???????DBBA@@@@A@@ " The highest base quality score in this sequence: ‘D’=(68-33)=35 From http://en.wikipedia.org/wiki/FASTQ_format = 0.00032 (or 1/3200 incorrect)P=10 -35/10 If base quality = 35
  • 17. Steps in Alignment/Mapping 1.  Get your sequence data 2.  Check quality of sequence data 3.  Choose an alignment/mapping program 4.  Run the alignment 5.  View the alignments 6.  Downstream Processing 17
  • 18. Running FastQC Open FastQC program Open in browser: fastqc_report.html 18 Per base sequence quality p-value = 0.0001 p-value = 0.001 p-value = 0.01 p-value = 0.05 Babraham Bioinformatics http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
  • 19. Steps in Alignment/Mapping 1.  Get your sequence data 2.  Check quality of sequence data 3.  Choose an alignment/mapping program 4.  Run the alignment 5.  View the alignments 6.  Downstream Processing 19
  • 20. Short Read Alignment Software § BFAST § BLASTN § BLAT § Bowtie § BWA § ELAND § GNUMAP § GMAP and GSNAP § MAQ § mrFAST and mrsFAST § MOSAIK § Novoalign § RUM § SHRiMP § SOAP § SpliceMap § SSAHA and SSAHA2 § STAR § TopHat § ~20 more… 20 http://en.wikipedia.org/wiki/List_of_sequence_alignment_software http://tinyurl.com/seqanswers-mapping ls /usr/local/bio_apps/
  • 21. Issues of Consideration for Alignment Software §  Library types: •  Genomic DNA (for resequencing) •  ChIP DNA (PCR bias) •  RNA-seq cDNA –  mRNA-seq (junction mapping) –  smRNA-seq (adapter trimming) 21 3 n This protocol explains how to prepare libraries of chromatin-immuno- precipitated DNA for analysis on the Illumina Cluster Station and Genome Analyzer. You will add adapter sequences onto the ends of DNA fragments to generate the following template format: Figure 1 Fragments after Sample Preparation The adapter sequences correspond to the two surface-bound oligos on the flow cells used in the Cluster Station. DNA Fragment Adapters 3 Introduction This protocol explains how to prepare libraries of small RNA for subsequent cDNA sequencing on the Illumina Cluster Station and Genome Analyzer. You will physically isolate small RNA, ligate the adapters necessary for use during cluster creation, and reverse-transcribe and PCR to generate the following template format: Figure 1 Fragments after Sample Preparation The 5’ small RNA adapter is necessary for reverse transcription and amplification of the small RNA fragment. This adapter also contains the DNA sequencing primer binding site. The 3’ small RNA adapter corresponds to Small RNA Adapters cDNA Fragment Adapter Ligation RT-PCR Illumina
  • 22. Issues of Consideration for Alignment Software §  Types of reads •  Single-end •  Paired-end 22 1 2 Mean, Standard Deviation of Inner Distance e.g. SRR036642.fastq e.g. SRR027894_1.fastq, SRR027894_2.fastq
  • 23. Issues of Consideration for Alignment Software §  Library types, continued: •  Multiplexed library (demultiplex) •  Mate pair library •  Bisulfite-converted (C->T reference genome) 23Illumina Heng Li, 2010
  • 24. Issues of Consideration for Alignment Software §  Platform differences •  Bases (ACTG) •  Colorspace (2-base encoding, SOLiD) •  Read Length •  454 (homopolymers) 24
  • 25. Issues of Consideration for Alignment Software §  Software Properties •  Open-source or proprietary ($) •  Accuracy •  Speed of algorithm •  Multi-threaded or single processor •  RAM requirements (2GB vs 50GB for loading index) •  Use of base quality score •  Gapped alignment (indels) 25
  • 26. Steps in Alignment/Mapping 1.  Get your sequence data 2.  Check quality of sequence data 3.  Choose an alignment/mapping program 4.  Run the alignment 5.  View the alignments 6.  Downstream Processing 26
  • 27. The Command Line Terminal A New World to Some
  • 28. File Manager/Browser by Operating System 28 OS: Windows Mac OSX Unix FM: Explorer Finder Shell Input Method:
  • 29. Anatomy of the Terminal, “Command Line”, or “Shell” Prompt (computer_name:current_directory username) Cursor Command Argument Window Output Mac: Applications -> Utilities -> Terminal Windows: Download open source software PuTTY http://www.chiark.greenend.org.uk/~sgtatham/putty/ Other SSH Clients (http://en.wikipedia.org/wiki/Comparison_of_SSH_clients) Cygwin (http://www.cygwin.com/) 29
  • 30. How to execute a command command argument output output 30
  • 31. ls (“list”) ls list the files, links, subdirectories, etc. in a directory ls -a same as “ls”, but also show the “hidden” files ls -l list files with details (size, timestamp, ownership, permissions) ls -lh use “human-readable” file sizes *See handout for more options!* 31
  • 32. cd (“change directory”), mkdir (“make directory”) and viewing files cd ~ change to home directory cd test_data change to “test_data” directory cd .. change to higher directory (“go up”) cd ~/unix_hpc change to home directory > “unix_hpc” directory mkdir dir_name make directory “dir_name” pwd “print working directory” head junctions.bed view the first 10 lines of “junctions.bed” head -5 file1 view the first 5 lines of “file1” tail lymph3K.fastq view the last 10 lines of “lymph3K.fastq” tail -5 file1 view the last 5 lines of “file1” less lymph3K.fastq view a file; Space to page down, Ctrl-b to page up, arrow keys also work; “/” to search, “q” to quit (faster for huge files) 32
  • 34. Using ChIP-seq to analyze protein-DNA contacts §  Proteins called transcription factors (TFs) are involved in regulation of gene activation §  The first step in gene activation is binding of the TF to its target gene. Gene X RNA Polymerase TF
  • 35. Chromatin Immunoprecipitation (ChIP) See also: ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia. Genome Res. 2012 Sep;22(9):1813-31
  • 36. Burrows-Wheeler Transformation §  Uses Burrows-Wheeler Transformation •  small genome index •  small memory footprint (RAM) during alignment •  faster alignment §  Good at getting very accurate alignments quickly §  Used in BWA, Bowtie, Bowtie2 36 Reference Sequence Indexed Sequence Burrows-Wheeler Transformation Langmead B, Trapnell C, Pop M, Salzberg SL. Genome Biol 10:R25. Create all permutations, then sort
  • 39. Steps in TopHat Alignment 39Genome Biology (2013) 14:R36.
  • 40. Alignment for Variant Analysis §  Variants •  Small-scale –  single nucleotide variants (SNV) or single nucleotide polymorphism (SNP) –  short insertions or deletions –  deletion followed by insertion (indel) •  Large-scale, structural –  copy number variants (CNV) –  inversions and translocations §  Alignment software that will support gapped alignment for small-scale variation •  BWA (also uses Burroughs-Wheeler algorithm) •  Novoalign •  Bowtie2 •  GSNAP •  GEM •  mrFAST •  MOSAIK •  RMAP •  rNA •  RTG Investigator •  Segemehl •  SHRiMP •  Stampy •  SToRM 40http://en.wikipedia.org/wiki/List_of_sequence_alignment_software http://www.hgvs.org/mutnomen/recs-DNA.html
  • 41. iGenomes §  Common standard datasets for genomic analysis, organized in standardized directory structure •  Found in /gpfs/bio_data/iGenomes on NIAID HPC •  Found in /fdb/igenomes on Biowulf §  Files have additional formatting required by TopHat, Cufflinks §  Maintained by Illumina, hosted on TopHat / Cufflinks website •  http://tophat.cbcb.umd.edu/igenomes.html §  Approximately 500Gb for all species together §  Genomes available: 41 Arabidopsis_thaliana Bacillus_cereus_ATCC_10987 Bacillus_subtilis_168 Bos_taurus Caenorhabditis_elegans Canis_familiaris Drosophila_melanogaster Enterobacteriophage_lamdba Equus_caballus Escherichia_coli_K_12_DH10B Escherichia_coli_K_12_MG1655 Gallus_gallus Glycine_max Homo_sapiens Macaca_mulatta Mus_musculus Mycobacterium_tuberculosis_H37RV Oryza_sativa_japonica Pan_troglodytes PhiX Pseudomonas_aeruginosa_PAO1 Rattus_norvegicus Rhodobacter_sphaeroides_2.4.1 Saccharomyces_cerevisiae Schizosaccharomyces_pombe Sorangium_cellulosum_So_ce_56 Sorghum_bicolor Staphylococcus_aureus_NCTC_8325 Sus_scrofa Zea_mays
  • 42. iGenomes Directory Structure [Species] Ensembl NCBI UCSC hg18 … hg19 Annotation Genes SmallRNA Variation Sequence BWAIndex BowtieIndex Chromosomes WholeGenomeFasta AbundantSequences GenomeStudio 42 GTF, other formats FASTA filesPre-built Indexes Examples: iGenomes/Homo_sapiens/UCSC/hg19/Sequence/BowtieIndex/genome iGenomes/Homo_sapiens/UCSC/hg19/Sequence/WholeGenomeFasta/genome.fa iGenomes/Homo_sapiens/UCSC/hg19/Annotation/Genes/genes.gtf
  • 43. Are you still awake? 43
  • 44. Mapping Demo with Bowtie and BWA §  SRR036642 from SRA •  ChIP-seq •  Map using Bowtie §  SRR062634 from SRA •  Human Resequencing data •  Map using BWA 44
  • 45. No HPC available to you? Free, Alternative Ways to Map NGS Reads §  Galaxy •  Web-based analysis workflow interface •  https://main.g2.bx.psu.edu/ •  Emphasis on NGS tools •  Includes Bowtie, BWA, TopHat §  Kbase •  Web-based command-line interface •  http://kbase.science.energy.gov/ •  Includes Bowtie, BWA §  Disadvantages of online tools: •  Takes long time to upload data to servers •  Disk space limitations •  Limited customization of analysis workflow 45
  • 46. Steps in Alignment/Mapping 1.  Get your sequence data 2.  Check quality of sequence data 3.  Choose an alignment/mapping program 4.  Run the alignment 5.  View the alignments 6.  Downstream Processing 46
  • 47. Most commonly used alignment file formats §  SAM (sequence alignment map) Unified format for storing alignments to a reference genome §  BAM (binary version of SAM) Compressed SAM file, is normally indexed 47
  • 48. SAM/BAM format (sequence alignment map): Most commonly used alignment file formats 48 QNAME FLAG RNAME POSITION MAPQ CIGAR MRNM MPOS TLEN SEQ QUAL OPT Unified format for storing alignments to a reference genome BAM is a compressed SAM file, normally indexed http://samtools.sourceforge.net/samtools.shtml http://samtools.sourceforge.net/SAM1.pdf http://picard.sourceforge.net/explain-flags.html
  • 49. Picard Tools AddOrReplaceReadGroups.jar BamIndexStats.jar BamToBfq.jar BuildBamIndex.jar CalculateHsMetrics.jar CleanSam.jar CollectAlignmentSummaryMetrics.jar CollectCDnaMetrics.jar CollectGcBiasMetrics.jar CollectInsertSizeMetrics.jar CollectMultipleMetrics.jar CompareSAMs.jar CreateSequenceDictionary.jar EstimateLibraryComplexity.jar ExtractIlluminaBarcodes.jar ExtractSequences.jar FastqToSam.jar FixMateInformation.jar IlluminaBasecallsToSam.jar MarkDuplicates.jar MeanQualityByCycle.jar MergeBamAlignment.jar MergeSamFiles.jar NormalizeFasta.jar picard-1.45.jar QualityScoreDistribution.jar ReorderSam.jar ReplaceSamHeader.jar RevertSam.jar sam-1.45.jar SamFormatConverter.jar SamToFastq.jar SortSam.jar ValidateSamFile.jar ViewSam.jar 49 http://broadinstitute.github.io/picard/ java -jar QualityScoreDistribution.jar I=file.bam CHART=file.pdf /usr/local/bio_apps/java/bin/java -jar /usr/local/bio_apps/picard-tools/ CollectMultipleMetrics.jar …
  • 50. Visualization of output in Integrated Genome Browser (IGV) §  IGV download •  http://www.broadinstitute.org/igv/projects/current/igv_mm.jnlp (Windows 1.2GB) •  http://www.broadinstitute.org/igv/projects/current/igv_lm.jnlp (Mac 2GB) §  Open IGV by double-clicking. Upload data by selecting File → Load from URL, and entering the following links. §  Links to BAM files •  ChIP-seq: –  https://dl.dropbox.com/u/12821862/SRR036642.bam •  DNA-seq: –  https://dl.dropbox.com/u/30379708/SRR062634.sorted.bam •  RNA-seq: –  https://dl.dropbox.com/u/30379708/Upenn/lymph_accepted_hits.bam –  https://dl.dropbox.com/u/30379708/Upenn/wbc_accepted_hits.bam §  Examples: •  chr6:26,224,647-26,402,373 •  rs1205023, chr6:3,582,094-3,582,266 •  AIF1 •  LST1 50
  • 51. Steps in Alignment/Mapping 1.  Get your sequence data 2.  Check quality of sequence data 3.  Choose an alignment/mapping program 4.  Run the alignment 5.  View the alignments 6.  Downstream Processing 51
  • 52. Downstream Processing §  Finding peaks (ChIP-seq) §  Annotating peaks to genes (ChIP-seq) §  Assembling transcripts (RNA-seq) §  Annotating transcripts to genes (RNA-seq) §  Etc. 52Park, Nat Rev Genet, 2009 http://grimmond.imb.uq.edu.au/mammalian_transcriptome.html
  • 53. 53 Examples of using different mapping strategies for NGS
  • 54. ChIP-seq and Differential Expression RNA- seq §  The transcription factor T-bet is induced by multiple pathways and prevents an endogenous Th2 cell program during Th1 cell responses. Immunity. 2012 Oct 19;37(4):660-73. doi: 10.1016/j.immuni.2012.09.007. Epub 2012 Oct 4. Zhu J, Jankovic D, Oler AJ, Wei G, Sharma S, Hu G, Guo L, Yagi R, Yamane H, Punkosdy G, Feigenbaum L, Zhao K, Paul WE. §  ChIP-seq Methods •  Mapping: Bowtie •  Peaks: MACS §  RNA-seq Methods •  Mapping: TopHat •  Expression: USeq 54
  • 55. Resequencing/Variant Analysis §  Whole genome sequencing of peach (Prunus persica L.) for SNP identification and selection. BMC Genomics. 2011 Nov 22;12:569. doi: 10.1186/1471-2164-12-569. Ahmad R, Parfitt DE, Fass J, Ogundiwin E, Dhingra A, Gradziel TM, Lin D, Joshi NA, Martinez-Garcia PJ, Crisosto CH. §  Methods •  Mapping: BWA •  SNP calling: SAMtools 55http://www.themoneytimes.com/files/peach.jpg?1270231192
  • 56. RNA-seq alternative splicing §  RNA-Seq analysis of the parietal cortex in Alzheimer's disease reveals alternatively spliced isoforms related to lipid metabolism. Neurosci Lett. 2013 Mar 1;536:90-5. doi: 10.1016/j.neulet.2012.12.042. Epub 2013 Jan 7. Mills JD, Nalpathamkalam T, Jacobs HI, Janitz C, Merico D, Hu P, Janitz M. §  Methods •  Mapping: TopHat •  Splicing: Cufflinks, Cuffdiff 56
  • 57. Stranded RNA-seq §  Directional gene expression and antisense transcripts in sexual and asexual stages of Plasmodium falciparum. BMC Genomics. 2011 Nov 30;12:587. doi: 10.1186/1471-2164-12-587. López-Barragán MJ, Lemieux J, Quiñones M, Williamson KC, Molina-Cruz A, Cui K, Barillas-Mury C, Zhao K, Su XZ. §  Methods •  Mapping: TopHat •  Expression: Cufflinks 57
  • 58. Genome-wide Bisulfite Sequencing §  Whole-genome bisulfite DNA sequencing of a DNMT3B mutant patient. Epigenetics. 2012 Jun 1;7(6):542-50. doi: 10.4161/epi.20523. Epub 2012 Jun 1. Heyn H, Vidal E, Sayols S, Sanchez-Mut JV, Moran S, Medina I, Sandoval J, Simó-Riudalbas L, Szczesna K, Huertas D, Gatto S, Matarazzo MR, Dopazo J, Esteller M. §  Methods: •  Mapping and Bisulfite analysis: BSMAP 58
  • 59. Cross-linking immunoprecipitation sequencing (CLIP-seq) §  LIN28 binds messenger RNAs at GGAGA motifs and regulates splicing factor abundance. Mol Cell. 2012 Oct 26;48(2):195-206. doi: 10.1016/j.molcel.2012.08.004. Epub 2012 Sep 6. Wilbert ML, Huelga SC, Kapeli K, Stark TJ, Liang TY, Chen SX, Yan BY, Nathanson JL, Hutt KR, Lovci MT, Kazan H, Vu AQ, Massirer KB, Morris Q, Hoon S, Yeo GW. §  Methods •  Mapping: Bowtie •  Peaks: Custom scripts 59
  • 60. Ribosomal Profiling sequencing (Ribo-seq) §  Genome-wide ribosome profiling reveals complex translational regulation in response to oxidative stress. Proc Natl Acad Sci U S A. 2012 Oct 23;109(43):17394-9. doi: 10.1073/pnas.1120799109. Epub 2012 Oct 8. Gerashchenko MV, Lobanov AV, Gladyshev VN. §  Methods •  Mapping: Bowtie •  Translation Efficiency: Custom Perl scripts 60
  • 61. Chromosome Conformation Capture Sequencing (4C) §  Multiplexed chromosome conformation capture sequencing for rapid genome-scale high-resolution detection of long-range chromatin interactions. Nat Protoc. 2013 Feb 14;8(3):509-24. doi: 10.1038/nprot.2013.018. Epub 2013 Feb 14. Stadhouders R, Kolovos P, Brouwer R, Zuin J, van den Heuvel A, Kockx C, Palstra RJ, Wendt KS, Grosveld F, van Ijcken W, Soler E. §  Methods •  Mapping: Bowtie via NARWHAL •  Post-alignment: BED Tools 61
  • 62. DNase I Hypersensitivity (DNase-seq) §  Modeling gene expression using chromatin features in various cellular contexts. Genome Biol. 2012 Jun 13;13(9):R53. doi: 10.1186/gb-2012-13-9-r53. Dong X, Greven MC, Kundaje A, Djebali S, Brown JB, Cheng C, Gingeras TR, Gerstein M, Guigó R, Birney E, Weng Z. §  Methods •  Mapping: Maq •  Peak calling: F-Seq 62 http://genome.ucsc.edu/cgi-bin/hgTrackUi? hgsid=328941501&g=wgEncodeChromatin Map&hgTracksConfigPage=configure
  • 63. 16S rRNA Microbiome Sequencing §  Reducing the effects of PCR amplification and sequencing artifacts on 16S rRNA-based studies. PLoS One. 2011;6(12):e27310. doi: 10.1371/journal.pone.0027310. Epub 2011 Dec 14. Schloss PD, Gevers D, Westcott SL. §  Methods •  Taxonomic Assignment: Mothur classify.seqs •  Alignment: Mothur align.seqs 63
  • 64. Polyploid Genome Re-sequencing §  PolyCat: A Resource for Genome Categorization of Sequencing Reads From Allopolyploid Organisms. G3 (Bethesda). 2013 Mar;3(3):517-25. doi: 10.1534/g3.112.005298. Epub 2013 Mar 1. Page JT, Gingle AR, Udall JA. §  Methods •  Mapping: GSNAP •  Homoeo-SNP calling: PolyCat 64
  • 65. Additional Resources §  Commercial Software for NGS Analysis (No Command Line!) •  Partek Genomics Suite –  http://www.partek.com/?q=partekgs •  CLCBio Genomics Workbench –  http://www.clcbio.com/products/clc-genomics-workbench/ 65
  • 66. Next-Generation Sequencing Analysis Series Andrew Oler, PhD High-throughput Sequencing Bioinformatics Specialist BCBB/OCICB/NIAID/NIH
  • 67. Alignment versus De Novo Assembly 67 Short Sequence “Reads” Is a Reference Genome available? Yes No Alignment to Reference de novo Assembly ? http://www.ncbi.nlm.nih.gov/sites/genome “Browse by organism groups”
  • 68. General strategy of assembling a genome de novo 68 Pre-process short reads (trim, quality filter…) Assemble sequences into contigs Order contigs into scaffolds Annotate genome
  • 69. Basic Preprocessing Tools for evaluating quality •  PrinSeq (web and command line) - http://edwards.sdsu.edu/cgi-bin/prinseq/prinseq.cgi •  FastQC (stand-alone and command line) - http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ Tools for trimming reads and removing adaptors •  Btrim - http://www.ncbi.nlm.nih.gov/pubmed/21651976 –  Trims off adapters, barcodes and/or low quality regions from single or paired- end reads •  Cutadapt - http://code.google.com/p/cutadapt/ –  Provides many options of trimming –  Accepts fasta, fastq and csfasta/qual –  Needs ordering of pairs; could be done with cmpfastq script §  http://compbio.brc.iop.kcl.ac.uk/software/download/cmpfastq 69
  • 70. Assembly of Sequences §  Algorithms 1.  Greedy 2.  Overlap-layout- consensus (OLC) 3.  De Bruijn Graph 70 Schatz M C et al. Genome Res. 2010;20:1165-1173
  • 71. Greedy Was used in the very early next gen assemblers (e.g. SSAKE, VCAKE) 1- The highest scoring alignment takes on another read with the highest score 2- The paired end reads are used to generate super contigs 3- Mate pairs could also be used to determine contig order 71 * Repeats can cause big problems in this approach
  • 72. Imperfect Overlap Between Reads Can Lead to Incorrect Assembly in the Greedy Approach 72 Brief Bioinform. 2009 July; 10(4): 354–366. Correct! Incorrect Imperfect overlap
  • 73. Greedy Extension Leads to Arrested Assembly if Multiple Matches are Found 73 Existing Contig Two Unassembled Reads that Match Contig Can’t Resolve, so Assembly Stops
  • 74. •  Perform better overall •  All against all using k-mers as seeds; Seed & Extend algorithm is used. •  Good for Long reads (e.g. Sanger or other >100bp, such as 454, Ion Torrent, PacBio) due to minimum overlap threshold •  Examples: CABOG (Celera), ARACHNE •  Newbler developed for 454 is based on OLC and is now being used for IonTorrent Overlap Graph or Overlap-layout-consensus (OLC)
  • 75. •  It breaks reads into successive k-mers and the graph maps the k-mers •  Each k-mer is a node and edges are drawn between each k-mer in a read. •  Repeat sequences create a fork in the graph; alternative sequences create a bubble. •  The k-mer size can only be determined by “trial and error”. •  A small value of K will create a complex graph but a large value of K may miss small overlaps. A good starting point would be a k-mer size that is 2/3 the size of the read •  Good for short reads or small genomes. With long reads and/or large genomes, may require lots of RAM (e.g., ~0.5 TB for human) De Bruijn Graph Examples are: Velvet, SOAPdenovo, ALLPATHS-LG, ABySS
  • 76. Evaluating the assembly §  Genome assembly results: •  contig size and number of contigs produced •  scaffold size and number •  N50 and N90 §  Coverage §  GC Content §  Genome annotation •  repeats analysis and annotation •  protein-coding gene annotation (including gene structure prediction and gene function annotation) •  non-coding RNA gene annotation (including annotation of microRNA, tRNA, rRNA, and other ncRNA) •  transposon and tandem repeats annotation §  Comparative genomics and evolution (chromosome structure, conserved gene families) 76
  • 77. Evaluating the assembly Basic statistics N50 the length of the shortest contig such that the sum of contigs of equal length or longer is at least 50% of the total length of all contigs. Contig size (bp) 3000 2000 N50 1200 800 600 N90 400 Total: 8000 N90 = the length of the shortest contig such that the sum of contigs of equal length or longer is at least 90% of the total length of all contigs. 77
  • 78. To Determine Optimal kmer Size, Try Many 78 0 100 200 300 400 500 600 700 800 900 1000 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 Contigs(bp) kmer (bp) Effect of kmer Length on Contig Length in ABySS ABySS N25 ABySS N50 ABySS N75 *This will vary based on dataset (genome, read length, etc.) *Good starting point is 2/3 of the read length.
  • 79. Example of de novo genome assembly from start to finish: Giant Panda § “The sequence and de novo assembly of the giant panda genome.” Nature. 2010 Jan 21;463(7279):311-7. doi: 10.1038/ nature08696. Epub 2009 Dec 13. 79
  • 81. 81 Complex genome (if any condition met) •  GC content: < 35% or > 65% •  Repeat content: >50% •  Heterozygous diploid or polyploid •  Heterozygosity rate > 0.5% Sequencing Strategies for De novo Assembly
  • 82. Flowchart of the panda genome assembly 82
  • 83. Supplementary Methods for Details about Panda Genome Assembly §  Illumina GA Platform, 35-71 bp paired-end reads §  “In total, we generated 176-Gb of usable sequence (equal to 73-fold coverage of the whole genome), with an average read length of 52  bp” 83
  • 85. Sequencing Error Correction and Filtering §  “The quality requirements for de novo sequencing is far higher than for re-sequencing, because sequencing errors can create difficulties for the short-read assembly algorithm. We therefore carried out a stringent filtering process.” §  Remove reads that contain only/mostly adapter. •  How would you do that? §  Exclude datasets/lanes with too much low-quality sequence. §  Trimming at 3’ end to remove low-quality bases §  Remove duplicate base call reads §  Remove reads with significant excess of “N” and low- quality bases. •  How would you do that? 85
  • 86. Sequencing Error Correction and Filtering §  Error correction by K-mer frequency: “Prior to assembly, the sequence errors were corrected based on K-mer frequency information. For the panda genome assembly, we chose K=17 bp, and corrected sequencing errors for the 17-mers with a frequency lower than 4. In summary, we corrected 8.4% of the reads and 0.2% of the bases. The total, the number of distinct 27-mers (we used 27-mer in graph construction and assembly) was reduced from 8.62 billion to 2.69 billion (3.2 times smaller) through this error correction step.” §  Internal to SOAPdenovo §  Quake or ALLPATHS-LG error corrector can be used as standalone methods to do this 86
  • 87. http://soap.genomics.org.cn/down/soapdenovo.pdf A. Create Graph Kmer = 27 Strategy of SOAPdenovo http://1.usa.gov/oTUrWC
  • 88. B. Simplify the graph by removing errors 72 million 2.6 million (Keep Contigs >100bp) N50: 1483 N90: 224 Strategy of SOAPdenovo
  • 89. C. Realign reads into contigs and use paired end information to create scaffolds •  Require at least 3 consistent pairs to make a connection •  Start with small inserts, progressively add larger insert libraries Strategy of SOAPdenovo
  • 90. Scaffolding Statistics 90 “In principle, the scaffold size could have been further improved by using even more distant insert-sized paired-end data, such as fosmid ends (~35 Kb) and BAC ends (100~150 Kb).”
  • 91. §  D. Close Gaps using Paired-end reads •  Mainly repeats (masked during scaffold construction) •  Local assembly of the reads that align to the gap •  If unknown copy number, fill with Ns •  97% of gaps filled •  Increased coverage from 84.2% to 93.6% 91 Strategy of SOAPdenovo
  • 92. Are you still awake? 92
  • 93. SOAPdenovo Demo §  Xpr1 variants in mouse §  “Endogenous gammaretrovirus acquisition in Mus musculus subspecies carrying functional variants of the XPR1 virus receptor.” •  J Virol. 2013 Sep;87(17):9845-55 §  In IGV, go to Mouse, mm9 §  “Load from URL” : •  https://dl.dropbox.com/u/30379708/H12.bam •  https://dl.dropbox.com/u/30379708/H15.bam §  Go to chr1:157,136,824-157,137,961 in browser 93
  • 94. SOAPdenovo2 §  Updates •  Reduced memory consumption in graph construction •  Resolves more repeat regions in contig assembly •  Increased coverage and length in scaffold construction •  Improved gap closing •  Optimization for large genome §  Luo et al.: SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. GigaScience 2012 1:18. 94
  • 95. Annotation of Assembly: Repeats §  Repeatmasker •  http://www.binfo.ncku.edu.tw/RM/ webrepeatmaskerhelp.html •  Known repeats –  Uses RepBase database of known repeats •  Low complexity repeats, satellites, etc. –  “100 bp stretch of DNA is masked when it is >87% AT or >89% GC, a 30 bp stretch has to contain 29 A/T (or GC) nucleotides” 95
  • 96. Annotation of Assembly: Gene Structure and Function §  Known Genes •  Mapped human and dog genes to panda assembly •  ~20,000 genes §  Novel Gene Prediction •  Genscan •  Augustus •  Required at least 3 exons, and at least 30% of the translated sequence should align to SwissProt §  Gene Function •  Predict Domains (InterPro) •  Functional Gene Ontology •  ncRNAs (e.g., tRNAs, etc.) using INFERNAL •  Pathways using KEGG 96
  • 97. Assessment of Assembly: Coverage and Annotation 97
  • 98. Choosing a de novo Assembler §  Assemblathon 1 •  Genome Res. 2011 21: 2224-2241 §  Genome Assembly Gold-standard Evalutions (GAGE) •  Genome Res. 2012 22: 557-567 •  http://gage.cbcb.umd.edu/results/index.html 98
  • 99. Assemblathon 1 99 •  BROAD (ALLPATHS-LG) and BGI (SOAPdenovo) performed best overall.
  • 100. GAGE §  Multiple genomes •  Human chr14 (88 Mb) •  S. aureus (2.9 Mb) •  R. sphaeroides (4.6 Mb) •  B. impatiens (~250 Mb) §  Used Quake and ALLPATHS-LG for error correction for all datasets prior to assembly (chose the best for final report) §  Compared assembly to known reference to determine how many errors, etc. 100
  • 101. GAGE Results 101 •  Corrected N50 is most instructive
  • 102. GAGE Results 102 Unneccesary Duplication/ Compression Goal: 100% Small contigs Goal: 0% Reference Bases Missing Goal: 0% Sequence not in Reference Goal: 0% Human more difficult
  • 104. GAGE Results 104 Misjoins in the assembly are visible in dot-plot graphs
  • 105. GAGE Summary 105 •  N50 is average of the three genomes with a known reference •  Vertical axis is distance between errors •  “Best” is top right area of graph
  • 106. GAGE Conclusions •  “ALLPATHS-LG demonstrated consistently strong performance based on contig and scaffold size, with the best trade-off between size and error rate” •  “Considering all metrics, and with the caveat that it requires a precise recipe of input libraries, ALLPATHS-LG appears to be the most consistently performing assembler, both in terms of contiguity and correctness.” •  “SOAPdenovo produced results that initially seemed superior to most assemblers, but on closer inspection it generated many misassemblies that would be impossible to detect without access to a reference genome.” •  “Despite its poor performance on human, SOAPdenovo performed very well on the bacteria, creating contigs that were eight times larger than it built on the human data.” •  “Velvet had a particularly high error rate for its scaffolds, creating many more inversions and translocations than any other algorithm.” •  “Finally, we should note that all of the assemblers considered here are under constant development, and many will be improved by the time this analysis appears.” 106
  • 107. Some Strategies for Refining an Assembly §  Deeper coverage: the shorter the reads, the deeper the coverage needed to produce long contigs §  Mix of short and long read sizes §  Combinatorial approach •  e.g., assemble short reads with de Bruijn (e.g., Velvet), then treat the contigs as long reads in an OCL assembler (e.g., CABOG) §  Comparative assembly (using a reference sequence to assist) §  Libraries with a variety of insert sizes and mate pair libraries to scaffold contigs into supercontigs 107 Schatz M C et al. Genome Res. 2010;20:1165-1173
  • 108. Thank You Questions or Comments please contact: andrew.oler@nih.gov ScienceApps@niaid.nih.gov 108
  • 109. Bowtie Command-line bowtie [options]* <ebwt> {-1 <m1> -2 <m2> | --12 <r> | <s>} [<hit>] e.g., Paired-end bowtie hg19 -1 SRR027894_1.fastq -2 SRR027894_2.fastq e.g., Single-end bowtie hg19 SRR036642.fastq bowtie hg19 SRR036642.fastq,SRR036643.fastq 109 Paired-end Single- end Tab- delimited (uncommon) “OR” “OR” Index name (genome) Output file (optional) http://bowtie-bio.sourceforge.net/manual.shtml
  • 110. Bowtie Command-line Options To get options, type: /usr/local/bio_apps/bowtie/bowtie --solexa1.3-quals Use for Illumina pipeline 1.3-1.7 quality scores (phred+64) (omit for Illumina 1.8) -p <int> Number of threads/processors (default: 1) Alignment: -v <int> Number of mismatches allowed in sequence OR -n <int> Number of mismatches allowed in “seed” portion (first part of read) (default: 2) -l <int> Length of seed (default: 28bp) -e <int> Maximum sum of scores of all mismatched bases (default: 70) Reporting Reads: -k <int> Number of alignments to report (default: 1) -a Report all alignments (disables -k; default: off) -m <int> Skip read if more than this many alignments (default: no limit) -M <int> Like -m but reports one random alignment instead of skipping (default: no limit) --best Order in best-to-worst quality alignment (i.e., fewest mismatches first) --strata Only consider those alignments with the fewest mismatches Output: -t Print out time at each step (to terminal) -S Output in SAM format --un <file> Save unaligned reads to a file (give it a name) --max <file> Save reads with more alignments than -m to a file (i.e. repeats; give it a name) 110 http://bowtie-bio.sourceforge.net/manual.shtml
  • 111. Bowtie n mode versus v mode 111 CTCTGCACGTGTGGGTTCGAGTCCCACCTTCGTTTG ATTGTGCTCTGCACGCGTGGGTTCGAATCCCACCTTCGTCGACCGTTT Reference sequence Read sequence FHHHHIGHHFHIFFFGHGCD/DBA>=@?A980/*-) Quality: 37 14 9 8 = ?68 In v mode (e.g., -v 2 commonly used): In n mode (default -n 2 -e 70 -l 28): KEEP (because <=70) REJECT (because >2 mismatches)
  • 112. Example Bowtie Commands §  These are some things you could add to a script #Default alignment settings (plus threaded and SAM output): bowtie -p 2 -t -S bowtie_hg19/genome SRR036642.fastq out.sam #Unique alignments: bowtie -p 2 -t -S -m 1 -a --best --strata bowtie_hg19/genome SRR036642.fastq out.sam bowtie -p 2 -t -S -m 1 -a bowtie_hg19/genome SRR036642.fastq out.sam #Allowing up to 10 repeats (for gene families): bowtie -p 2 -t -S -m 10 -a --best --strata bowtie_hg19/genome SRR036642.fastq out.sam bowtie -p 2 -t -S -m 10 -a bowtie_hg19/genome SRR036642.fastq out.sam bowtie -p 2 -t -S -k 10 --best --strata bowtie_hg19/genome SRR036642.fastq out.sam #Input a gzipped file to bowtie (- means stdin) gunzip -c SRR036642.fastq.gz | bowtie -p 2 -t bowtie_hg19/genome - 112
  • 113. Effects of Various Options on Bowtie Output alignment settings time (s) reads aligned reads not aligned reads suppressed by -m (repeat) # alignment s reported reads/ alignment s ratio default 3 164801 (86.60%) 25492 (13.40%) 0 164801 1 unique: m1, a, best, strata 4 132168 (69.45%) 25379 (13.34%) 32746 (17.21%) 132168 1 unique: m1, a 11 120069 (63.10%) 25492 (13.40%) 44732 (23.51%) 120069 1 max10: m10, a, best, strata 5 147459 (77.49%) 25379 (13.34%) 17455 (9.17%) 191860 0.768 max10: m10, a 14 135517 (71.21%) 25492 (13.40%) 29284 (15.39%) 180796 0.750 max10: k10, best, strata 5 164914 (86.66%) 25379 (13.34%) 0 366410 0.450 113 Total reads in test dataset: 190293 * * Playing with -l, -n, -e settings could decrease this number, but you will still have some not aligned
  • 114. Let’s Run Bowtie Exercise 1: Get today’s alignment dataset cp -r /scratch/aln ~ ls ~/aln Exercise 2: cd ~/aln *Hint: You can use nano (or other text editor) to change email in test_bowtie.sh script to your email address. qsub test_bowtie.sh **qsub output will tell you your jobID (needed for the next step) qstat -u $LOGNAME (to check status of job occasionally) cat test_bowtie.sh (to look at the script that we submitted) *Hint: Use “genome” as the genome name in commands instead of hg19 because of the index basename in the folder e.g., bowtie bowtie_hg19/genome SRR036642.fastq *Hint: to learn PBS syntax for submitting jobs on Biowulf, see their website: http://biowulf/user_guide.html 114
  • 115. Bowtie Output Stats Exercise 2, continued: cat bowtie_test.oXXXXXX (substitute XXXXXX for jobID) Unique alignments Seeded quality full-index search: 00:00:09 # reads processed: 190293 # reads with at least one reported alignment: 132168 (69.45%) # reads that failed to align: 25379 (13.34%) # reads with alignments suppressed due to -m: 32746 (17.21%) Reported 132168 alignments to 1 output stream(s) Overall time: 00:00:51 Unique + repeat alignments max 10 Seeded quality full-index search: 00:00:10 # reads processed: 190293 # reads with at least one reported alignment: 147459 (77.49%) # reads that failed to align: 25379 (13.34%) # reads with alignments suppressed due to -m: 17455 (9.17%) Reported 191860 alignments to 1 output stream(s) 115
  • 116. TopHat Command Line 116 tophat [options]* <index_base> <reads> •  See “TopHat” Section of Exercise Handout •  Copy/Paste these commands, waiting for each to finish before going to the next (should take ~1-2 minutes altogether): §  cd ~/rnaseq_upenn §  tophat -o wbc -p 2 -G hg19_chr6_refFlat_noRandomHapUn.gtf index/chr6 wbc_aln.fastq.gz §  tophat -o lymph -p 2 -G hg19_chr6_refFlat_noRandomHapUn.gtf index/chr6 lymph_aln.fastq.gz Fastq fileIndex name (genome) http://tophat.cbcb.umd.edu/manual.html
  • 117. Using SAM Tools to Get Sorted BAM File Convert SAM to BAM samtools view [options] <in.bam or in.sam> Options: -b Output is BAM -S Input is SAM -h Include header if output is SAM -o Output file (default: stdout) e.g., samtools view -bS -o SRR062634.bam SRR062634.sam samtools view -h SRR062634.bam | head -n 100 samtools sort [options] <in.bam> <out.prefix> (.bam extension will be added to the “prefix”) e.g., samtools sort SRR062634.bam SRR062634.sorted samtools index <in.sorted.bam> e.g., samtools index SRR062634.sorted.bam 117 http://samtools.sourceforge.net/samtools.shtml Convert sam to bam (to compress) or bam to sam Sort bam file Index bam file
  • 118. Get BAM Stats BAM Stats with SAMtools and Picard #Put samtools and other binaries on your PATH export PATH=/usr/local/bio_apps/R/bin/:/usr/local/bio_apps/ java/bin/:/usr/local/bio_apps/samtools/:$PATH samtools idxstats <in.bam> #Outputs chr, length of chr, # mapped reads, # unmapped reads e.g., samtools idxstats SRR062634.sorted.bam samtools flagstat SRR062634.sorted.bam java -jar /usr/local/bio_apps/picard-tools/ CollectMultipleMetrics.jar I=SRR062634.sorted.bam O=SRR062634.sorted 118
  • 119. Alignment QC Plots 119 0 100 200 300 400 500 0100200300400500 Insert Size Histogram for All_Reads in file accepted_hits.bam Insert Size Count FR 0 20 40 60 80 100 01020304050 accepted_hits.bam Quality By Cycle Cycle MeanQuality Mean Quality Mean Original Quality 10 20 30 40 05000001000000150000020000002500000 accepted_hits.bam Quality Score Distribution Quality Score Observations Quality Scores Original Quality Scores