Workshop NGS data analysis - 2

Sequencing data analysis
Workshop – part 2 / mapping to a reference genome

Outline

Previously in this workshop…

Mapping to a reference genome – the steps

Mapping to a reference genome – the workshop

Maté Ongenaert

Introduction – the real cost of sequencing

The workflow of NGS data analysis
Data analysis

Raw machine reads… What’s next?

Preprocessing (machine/technology)
- adaptors, indexes, conversions,…
- machine/technology dependent

Reads with associated qualities (universal)
- FASTQ
- QC check

Depending on application (general applicable)
- ‘de novo’ assembly of genome (bacterial genomes,…)
- Mapping to a reference genome  mapped reads
- SAM/BAM/…

High-level analysis (specific for application)
- SNP calling
- Peak calling

The workflow of NGS data analysis

Main data formats
Raw sequence reads:

- Represent the sequence ~ FASTA
>SEQUENCE_IDENTIFIER
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT

- Extension: represent the quality, per base ~ FASTQ – Q for quality
Score ~ phred ~ ASCII table ~ phred + 33 = Sanger
@SEQUENCE_IDENTIFIER
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

- Machine and platform independent and compressed: SRA (NCBI)
Get the original FASTQ file using SRATools (NCBI)

Main data formats
- Now moving to a common file format  SAM / BAM (Sequence Alignment/Map)
- BAM: binary (read: computer-readable, indexed, compressed) ‘form’ of SAM

DESCRIPTION OF THE 11 FIELDS IN THE ALIGNMENT SECTION

# QNAME: template name
#FLAG
#RNAME: reference name
# POS: mapping position
#MAPQ: mapping quality
#CIGAR: CIGAR string
#RNEXT: reference name of the mate/next fragment
#PNEXT: position of the mate/next fragment
#TLEN: observed template length
#SEQ: fragment sequence
#QUAL: ASCII of Phred-scale base quality+33

#Headers
@HD VN:1.3 SO:coordinate
@SQ SN:ref LN:45

#Alignment block
r001 163 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG *
r002 0 ref 9 30 3S6M1P1I4M * 0 0 AAAAGATAAGGATA *
r003 0 ref 9 30 5H6M * 0 0 AGCTAA * NM:i:1
r004 0 ref 16 30 6M14N5M * 0 0 ATAGCTTCAGC *

Main data formats
- BED files (location / annotation / scores): Browser Extensible Data
Used for mapping / annotation / peak locations / - extension: bigBED (binary)
FIELDS USED:
# chr
# start
# end
# name
# score
# strand

track name=pairedReads description="Clone Paired Reads" useScore=1
#chr start end name score strand
chr22 1000 5000 cloneA 960 +
chr22 2000 6000 cloneB 900 –

- BEDGraph files (location, combined with score)
Used to represent peak scores
track type=bedGraph name="BedGraph Format" description="BedGraph format"
visibility=full color=200,100,0 altColor=0,100,200 priority=20
#chr start end score
chr19 59302000 59302300 -1.0
chr19 59302300 59302600 -0.75
chr19 59302600 59302900 -0.50

Main data formats
- WIG files (location / annotation / scores): wiggle
Used for visulization or summarize data, in most cases count data or normalized count
data (RPKM) – extension: BigWig – binary versions (often used in GEO for ChIP-seq peaks)

browser position chr19:59304200-59310700
browser hide all

#150 base wide bar graph at arbitrarily spaced positions,
#threshold line drawn at y=11.76
#autoScale off viewing range set to [0:25]
#priority = 10 positions this as the first graph

track type=wiggle_0 name="variableStep" description="variableStep format"
visibility=full autoScale=off viewLimits=0.0:25.0 color=50,150,255
yLineMark=11.76 yLineOnOff=on priority=10
variableStep chrom=chr19 span=150
59304701 10.0
59304901 12.5
59305401 15.0
59305601 17.5
59305901 20.0
59306081 17.5

Main data formats
- GFF format (General Feature Format) or GTF
Used for annotation of genetic / genomic features – such as all coding genes in Ensembl
Often used in downstream analysis to assign annotation to regions / peaks / …
FIELDS USED:

# seqname (the name of the sequence)
# source (the program that generated this feature)
# feature (the name of this type of feature – for example: exon)
# start (the starting position of the feature in the sequence)
# end (the ending position of the feature)
# score (a score between 0 and 1000)
# strand (valid entries include '+', '-', or '.')
# frame (if the feature is a coding exon, frame should be a number between
0-2 that represents the reading frame of the first base. If the feature is
not a coding exon, the value should be '.'.)
# group (all lines with the same group are linked together into a single
item)

track name=regulatory description="TeleGene(tm) Regulatory Regions"
#chr source feature start end scores tr fr group
chr22 TeleGene enhancer 1000000 1001000 500 + . touch1
chr22 TeleGene promoter 1010000 1010100 900 + . touch1
chr22 TeleGene promoter 1020000 1020000 800 - . touch2

Main data formats
- VCF format (Variant Call Format)
For SNP representation

Main data formats
- http://genome.ucsc.edu/FAQ/FAQformat.html

- UCSC brower data formats, including all most commonly used formats that are
accepted and widely used

- In addition, ENCODE data formats (narrowPeak / broadPEAK)

Mapping to a reference genome
The workflow
Mapping:

Aligning the raw sequence reads to a reference genome by using an indexing strategy and
aligning algorithm, taking into account the quality scores and with specific conditions

- Raw sequence reads with quality scores: FASTQ
- Reference genome: FASTA files can be downloaded (UCSC/Ensembl)

- Sequence reads <> reference genome: alignment
- To perform an efficient alignment, an indexing strategy is used
- For instance (BWA/Bowtie): FM indexes (based on burrows-wheeler algorithm) on the
reference genome and/or the sequence reads

- Specific conditions: single-end or paired-end; how many mismatches allowed; trade-off
speed/accuracy/specificity; local re-alignment afterwards for improved indel calling; …

>> Result: mapped sequence reads: chr / start / end / quality >> SAM file (>> BAM)

The workflow
The reference genome

- Sequences (human; rat: mouse:…) can be downloaded from UCSC (Golden path) or
Ensembl
- Difficulty: download in 2bit format (needs convertor) >> fasta files (.fa)
- Need to be indexed by the mapping program you are going to use

- BWA: bwa index
- Bowtie: bowtie-build (pre-computed indexes available)

- BWA example:

bwa index [-p prefix] [-a algoType] [-c] <in.db.fasta>
Index database sequences in the FASTA format.

OPTIONS:
-c Build color-space index. The input fast should be in nucleotide space.
-p STR Prefix of the output database [same as db filename]
-a STR Algorithm for constructing BWT index. Available options are:
is IS linear-time algorithm for constructing suffix array.
It requires 5.37N memory where N is the size of the database.
bwtsw Algorithm implemented in BWT-SW. This method works with the whole human genome

The workflow
The sequencing reads

- Sequence reads with quality scores: FASTQ files from the machine
- Depending on the mapping program, need to be indexed as well

- BWA: converts reads to SA coordinates (Suffix Array) based on the reference genome
index
- Bowtie: not needed: indexing and aligning in one step

- BWA:
- Index reference genome
- Index sequence reads (INPUT: FASTQ and REF. GENOME ) >> SA coordinates (OUTPUT:
SAI)
- SA coordinates (INPUT: SAI/FASTQ and REF. GENOME >> SAM/BAM (OUTPUT)

The workflow
aln bwa aln [-n][-o][-e][-d][-i][-k][-l][-t][-cRN][-M][-O][-E][-q]
<in.db.fasta> <in.query.fq> > <out.sai>

Find the SA coordinates of the input reads.
Maximum maxSeedDiff differences are allowed in the first seedLen subsequence
maximum maxDiff differences are allowed in the whole sequence.

OPTIONS:
-n NUM Maximum edit distance if the value is INT
-o INT Maximum number of gap opens
-e INT Maximum number of gap extensions, -1 for k-difference mode
-d INT Disallow a long deletion within INT bp towards the 3’-end
-i INT Disallow an indel within INT bp towards the ends [5]
-l INT Take the first INT subsequence as seed.
-k INT Maximum edit distance in the seed
-t INT Number of threads (multi-threading mode)
-M INT Mismatch penalty
-O INT Gap open penalty
-E INT Gap extension penalty
-R INT Proceed with suboptimal alignments
-c Reverse query but not complement it
-N Disable iterative search.
-q INT Parameter for read trimming.
-I The input is in the Illumina 1.3+ read format (quality equals ASCII-64)
-B INT Length of barcode starting from the 5’-end.
-b Specify the input read sequence file is the BAM format.
-0 When -b is specified, only use single-end reads in mapping.
-1 When -b is specified, only use the first read in a read pair in mapping
-2 When -b is specified, only use the second read in a read pair in mapping

The workflow
samse bwa samse [-n maxOcc] <in.db.fasta> <in.sai> <in.fq> > <out.sam>
Generate alignments in the SAM format given single-end reads
Repetitive hits will be randomly chosen.

OPTIONS:
-n INT Maximum number of alignments to output in the XA tag for reads paired properly.
-r STR Specify the read group in a format like ‘@RGtID:footSM:bar’

sampe bwa sampe [-a][-o][-n][-N][-P]<in.db.fasta>
<in1.sai><in2.sai><in1.fq><in2.fq> ><out.sam>
Generate alignments in the SAM format given paired-end reads.
Repetitive read pairs will be placed randomly.

OPTIONS:
-a INT Maximum insert size for a read pair to be considered being mapped properly.
-o INT Maximum occurrences of a read for pairing.
-P Load the entire FM-index into memory to reduce disk operations
-n INT Maximum number of alignments to output in the XA tag for reads paired properly
-N INT Maximum number of alignments to output in the XA tag for disconcordant read pairs
-r STR Specify the read group in a format like ‘@RGtID:footSM:bar’

The workshop
Mapping using BWA

bwa-0.5.9 aln -t 4 /opt/genomes/GRCh37/index/bwa/GRCh37 SRR058523.fastq > SRR058523.sai

bwa-0.5.9 BWA and its version
aln: alignement functionality of BWA
-t 4: use 4 processes (CPU cores) at the same time to speed up
/opt/genomes/index/bwa/GRCh37: location of the reference genome index
SRR058523.fastq: fastq file to align to the reference
> Indicates outputting to a file
SRR058523.sai: the output file (SA Index file)

Maps the input sequences (FASTQ) to the reference genome index  output: indexes of
the reads

No ‘real genomic mapping’ thus, this would need a next step…

The workshop
Mapping using BWA

bwa-0.5.9 samse /opt/genomes/GRCh37/index/bwa/GRCh37 SRR058523.sai SRR058523.fastq |
samtools-0.1.18 view -bhSo PHF6-unsorted.bam –

bwa-0.5.9 BWA and its version
samse: single-end mapping and output to sam format
/opt/genomes/index/bwa/GRCh37: location of the reference genome index
SRR058523.sai: the reads index
SRR058523.fastq: the raw reads and quality scores

This would output a sam file (> SRR058523.sam) for instance
But we don’t need the SAM file, we would like a BAM file  processing by samtools

| is the ‘pipe’ symbol: hands over the output from one command to the other
samtools-0.1.18: samtools and its version
view: the command to process sam files
- B output BAM ; h print the headers; S input is SAM; o output name
PHF6-unsorted.bam: output file name
- End of the | symbol (end of second command)

The workshop
Mapping using BWA

bwa-0.5.9 aln -t 4 /opt/genomes/GRCh37/index/bwa/GRCh37 SRR058523.fastq > SRR058523.sai

bwa-0.5.9 samse /opt/genomes/GRCh37/index/bwa/GRCh37 SRR058523.sai SRR058523.fastq |
samtools-0.1.18 view -bhSo PHF8-unsorted.bam –

Two-step process in BWA

Next steps: process the BAM file  sort and index it (using samtools)

samtools-0.1.18 sort PHF8-unsorted.bam PHF8-sorted

Creates a sorted BAM file (PHF6-sorted.bam)
samtools-0.1.18 index PHF8-sorted.bam

Indexes the sorted BAM file (and created a BAM index file – PHF6-sorted.bam.bai)

The workshop
BAM: what’s next?

So, now we have the sorted and indexed BAM file – what’s next?

This file is the starting point for all other analysis, depending on the application:

ChIP-seq: peak calling
SNP calling
RNA-seq: calculate gene-expression levels of the transcripts / find splice variants

What are the first things?
- Visualize it (IGV can load BAM files)
- First downstream analysis: QC and basic statistics (how many mapped reads, quality
distribution, distribution accross chromosomes,…)

The workshop
First downstream analysis

- QC and basic statistics (how many mapped reads, quality distribution, distribution
accross chromosomes, information on paired-end reads,…)

Samstat
/opt/samstat/samstat PHF8-sorted.bam

- Outputs a HTML file with statistics

The workshop


BamUtil (stats)

Bam stats --in PHF8-sorted.bam –-basic --phred --baseSum

Number of records read = 15732744

TotalReads(e6) 15.73
MappedReads(e6) 15.04
PairedReads(e6) 15.73
ProperPair(e6) 14.65
DuplicateReads(e6) 0.00
QCFailureReads(e6) 0.00

MappingRate(%) 95.59
PairedReads(%) 100.00
ProperPair(%) 93.11
DupRate(%) 0.00
QCFailRate(%) 0.00

TotalBases(e6) 802.37
BasesInMappedReads(e6) 766.95

Quality Count
33 0
34 0
35 71373
36 0
37 0
38 203544
39 403649
40 921714
41 2081099
42 1974615
43 2285826

The workshop


Samtools
samtools-0.1.18 idxstats PHF8-sorted.bam

1 249250621 503714 0
2 243199373 345217 0
3 198022430 273477 0
4 191154276 229016 0
5 180915260 360339 0
6 171115067 257468 0
7 159138663 269704 0
8 146364022 242656 0
9 141213431 203505 0
10 135534747 237496 0
11 135006516 218116 0
12 133851895 231426 0
13 115169878 106831 0
14 107349540 119062 0
15 102531392 141351 0
16 90354753 183004 0
17 81195210 187024 0
18 78077248 86101 0

The workshop

- Think about PCR duplicates  you may want to remove them (or set a ‘flag’ in the BAM
file, indicating it is a duplicate)
- Samtools rmdup or Picard MarkDuplicates

- Find out how these tools work and what otyher flags are used in BAM files
- Can you make statistics with the BAM flags?

The workshop
Mapping – now let’s start!

- Mapping is only the starting point for most downstream analysis tools
- Depends on the application and what you want to do:

- Exome sequencing / whole genome sequencing: SNP calling (samtools): based on
mapping quality / coverage /  identification of SNPs (VCF output format)

- ChIP-seq: peak calling: based on coverage of ChIP and input, enriched regions are
identified (BED output, BEDgraph and/or WIG files)

- RNA-seq: assign reads to the transcripts, normalize (length of exon and number of
reads in the sequencing library = RPKM)  (relative) expression levels 
identification of differentially expressed genes

Workshop NGS data analysis - 2

More Related Content

What's hot (19)

Viewers also liked (20)

Similar to Workshop NGS data analysis - 2 (20)

More from Maté Ongenaert (12)

Recently uploaded (20)

Workshop NGS data analysis - 2