ChIP-seq - Data processing

ChIP - Data processing
Sebastian Schmeier
s.schmeier@gmail.com
http://sschmeier.github.io/bioinf-workshop/
2015

Sebastian Schmeier 2
DNA  
sequencing
RNA  
sequencing
Biological
sample
Gene regulation,
chromatin structure
Genome  
variation
Gene
expression
Genome  
assembly
Transcriptome
assembly,
Splice variant
detection
Metabarcoding
Common analyses overview

Sebastian Schmeier
Gene regulation, chromatin structure
• How do we analyse it?
• Mapping reads to a reference genome
• Calling peaks
3

Sebastian Schmeier
Mapping reads
4
Chromatin immunoprecipitation (ChIP)
http://www.nature.com/nrg/journal/v11/n7/full/nrg2795.html

Sebastian Schmeier
Mapping reads
• Challenges
• Approximate String Matching Problem
• Burrows-Wheeler transform
• Bowtie
5

Sebastian Schmeier
Challenges of mapping short reads
• If the reference genome is very large, and if we have billions of
reads, how quickly can we align the reads to the genome?
• The task of mapping billions of sequences to a mammalian-
sized genome calls for extraordinarily efﬁcient algorithms, in
which every bit of memory is used optimally or near
optimally.
6

Sebastian Schmeier
Challenges of mapping short reads
• If a read comes from a repetitive element in the reference, a
program must pick which copy of the repeat the read belongs
to
• The program may choose to report multiple possible
locations or to pick a location heuristically
• Sequencing errors or variations between the sequenced
chromosomes and the reference genome exacerbate this
problem, because the alignment between the read and its
true source in the genome may actually have more
differences than the alignment between the read and some
other copy of the repeat
7

Sebastian Schmeier
Choice?
• Intelligently make tradeoffs in
• Speed
• Memory utilisation
• Accuracy
• Ease of use
• Adoption and maintenance
• Understanding of the fundamental  
methods
8https://www.ebi.ac.uk/~nf/hts_mappers/

Sebastian Schmeier
Mapping algorithms
• One could ﬁnd the true locations using exact matching,
assuming:
• a genome had no repeats and a sequencing experiment
introduced no errors
• a sufﬁcient read length relative to the genome size
• Assumption do NOT hold
9

Sebastian Schmeier
Mapping algorithms
• Searching for occurrences of the read sequence within the
reference sequence but allowing for some mismatches and
gaps between the two
• Standard algorithm: dynamic programming
• Too slow
• Too much memory required
10

Sebastian Schmeier
Mapping algorithms
• Two main ideas for addressing large input sizes (in # of reads and size of the reference):
• filtering
• quickly exclude large regions of the reference where no approximate match can be
found
• indexing
• Preprocessing the reference sequence and/or the set of reads to establish string indices
• Benefit of preprocessing into string indices is that it typically does not require scanning
the whole reference, and it can therefore conduct queries much faster at the expense
of larger memory consumption.
• The string indices that are currently used are:
• Suffix array
• Enhanced suffix array
• FM-index (Full-text index in Minute space) + Burrows-Wheeler transform
11

Sebastian Schmeier
Burrows-Wheeler transform (BWT)
Creation
• Write down all rotation of the string
• Sort the matrix lexicographically
• Last column is the BWT(T)
• The rows in the matrix are essentially
the sorted sufﬁxes of the text
• SA(T) is the start offset in the original
string
12
Introduction to the Burrows-WheelerTransform and FM Index
Ben Langmead, Department of Computer Science, JHU
T
=
abaaba$
BWT(T)
=
abba$aa
SA(T)
6
5
2
3
0
4
1
SA(T)
6
5
4
3
2
1
0

Sebastian Schmeier
LF mapping
• We rank according to how many
times the same character occurred
previously in BWT(T)
• We keep an array of positions in the
rotation SA(T)
• We keep an index of occurrences
starting at zero
13
SA(T)
6
5
2
3
0
4
1
SA(T)
6
5
2
3
0
4
1
T
=
abaaba$
BWT(T)
=
abba$aa
0
1
2
3
0
1

Sebastian Schmeier
Exact matching
14
P=aba P=aba P=aba
SA(T)
6
5
2
3
0
4
1
T
=
abaaba$
BWT(T)
=
abba$aa
0
1
2
3
0
1
0
1
2
3
0
1
0
1
2
3
0
1

Sebastian Schmeier
Exact matching
15
P=aba P=aba P=aba
SA(T)
6
5
2
3
0
4
1
T
=
abaaba$
BWT(T)
=
abba$aa
0
1
2
3
0
1
0
1
2
3
0
1
0
1
2
3
0
1

Sebastian Schmeier
Exact matching
16
P=aba P=aba P=aba
SA(T)
6
5
2
3
0
4
1
T
=
abaaba$
BWT(T)
=
abba$aa
0
1
2
3
0
1
0
1
2
3
0
1
0
1
2
3
0
1

Sebastian Schmeier
Exact matching
17
P=aba P=aba P=aba
SA(T)
6
5
2
3
0
4
1
T
=
abaaba$
BWT(T)
=
abba$aa
0
1
2
3
0
1
0
1
2
3
0
1
0
1
2
3
0
1

Sebastian Schmeier
Exact matching
18
P=aba P=aba P=aba
SA(T)
6
5
2
3
0
4
1
T
=
abaaba$
BWT(T)
=
abba$aa
0
1
2
3
0
1
0
1
2
3
0
1
0
1
2
3
0
1

Sebastian Schmeier
Exact matching
19
P=aba P=aba P=aba
SA(T)
6
5
2
3
0
4
1
T
=
abaaba$
BWT(T)
=
abba$aa
0
1
2
3
0
1
0
1
2
3
0
1
0
1
2
3
0
1
• Allows for matching in constant time   T
=
abaaba$
0 3

Sebastian Schmeier
Bowtie
• FM Index ﬁnds exact sequence matches quickly in small
memory, but short read alignment demands more:
• Allowances for mismatches
• Consideration of quality values
21

Sebastian Schmeier
Bowtie
• Bowtie’s solution: backtracking quality-aware search
• if a particular base is not found in the index, while traversing
the matrix, backtrack and try another “base” based on
quality and continue with the search string
22
aaaaaaaaaaaa
SA(T)
6
5
2
3
0
4
1
aaaa
0
1
2
3
0
1
0
1
2
3
0
1
0
1
2
3
0
1
0
1
2
3
0
1
P= P= P= P=
T
=
abaaba$

Sebastian Schmeier
Burrows-Wheeler transform
genome scale
23
• Some clever tricks involved to
achieve more compression of the
data structures (FM-Index*)
• Use BWT on the reference
genome to build the index
• Look up each read
• Convert to genome locations
How to map billions of short reads onto genomes.Trapnell & Salzberg. Nature Biotechnology 2009
*Ferragina & Manzini (2000). Opportunistic Data Structures with Applications. Proc. of the 41st Annual Symposium on Foundations of Computer Science

Sebastian Schmeier
Peak calling
24
Chromatin immunoprecipitation (ChIP)
http://www.nature.com/nrg/journal/v11/n7/full/nrg2795.html

Sebastian Schmeier
Peak calling
• ChIP proﬁle
• Challenges
• MACS
25

Sebastian Schmeier
• Only 5’ ends of ChIPed fragments
are sequenced
• Shifted read distribution
• Expected symmetry between
Watson/Crick read distributions
26
ChIP proﬁle
http://www.nature.com/nrg/journal/v10/n10/abs/nrg2641.html

Sebastian Schmeier
• Adjust for sequence mappability - regions that contain repetitive elements
have different expected tag count
27
Peak calling challenges
http://www.nature.com/nbt/journal/v27/n1/full/nbt.1518.html

Sebastian Schmeier
• Different ChIP-seq applications
produce different type of peaks.
• Most current tools have been
designed to detect sharp peaks  
(TF binding, histone modiﬁcations
at regulatory elements)
28
Computation for ChIP-seq and RNA-seq studies Pepke et al. Nat. Methods 2009

Sebastian Schmeier
• Deﬁnition of enriched regions/peaks:
• Which statistic to used?
• What boundaries should be
reported?
• What score to use  
(ratio, p-val, q-val)?
• Compute/estimate a FDR?
29
http://www.nature.com/nrg/journal/v10/n10/abs/nrg2641.html

Sebastian Schmeier 30
Step 1: Modelling the tag shift
1. Scan genome with a window of
user-deﬁned sonication size
2. Keep the best 1000 (or less) peaks
having a fold enr. > mfold (default
32, relative to random model)
3. Separate Watson/Crick tags
4. Shift size is modelled as the
distance d between the modes of
the Watson and Crick peaks
Model-based Analysis of ChIP-Seq (MACS). Zhang. et al. Genome Biology 2008
http://www.biologie.ens.fr/~mthomas/other/chip-seq-training/booklet/booklet_chip-seq.pdf
MACSMACS
d

Sebastian Schmeier
MACS
Step 2: Peak detection
1. Shift every tag by d/2
2. Slide a 2d window across the  
genome to find candidate peaks with significant tag
enrichment (according to Poisson distribution, default p-
value = 10
-5
)
3. Merge overlapping peaks
4. Report:
• fold enrichment for called peaks: ratio between tag
counts and expected using Poisson distribution (using
input data if provided)
• Position with highest pile-up is defined as the summit of
peak
• Empiric FDR if control sample is provided (sample swap),  
FDR = #control peaks / #ChIP peaks
31
MACS
d
http://www.biologie.ens.fr/~mthomas/other/chip-seq-training/booklet/booklet_chip-seq.pdf

Sebastian Schmeier
Again lots of choice
32Computation for ChIP-seq and RNA-seq studies Pepke et al. Nat. Methods 2009
and more…

Sebastian Schmeier
Visualise to assess quality
• Assess the data quality e.g. positive controls, background
• Determine cutoffs (looking at positive controls)
• Compare different peak ﬁnder outputs
• Integration of data / co-visualization
33Identifying ChIP-seq enrichment using MACS. Feng et al. Nat. Protocols, 2012

References
Introduction to the Burrows-WheelerTransform and FM Index. Ben Langmead, Department of Computer Science, JHU
How to map billions of short reads onto genomes.Trapnell & Salzberg. Nature Biotechnology 2009
Practical Guidelines for the Comprehensive Analysis of ChIP-seq Data. Bailey et al. PLoS Comp. Bio. 2013
Computation for ChIP-seq and RNA-seq studies Pepke et al. Nat. Methods 2009
Sebastian Schmeier
s.schmeier@gmail.com
http://sschmeier.github.io/bioinf-workshop/

ChIP-seq - Data processing

More Related Content

What's hot (20)

Viewers also liked (16)

Similar to ChIP-seq - Data processing (20)

Recently uploaded (20)

ChIP-seq - Data processing