SlideShare a Scribd company logo
ChIP - Data processing
Sebastian Schmeier
s.schmeier@gmail.com
http://sschmeier.github.io/bioinf-workshop/
2015
Sebastian Schmeier 2
DNA 

sequencing
RNA 

sequencing
Biological
sample
Gene regulation,
chromatin structure
Genome 

variation
Gene
expression
Genome 

assembly
Transcriptome
assembly,
Splice variant
detection
Metabarcoding
Common analyses overview
Sebastian Schmeier
Gene regulation, chromatin structure
• How do we analyse it?
• Mapping reads to a reference genome
• Calling peaks
3
Sebastian Schmeier
Mapping reads
4
Chromatin immunoprecipitation (ChIP)
http://www.nature.com/nrg/journal/v11/n7/full/nrg2795.html
Sebastian Schmeier
Mapping reads
• Challenges
• Approximate String Matching Problem
• Burrows-Wheeler transform
• Bowtie
5
Sebastian Schmeier
Challenges of mapping short reads
• If the reference genome is very large, and if we have billions of
reads, how quickly can we align the reads to the genome?
• The task of mapping billions of sequences to a mammalian-
sized genome calls for extraordinarily efficient algorithms, in
which every bit of memory is used optimally or near
optimally.
6
Sebastian Schmeier
Challenges of mapping short reads
• If a read comes from a repetitive element in the reference, a
program must pick which copy of the repeat the read belongs
to
• The program may choose to report multiple possible
locations or to pick a location heuristically
• Sequencing errors or variations between the sequenced
chromosomes and the reference genome exacerbate this
problem, because the alignment between the read and its
true source in the genome may actually have more
differences than the alignment between the read and some
other copy of the repeat
7
Sebastian Schmeier
Choice?
• Intelligently make tradeoffs in
• Speed
• Memory utilisation
• Accuracy
• Ease of use
• Adoption and maintenance
• Understanding of the fundamental 

methods
8https://www.ebi.ac.uk/~nf/hts_mappers/
Sebastian Schmeier
Mapping algorithms
• One could find the true locations using exact matching,
assuming:
• a genome had no repeats and a sequencing experiment
introduced no errors
• a sufficient read length relative to the genome size
• Assumption do NOT hold
9
Sebastian Schmeier
Mapping algorithms
• Approximate String Matching Problem
• Searching for occurrences of the read sequence within the
reference sequence but allowing for some mismatches and
gaps between the two
• Standard algorithm: dynamic programming
• Too slow
• Too much memory required
10
Sebastian Schmeier
Mapping algorithms
• Approximate String Matching Problem
• Two main ideas for addressing large input sizes (in # of reads and size of the reference):
• filtering
• quickly exclude large regions of the reference where no approximate match can be
found
• indexing
• Preprocessing the reference sequence and/or the set of reads to establish string indices
• Benefit of preprocessing into string indices is that it typically does not require scanning
the whole reference, and it can therefore conduct queries much faster at the expense
of larger memory consumption.
• The string indices that are currently used are:
• Suffix array
• Enhanced suffix array
• FM-index (Full-text index in Minute space) + Burrows-Wheeler transform
11
Sebastian Schmeier
Burrows-Wheeler transform (BWT)
Creation
• Write down all rotation of the string
• Sort the matrix lexicographically
• Last column is the BWT(T)
• The rows in the matrix are essentially
the sorted suffixes of the text
• SA(T) is the start offset in the original
string
12
Introduction to the Burrows-WheelerTransform and FM Index
Ben Langmead, Department of Computer Science, JHU
T	
  =	
  abaaba$
BWT(T)	
  =	
  abba$aa
SA(T)
6
5
2
3
0
4
1
SA(T)
6
5
4
3
2
1
0
Sebastian Schmeier
Burrows-Wheeler transform (BWT)
LF mapping
• We rank according to how many
times the same character occurred
previously in BWT(T)
• We keep an array of positions in the
rotation SA(T)
• We keep an index of occurrences
starting at zero
13
Introduction to the Burrows-WheelerTransform and FM Index
Ben Langmead, Department of Computer Science, JHU
SA(T)
6
5
2
3
0
4
1
SA(T)
6
5
2
3
0
4
1
T	
  =	
  abaaba$
BWT(T)	
  =	
  abba$aa
0
1
2
3
0
1
Sebastian Schmeier
Burrows-Wheeler transform (BWT)
Exact matching
14
Introduction to the Burrows-WheelerTransform and FM Index
Ben Langmead, Department of Computer Science, JHU
P=aba P=aba P=aba
SA(T)
6
5
2
3
0
4
1
T	
  =	
  abaaba$
BWT(T)	
  =	
  abba$aa
0
1
2
3
0
1
0
1
2
3
0
1
0
1
2
3
0
1
Sebastian Schmeier
Burrows-Wheeler transform (BWT)
Exact matching
15
Introduction to the Burrows-WheelerTransform and FM Index
Ben Langmead, Department of Computer Science, JHU
P=aba P=aba P=aba
SA(T)
6
5
2
3
0
4
1
T	
  =	
  abaaba$
BWT(T)	
  =	
  abba$aa
0
1
2
3
0
1
0
1
2
3
0
1
0
1
2
3
0
1
Sebastian Schmeier
Burrows-Wheeler transform (BWT)
Exact matching
16
Introduction to the Burrows-WheelerTransform and FM Index
Ben Langmead, Department of Computer Science, JHU
P=aba P=aba P=aba
SA(T)
6
5
2
3
0
4
1
T	
  =	
  abaaba$
BWT(T)	
  =	
  abba$aa
0
1
2
3
0
1
0
1
2
3
0
1
0
1
2
3
0
1
Sebastian Schmeier
Burrows-Wheeler transform (BWT)
Exact matching
17
Introduction to the Burrows-WheelerTransform and FM Index
Ben Langmead, Department of Computer Science, JHU
P=aba P=aba P=aba
SA(T)
6
5
2
3
0
4
1
T	
  =	
  abaaba$
BWT(T)	
  =	
  abba$aa
0
1
2
3
0
1
0
1
2
3
0
1
0
1
2
3
0
1
Sebastian Schmeier
Burrows-Wheeler transform (BWT)
Exact matching
18
Introduction to the Burrows-WheelerTransform and FM Index
Ben Langmead, Department of Computer Science, JHU
P=aba P=aba P=aba
SA(T)
6
5
2
3
0
4
1
T	
  =	
  abaaba$
BWT(T)	
  =	
  abba$aa
0
1
2
3
0
1
0
1
2
3
0
1
0
1
2
3
0
1
Sebastian Schmeier
Burrows-Wheeler transform (BWT)
Exact matching
19
Introduction to the Burrows-WheelerTransform and FM Index
Ben Langmead, Department of Computer Science, JHU
P=aba P=aba P=aba
SA(T)
6
5
2
3
0
4
1
T	
  =	
  abaaba$
BWT(T)	
  =	
  abba$aa
0
1
2
3
0
1
0
1
2
3
0
1
0
1
2
3
0
1
• Allows for matching in constant time 
 T	
  =	
  abaaba$
0 3
Sebastian Schmeier 20
Sebastian Schmeier
Bowtie
• FM Index finds exact sequence matches quickly in small
memory, but short read alignment demands more:
• Allowances for mismatches
• Consideration of quality values
21
Sebastian Schmeier
Bowtie
• Bowtie’s solution: backtracking quality-aware search
• if a particular base is not found in the index, while traversing
the matrix, backtrack and try another “base” based on
quality and continue with the search string
22
aaaaaaaaaaaa
SA(T)
6
5
2
3
0
4
1
aaaa
0
1
2
3
0
1
0
1
2
3
0
1
0
1
2
3
0
1
0
1
2
3
0
1
P= P= P= P=
Introduction to the Burrows-WheelerTransform and FM Index
Ben Langmead, Department of Computer Science, JHU
T	
  =	
  abaaba$
Sebastian Schmeier
Burrows-Wheeler transform
genome scale
23
• Some clever tricks involved to
achieve more compression of the
data structures (FM-Index*)
• Use BWT on the reference
genome to build the index
• Look up each read
• Convert to genome locations
How to map billions of short reads onto genomes.Trapnell & Salzberg. Nature Biotechnology 2009
*Ferragina & Manzini (2000). Opportunistic Data Structures with Applications. Proc. of the 41st Annual Symposium on Foundations of Computer Science
Sebastian Schmeier
Peak calling
24
Chromatin immunoprecipitation (ChIP)
http://www.nature.com/nrg/journal/v11/n7/full/nrg2795.html
Sebastian Schmeier
Peak calling
• ChIP profile
• Challenges
• MACS
25
Sebastian Schmeier
• Only 5’ ends of ChIPed fragments
are sequenced
• Shifted read distribution
• Expected symmetry between
Watson/Crick read distributions
26
ChIP profile
http://www.nature.com/nrg/journal/v10/n10/abs/nrg2641.html
Sebastian Schmeier
• Adjust for sequence mappability - regions that contain repetitive elements
have different expected tag count
27
Peak calling challenges
http://www.nature.com/nbt/journal/v27/n1/full/nbt.1518.html
Sebastian Schmeier
• Different ChIP-seq applications
produce different type of peaks.
• Most current tools have been
designed to detect sharp peaks 

(TF binding, histone modifications
at regulatory elements)
28
Peak calling challenges
Computation for ChIP-seq and RNA-seq studies Pepke et al. Nat. Methods 2009
Sebastian Schmeier
• Definition of enriched regions/peaks:
• Which statistic to used?
• What boundaries should be
reported?
• What score to use 

(ratio, p-val, q-val)?
• Compute/estimate a FDR?
29
Peak calling challenges
http://www.nature.com/nrg/journal/v10/n10/abs/nrg2641.html
Sebastian Schmeier 30
Step 1: Modelling the tag shift
1. Scan genome with a window of
user-defined sonication size
2. Keep the best 1000 (or less) peaks
having a fold enr. > mfold (default
32, relative to random model)
3. Separate Watson/Crick tags
4. Shift size is modelled as the
distance d between the modes of
the Watson and Crick peaks
Model-based Analysis of ChIP-Seq (MACS). Zhang. et al. Genome Biology 2008
http://www.biologie.ens.fr/~mthomas/other/chip-seq-training/booklet/booklet_chip-seq.pdf
MACSMACS
d
Sebastian Schmeier
MACS
Step 2: Peak detection
1. Shift every tag by d/2
2. Slide a 2d window across the 

genome to find candidate peaks with significant tag
enrichment (according to Poisson distribution, default p-
value = 10
-5
)
3. Merge overlapping peaks
4. Report:
• fold enrichment for called peaks: ratio between tag
counts and expected using Poisson distribution (using
input data if provided)
• Position with highest pile-up is defined as the summit of
peak
• Empiric FDR if control sample is provided (sample swap), 

FDR = #control peaks / #ChIP peaks
31
MACS
Model-based Analysis of ChIP-Seq (MACS). Zhang. et al. Genome Biology 2008
d
http://www.biologie.ens.fr/~mthomas/other/chip-seq-training/booklet/booklet_chip-seq.pdf
Sebastian Schmeier
Again lots of choice
32Computation for ChIP-seq and RNA-seq studies Pepke et al. Nat. Methods 2009
and more…
Sebastian Schmeier
Visualise to assess quality
• Assess the data quality e.g. positive controls, background
• Determine cutoffs (looking at positive controls)
• Compare different peak finder outputs
• Integration of data / co-visualization
33Identifying ChIP-seq enrichment using MACS. Feng et al. Nat. Protocols, 2012
References
Introduction to the Burrows-WheelerTransform and FM Index. Ben Langmead, Department of Computer Science, JHU
How to map billions of short reads onto genomes.Trapnell & Salzberg. Nature Biotechnology 2009
Practical Guidelines for the Comprehensive Analysis of ChIP-seq Data. Bailey et al. PLoS Comp. Bio. 2013
Model-based Analysis of ChIP-Seq (MACS). Zhang. et al. Genome Biology 2008
Computation for ChIP-seq and RNA-seq studies Pepke et al. Nat. Methods 2009
Sebastian Schmeier
s.schmeier@gmail.com
http://sschmeier.github.io/bioinf-workshop/

More Related Content

PPTX
Next Gen Sequencing (NGS) Technology Overview
PPT
P. Joshi SBDD and docking (1).ppt
PDF
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
PPTX
HIGH THROUGHPUT SCREENING.pptx
PPTX
A Comparison of NGS Platforms.
PPTX
HIGH THROUGHPUT SCREENING Technology
PDF
Variant calling and how to prioritize somatic mutations and inheritated varia...
PDF
Gene mapping and its sequence
Next Gen Sequencing (NGS) Technology Overview
P. Joshi SBDD and docking (1).ppt
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
HIGH THROUGHPUT SCREENING.pptx
A Comparison of NGS Platforms.
HIGH THROUGHPUT SCREENING Technology
Variant calling and how to prioritize somatic mutations and inheritated varia...
Gene mapping and its sequence

What's hot (20)

PDF
Analysis of ChIP-Seq Data
PPSX
Next Generation Sequencing
PDF
Introduction to next generation sequencing
PPTX
Small Molecule Real Time Sequencing
PDF
PPTX
Introduction to Proteogenomics
PPT
Rna seq pipeline
PDF
Liquid Biopsy Overview, Challenges and New Solutions: Liquid Biopsy Series Pa...
PPT
Assembly and finishing
PPTX
Helicos Sequencing
PDF
Generations of sequencing technologies.
PPTX
cre-lox and cre recombinases in Mouse Genome Informatics (MGI): Module 2
PPTX
Ppt snp detection
PPTX
Monoclonal Antibodies for Cancer Treatment
PPT
How the blast work
PPTX
Next generation sequencing
PPT
sequencing of genome
PPT
Characterization of Cell Line
PPTX
Genome Editing with TALENS
Analysis of ChIP-Seq Data
Next Generation Sequencing
Introduction to next generation sequencing
Small Molecule Real Time Sequencing
Introduction to Proteogenomics
Rna seq pipeline
Liquid Biopsy Overview, Challenges and New Solutions: Liquid Biopsy Series Pa...
Assembly and finishing
Helicos Sequencing
Generations of sequencing technologies.
cre-lox and cre recombinases in Mouse Genome Informatics (MGI): Module 2
Ppt snp detection
Monoclonal Antibodies for Cancer Treatment
How the blast work
Next generation sequencing
sequencing of genome
Characterization of Cell Line
Genome Editing with TALENS
Ad

Viewers also liked (16)

PDF
Next-generation sequencing and quality control: An Introduction (2016)
PDF
ECCB 2010 Next-gen sequencing Tutorial
PDF
Quality Control of Sequencing Data
PDF
Quality Control of NGS Data Solutions
PDF
Promises and Challenges of Next Generation Sequencing for HIV and HCV
PPTX
GTC group 8 - Next Generation Sequencing
PDF
Examining gene expression and methylation with next gen sequencing
PDF
Next-generation sequencing course, part 1: technologies
PDF
Genome assembly: An Introduction (2016)
PPTX
Next-generation sequencing data format and visualization with ngs.plot 2015
PPT
New Generation Sequencing Technologies: an overview
PDF
Making your science powerful : an introduction to NGS experimental design
PDF
Next Generation Sequencing Informatics - Challenges and Opportunities
PPTX
Ngs ppt
PDF
Ngs intro_v6_public
PDF
NGS - Basic principles and sequencing platforms
Next-generation sequencing and quality control: An Introduction (2016)
ECCB 2010 Next-gen sequencing Tutorial
Quality Control of Sequencing Data
Quality Control of NGS Data Solutions
Promises and Challenges of Next Generation Sequencing for HIV and HCV
GTC group 8 - Next Generation Sequencing
Examining gene expression and methylation with next gen sequencing
Next-generation sequencing course, part 1: technologies
Genome assembly: An Introduction (2016)
Next-generation sequencing data format and visualization with ngs.plot 2015
New Generation Sequencing Technologies: an overview
Making your science powerful : an introduction to NGS experimental design
Next Generation Sequencing Informatics - Challenges and Opportunities
Ngs ppt
Ngs intro_v6_public
NGS - Basic principles and sequencing platforms
Ad

Similar to ChIP-seq - Data processing (20)

PPTX
Pathogen phylogenetics using BEAST
PPTX
Introduction to Bayesian phylogenetics and BEAST
PDF
Use of a Levy Distribution for Modeling Best Case Execution Time Variation
PDF
ChipSeq Data Analysis
PPTX
Basic Local Alignment Search Tool (BLAST)
PPTX
2015 bioinformatics database_searching_wimvancriekinge
PPT
B.sc biochem i bobi u 3.2 algorithm + blast
PPT
B.sc biochem i bobi u 3.2 algorithm + blast
PPTX
2016 bioinformatics i_database_searching_wimvancriekinge
PPTX
Bioinformatica t4-alignments
PDF
20110524zurichngs 1st pub
PPTX
Bioinfo ngs data format visualization v2
PPT
sequencea.ppt
PPT
sequenckjkojkjhguignmpojihiubgijnkompoje.ppt
PPT
sequf;lds,g;'dsg;dlld'g;;gldgence - Copy.ppt
PPT
lecture4.ppt Sequence Alignmentaldf sdfsadf
PDF
Heuristic design of experiments w meta gradient search
PPTX
from genome sequencing to genome assembly
PDF
Target Enrichment with NGS: Cardiomyopathy as a case study - BMR Genomics
PDF
Scaling up genomic analysis with ADAM
Pathogen phylogenetics using BEAST
Introduction to Bayesian phylogenetics and BEAST
Use of a Levy Distribution for Modeling Best Case Execution Time Variation
ChipSeq Data Analysis
Basic Local Alignment Search Tool (BLAST)
2015 bioinformatics database_searching_wimvancriekinge
B.sc biochem i bobi u 3.2 algorithm + blast
B.sc biochem i bobi u 3.2 algorithm + blast
2016 bioinformatics i_database_searching_wimvancriekinge
Bioinformatica t4-alignments
20110524zurichngs 1st pub
Bioinfo ngs data format visualization v2
sequencea.ppt
sequenckjkojkjhguignmpojihiubgijnkompoje.ppt
sequf;lds,g;'dsg;dlld'g;;gldgence - Copy.ppt
lecture4.ppt Sequence Alignmentaldf sdfsadf
Heuristic design of experiments w meta gradient search
from genome sequencing to genome assembly
Target Enrichment with NGS: Cardiomyopathy as a case study - BMR Genomics
Scaling up genomic analysis with ADAM

Recently uploaded (20)

PDF
01-Introduction-to-Information-Management.pdf
PPTX
Cell Structure & Organelles in detailed.
PPTX
Cell Types and Its function , kingdom of life
PDF
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PPTX
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PDF
TR - Agricultural Crops Production NC III.pdf
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PDF
VCE English Exam - Section C Student Revision Booklet
PPTX
Introduction to Child Health Nursing – Unit I | Child Health Nursing I | B.Sc...
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
Anesthesia in Laparoscopic Surgery in India
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PDF
Origin of periodic table-Mendeleev’s Periodic-Modern Periodic table
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PDF
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
PDF
Business Ethics Teaching Materials for college
01-Introduction-to-Information-Management.pdf
Cell Structure & Organelles in detailed.
Cell Types and Its function , kingdom of life
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
Supply Chain Operations Speaking Notes -ICLT Program
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
TR - Agricultural Crops Production NC III.pdf
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
STATICS OF THE RIGID BODIES Hibbelers.pdf
VCE English Exam - Section C Student Revision Booklet
Introduction to Child Health Nursing – Unit I | Child Health Nursing I | B.Sc...
Final Presentation General Medicine 03-08-2024.pptx
Anesthesia in Laparoscopic Surgery in India
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
Origin of periodic table-Mendeleev’s Periodic-Modern Periodic table
2.FourierTransform-ShortQuestionswithAnswers.pdf
Renaissance Architecture: A Journey from Faith to Humanism
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
Business Ethics Teaching Materials for college

ChIP-seq - Data processing

  • 1. ChIP - Data processing Sebastian Schmeier s.schmeier@gmail.com http://sschmeier.github.io/bioinf-workshop/ 2015
  • 2. Sebastian Schmeier 2 DNA 
 sequencing RNA 
 sequencing Biological sample Gene regulation, chromatin structure Genome 
 variation Gene expression Genome 
 assembly Transcriptome assembly, Splice variant detection Metabarcoding Common analyses overview
  • 3. Sebastian Schmeier Gene regulation, chromatin structure • How do we analyse it? • Mapping reads to a reference genome • Calling peaks 3
  • 4. Sebastian Schmeier Mapping reads 4 Chromatin immunoprecipitation (ChIP) http://www.nature.com/nrg/journal/v11/n7/full/nrg2795.html
  • 5. Sebastian Schmeier Mapping reads • Challenges • Approximate String Matching Problem • Burrows-Wheeler transform • Bowtie 5
  • 6. Sebastian Schmeier Challenges of mapping short reads • If the reference genome is very large, and if we have billions of reads, how quickly can we align the reads to the genome? • The task of mapping billions of sequences to a mammalian- sized genome calls for extraordinarily efficient algorithms, in which every bit of memory is used optimally or near optimally. 6
  • 7. Sebastian Schmeier Challenges of mapping short reads • If a read comes from a repetitive element in the reference, a program must pick which copy of the repeat the read belongs to • The program may choose to report multiple possible locations or to pick a location heuristically • Sequencing errors or variations between the sequenced chromosomes and the reference genome exacerbate this problem, because the alignment between the read and its true source in the genome may actually have more differences than the alignment between the read and some other copy of the repeat 7
  • 8. Sebastian Schmeier Choice? • Intelligently make tradeoffs in • Speed • Memory utilisation • Accuracy • Ease of use • Adoption and maintenance • Understanding of the fundamental 
 methods 8https://www.ebi.ac.uk/~nf/hts_mappers/
  • 9. Sebastian Schmeier Mapping algorithms • One could find the true locations using exact matching, assuming: • a genome had no repeats and a sequencing experiment introduced no errors • a sufficient read length relative to the genome size • Assumption do NOT hold 9
  • 10. Sebastian Schmeier Mapping algorithms • Approximate String Matching Problem • Searching for occurrences of the read sequence within the reference sequence but allowing for some mismatches and gaps between the two • Standard algorithm: dynamic programming • Too slow • Too much memory required 10
  • 11. Sebastian Schmeier Mapping algorithms • Approximate String Matching Problem • Two main ideas for addressing large input sizes (in # of reads and size of the reference): • filtering • quickly exclude large regions of the reference where no approximate match can be found • indexing • Preprocessing the reference sequence and/or the set of reads to establish string indices • Benefit of preprocessing into string indices is that it typically does not require scanning the whole reference, and it can therefore conduct queries much faster at the expense of larger memory consumption. • The string indices that are currently used are: • Suffix array • Enhanced suffix array • FM-index (Full-text index in Minute space) + Burrows-Wheeler transform 11
  • 12. Sebastian Schmeier Burrows-Wheeler transform (BWT) Creation • Write down all rotation of the string • Sort the matrix lexicographically • Last column is the BWT(T) • The rows in the matrix are essentially the sorted suffixes of the text • SA(T) is the start offset in the original string 12 Introduction to the Burrows-WheelerTransform and FM Index Ben Langmead, Department of Computer Science, JHU T  =  abaaba$ BWT(T)  =  abba$aa SA(T) 6 5 2 3 0 4 1 SA(T) 6 5 4 3 2 1 0
  • 13. Sebastian Schmeier Burrows-Wheeler transform (BWT) LF mapping • We rank according to how many times the same character occurred previously in BWT(T) • We keep an array of positions in the rotation SA(T) • We keep an index of occurrences starting at zero 13 Introduction to the Burrows-WheelerTransform and FM Index Ben Langmead, Department of Computer Science, JHU SA(T) 6 5 2 3 0 4 1 SA(T) 6 5 2 3 0 4 1 T  =  abaaba$ BWT(T)  =  abba$aa 0 1 2 3 0 1
  • 14. Sebastian Schmeier Burrows-Wheeler transform (BWT) Exact matching 14 Introduction to the Burrows-WheelerTransform and FM Index Ben Langmead, Department of Computer Science, JHU P=aba P=aba P=aba SA(T) 6 5 2 3 0 4 1 T  =  abaaba$ BWT(T)  =  abba$aa 0 1 2 3 0 1 0 1 2 3 0 1 0 1 2 3 0 1
  • 15. Sebastian Schmeier Burrows-Wheeler transform (BWT) Exact matching 15 Introduction to the Burrows-WheelerTransform and FM Index Ben Langmead, Department of Computer Science, JHU P=aba P=aba P=aba SA(T) 6 5 2 3 0 4 1 T  =  abaaba$ BWT(T)  =  abba$aa 0 1 2 3 0 1 0 1 2 3 0 1 0 1 2 3 0 1
  • 16. Sebastian Schmeier Burrows-Wheeler transform (BWT) Exact matching 16 Introduction to the Burrows-WheelerTransform and FM Index Ben Langmead, Department of Computer Science, JHU P=aba P=aba P=aba SA(T) 6 5 2 3 0 4 1 T  =  abaaba$ BWT(T)  =  abba$aa 0 1 2 3 0 1 0 1 2 3 0 1 0 1 2 3 0 1
  • 17. Sebastian Schmeier Burrows-Wheeler transform (BWT) Exact matching 17 Introduction to the Burrows-WheelerTransform and FM Index Ben Langmead, Department of Computer Science, JHU P=aba P=aba P=aba SA(T) 6 5 2 3 0 4 1 T  =  abaaba$ BWT(T)  =  abba$aa 0 1 2 3 0 1 0 1 2 3 0 1 0 1 2 3 0 1
  • 18. Sebastian Schmeier Burrows-Wheeler transform (BWT) Exact matching 18 Introduction to the Burrows-WheelerTransform and FM Index Ben Langmead, Department of Computer Science, JHU P=aba P=aba P=aba SA(T) 6 5 2 3 0 4 1 T  =  abaaba$ BWT(T)  =  abba$aa 0 1 2 3 0 1 0 1 2 3 0 1 0 1 2 3 0 1
  • 19. Sebastian Schmeier Burrows-Wheeler transform (BWT) Exact matching 19 Introduction to the Burrows-WheelerTransform and FM Index Ben Langmead, Department of Computer Science, JHU P=aba P=aba P=aba SA(T) 6 5 2 3 0 4 1 T  =  abaaba$ BWT(T)  =  abba$aa 0 1 2 3 0 1 0 1 2 3 0 1 0 1 2 3 0 1 • Allows for matching in constant time 
 T  =  abaaba$ 0 3
  • 21. Sebastian Schmeier Bowtie • FM Index finds exact sequence matches quickly in small memory, but short read alignment demands more: • Allowances for mismatches • Consideration of quality values 21
  • 22. Sebastian Schmeier Bowtie • Bowtie’s solution: backtracking quality-aware search • if a particular base is not found in the index, while traversing the matrix, backtrack and try another “base” based on quality and continue with the search string 22 aaaaaaaaaaaa SA(T) 6 5 2 3 0 4 1 aaaa 0 1 2 3 0 1 0 1 2 3 0 1 0 1 2 3 0 1 0 1 2 3 0 1 P= P= P= P= Introduction to the Burrows-WheelerTransform and FM Index Ben Langmead, Department of Computer Science, JHU T  =  abaaba$
  • 23. Sebastian Schmeier Burrows-Wheeler transform genome scale 23 • Some clever tricks involved to achieve more compression of the data structures (FM-Index*) • Use BWT on the reference genome to build the index • Look up each read • Convert to genome locations How to map billions of short reads onto genomes.Trapnell & Salzberg. Nature Biotechnology 2009 *Ferragina & Manzini (2000). Opportunistic Data Structures with Applications. Proc. of the 41st Annual Symposium on Foundations of Computer Science
  • 24. Sebastian Schmeier Peak calling 24 Chromatin immunoprecipitation (ChIP) http://www.nature.com/nrg/journal/v11/n7/full/nrg2795.html
  • 25. Sebastian Schmeier Peak calling • ChIP profile • Challenges • MACS 25
  • 26. Sebastian Schmeier • Only 5’ ends of ChIPed fragments are sequenced • Shifted read distribution • Expected symmetry between Watson/Crick read distributions 26 ChIP profile http://www.nature.com/nrg/journal/v10/n10/abs/nrg2641.html
  • 27. Sebastian Schmeier • Adjust for sequence mappability - regions that contain repetitive elements have different expected tag count 27 Peak calling challenges http://www.nature.com/nbt/journal/v27/n1/full/nbt.1518.html
  • 28. Sebastian Schmeier • Different ChIP-seq applications produce different type of peaks. • Most current tools have been designed to detect sharp peaks 
 (TF binding, histone modifications at regulatory elements) 28 Peak calling challenges Computation for ChIP-seq and RNA-seq studies Pepke et al. Nat. Methods 2009
  • 29. Sebastian Schmeier • Definition of enriched regions/peaks: • Which statistic to used? • What boundaries should be reported? • What score to use 
 (ratio, p-val, q-val)? • Compute/estimate a FDR? 29 Peak calling challenges http://www.nature.com/nrg/journal/v10/n10/abs/nrg2641.html
  • 30. Sebastian Schmeier 30 Step 1: Modelling the tag shift 1. Scan genome with a window of user-defined sonication size 2. Keep the best 1000 (or less) peaks having a fold enr. > mfold (default 32, relative to random model) 3. Separate Watson/Crick tags 4. Shift size is modelled as the distance d between the modes of the Watson and Crick peaks Model-based Analysis of ChIP-Seq (MACS). Zhang. et al. Genome Biology 2008 http://www.biologie.ens.fr/~mthomas/other/chip-seq-training/booklet/booklet_chip-seq.pdf MACSMACS d
  • 31. Sebastian Schmeier MACS Step 2: Peak detection 1. Shift every tag by d/2 2. Slide a 2d window across the 
 genome to find candidate peaks with significant tag enrichment (according to Poisson distribution, default p- value = 10 -5 ) 3. Merge overlapping peaks 4. Report: • fold enrichment for called peaks: ratio between tag counts and expected using Poisson distribution (using input data if provided) • Position with highest pile-up is defined as the summit of peak • Empiric FDR if control sample is provided (sample swap), 
 FDR = #control peaks / #ChIP peaks 31 MACS Model-based Analysis of ChIP-Seq (MACS). Zhang. et al. Genome Biology 2008 d http://www.biologie.ens.fr/~mthomas/other/chip-seq-training/booklet/booklet_chip-seq.pdf
  • 32. Sebastian Schmeier Again lots of choice 32Computation for ChIP-seq and RNA-seq studies Pepke et al. Nat. Methods 2009 and more…
  • 33. Sebastian Schmeier Visualise to assess quality • Assess the data quality e.g. positive controls, background • Determine cutoffs (looking at positive controls) • Compare different peak finder outputs • Integration of data / co-visualization 33Identifying ChIP-seq enrichment using MACS. Feng et al. Nat. Protocols, 2012
  • 34. References Introduction to the Burrows-WheelerTransform and FM Index. Ben Langmead, Department of Computer Science, JHU How to map billions of short reads onto genomes.Trapnell & Salzberg. Nature Biotechnology 2009 Practical Guidelines for the Comprehensive Analysis of ChIP-seq Data. Bailey et al. PLoS Comp. Bio. 2013 Model-based Analysis of ChIP-Seq (MACS). Zhang. et al. Genome Biology 2008 Computation for ChIP-seq and RNA-seq studies Pepke et al. Nat. Methods 2009 Sebastian Schmeier s.schmeier@gmail.com http://sschmeier.github.io/bioinf-workshop/