Analysis of ChIP-Seq Data

Bioinformatics Analysis of ChIP-Seq
Phil Ewels, NGI Stockholm
phil.ewels@scilifelab.se
Epigenetics and its applications
in clinical research (2601)
2017-03-21

Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42
Talk Overview
• Overview of ChIP-Seq
• ChIP-Seq data processing
• Peak Calling
• Normalisation & quality control
• Analysis Pipelines
• Downstream analyses
2

Overview of ChIP-Seq
• Question
- Can we ﬁnd where a protein of interest binds across
the genome?
• Requirements
- Good antibody
- Reference genome
• Assumptions
- Protein binds in a stable pattern
- Binding is comparable across a population of cells
4

• Cross-link DNA and proteins
• Isolate DNA & fragmentation
• Chromatin
Immunoprecipitation
• Reverse cross-links 
and purify DNA
• Add adapters & sequence
5

• Chromatin
Immunoprecipitation
and purify DNA
6

• Chromatin
Immunoprecipitation
and purify DNA
7

• Chromatin
Immunoprecipitation
and purify DNA
8

• Chromatin
Immunoprecipitation
and purify DNA
9

Data processing
• Sequence QC
- FastQC / FastQ Screen
• Trimming
- Cutadapt / Trimmomatic / AlienTrimmer / FASTX-Toolkit
• Alignment
- Bowtie / BWA / STAR
• Duplicate removal
- Picard / Samtools / SeqMonk
11

Data processing: FastQC
12

Data processing: FastQ Screen
13

Data processing: cutadapt
14
http://opensource.scilifelab.se/projects/cutadapt/

Data processing: cutadapt + FastQC
15

Data processing: Alignment
• Bowtie 1 & 2
- Bowtie 1 good for short reads (less than 50bp)
- Bowtie 2 better with longer reads
• STAR
- As good as bowtie but much faster
- Has a large memory footprint (~30 gigs for Human)
• BWA / Subread / SOAP / MAQ
• Alignments should be unique
16

Data processing: Duplicate Removal
• Duplicates can come from multiple sources
- PCR duplicates
- Optical duplicates
- Deep sequencing (genuine duplicates)
• How do you deﬁne duplicates?
- Sequence content - errors?
- Mapping position
• Sonication makes genuine duplicates unlikely
17

Results Summary: MultiQC
18
• Scans your results directory
and parses log ﬁles
• Builds a single report
summarising everything
http://multiqc.info

Peak Calling: Considerations
• What kind of mark are you looking for?
• Point-source factors
- Few, sharp peaks
- Most transcription factors
• Many peaks
- RNA Polymerase II
• Broad peaks
- Some histone marks (H3K27me3)
20

Peak Calling: Tools
• Huge number of tools available
• Many different statistical approaches
• Only important thing to remember: your results
should look sensible and you must be consistent
• If in doubt, use MACS v2 or SPP
- https://github.com/taoliu/MACS/
- http://compbio.med.harvard.edu/Supplements/ChIP-seq/
21

Normalisation & quality control
• What we expect to see:
• Assumptions:
- Specific antibody
- Perfect purification
- Equal representation
• Reality:
- Non-specific antibody binding
- Unbound DNA being sequenced
- Open chromatin bias, repetitive regions not aligned
23

Normalisation: input controls
• Typically run an input sample
- Cross-linked DNA, but no ChIP step
- Can use a non-nuclear antibody such as IgG
- Same sample, same prep
- Captures systematic biases (eg. chromatin type, GC)
• Can use the data in multiple ways
- Just determine regions to exclude
- Subtraction normalisation
- Typically used when calling peaks
24

Normalisation: Signal and Noise
25
• We will sequence lots of irrelevant stuff
- What is signal and what is noise?
• Essentially, we’re looking for enrichment
- Peak callers do a lot of this for you
• Most peak callers need an input sample
- Some can use mappability and GC content instead

Quality Control: visualisation
26
• Visualising the data is quick and very helpful
- UCSC / SeqMonk / IGV
• Fast impression of how the experiment has worked
• Not enough on its own

Quality Control: SeqMonk
27

Quality Control: Saturation Analysis
• If you sequence more reads, you’ll ﬁnd more peaks
• If you’ve sequenced enough, you should be
nearing a plateau
• Look into complexity of data
- Preseq
- SPP subsampling
28

Quality Control: Strand Cross-Correlation
29
• Single-end sequencing
should give a bimodal peak
around binding sites on the
two DNA strands
• Some peak callers use this
to aid in region calling and
for QC
• Can deﬁne NSC and RSC
scores…

Quality Control: Strand Cross-Correlation
30
• Can deﬁne NSC and RSC scores…
Landt et al. 2012

Quality Control: Stats, stats, stats
• NSC and RSC
- Normalised strand cross-correlation coefﬁcient
- Relative strand cross-correlation coefﬁcient
• FRiP
- Fraction of reads in peaks
• FDRs, IDRs, p-values of peaks
- False discovery rates
- Irreproducible discovery rates
31

Quality Control: Stats, stats, stats
• Useful if you have a lot of samples
- Allows benchmarking and identiﬁcation of failed
samples
• Don’t be overwhelmed by the acronyms
• Believe your eyes - if the data looks trustworthy, it
probably is trustworthy
32

Bioinformatics Workflows
• Running all of these steps for many samples is
repetitive
- Difficult, dull, prone to errors
• Processing can be automated by a Workflow
Manager
- Also known as Pipeline Tools
• Execute processing steps for you, managing files
and dependencies.
34

Cluster Flow
• Available on UPPMAX
• ChIP-seq pipeline
- Runs QC, alignment,
deduplication and
generates coverage
tracks / ﬁngerprint plots
- Written with J Westholm
35
#fastqc
#bowtie1
#samtools_sort_index
#bedtools_bamToBed
#bedToNrf
#picard_dedup
#phantompeaktools_runSpp
#deeptools_bamCoverage
#deeptools_bamFingerprint
#bedtools_intersectNeg
module load clusterflow
cf --setup
cf-uppmax --add_genomes
cf --genome GRCh37 chipseq_qc *.fq.gz

Nextﬂow
• Runs on UPPMAX
• Several pipelines built at NGI, including ChIP-seq
- Still under development, could be a little buggy
• Also runs elsewhere. Docker coming soon.
36
curl -fsSL get.nextflow.io | bash
nextflow run SciLifeLab/NGI-ChIPseq
--project b2017123
--reads '*_R{1,2}.fastq.gz'
--macsconfig ‘macssetup.config'
--genome GRCh37

Bioinformatics Workﬂows
• These are great, but come with some caveats
- Some setup is required
- They don’t always work…
- Results must be checked
• They are not a substitute for understanding the
analysis steps
• You are still responsible for your results!
37

Downstream Analysis: Annotation
• You have reads! Peaks! But where are they?
- Co-ordinates are not helpful by themselves
• BEDTools
- closest: distance to nearest genes
- intersect: overlap with feature classes
• HOMER annotation
• SeqMonk Average quantitation plots
39

40
• HOMER can annotate read intensities

41
• SeqMonk average quantitation plot across genes

• GO analysis is increasingly popular
- Gene Ontology search
• Databases classify every gene with a restricted
vocabulary
• Use your data to ﬁnd if any GO terms are enriched
• Like peak callers, lots of software available
- DAVID and GREAT are popular
- Cytoscape good for visualisation
42

Downstream Analysis: Motif Searching
• Search peaks for enriched sequence motifs
- Could indicate a TF binding motif
- Interesting for new ChIP factors
- Can be informative for co-operative binding
• HOMER is one of many tools to do this
43

Downstream Analysis: Differential binding
• May want to compare samples across conditions or
time series
• Overlapping peaks is too simplistic
• DiffBind: R Bioconductor package
- ChIP-seq equivalent of DESeq and edgeR
- Extensive documentation and tutorials
- http://bioconductor.org/packages/release/bioc/html/DiffBind.html
44

Conclusions
• There is no “correct way” to analyse ChIP-seq
- Depends on biological system and question
- Affected by number of samples and experimental setup
- Deﬁned by your experience and skills
• Two packages that do a lot of steps:
- HOMER
- SeqMonk
- Lots of YouTube walk through videos
- https://youtu.be/LcMVb4zQBXI and https://youtu.be/Cy13yV6Rf6s
46

Further Reading
• Practical Guidelines for the Comprehensive Analysis of
ChIP-seq Data
- Bailey et al. PLOS Comp Bio (2013)
• ChIP-seq guidelines and practices of the ENCODE and
modENCODE consortia
- Landt et al. Genome Research (2012)
• ChIP–seq: advantages and challenges of a maturing
technology
- Park. Nature Reviews Genetics (2009)
• http://seqanswers.com and http://biostars.org
47

Questions?
phil.ewels@scilifelab.se
Slides: http://tiny.cc/chipseq

Analysis of ChIP-Seq Data

More Related Content

What's hot (20)

Viewers also liked (20)

Similar to Analysis of ChIP-Seq Data (20)

More from Phil Ewels (16)

Recently uploaded (20)

Analysis of ChIP-Seq Data