SlideShare a Scribd company logo
Bioinformatics Analysis of ChIP-Seq
Phil Ewels, NGI Stockholm
phil.ewels@scilifelab.se
Epigenetics and its applications
in clinical research (2601)
2017-03-21
Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42
Talk Overview
• Overview of ChIP-Seq
• ChIP-Seq data processing
• Peak Calling
• Normalisation & quality control
• Analysis Pipelines
• Downstream analyses
2
Overview of ChIP-Seq
Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42
Overview of ChIP-Seq
• Question
- Can we find where a protein of interest binds across
the genome?
• Requirements
- Good antibody
- Reference genome
• Assumptions
- Protein binds in a stable pattern
- Binding is comparable across a population of cells
4
Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42
Overview of ChIP-Seq
• Cross-link DNA and proteins
• Isolate DNA & fragmentation
• Chromatin
Immunoprecipitation
• Reverse cross-links

and purify DNA
• Add adapters & sequence
5
Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42
Overview of ChIP-Seq
• Cross-link DNA and proteins
• Isolate DNA & fragmentation
• Chromatin
Immunoprecipitation
• Reverse cross-links

and purify DNA
• Add adapters & sequence
6
Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42
Overview of ChIP-Seq
• Cross-link DNA and proteins
• Isolate DNA & fragmentation
• Chromatin
Immunoprecipitation
• Reverse cross-links

and purify DNA
• Add adapters & sequence
7
Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42
Overview of ChIP-Seq
• Cross-link DNA and proteins
• Isolate DNA & fragmentation
• Chromatin
Immunoprecipitation
• Reverse cross-links

and purify DNA
• Add adapters & sequence
8
Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42
Overview of ChIP-Seq
• Cross-link DNA and proteins
• Isolate DNA & fragmentation
• Chromatin
Immunoprecipitation
• Reverse cross-links

and purify DNA
• Add adapters & sequence
9
ChIP-Seq data processing
Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42
Data processing
• Sequence QC
- FastQC / FastQ Screen
• Trimming
- Cutadapt / Trimmomatic / AlienTrimmer / FASTX-Toolkit
• Alignment
- Bowtie / BWA / STAR
• Duplicate removal
- Picard / Samtools / SeqMonk
11
Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42
Data processing: FastQC
12
Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42
Data processing: FastQ Screen
13
Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42
Data processing: cutadapt
14
http://opensource.scilifelab.se/projects/cutadapt/
Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42
Data processing: cutadapt + FastQC
15
Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42
Data processing: Alignment
• Bowtie 1 & 2
- Bowtie 1 good for short reads (less than 50bp)
- Bowtie 2 better with longer reads
• STAR
- As good as bowtie but much faster
- Has a large memory footprint (~30 gigs for Human)
• BWA / Subread / SOAP / MAQ
• Alignments should be unique
16
Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42
Data processing: Duplicate Removal
• Duplicates can come from multiple sources
- PCR duplicates
- Optical duplicates
- Deep sequencing (genuine duplicates)
• How do you define duplicates?
- Sequence content - errors?
- Mapping position
• Sonication makes genuine duplicates unlikely
17
Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42
Results Summary: MultiQC
18
• Scans your results directory
and parses log files
• Builds a single report
summarising everything
http://multiqc.info
Peak Calling
Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42
Peak Calling: Considerations
• What kind of mark are you looking for?
• Point-source factors
- Few, sharp peaks
- Most transcription factors
• Many peaks
- RNA Polymerase II
• Broad peaks
- Some histone marks (H3K27me3)
20
Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42
Peak Calling: Tools
• Huge number of tools available
• Many different statistical approaches
• Only important thing to remember: your results
should look sensible and you must be consistent
• If in doubt, use MACS v2 or SPP
- https://github.com/taoliu/MACS/
- http://compbio.med.harvard.edu/Supplements/ChIP-seq/
21
Normalisation & QC
Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42
Normalisation & quality control
• What we expect to see:
• Assumptions:
- Specific antibody
- Perfect purification
- Equal representation
• Reality:
- Non-specific antibody binding
- Unbound DNA being sequenced
- Open chromatin bias, repetitive regions not aligned
23
Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42
Normalisation: input controls
• Typically run an input sample
- Cross-linked DNA, but no ChIP step
- Can use a non-nuclear antibody such as IgG
- Same sample, same prep
- Captures systematic biases (eg. chromatin type, GC)
• Can use the data in multiple ways
- Just determine regions to exclude
- Subtraction normalisation
- Typically used when calling peaks
24
Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42
Normalisation: Signal and Noise
25
• We will sequence lots of irrelevant stuff
- What is signal and what is noise?
• Essentially, we’re looking for enrichment
- Peak callers do a lot of this for you
• Most peak callers need an input sample
- Some can use mappability and GC content instead
Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42
Quality Control: visualisation
26
• Visualising the data is quick and very helpful
- UCSC / SeqMonk / IGV
• Fast impression of how the experiment has worked
• Not enough on its own
Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42
Quality Control: SeqMonk
27
Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42
Quality Control: Saturation Analysis
• If you sequence more reads, you’ll find more peaks
• If you’ve sequenced enough, you should be
nearing a plateau
• Look into complexity of data
- Preseq
- SPP subsampling
28
Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42
Quality Control: Strand Cross-Correlation
29
• Single-end sequencing
should give a bimodal peak
around binding sites on the
two DNA strands
• Some peak callers use this
to aid in region calling and
for QC
• Can define NSC and RSC
scores…
Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42
Quality Control: Strand Cross-Correlation
30
• Can define NSC and RSC scores…
Landt et al. 2012
Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42
Quality Control: Stats, stats, stats
• NSC and RSC
- Normalised strand cross-correlation coefficient
- Relative strand cross-correlation coefficient
• FRiP
- Fraction of reads in peaks
• FDRs, IDRs, p-values of peaks
- False discovery rates
- Irreproducible discovery rates
31
Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42
Quality Control: Stats, stats, stats
• Useful if you have a lot of samples
- Allows benchmarking and identification of failed
samples
• Don’t be overwhelmed by the acronyms
• Believe your eyes - if the data looks trustworthy, it
probably is trustworthy
32
Analysis Pipelines
Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42
Bioinformatics Workflows
• Running all of these steps for many samples is
repetitive
- Difficult, dull, prone to errors
• Processing can be automated by a Workflow
Manager
- Also known as Pipeline Tools
• Execute processing steps for you, managing files
and dependencies.
34
Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42
Cluster Flow
• Available on UPPMAX
• ChIP-seq pipeline
- Runs QC, alignment,
deduplication and
generates coverage
tracks / fingerprint plots
- Written with J Westholm
35
#fastqc	
#bowtie1	
				#samtools_sort_index	
								#bedtools_bamToBed	
												#bedToNrf	
								#picard_dedup	
												#samtools_sort_index	
																#phantompeaktools_runSpp	
																				#deeptools_bamCoverage	
																				#deeptools_bamFingerprint	
																#bedtools_intersectNeg	
																				#samtools_sort_index
module	load	clusterflow	
cf	--setup	
cf-uppmax	--add_genomes	
cf	--genome	GRCh37	chipseq_qc	*.fq.gz
Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42
Nextflow
• Runs on UPPMAX
• Several pipelines built at NGI, including ChIP-seq
- Still under development, could be a little buggy
• Also runs elsewhere. Docker coming soon.
36
curl	-fsSL	get.nextflow.io	|	bash	
nextflow	run	SciLifeLab/NGI-ChIPseq		
--project	b2017123		
--reads	'*_R{1,2}.fastq.gz'		
--macsconfig	‘macssetup.config'		
--genome	GRCh37
Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42
Bioinformatics Workflows
• These are great, but come with some caveats
- Some setup is required
- They don’t always work…
- Results must be checked
• They are not a substitute for understanding the
analysis steps
• You are still responsible for your results!
37
Downstream Analysis
Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42
Downstream Analysis: Annotation
• You have reads! Peaks! But where are they?
- Co-ordinates are not helpful by themselves
• BEDTools
- closest: distance to nearest genes
- intersect: overlap with feature classes
• HOMER annotation
• SeqMonk Average quantitation plots
39
Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42
Downstream Analysis: Annotation
40
• HOMER can annotate read intensities
Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42
Downstream Analysis: Annotation
41
• SeqMonk average quantitation plot across genes
Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42
Downstream Analysis: Annotation
• GO analysis is increasingly popular
- Gene Ontology search
• Databases classify every gene with a restricted
vocabulary
• Use your data to find if any GO terms are enriched
• Like peak callers, lots of software available
- DAVID and GREAT are popular
- Cytoscape good for visualisation
42
Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42
Downstream Analysis: Motif Searching
• Search peaks for enriched sequence motifs
- Could indicate a TF binding motif
- Interesting for new ChIP factors
- Can be informative for co-operative binding
• HOMER is one of many tools to do this
43
Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42
Downstream Analysis: Differential binding
• May want to compare samples across conditions or
time series
• Overlapping peaks is too simplistic
• DiffBind: R Bioconductor package
- ChIP-seq equivalent of DESeq and edgeR
- Extensive documentation and tutorials
- http://bioconductor.org/packages/release/bioc/html/DiffBind.html
44
Conclusions
Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42
Conclusions
• There is no “correct way” to analyse ChIP-seq
- Depends on biological system and question
- Affected by number of samples and experimental setup
- Defined by your experience and skills
• Two packages that do a lot of steps:
- HOMER
- SeqMonk
- Lots of YouTube walk through videos
- https://youtu.be/LcMVb4zQBXI and https://youtu.be/Cy13yV6Rf6s
46
Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42
Further Reading
• Practical Guidelines for the Comprehensive Analysis of
ChIP-seq Data
- Bailey et al. PLOS Comp Bio (2013)
• ChIP-seq guidelines and practices of the ENCODE and
modENCODE consortia
- Landt et al. Genome Research (2012)
• ChIP–seq: advantages and challenges of a maturing
technology
- Park. Nature Reviews Genetics (2009)
• http://seqanswers.com and http://biostars.org
47
Questions?
phil.ewels@scilifelab.se
Slides: http://tiny.cc/chipseq

More Related Content

PDF
PDF
ChipSeq Data Analysis
PPTX
Dot matrix Analysis Tools (Bioinformatics)
PDF
An introduction to RNA-seq data analysis
PDF
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
PDF
454 pyrosequencing @ujjwalsirohi
PPT
Assembly and finishing
PDF
Secondary Structure Prediction of proteins
ChipSeq Data Analysis
Dot matrix Analysis Tools (Bioinformatics)
An introduction to RNA-seq data analysis
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
454 pyrosequencing @ujjwalsirohi
Assembly and finishing
Secondary Structure Prediction of proteins

What's hot (20)

PPTX
Primer Designing (General Rules)
PPTX
Protein fold recognition and ab_initio modeling
PPTX
Electrophoretic mobility shift assay
PPTX
PAM : Point Accepted Mutation
PPTX
mRNA Isolation
PPTX
Massively Parallel Signature Sequencing (MPSS)
PPTX
Microarray technique
PPT
Zinc finger technology
PPTX
polymerase Chain Reaction(PCR)
PDF
Emulsion pcr
PPT
Enzymes and proteins in dna replication
PPTX
Introduction to Next Generation Sequencing
PDF
Introduction to next generation sequencing
PDF
Next generation sequencing
PDF
Introduction to Next-Generation Sequencing (NGS) Technology
PPT
Protein protein interactions-ppt
PPTX
NMR of protein
PPTX
Transcriptome analysis
Primer Designing (General Rules)
Protein fold recognition and ab_initio modeling
Electrophoretic mobility shift assay
PAM : Point Accepted Mutation
mRNA Isolation
Massively Parallel Signature Sequencing (MPSS)
Microarray technique
Zinc finger technology
polymerase Chain Reaction(PCR)
Emulsion pcr
Enzymes and proteins in dna replication
Introduction to Next Generation Sequencing
Introduction to next generation sequencing
Next generation sequencing
Introduction to Next-Generation Sequencing (NGS) Technology
Protein protein interactions-ppt
NMR of protein
Transcriptome analysis
Ad

Viewers also liked (20)

PPTX
RNA-seq Data Analysis Overview
PPTX
Rna seq and chip seq
PDF
Bioinformatics.Practical Notebook
PDF
Using visual aids effectively
PPTX
Macs course
PDF
DNA Motif Finding 2010
PDF
Next-generation genomics: an integrative approach
PPTX
Dna binding protein(motif)
PPT
Dotplots for Bioinformatics
PPT
Internet McMenemy
PDF
Drablos Composite Motifs Bosc2009
PDF
20091110 Technical Seminar ChIP-seq Data Analysis
PPTX
Benchmarking 16S rRNA gene sequencing and bioinformatics tools for identifica...
PDF
XPRIME: A Novel Motif Searching Method
PPT
6 motif and pattern
PPTX
MEMEs in the Classroom
PDF
Bioinformatics
PPTX
Evolution of dental informatics as a major research
PPT
DESeq Paper Journal club
PDF
Bioinformatics and NGS for advancing in hearing loss research
RNA-seq Data Analysis Overview
Rna seq and chip seq
Bioinformatics.Practical Notebook
Using visual aids effectively
Macs course
DNA Motif Finding 2010
Next-generation genomics: an integrative approach
Dna binding protein(motif)
Dotplots for Bioinformatics
Internet McMenemy
Drablos Composite Motifs Bosc2009
20091110 Technical Seminar ChIP-seq Data Analysis
Benchmarking 16S rRNA gene sequencing and bioinformatics tools for identifica...
XPRIME: A Novel Motif Searching Method
6 motif and pattern
MEMEs in the Classroom
Bioinformatics
Evolution of dental informatics as a major research
DESeq Paper Journal club
Bioinformatics and NGS for advancing in hearing loss research
Ad

Similar to Analysis of ChIP-Seq Data (20)

PPT
AdamAmeur_SciLife_Bioinfo_course_Nov2015.ppt
PPT
AdamAmeur_SciLife_Bioinfo_course_Nov2015.ppt
PPTX
Introduction to Single-cell RNA-seq
PDF
ECCB 2014: Extracting patterns of database and software usage from the bioinf...
PDF
Gwas.emes.comp
PPTX
Cshl minseqe 2013_ouellette
PPTX
GLBIO/CCBC Metagenomics Workshop
PDF
Big data solution for ngs data analysis
PPTX
Making powerful science: an introduction to NGS data analysis
PPTX
BioAssay Express: Creating and exploiting assay metadata
PDF
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
PPTX
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
PDF
wings2014 Workshop 1 Design, sequence, align, count, visualize
PPTX
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
PPTX
DevoFlow - Scaling Flow Management for High-Performance Networks
PDF
BioTeam Bhanu Rekepalli Presentation at BICoB 2015
PPTX
Giab poster structural variants ashg 2018
PPTX
Giab jan2016 analysis team breakout summary
PDF
Introduction to Galaxy and RNA-Seq
PDF
Benchmark Education
AdamAmeur_SciLife_Bioinfo_course_Nov2015.ppt
AdamAmeur_SciLife_Bioinfo_course_Nov2015.ppt
Introduction to Single-cell RNA-seq
ECCB 2014: Extracting patterns of database and software usage from the bioinf...
Gwas.emes.comp
Cshl minseqe 2013_ouellette
GLBIO/CCBC Metagenomics Workshop
Big data solution for ngs data analysis
Making powerful science: an introduction to NGS data analysis
BioAssay Express: Creating and exploiting assay metadata
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
wings2014 Workshop 1 Design, sequence, align, count, visualize
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
DevoFlow - Scaling Flow Management for High-Performance Networks
BioTeam Bhanu Rekepalli Presentation at BICoB 2015
Giab poster structural variants ashg 2018
Giab jan2016 analysis team breakout summary
Introduction to Galaxy and RNA-Seq
Benchmark Education

More from Phil Ewels (16)

PDF
Reproducible bioinformatics for everyone: Nextflow & nf-core
PDF
Reproducible bioinformatics workflows with Nextflow and nf-core
PDF
ELIXIR Proteomics Community - Connection with nf-core
PDF
Coffee 'n code: Regexes
PDF
Nextflow Camp 2019: nf-core tutorial (Updated Feb 2020)
PDF
Nextflow Camp 2019: nf-core tutorial
PDF
EpiChrom 2019 - Updates in Epigenomics at the NGI
PDF
The future of genomics in the cloud
PDF
SciLifeLab NGI NovaSeq seminar
PDF
Lecture: NGS at the National Genomics Infrastructure
PDF
SBW 2016: MultiQC Workshop
PDF
Whole Genome Sequencing - Data Processing and QC at SciLifeLab NGI
PDF
NBIS ChIP-seq course
PDF
NBIS RNA-seq course
PDF
Developing Reliable QC at the Swedish National Genomics Infrastructure
PDF
Standardising Swedish genomics analyses using nextflow
Reproducible bioinformatics for everyone: Nextflow & nf-core
Reproducible bioinformatics workflows with Nextflow and nf-core
ELIXIR Proteomics Community - Connection with nf-core
Coffee 'n code: Regexes
Nextflow Camp 2019: nf-core tutorial (Updated Feb 2020)
Nextflow Camp 2019: nf-core tutorial
EpiChrom 2019 - Updates in Epigenomics at the NGI
The future of genomics in the cloud
SciLifeLab NGI NovaSeq seminar
Lecture: NGS at the National Genomics Infrastructure
SBW 2016: MultiQC Workshop
Whole Genome Sequencing - Data Processing and QC at SciLifeLab NGI
NBIS ChIP-seq course
NBIS RNA-seq course
Developing Reliable QC at the Swedish National Genomics Infrastructure
Standardising Swedish genomics analyses using nextflow

Recently uploaded (20)

PDF
Phytochemical Investigation of Miliusa longipes.pdf
PPTX
Introduction to Fisheries Biotechnology_Lesson 1.pptx
PPTX
Microbiology with diagram medical studies .pptx
PDF
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
PDF
HPLC-PPT.docx high performance liquid chromatography
PDF
AlphaEarth Foundations and the Satellite Embedding dataset
PPTX
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
PPT
Chemical bonding and molecular structure
PDF
MIRIDeepImagingSurvey(MIDIS)oftheHubbleUltraDeepField
PDF
Biophysics 2.pdffffffffffffffffffffffffff
PPT
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
PPTX
Derivatives of integument scales, beaks, horns,.pptx
PDF
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
PPTX
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
PDF
. Radiology Case Scenariosssssssssssssss
PPTX
2. Earth - The Living Planet Module 2ELS
PPTX
Cell Membrane: Structure, Composition & Functions
PDF
The scientific heritage No 166 (166) (2025)
PPTX
Comparative Structure of Integument in Vertebrates.pptx
PPTX
microscope-Lecturecjchchchchcuvuvhc.pptx
Phytochemical Investigation of Miliusa longipes.pdf
Introduction to Fisheries Biotechnology_Lesson 1.pptx
Microbiology with diagram medical studies .pptx
VARICELLA VACCINATION: A POTENTIAL STRATEGY FOR PREVENTING MULTIPLE SCLEROSIS
HPLC-PPT.docx high performance liquid chromatography
AlphaEarth Foundations and the Satellite Embedding dataset
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
Chemical bonding and molecular structure
MIRIDeepImagingSurvey(MIDIS)oftheHubbleUltraDeepField
Biophysics 2.pdffffffffffffffffffffffffff
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
Derivatives of integument scales, beaks, horns,.pptx
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
. Radiology Case Scenariosssssssssssssss
2. Earth - The Living Planet Module 2ELS
Cell Membrane: Structure, Composition & Functions
The scientific heritage No 166 (166) (2025)
Comparative Structure of Integument in Vertebrates.pptx
microscope-Lecturecjchchchchcuvuvhc.pptx

Analysis of ChIP-Seq Data

  • 1. Bioinformatics Analysis of ChIP-Seq Phil Ewels, NGI Stockholm phil.ewels@scilifelab.se Epigenetics and its applications in clinical research (2601) 2017-03-21
  • 2. Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42 Talk Overview • Overview of ChIP-Seq • ChIP-Seq data processing • Peak Calling • Normalisation & quality control • Analysis Pipelines • Downstream analyses 2
  • 4. Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42 Overview of ChIP-Seq • Question - Can we find where a protein of interest binds across the genome? • Requirements - Good antibody - Reference genome • Assumptions - Protein binds in a stable pattern - Binding is comparable across a population of cells 4
  • 5. Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42 Overview of ChIP-Seq • Cross-link DNA and proteins • Isolate DNA & fragmentation • Chromatin Immunoprecipitation • Reverse cross-links
 and purify DNA • Add adapters & sequence 5
  • 6. Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42 Overview of ChIP-Seq • Cross-link DNA and proteins • Isolate DNA & fragmentation • Chromatin Immunoprecipitation • Reverse cross-links
 and purify DNA • Add adapters & sequence 6
  • 7. Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42 Overview of ChIP-Seq • Cross-link DNA and proteins • Isolate DNA & fragmentation • Chromatin Immunoprecipitation • Reverse cross-links
 and purify DNA • Add adapters & sequence 7
  • 8. Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42 Overview of ChIP-Seq • Cross-link DNA and proteins • Isolate DNA & fragmentation • Chromatin Immunoprecipitation • Reverse cross-links
 and purify DNA • Add adapters & sequence 8
  • 9. Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42 Overview of ChIP-Seq • Cross-link DNA and proteins • Isolate DNA & fragmentation • Chromatin Immunoprecipitation • Reverse cross-links
 and purify DNA • Add adapters & sequence 9
  • 11. Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42 Data processing • Sequence QC - FastQC / FastQ Screen • Trimming - Cutadapt / Trimmomatic / AlienTrimmer / FASTX-Toolkit • Alignment - Bowtie / BWA / STAR • Duplicate removal - Picard / Samtools / SeqMonk 11
  • 12. Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42 Data processing: FastQC 12
  • 13. Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42 Data processing: FastQ Screen 13
  • 14. Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42 Data processing: cutadapt 14 http://opensource.scilifelab.se/projects/cutadapt/
  • 15. Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42 Data processing: cutadapt + FastQC 15
  • 16. Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42 Data processing: Alignment • Bowtie 1 & 2 - Bowtie 1 good for short reads (less than 50bp) - Bowtie 2 better with longer reads • STAR - As good as bowtie but much faster - Has a large memory footprint (~30 gigs for Human) • BWA / Subread / SOAP / MAQ • Alignments should be unique 16
  • 17. Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42 Data processing: Duplicate Removal • Duplicates can come from multiple sources - PCR duplicates - Optical duplicates - Deep sequencing (genuine duplicates) • How do you define duplicates? - Sequence content - errors? - Mapping position • Sonication makes genuine duplicates unlikely 17
  • 18. Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42 Results Summary: MultiQC 18 • Scans your results directory and parses log files • Builds a single report summarising everything http://multiqc.info
  • 20. Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42 Peak Calling: Considerations • What kind of mark are you looking for? • Point-source factors - Few, sharp peaks - Most transcription factors • Many peaks - RNA Polymerase II • Broad peaks - Some histone marks (H3K27me3) 20
  • 21. Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42 Peak Calling: Tools • Huge number of tools available • Many different statistical approaches • Only important thing to remember: your results should look sensible and you must be consistent • If in doubt, use MACS v2 or SPP - https://github.com/taoliu/MACS/ - http://compbio.med.harvard.edu/Supplements/ChIP-seq/ 21
  • 23. Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42 Normalisation & quality control • What we expect to see: • Assumptions: - Specific antibody - Perfect purification - Equal representation • Reality: - Non-specific antibody binding - Unbound DNA being sequenced - Open chromatin bias, repetitive regions not aligned 23
  • 24. Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42 Normalisation: input controls • Typically run an input sample - Cross-linked DNA, but no ChIP step - Can use a non-nuclear antibody such as IgG - Same sample, same prep - Captures systematic biases (eg. chromatin type, GC) • Can use the data in multiple ways - Just determine regions to exclude - Subtraction normalisation - Typically used when calling peaks 24
  • 25. Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42 Normalisation: Signal and Noise 25 • We will sequence lots of irrelevant stuff - What is signal and what is noise? • Essentially, we’re looking for enrichment - Peak callers do a lot of this for you • Most peak callers need an input sample - Some can use mappability and GC content instead
  • 26. Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42 Quality Control: visualisation 26 • Visualising the data is quick and very helpful - UCSC / SeqMonk / IGV • Fast impression of how the experiment has worked • Not enough on its own
  • 27. Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42 Quality Control: SeqMonk 27
  • 28. Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42 Quality Control: Saturation Analysis • If you sequence more reads, you’ll find more peaks • If you’ve sequenced enough, you should be nearing a plateau • Look into complexity of data - Preseq - SPP subsampling 28
  • 29. Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42 Quality Control: Strand Cross-Correlation 29 • Single-end sequencing should give a bimodal peak around binding sites on the two DNA strands • Some peak callers use this to aid in region calling and for QC • Can define NSC and RSC scores…
  • 30. Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42 Quality Control: Strand Cross-Correlation 30 • Can define NSC and RSC scores… Landt et al. 2012
  • 31. Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42 Quality Control: Stats, stats, stats • NSC and RSC - Normalised strand cross-correlation coefficient - Relative strand cross-correlation coefficient • FRiP - Fraction of reads in peaks • FDRs, IDRs, p-values of peaks - False discovery rates - Irreproducible discovery rates 31
  • 32. Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42 Quality Control: Stats, stats, stats • Useful if you have a lot of samples - Allows benchmarking and identification of failed samples • Don’t be overwhelmed by the acronyms • Believe your eyes - if the data looks trustworthy, it probably is trustworthy 32
  • 34. Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42 Bioinformatics Workflows • Running all of these steps for many samples is repetitive - Difficult, dull, prone to errors • Processing can be automated by a Workflow Manager - Also known as Pipeline Tools • Execute processing steps for you, managing files and dependencies. 34
  • 35. Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42 Cluster Flow • Available on UPPMAX • ChIP-seq pipeline - Runs QC, alignment, deduplication and generates coverage tracks / fingerprint plots - Written with J Westholm 35 #fastqc #bowtie1 #samtools_sort_index #bedtools_bamToBed #bedToNrf #picard_dedup #samtools_sort_index #phantompeaktools_runSpp #deeptools_bamCoverage #deeptools_bamFingerprint #bedtools_intersectNeg #samtools_sort_index module load clusterflow cf --setup cf-uppmax --add_genomes cf --genome GRCh37 chipseq_qc *.fq.gz
  • 36. Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42 Nextflow • Runs on UPPMAX • Several pipelines built at NGI, including ChIP-seq - Still under development, could be a little buggy • Also runs elsewhere. Docker coming soon. 36 curl -fsSL get.nextflow.io | bash nextflow run SciLifeLab/NGI-ChIPseq --project b2017123 --reads '*_R{1,2}.fastq.gz' --macsconfig ‘macssetup.config' --genome GRCh37
  • 37. Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42 Bioinformatics Workflows • These are great, but come with some caveats - Some setup is required - They don’t always work… - Results must be checked • They are not a substitute for understanding the analysis steps • You are still responsible for your results! 37
  • 39. Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42 Downstream Analysis: Annotation • You have reads! Peaks! But where are they? - Co-ordinates are not helpful by themselves • BEDTools - closest: distance to nearest genes - intersect: overlap with feature classes • HOMER annotation • SeqMonk Average quantitation plots 39
  • 40. Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42 Downstream Analysis: Annotation 40 • HOMER can annotate read intensities
  • 41. Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42 Downstream Analysis: Annotation 41 • SeqMonk average quantitation plot across genes
  • 42. Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42 Downstream Analysis: Annotation • GO analysis is increasingly popular - Gene Ontology search • Databases classify every gene with a restricted vocabulary • Use your data to find if any GO terms are enriched • Like peak callers, lots of software available - DAVID and GREAT are popular - Cytoscape good for visualisation 42
  • 43. Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42 Downstream Analysis: Motif Searching • Search peaks for enriched sequence motifs - Could indicate a TF binding motif - Interesting for new ChIP factors - Can be informative for co-operative binding • HOMER is one of many tools to do this 43
  • 44. Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42 Downstream Analysis: Differential binding • May want to compare samples across conditions or time series • Overlapping peaks is too simplistic • DiffBind: R Bioconductor package - ChIP-seq equivalent of DESeq and edgeR - Extensive documentation and tutorials - http://bioconductor.org/packages/release/bioc/html/DiffBind.html 44
  • 46. Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42 Conclusions • There is no “correct way” to analyse ChIP-seq - Depends on biological system and question - Affected by number of samples and experimental setup - Defined by your experience and skills • Two packages that do a lot of steps: - HOMER - SeqMonk - Lots of YouTube walk through videos - https://youtu.be/LcMVb4zQBXI and https://youtu.be/Cy13yV6Rf6s 46
  • 47. Phil Ewels - Bioinformatics Analysis of ChIP-Seq / 42 Further Reading • Practical Guidelines for the Comprehensive Analysis of ChIP-seq Data - Bailey et al. PLOS Comp Bio (2013) • ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia - Landt et al. Genome Research (2012) • ChIP–seq: advantages and challenges of a maturing technology - Park. Nature Reviews Genetics (2009) • http://seqanswers.com and http://biostars.org 47