SlideShare a Scribd company logo
Luca Cozzuto
Sarah Bonnin
Bioinformatics Core Facility
Additional topics (parsing
methods) for biologists with
a focus on ChIP-seq data
ChIP-Seq experiment
By Jkwchui - Cell diagram adapted from LadyOfHats' Animal Cell diagram. Information based on Illumina data sheet, as well as ChIP and immunoprecipitation articles
& references., CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=17890854
ChIP-Seq experiment
By Jkwchui - Cell diagram adapted from LadyOfHats' Animal Cell diagram. Information based on Illumina data sheet, as well as ChIP and immunoprecipitation articles
& references., CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=17890854
ChIP-Seq experiment
By Jkwchui - Cell diagram adapted from LadyOfHats' Animal Cell diagram. Information based on Illumina data sheet, as well as ChIP and immunoprecipitation articles
& references., CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=17890854
ChIP-Seq experiment
By Jkwchui - Cell diagram adapted from LadyOfHats' Animal Cell diagram. Information based on Illumina data sheet, as well as ChIP and immunoprecipitation articles
& references., CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=17890854
@HWI-ST227:389:C4WA2ACXX:7:1204:2272:59979
GGAGGAAGGTCCTCGCTCCTCTTTCATATAAGGGAAATGGCTGAAT
+
FFFFHHHHHHJIJJJJJJJJIJJJIGIGIGGIJJIJIJJJJJJIII
@HWI-ST227:389:C4WA2ACXX:7:1205:15214:42893
GAGGATCCCAGGGAGGAAGGTCCTCGCTCCTCTTTCATCTAAGGGA
+
12BAFB?A:3<AE1@<FF;1*@EG*)?0?DBD>9BF9B*?######
@HWI-ST227:389:C4WA2ACXX:8:2208:2467:44624
AAAGAGGAGAGAGGACCATCCTCCCTGGGATCCTCAGAAGTCTACT
+
BDDA:DB?2AA@FC>F?EEGC<FED>GFD;?GBB?<?F99*/9?9?
Raw data, reads in FASTQ format
Raw data, reads in FASTQ format
@HWI-ST227:389:C4WA2ACXX:7:1204:2272:59979
GGAGGAAGGTCCTCGCTCCTCTTTCATATAAGGGAAATGGCTGAAT
+
FFFFHHHHHHJIJJJJJJJJIJJJIGIGIGGIJJIJIJJJJJJIII
@HWI-ST227:389:C4WA2ACXX:7:1205:15214:42893
GAGGATCCCAGGGAGGAAGGTCCTCGCTCCTCTTTCATCTAAGGGA
+
12BAFB?A:3<AE1@<FF;1*@EG*)?0?DBD>9BF9B*?######
@HWI-ST227:389:C4WA2ACXX:8:2208:2467:44624
AAAGAGGAGAGAGGACCATCCTCCCTGGGATCCTCAGAAGTCTACT
+
BDDA:DB?2AA@FC>F?EEGC<FED>GFD;?GBB?<?F99*/9?9?
Header Sequence Quality
Raw data, reads in FASTQ format
zcat B7_H3K4me1.fastq.gz | awk '{num++}END{print num/4}’
41103741
Counting fastq reads (the slow way)
Raw data, reads in FASTQ format
Phred quality score.
l Q=-10 log10p
l p = probability that the corresponding base call is
incorrect
l Example: p = 0.001 means a quality of 30
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJ
0........................guatda.com/cmx.p26...31.........41
Raw data, reads in FASTQ format
Analyzing the quality (FASTQC)
GOOD BAD
https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
Alignment
l Align 20-30 million reads per sample to the reference
genome.
l Reference genome can be very long (human is 3 Giga
bases)
l We need ultra-fast mappers:
l Bowtie (http://bowtie-bio.sourceforge.net/index.shtml)
l Bwa (http://bio-bwa.sourceforge.net/)
l GEM (https://github.com/smarco/gem3-mapper)
l …
Reference genome (Fasta file)
>1 dna:chromosome chromosome:GRCm38:1:1:195471971:1 REF
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
Reference genome (Fasta file)
>1 dna:chromosome chromosome:GRCm38:1:1:195471971:1 REF
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
Header
Reference genome (Fasta file)
zcat GRCm38.primary_assembly.genome.fa.gz | grep ">"
>chr1 1
>chr2 2
>chr3 3
>chr4 4
>chr5 5
>chr6 6
>chr7 7
>chr8 8
>chr9 9
>chr10 10
>chr11 11
>chr12 12
>chr13 13
>chr14 14
>chr15 15
>chr16 16
>chr17 17
>chr18 18
>chr19 19
>chrX X
>chrY Y
>chrM MT
Annotations (GTF format)
#!genome-build GRCm38.p5
#!genome-version GRCm38
#!genome-date 2012-01
#!genome-build-accession NCBI:GCA_000001635.7
#!genebuild-last-updated 2017-01
1 havana gene 3073253 3074322 . + . gene_id
"ENSMUSG00000102693"; gene_version "1"; gene_name "4933401J01Rik";
gene_source "havana"; gene_biotype "TEC"; havana_gene "OTTMUSG00000049935";
havana_gene_version "1";
https://www.ensembl.org/info/website/upload/gff.html
Header
Annotations (GTF format)
#!genome-build GRCm38.p5
#!genome-version GRCm38
#!genome-date 2012-01
#!genome-build-accession NCBI:GCA_000001635.7
#!genebuild-last-updated 2017-01
1 havana gene 3073253 3074322 . + . gene_id
"ENSMUSG00000102693"; gene_version "1"; gene_name "4933401J01Rik";
gene_source "havana"; gene_biotype "TEC"; havana_gene "OTTMUSG00000049935";
havana_gene_version "1";
Reference sequence // Source // Feature (gene, transcript, exon etc) //
Start // End // Score // Strand // Frame (0,1,2) //
Attributes separated by “;”
https://www.ensembl.org/info/website/upload/gff.html
Alignment
l Align 20-30 million reads per sample to the reference
genome.
l Reference genome has to be indexed
l Problems with repetitive sequences
?
Alignment
l Align 20-30 million reads per sample to the reference
genome.
l Reference genome has to be indexed
l Problems with repetitive sequences
l Problems with PCR artifacts (marking duplicates)
Alignment (SAM / BAM format)
@HD VN:1.5 SO:coordinate
@SQ SN:1 LN:195471971
@SQ SN:2 LN:182113224
@SQ SN:3 LN:160039680
…
@PG ID:bowtie2 PN:bowtie2 VN:2.3.2
CL:"/usr/local/bin/bowtie2-align-s --wrapper basic-0 --non-deterministic -x
bowtie2genome -p 8 -U B7_H3K4me1.fastq.gz"
NS500454:71:H3TV7BGXY:4:22608:3293:16569 16 1 3000101 7
75M * 0 0
TTTTTTTTTTTTTTTTTTTTTTTGGTTTTGAGACTATTGATGACTGCCTCTATTTCTTTAGGGGAAATGGGACTTE/EEEAAEEE
EEEE6E6EAEE/E6EEE//<6/E/EAEE/EE/E/EE66E6E6EEEEEEE/EAAA/E/EE/AAAAA MD:Z:25G1G0G46
XG:i:0 NM:i:3 XM:i:3 XN:i:0 XO:i:0 AS:i:-11 XS:i:-20 YT:Z:UU
PG:Z:MarkDuplicates
https://samtools.github.io/hts-specs/SAMv1.pdf
Alignment (SAM / BAM format)
@HD VN:1.5 SO:coordinate
@SQ SN:1 LN:195471971
@SQ SN:2 LN:182113224
@SQ SN:3 LN:160039680
…
@PG ID:bowtie2 PN:bowtie2 VN:2.3.2
CL:"/usr/local/bin/bowtie2-align-s --wrapper basic-0 --non-deterministic -x
bowtie2genome -p 8 -U B7_H3K4me1.fastq.gz"
NS500454:71:H3TV7BGXY:4:22608:3293:16569 16 1 3000101 7
75M * 0 0
TTTTTTTTTTTTTTTTTTTTTTTGGTTTTGAGACTATTGATGACTGCCTCTATTTCTTTAGGGGAAATGGGACTT
E/EEEAAEEEEEEE6E6EAEE/E6EEE//<6/E/EAEE/EE/E/EE66E6E6EEEEEEE/EAAA/E/EE/AAAAA
MD:Z:25G1G0G46 XG:i:0 NM:i:3 XM:i:3 XN:i:0 XO:i:0 AS:i:-11
XS:i:-20 YT:Z:UU PG:Z:MarkDuplicates
Header
@HD: header line // VN: format version // SO: sorting order of alignments
@SQ: reference sequence dictionary // SN: sequence name // LN: length
@PG: program // ID: program name // VN: version // CL: command line
https://samtools.github.io/hts-specs/SAMv1.pdf
Alignment (SAM / BAM format)
@HD VN:1.5 SO:coordinate
@SQ SN:1 LN:195471971
@SQ SN:2 LN:182113224
@SQ SN:3 LN:160039680
…
@PG ID:bowtie2 PN:bowtie2 VN:2.3.2
CL:"/usr/local/bin/bowtie2-align-s --wrapper basic-0 --non-deterministic -x
bowtie2genome -p 8 -U B7_H3K4me1.fastq.gz"
NS500454:71:H3TV7BGXY:4:22608:3293:16569 16 1 3000101 7
75M * 0 0
TTTTTTTTTTTTTTTTTTTTTTTGGTTTTGAGACTATTGATGACTGCCTCTATTTCTTTAGGGGAAATGGGACTT
E/EEEAAEEEEEEE6E6EAEE/E6EEE//<6/E/EAEE/EE/E/EE66E6E6EEEEEEE/EAAA/E/EE/AAAAA
MD:Z:25G1G0G46 XG:i:0 NM:i:3 XM:i:3 XN:i:0 XO:i:0 AS:i:-11
XS:i:-20 YT:Z:UU PG:Z:MarkDuplicates
Alignment
Query name // FLAG // Reference name // leftmost mapping position //
Mapping quality (7, p=0.2) // CIGAR string // Reference name for mate read //
Position of the mate // template length // sequence // quality
In this case FLAG 16 means: “read being reverse complemented”
https://samtools.github.io/hts-specs/SAMv1.pdf
Alignment (SAM / BAM format)
https://software.broadinstitute.org/software/igv/
Quality control of the enrichment
https://deeptools.readthedocs.io/en/develop/index.html
Distribution of the signal (wiggle format)
https://deeptools.readthedocs.io/en/develop/index.html
variableStep chrom=chr2
300701 12.5
300702 12.5
300703 12.5
300704 12.5
300705 12.5
...
Peak calling
https://software.broadinstitute.org/software/igv/
Peak calling
Zhang Y, Liu T, Meyer CA, Eeckhoute J, Johnson DS, Bernstein BE, Nusbaum C, Myers RM, Brown M, Li W, Liu XS. Model-based analysis of ChIP-
Seq (MACS). Genome Biol. 2008;9(9):R137.
It is possible to infer the fragment size and use it for extending the reads to
get more reliable peaks (i.e. binding sites). The peak is in the middle.
Peak coordinates (Bed format)
https://genome.ucsc.edu/FAQ/FAQformat.html#format1
Chromosome // Start // End (3 fields BED)
+ Name // Score // Strand (6 fields BED)
+ thickStart // thickEnd // itemRgb
+ blockCount // blockSizes // blockStarts (12 fields BED)
track name=chipseq description=”IP of Ring1B TF"
1 3444977 3445551 peak_1 31 .
1 4773116 4774454 peak_2 114 .
1 4774530 4777431 peak_3 108 .
1 4786374 4786850 peak_4 80 .
1 4806806 4807288 peak_5 66 .
bigBed and bigWig format
https://genome.ucsc.edu/goldenpath/help/bigWig.html
https://genome.ucsc.edu/goldenpath/help/bigBed.html
Indexed binary format generated from bed and wiggle files.
Annotating peaks
https://bedtools.readthedocs.io/en/latest/
Quinlan AR. BEDTools: The Swiss-Army Tool for Genome Feature Analysis. Curr Protoc Bioinformatics. 2014 Sep 8;47:11.12.1-34
Crossing information from gtf files and bed files (BedTools)
intersectBed -a Peaks/B7_H3K4me1_vs_B7_input-macs-narrow--q_0_peaks.bed 
-b gencode.vM17.annotation.gtf 
-wa -wb -nonamecheck | 
awk '{if ($9 == "gene") print }'
Annotating peaks
https://bedtools.readthedocs.io/en/latest/
Quinlan AR. BEDTools: The Swiss-Army Tool for Genome Feature Analysis. Curr Protoc Bioinformatics. 2014 Sep 8;47:11.12.1-34
Crossing information from gtf files and bed files (BedTools)
intersectBed -a Peaks/B7_H3K4me1_vs_B7_input-macs-narrow--q_0_peaks.bed 
-b gencode.vM17.annotation.gtf 
-wa -wb -nonamecheck | 
awk '{if ($9 == "gene") print }'
chr1 3444977 3445551 peak_15 31 .
chr1 HAVANA gene -nonamecheck 3205901 3671498 . -
. gene_id "ENSMUSG00000051951.5"; gene_type
"protein_coding"; gene_name "Xkr4"; level 2; havana_gene
"OTTMUSG00000026353.2";
Annotating peaks
https://bedtools.readthedocs.io/en/latest/
Quinlan AR. BEDTools: The Swiss-Army Tool for Genome Feature Analysis. Curr Protoc Bioinformatics. 2014 Sep 8;47:11.12.1-34
Crossing information from gtf files and bed files (BedTools)
awk '{if ($3 == "gene") print }' gencode.vM17.annotation.gtf | 
closestBed -a Peaks/B7_H3K4me1_vs_B7_input-macs-narrow--q_0_peaks.bed 
-d -b -

More Related Content

PDF
ShowNet2008-Topology
PPTX
Finding Evil In DNS Traffic
PDF
Autiting policy (1)
PDF
Riks mangement policy
PPTX
Workshop NGS data analysis - 2
PPTX
Ensembl annotation
PPT
20100516 bioinformatics kapushesky_lecture08
PPTX
GRC GIAB Workshop ASHG 2019 Small Variant Benchmark
ShowNet2008-Topology
Finding Evil In DNS Traffic
Autiting policy (1)
Riks mangement policy
Workshop NGS data analysis - 2
Ensembl annotation
20100516 bioinformatics kapushesky_lecture08
GRC GIAB Workshop ASHG 2019 Small Variant Benchmark

Similar to Course on parsing methods for biologists with a focus on ChIP-seq data (20)

PDF
epigenomics_2020_lecture1_epigenomics_2020_lecture1.pdf
PDF
AGBT2017 Reference Workshop: Lindsay
PDF
20110524zurichngs 1st pub
PDF
2023 GIAB AMP Update
PPTX
Imgc2011 bioinformatics tutorial
PPTX
Bioinformatic tool for Annotation of gene
PPTX
Bioinfo ngs data format visualization v2
PPTX
VCF and RDF
PPTX
CS Lecture 2017 04-11 from Data to Precision Medicine
PDF
Forsharing cshl2011 sequencing
PDF
VectorBase gene sets
PPTX
Bioinformatics t8-go-hmm v2014
PPTX
Rnaseq forgenefinding
PPTX
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
PPT
Bioinformatica 08-12-2011-t8-go-hmm
PDF
RNA-seq: Mapping and quality control - part 3
PPTX
from genome sequencing to genome assembly
PDF
20140710 6 c_mason_ercc2.0_workshop
PPT
AlgoAlignementGenomicSequences.ppt
PPTX
Next-generation sequencing format and visualization with ngs.plot
epigenomics_2020_lecture1_epigenomics_2020_lecture1.pdf
AGBT2017 Reference Workshop: Lindsay
20110524zurichngs 1st pub
2023 GIAB AMP Update
Imgc2011 bioinformatics tutorial
Bioinformatic tool for Annotation of gene
Bioinfo ngs data format visualization v2
VCF and RDF
CS Lecture 2017 04-11 from Data to Precision Medicine
Forsharing cshl2011 sequencing
VectorBase gene sets
Bioinformatics t8-go-hmm v2014
Rnaseq forgenefinding
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Bioinformatica 08-12-2011-t8-go-hmm
RNA-seq: Mapping and quality control - part 3
from genome sequencing to genome assembly
20140710 6 c_mason_ercc2.0_workshop
AlgoAlignementGenomicSequences.ppt
Next-generation sequencing format and visualization with ngs.plot
Ad

More from Luca Cozzuto (6)

PPTX
vectorQC: 'A pipeline for assembling and annotation of vectors'
PPTX
From Zero to Nextflow 2017
PPTX
Benchmarking 16S rRNA gene sequencing and bioinformatics tools for identifica...
PPTX
AnnoWiki
PPTX
Macs course
PDF
Annotating nc-RNAs with Rfam
vectorQC: 'A pipeline for assembling and annotation of vectors'
From Zero to Nextflow 2017
Benchmarking 16S rRNA gene sequencing and bioinformatics tools for identifica...
AnnoWiki
Macs course
Annotating nc-RNAs with Rfam
Ad

Recently uploaded (20)

PDF
O7-L3 Supply Chain Operations - ICLT Program
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PDF
Complications of Minimal Access Surgery at WLH
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PDF
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
PPTX
Presentation on HIE in infants and its manifestations
PPTX
Introduction-to-Literarature-and-Literary-Studies-week-Prelim-coverage.pptx
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
PDF
Chinmaya Tiranga quiz Grand Finale.pdf
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PDF
GENETICS IN BIOLOGY IN SECONDARY LEVEL FORM 3
PDF
Anesthesia in Laparoscopic Surgery in India
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PDF
A systematic review of self-coping strategies used by university students to ...
PDF
Classroom Observation Tools for Teachers
PPTX
Cell Types and Its function , kingdom of life
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
O7-L3 Supply Chain Operations - ICLT Program
O5-L3 Freight Transport Ops (International) V1.pdf
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
Complications of Minimal Access Surgery at WLH
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
Microbial diseases, their pathogenesis and prophylaxis
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
Presentation on HIE in infants and its manifestations
Introduction-to-Literarature-and-Literary-Studies-week-Prelim-coverage.pptx
102 student loan defaulters named and shamed – Is someone you know on the list?
Chinmaya Tiranga quiz Grand Finale.pdf
STATICS OF THE RIGID BODIES Hibbelers.pdf
FourierSeries-QuestionsWithAnswers(Part-A).pdf
GENETICS IN BIOLOGY IN SECONDARY LEVEL FORM 3
Anesthesia in Laparoscopic Surgery in India
Abdominal Access Techniques with Prof. Dr. R K Mishra
A systematic review of self-coping strategies used by university students to ...
Classroom Observation Tools for Teachers
Cell Types and Its function , kingdom of life
Pharmacology of Heart Failure /Pharmacotherapy of CHF

Course on parsing methods for biologists with a focus on ChIP-seq data

  • 1. Luca Cozzuto Sarah Bonnin Bioinformatics Core Facility Additional topics (parsing methods) for biologists with a focus on ChIP-seq data
  • 2. ChIP-Seq experiment By Jkwchui - Cell diagram adapted from LadyOfHats' Animal Cell diagram. Information based on Illumina data sheet, as well as ChIP and immunoprecipitation articles & references., CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=17890854
  • 3. ChIP-Seq experiment By Jkwchui - Cell diagram adapted from LadyOfHats' Animal Cell diagram. Information based on Illumina data sheet, as well as ChIP and immunoprecipitation articles & references., CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=17890854
  • 4. ChIP-Seq experiment By Jkwchui - Cell diagram adapted from LadyOfHats' Animal Cell diagram. Information based on Illumina data sheet, as well as ChIP and immunoprecipitation articles & references., CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=17890854
  • 5. ChIP-Seq experiment By Jkwchui - Cell diagram adapted from LadyOfHats' Animal Cell diagram. Information based on Illumina data sheet, as well as ChIP and immunoprecipitation articles & references., CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=17890854
  • 7. Raw data, reads in FASTQ format @HWI-ST227:389:C4WA2ACXX:7:1204:2272:59979 GGAGGAAGGTCCTCGCTCCTCTTTCATATAAGGGAAATGGCTGAAT + FFFFHHHHHHJIJJJJJJJJIJJJIGIGIGGIJJIJIJJJJJJIII @HWI-ST227:389:C4WA2ACXX:7:1205:15214:42893 GAGGATCCCAGGGAGGAAGGTCCTCGCTCCTCTTTCATCTAAGGGA + 12BAFB?A:3<AE1@<FF;1*@EG*)?0?DBD>9BF9B*?###### @HWI-ST227:389:C4WA2ACXX:8:2208:2467:44624 AAAGAGGAGAGAGGACCATCCTCCCTGGGATCCTCAGAAGTCTACT + BDDA:DB?2AA@FC>F?EEGC<FED>GFD;?GBB?<?F99*/9?9? Header Sequence Quality
  • 8. Raw data, reads in FASTQ format zcat B7_H3K4me1.fastq.gz | awk '{num++}END{print num/4}’ 41103741 Counting fastq reads (the slow way)
  • 9. Raw data, reads in FASTQ format Phred quality score. l Q=-10 log10p l p = probability that the corresponding base call is incorrect l Example: p = 0.001 means a quality of 30 !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJ 0........................guatda.com/cmx.p26...31.........41
  • 10. Raw data, reads in FASTQ format Analyzing the quality (FASTQC) GOOD BAD https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
  • 11. Alignment l Align 20-30 million reads per sample to the reference genome. l Reference genome can be very long (human is 3 Giga bases) l We need ultra-fast mappers: l Bowtie (http://bowtie-bio.sourceforge.net/index.shtml) l Bwa (http://bio-bwa.sourceforge.net/) l GEM (https://github.com/smarco/gem3-mapper) l …
  • 12. Reference genome (Fasta file) >1 dna:chromosome chromosome:GRCm38:1:1:195471971:1 REF NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
  • 13. Reference genome (Fasta file) >1 dna:chromosome chromosome:GRCm38:1:1:195471971:1 REF NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN Header
  • 14. Reference genome (Fasta file) zcat GRCm38.primary_assembly.genome.fa.gz | grep ">" >chr1 1 >chr2 2 >chr3 3 >chr4 4 >chr5 5 >chr6 6 >chr7 7 >chr8 8 >chr9 9 >chr10 10 >chr11 11 >chr12 12 >chr13 13 >chr14 14 >chr15 15 >chr16 16 >chr17 17 >chr18 18 >chr19 19 >chrX X >chrY Y >chrM MT
  • 15. Annotations (GTF format) #!genome-build GRCm38.p5 #!genome-version GRCm38 #!genome-date 2012-01 #!genome-build-accession NCBI:GCA_000001635.7 #!genebuild-last-updated 2017-01 1 havana gene 3073253 3074322 . + . gene_id "ENSMUSG00000102693"; gene_version "1"; gene_name "4933401J01Rik"; gene_source "havana"; gene_biotype "TEC"; havana_gene "OTTMUSG00000049935"; havana_gene_version "1"; https://www.ensembl.org/info/website/upload/gff.html
  • 16. Header Annotations (GTF format) #!genome-build GRCm38.p5 #!genome-version GRCm38 #!genome-date 2012-01 #!genome-build-accession NCBI:GCA_000001635.7 #!genebuild-last-updated 2017-01 1 havana gene 3073253 3074322 . + . gene_id "ENSMUSG00000102693"; gene_version "1"; gene_name "4933401J01Rik"; gene_source "havana"; gene_biotype "TEC"; havana_gene "OTTMUSG00000049935"; havana_gene_version "1"; Reference sequence // Source // Feature (gene, transcript, exon etc) // Start // End // Score // Strand // Frame (0,1,2) // Attributes separated by “;” https://www.ensembl.org/info/website/upload/gff.html
  • 17. Alignment l Align 20-30 million reads per sample to the reference genome. l Reference genome has to be indexed l Problems with repetitive sequences ?
  • 18. Alignment l Align 20-30 million reads per sample to the reference genome. l Reference genome has to be indexed l Problems with repetitive sequences l Problems with PCR artifacts (marking duplicates)
  • 19. Alignment (SAM / BAM format) @HD VN:1.5 SO:coordinate @SQ SN:1 LN:195471971 @SQ SN:2 LN:182113224 @SQ SN:3 LN:160039680 … @PG ID:bowtie2 PN:bowtie2 VN:2.3.2 CL:"/usr/local/bin/bowtie2-align-s --wrapper basic-0 --non-deterministic -x bowtie2genome -p 8 -U B7_H3K4me1.fastq.gz" NS500454:71:H3TV7BGXY:4:22608:3293:16569 16 1 3000101 7 75M * 0 0 TTTTTTTTTTTTTTTTTTTTTTTGGTTTTGAGACTATTGATGACTGCCTCTATTTCTTTAGGGGAAATGGGACTTE/EEEAAEEE EEEE6E6EAEE/E6EEE//<6/E/EAEE/EE/E/EE66E6E6EEEEEEE/EAAA/E/EE/AAAAA MD:Z:25G1G0G46 XG:i:0 NM:i:3 XM:i:3 XN:i:0 XO:i:0 AS:i:-11 XS:i:-20 YT:Z:UU PG:Z:MarkDuplicates https://samtools.github.io/hts-specs/SAMv1.pdf
  • 20. Alignment (SAM / BAM format) @HD VN:1.5 SO:coordinate @SQ SN:1 LN:195471971 @SQ SN:2 LN:182113224 @SQ SN:3 LN:160039680 … @PG ID:bowtie2 PN:bowtie2 VN:2.3.2 CL:"/usr/local/bin/bowtie2-align-s --wrapper basic-0 --non-deterministic -x bowtie2genome -p 8 -U B7_H3K4me1.fastq.gz" NS500454:71:H3TV7BGXY:4:22608:3293:16569 16 1 3000101 7 75M * 0 0 TTTTTTTTTTTTTTTTTTTTTTTGGTTTTGAGACTATTGATGACTGCCTCTATTTCTTTAGGGGAAATGGGACTT E/EEEAAEEEEEEE6E6EAEE/E6EEE//<6/E/EAEE/EE/E/EE66E6E6EEEEEEE/EAAA/E/EE/AAAAA MD:Z:25G1G0G46 XG:i:0 NM:i:3 XM:i:3 XN:i:0 XO:i:0 AS:i:-11 XS:i:-20 YT:Z:UU PG:Z:MarkDuplicates Header @HD: header line // VN: format version // SO: sorting order of alignments @SQ: reference sequence dictionary // SN: sequence name // LN: length @PG: program // ID: program name // VN: version // CL: command line https://samtools.github.io/hts-specs/SAMv1.pdf
  • 21. Alignment (SAM / BAM format) @HD VN:1.5 SO:coordinate @SQ SN:1 LN:195471971 @SQ SN:2 LN:182113224 @SQ SN:3 LN:160039680 … @PG ID:bowtie2 PN:bowtie2 VN:2.3.2 CL:"/usr/local/bin/bowtie2-align-s --wrapper basic-0 --non-deterministic -x bowtie2genome -p 8 -U B7_H3K4me1.fastq.gz" NS500454:71:H3TV7BGXY:4:22608:3293:16569 16 1 3000101 7 75M * 0 0 TTTTTTTTTTTTTTTTTTTTTTTGGTTTTGAGACTATTGATGACTGCCTCTATTTCTTTAGGGGAAATGGGACTT E/EEEAAEEEEEEE6E6EAEE/E6EEE//<6/E/EAEE/EE/E/EE66E6E6EEEEEEE/EAAA/E/EE/AAAAA MD:Z:25G1G0G46 XG:i:0 NM:i:3 XM:i:3 XN:i:0 XO:i:0 AS:i:-11 XS:i:-20 YT:Z:UU PG:Z:MarkDuplicates Alignment Query name // FLAG // Reference name // leftmost mapping position // Mapping quality (7, p=0.2) // CIGAR string // Reference name for mate read // Position of the mate // template length // sequence // quality In this case FLAG 16 means: “read being reverse complemented” https://samtools.github.io/hts-specs/SAMv1.pdf
  • 22. Alignment (SAM / BAM format) https://software.broadinstitute.org/software/igv/
  • 23. Quality control of the enrichment https://deeptools.readthedocs.io/en/develop/index.html
  • 24. Distribution of the signal (wiggle format) https://deeptools.readthedocs.io/en/develop/index.html variableStep chrom=chr2 300701 12.5 300702 12.5 300703 12.5 300704 12.5 300705 12.5 ...
  • 26. Peak calling Zhang Y, Liu T, Meyer CA, Eeckhoute J, Johnson DS, Bernstein BE, Nusbaum C, Myers RM, Brown M, Li W, Liu XS. Model-based analysis of ChIP- Seq (MACS). Genome Biol. 2008;9(9):R137. It is possible to infer the fragment size and use it for extending the reads to get more reliable peaks (i.e. binding sites). The peak is in the middle.
  • 27. Peak coordinates (Bed format) https://genome.ucsc.edu/FAQ/FAQformat.html#format1 Chromosome // Start // End (3 fields BED) + Name // Score // Strand (6 fields BED) + thickStart // thickEnd // itemRgb + blockCount // blockSizes // blockStarts (12 fields BED) track name=chipseq description=”IP of Ring1B TF" 1 3444977 3445551 peak_1 31 . 1 4773116 4774454 peak_2 114 . 1 4774530 4777431 peak_3 108 . 1 4786374 4786850 peak_4 80 . 1 4806806 4807288 peak_5 66 .
  • 28. bigBed and bigWig format https://genome.ucsc.edu/goldenpath/help/bigWig.html https://genome.ucsc.edu/goldenpath/help/bigBed.html Indexed binary format generated from bed and wiggle files.
  • 29. Annotating peaks https://bedtools.readthedocs.io/en/latest/ Quinlan AR. BEDTools: The Swiss-Army Tool for Genome Feature Analysis. Curr Protoc Bioinformatics. 2014 Sep 8;47:11.12.1-34 Crossing information from gtf files and bed files (BedTools) intersectBed -a Peaks/B7_H3K4me1_vs_B7_input-macs-narrow--q_0_peaks.bed -b gencode.vM17.annotation.gtf -wa -wb -nonamecheck | awk '{if ($9 == "gene") print }'
  • 30. Annotating peaks https://bedtools.readthedocs.io/en/latest/ Quinlan AR. BEDTools: The Swiss-Army Tool for Genome Feature Analysis. Curr Protoc Bioinformatics. 2014 Sep 8;47:11.12.1-34 Crossing information from gtf files and bed files (BedTools) intersectBed -a Peaks/B7_H3K4me1_vs_B7_input-macs-narrow--q_0_peaks.bed -b gencode.vM17.annotation.gtf -wa -wb -nonamecheck | awk '{if ($9 == "gene") print }' chr1 3444977 3445551 peak_15 31 . chr1 HAVANA gene -nonamecheck 3205901 3671498 . - . gene_id "ENSMUSG00000051951.5"; gene_type "protein_coding"; gene_name "Xkr4"; level 2; havana_gene "OTTMUSG00000026353.2";
  • 31. Annotating peaks https://bedtools.readthedocs.io/en/latest/ Quinlan AR. BEDTools: The Swiss-Army Tool for Genome Feature Analysis. Curr Protoc Bioinformatics. 2014 Sep 8;47:11.12.1-34 Crossing information from gtf files and bed files (BedTools) awk '{if ($3 == "gene") print }' gencode.vM17.annotation.gtf | closestBed -a Peaks/B7_H3K4me1_vs_B7_input-macs-narrow--q_0_peaks.bed -d -b -