Course on parsing methods for biologists with a focus on ChIP-seq data

Luca Cozzuto
Sarah Bonnin
Bioinformatics Core Facility
Additional topics (parsing
methods) for biologists with
a focus on ChIP-seq data

ChIP-Seq experiment
By Jkwchui - Cell diagram adapted from LadyOfHats' Animal Cell diagram. Information based on Illumina data sheet, as well as ChIP and immunoprecipitation articles
& references., CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=17890854

@HWI-ST227:389:C4WA2ACXX:7:1204:2272:59979
GGAGGAAGGTCCTCGCTCCTCTTTCATATAAGGGAAATGGCTGAAT
+
FFFFHHHHHHJIJJJJJJJJIJJJIGIGIGGIJJIJIJJJJJJIII
@HWI-ST227:389:C4WA2ACXX:7:1205:15214:42893
GAGGATCCCAGGGAGGAAGGTCCTCGCTCCTCTTTCATCTAAGGGA
+
12BAFB?A:3<AE1@<FF;1*@EG*)?0?DBD>9BF9B*?######
@HWI-ST227:389:C4WA2ACXX:8:2208:2467:44624
AAAGAGGAGAGAGGACCATCCTCCCTGGGATCCTCAGAAGTCTACT
+
BDDA:DB?2AA@FC>F?EEGC<FED>GFD;?GBB?<?F99*/9?9?
Raw data, reads in FASTQ format

@HWI-ST227:389:C4WA2ACXX:7:1204:2272:59979
GGAGGAAGGTCCTCGCTCCTCTTTCATATAAGGGAAATGGCTGAAT
+
FFFFHHHHHHJIJJJJJJJJIJJJIGIGIGGIJJIJIJJJJJJIII
@HWI-ST227:389:C4WA2ACXX:7:1205:15214:42893
GAGGATCCCAGGGAGGAAGGTCCTCGCTCCTCTTTCATCTAAGGGA
+
12BAFB?A:3<AE1@<FF;1*@EG*)?0?DBD>9BF9B*?######
@HWI-ST227:389:C4WA2ACXX:8:2208:2467:44624
AAAGAGGAGAGAGGACCATCCTCCCTGGGATCCTCAGAAGTCTACT
+
BDDA:DB?2AA@FC>F?EEGC<FED>GFD;?GBB?<?F99*/9?9?
Header Sequence Quality

zcat B7_H3K4me1.fastq.gz | awk '{num++}END{print num/4}’
41103741
Counting fastq reads (the slow way)

Phred quality score.
l Q=-10 log10p
l p = probability that the corresponding base call is
incorrect
l Example: p = 0.001 means a quality of 30
!"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJ
0........................guatda.com/cmx.p26...31.........41

Analyzing the quality (FASTQC)
GOOD BAD
https://www.bioinformatics.babraham.ac.uk/projects/fastqc/

Alignment
l Align 20-30 million reads per sample to the reference
genome.
l Reference genome can be very long (human is 3 Giga
bases)
l We need ultra-fast mappers:
l Bowtie (http://bowtie-bio.sourceforge.net/index.shtml)
l Bwa (http://bio-bwa.sourceforge.net/)
l GEM (https://github.com/smarco/gem3-mapper)
l …

Reference genome (Fasta file)
>1 dna:chromosome chromosome:GRCm38:1:1:195471971:1 REF
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN

>1 dna:chromosome chromosome:GRCm38:1:1:195471971:1 REF
Header

zcat GRCm38.primary_assembly.genome.fa.gz | grep ">"
>chr1 1
>chr2 2
>chr3 3
>chr4 4
>chr5 5
>chr6 6
>chr7 7
>chr8 8
>chr9 9
>chr10 10
>chr11 11
>chr12 12
>chr13 13
>chr14 14
>chr15 15
>chr16 16
>chr17 17
>chr18 18
>chr19 19
>chrX X
>chrY Y
>chrM MT

Annotations (GTF format)
#!genome-build GRCm38.p5
#!genome-version GRCm38
#!genome-date 2012-01
#!genome-build-accession NCBI:GCA_000001635.7
#!genebuild-last-updated 2017-01
1 havana gene 3073253 3074322 . + . gene_id
"ENSMUSG00000102693"; gene_version "1"; gene_name "4933401J01Rik";
gene_source "havana"; gene_biotype "TEC"; havana_gene "OTTMUSG00000049935";
havana_gene_version "1";
https://www.ensembl.org/info/website/upload/gff.html

Header
Annotations (GTF format)
#!genome-build GRCm38.p5
#!genome-version GRCm38
#!genome-date 2012-01
#!genome-build-accession NCBI:GCA_000001635.7
#!genebuild-last-updated 2017-01
1 havana gene 3073253 3074322 . + . gene_id
"ENSMUSG00000102693"; gene_version "1"; gene_name "4933401J01Rik";
gene_source "havana"; gene_biotype "TEC"; havana_gene "OTTMUSG00000049935";
havana_gene_version "1";
Reference sequence // Source // Feature (gene, transcript, exon etc) //
Start // End // Score // Strand // Frame (0,1,2) //
Attributes separated by “;”
https://www.ensembl.org/info/website/upload/gff.html

Alignment
genome.
l Reference genome has to be indexed
l Problems with repetitive sequences
?

Alignment
genome.
l Reference genome has to be indexed
l Problems with repetitive sequences
l Problems with PCR artifacts (marking duplicates)

Alignment (SAM / BAM format)
@HD VN:1.5 SO:coordinate
@SQ SN:1 LN:195471971
@SQ SN:2 LN:182113224
@SQ SN:3 LN:160039680
…
@PG ID:bowtie2 PN:bowtie2 VN:2.3.2
CL:"/usr/local/bin/bowtie2-align-s --wrapper basic-0 --non-deterministic -x
bowtie2genome -p 8 -U B7_H3K4me1.fastq.gz"
NS500454:71:H3TV7BGXY:4:22608:3293:16569 16 1 3000101 7
75M * 0 0
TTTTTTTTTTTTTTTTTTTTTTTGGTTTTGAGACTATTGATGACTGCCTCTATTTCTTTAGGGGAAATGGGACTTE/EEEAAEEE
EEEE6E6EAEE/E6EEE//<6/E/EAEE/EE/E/EE66E6E6EEEEEEE/EAAA/E/EE/AAAAA MD:Z:25G1G0G46
XG:i:0 NM:i:3 XM:i:3 XN:i:0 XO:i:0 AS:i:-11 XS:i:-20 YT:Z:UU
PG:Z:MarkDuplicates
https://samtools.github.io/hts-specs/SAMv1.pdf

@SQ SN:1 LN:195471971
@SQ SN:2 LN:182113224
@SQ SN:3 LN:160039680
…
NS500454:71:H3TV7BGXY:4:22608:3293:16569 16 1 3000101 7
75M * 0 0
TTTTTTTTTTTTTTTTTTTTTTTGGTTTTGAGACTATTGATGACTGCCTCTATTTCTTTAGGGGAAATGGGACTT
E/EEEAAEEEEEEE6E6EAEE/E6EEE//<6/E/EAEE/EE/E/EE66E6E6EEEEEEE/EAAA/E/EE/AAAAA
MD:Z:25G1G0G46 XG:i:0 NM:i:3 XM:i:3 XN:i:0 XO:i:0 AS:i:-11
XS:i:-20 YT:Z:UU PG:Z:MarkDuplicates
Header
@HD: header line // VN: format version // SO: sorting order of alignments
@SQ: reference sequence dictionary // SN: sequence name // LN: length
@PG: program // ID: program name // VN: version // CL: command line

@SQ SN:1 LN:195471971
@SQ SN:2 LN:182113224
@SQ SN:3 LN:160039680
…
NS500454:71:H3TV7BGXY:4:22608:3293:16569 16 1 3000101 7
75M * 0 0
TTTTTTTTTTTTTTTTTTTTTTTGGTTTTGAGACTATTGATGACTGCCTCTATTTCTTTAGGGGAAATGGGACTT
E/EEEAAEEEEEEE6E6EAEE/E6EEE//<6/E/EAEE/EE/E/EE66E6E6EEEEEEE/EAAA/E/EE/AAAAA
MD:Z:25G1G0G46 XG:i:0 NM:i:3 XM:i:3 XN:i:0 XO:i:0 AS:i:-11
XS:i:-20 YT:Z:UU PG:Z:MarkDuplicates
Alignment
Query name // FLAG // Reference name // leftmost mapping position //
Mapping quality (7, p=0.2) // CIGAR string // Reference name for mate read //
Position of the mate // template length // sequence // quality
In this case FLAG 16 means: “read being reverse complemented”

https://software.broadinstitute.org/software/igv/

Quality control of the enrichment
https://deeptools.readthedocs.io/en/develop/index.html

Distribution of the signal (wiggle format)
https://deeptools.readthedocs.io/en/develop/index.html
variableStep chrom=chr2
300701 12.5
300702 12.5
300703 12.5
300704 12.5
300705 12.5
...

Peak calling
https://software.broadinstitute.org/software/igv/

Peak calling
Zhang Y, Liu T, Meyer CA, Eeckhoute J, Johnson DS, Bernstein BE, Nusbaum C, Myers RM, Brown M, Li W, Liu XS. Model-based analysis of ChIP-
Seq (MACS). Genome Biol. 2008;9(9):R137.
It is possible to infer the fragment size and use it for extending the reads to
get more reliable peaks (i.e. binding sites). The peak is in the middle.

Peak coordinates (Bed format)
https://genome.ucsc.edu/FAQ/FAQformat.html#format1
Chromosome // Start // End (3 fields BED)
+ Name // Score // Strand (6 fields BED)
+ thickStart // thickEnd // itemRgb
+ blockCount // blockSizes // blockStarts (12 fields BED)
track name=chipseq description=”IP of Ring1B TF"
1 3444977 3445551 peak_1 31 .
1 4773116 4774454 peak_2 114 .
1 4774530 4777431 peak_3 108 .
1 4786374 4786850 peak_4 80 .
1 4806806 4807288 peak_5 66 .

bigBed and bigWig format
https://genome.ucsc.edu/goldenpath/help/bigWig.html
https://genome.ucsc.edu/goldenpath/help/bigBed.html
Indexed binary format generated from bed and wiggle files.

Annotating peaks
https://bedtools.readthedocs.io/en/latest/
Quinlan AR. BEDTools: The Swiss-Army Tool for Genome Feature Analysis. Curr Protoc Bioinformatics. 2014 Sep 8;47:11.12.1-34
Crossing information from gtf files and bed files (BedTools)
intersectBed -a Peaks/B7_H3K4me1_vs_B7_input-macs-narrow--q_0_peaks.bed
-b gencode.vM17.annotation.gtf
-wa -wb -nonamecheck |
awk '{if ($9 == "gene") print }'

Annotating peaks
intersectBed -a Peaks/B7_H3K4me1_vs_B7_input-macs-narrow--q_0_peaks.bed
-b gencode.vM17.annotation.gtf
-wa -wb -nonamecheck |
awk '{if ($9 == "gene") print }'
chr1 3444977 3445551 peak_15 31 .
chr1 HAVANA gene -nonamecheck 3205901 3671498 . -
. gene_id "ENSMUSG00000051951.5"; gene_type
"protein_coding"; gene_name "Xkr4"; level 2; havana_gene
"OTTMUSG00000026353.2";

Annotating peaks
awk '{if ($3 == "gene") print }' gencode.vM17.annotation.gtf |
closestBed -a Peaks/B7_H3K4me1_vs_B7_input-macs-narrow--q_0_peaks.bed
-d -b -

Course on parsing methods for biologists with a focus on ChIP-seq data

More Related Content

Similar to Course on parsing methods for biologists with a focus on ChIP-seq data (20)

More from Luca Cozzuto (6)

Recently uploaded (20)

Course on parsing methods for biologists with a focus on ChIP-seq data