epigenomics_2020_lecture1_epigenomics_2020_lecture1.pdf

Epigenomics data
analysis
11/30/2020 – 12/18/2020
Qi Sun, William Lai and Jeff Glaubitz
Bioinformatics Facilty & Epigenomics Facility

• gene
• CDS
• mRNA
• exon
• five_prime_UTR
• three_prime_UTR
• rRNA
• tRNA
GFF files
• ncRNA
• tmRNA
• transcript
• mobile_genetic_element
• origin_of_replication
• promoter
• repeat_region
Genome feature annotation
&
Read enrichment pattern
Genome
features
(gff3 feature
types)

Genome feature annotation & Epigenomics
data analysis
• File formats
• Public databases
• Genome browser
• Bedtools
• Deeptools
• Homer
Week 1
Genome features
• Peak calling
• Data QC
Week 2
Peak calling
Week 3
Integrated analysis

•Axt format
•BAM format
•BED format
•BED detail format
•bedGraph format
•barChart and bigBarChart format
•bigBed format
•bigGenePred table format
•bigPsl table format
•bigMaf table format
•bigChain table format
•bigNarrowPeak table format
•bigWig format
•Chain format
•CRAM format
•GenePred table format
•GFF format
•GTF format
•HAL format
•Hic format
•Interact and bigInteract format
•MAF format
•Microarray format
•Net format
•Personal Genome SNP format
•PSL format
•VCF format
•WIG format
File formats listed on UCSC web site
https://genome.ucsc.edu/FAQ/FAQformat.html#format3

Axt
BAM
BED
BED detail
bedGraph
barChart&bigBarCha
rt
bigBed
bigGenePred table
bigPsl table
bigMaf table
bigChain table
bigNarrowPeak table
bigWig
Chain
CRAM
GenePred table
GFF
GTF
HAL
Hic
Interact&bigInteract
MAF
Microarray
Net
Personal Genome
SNP
PSL
VCF
WIG
General format
broadPeak
gappedPeak
narrowPeak
pairedTagAlign
peptideMapping
RNA elements
tagAlign
Encode format
2bit
fasta
fastQ
nib
Sequence
File formats listed on UCSC Genome Browser web site
https://genome.ucsc.edu/FAQ/FAQformat.html#format3

Column Name
1 Chromosome/contig
2 Source of annotation
3 Feature type
4 Start position
5 End position
GFF2, GFF3 and GTF
6 Score
7 Strand ( + or -)
8 Phase (0,1 or 2)
9 Attributes or Group
9-column tab-delimited text file
chr1 Maker CDS 380 401 . + 0 gene_001
chr1 Maker start_codon 380 382 . + 0 gene_001
chr1 Maker stop_codon 708 710 . + 0 gene_001
Gff2 example:

chr1 Maker start_codon 380 382 . + 0 gene_001
chr1 Maker stop_codon 708 710 . + 0 gene_001
Gff2 example: Feature type
GFF2 vs GTF
Group
Feature type:
group:
GFF2: "CDS" "start_codon" "stop_codon" and "exon"
GTF: All GFF2 types plus optional types: 5UTR, 3UTR, inter, inter_CNS, and intron_CNS
GFF2: Single group attribute, not able to represent multiple splicing forms of the same gene
GTF: Two mandatory attributes: gene_id “gene_001"; transcript_id “gene_001.1"

chr1 Maker gene 1000 9000. + . ID=gene00001;Name=EDEN
chr1 Maker mRNA 1050 9000. + . ID=mRNA00001;Parent=gene00001;Name=EDEN.1
chr1 Maker exon 1050 1500. + . ID=exon00002;Parent=mRNA00001
chr1 Maker CDS 1201 1500. + 0ID=cds00001;Parent=mRNA00001
chr1 Maker CDS 3000 3902. + 0ID=cds00001;Parent=mRNA00001
GFF3 vs GTF
GFF3: Two mandatory attributes: ID and Parent
GTF: Two mandatory attributes: gene_id “gene0001"; transcript_id
“mRNA00001"
GFF3 example:

gffread, a tool to convert between GFF3 and GTF
gffread -E -T -o output.gtf input.gff3
gffread -E -G -o output.gff3 input.gtf
GFF3 -> GTF format
FTF -> GFF3 format

chr1 Maker exon 1050 1500. + .
gene_id “gene0001"; transcript_id “mRNA00001“;
gene_name "AT1G01010";
Lost information when converting from GFF3 to GTF
GFF3
GTF
gffread -E -G -o output.gff3 input.gtf
mRNA feature created from
first exon

gffread input.gff3 -g input.fasta -w transcript.fasta -x protein.fasta
Extract mRNA and protein sequences from the genome
using gffread
• Chromosome names should match between files;
• “Chr1”, “chr1” and “1” are different

CDS
antisense_RNA
antisense_lncRNA
exon
five_prime_UTR
gene
lnc_RNA
mRNA
miRNA
miRNA_primary_transcript
ncRNA
protein
pseudogene
pseudogenic_exon
pseudogenic_tRNA
pseudogenic_transcript
rRNA
snRNA
snoRNA
tRNA
three_prime_UTR
transcript_region
transposable_element
transposable_element_gene
transposon_fragment
uORF
zcat ara.gff.gz|cut -f3|sort |uniq
Chr1 Araport11 lnc_RNA 11101 11372 . + . ID=AT1G03987.1;Parent=AT1G03987;Name=AT1G03987.1
Chr3 Araport11 tRNA 2918413 2918486 74 - . ID=AT3G09505.1;Parent=AT3G09505;Note=tRNA-Arg
Chr2 Araport11 transposable_element 2782038 2785706 . + .ID=AT2TE12270;Name=AT2TE12270;Alias=ATCOPIA75
Other features in gff files:

Column Name
1 Chromosome/contig
2 Start position
3 End position
4 Name
5 Score
BED
chr1 1000 5000 gene1 . + 1000 5000 0 2 567,488, 0,3512
chr2 2000 6000 gene2 . - 2000 6000 0 2 433,399, 0,3601
6 Strand ( + or -)
7-9 thickStart, thickEnd, itemRgb
10 blockCount
11 blockSizes
12 blockStarts
chr1 1000 5000
chr2 2000 6000
3-column format
12-column format

BED  GFF3
chr1 999 9000 . . +
GFF3
BED
• Most software ignore the last 6 columns of BED
BED, bam
0-start, half-open (0-based): first nucleotide: 0; last nucleotide: 100
GFF, GTF, vcf, genbank, sam, blast output
1-start, fully-closed (1-based): first nucleotide: 1; last nucleotide: 100
Coordinate system for a 100-bp sequence)

Start/end positions of stranded DNA sequence:
chr1 0 100 . . +
chr1 0 100 . . -
Plus strand
Minus strand
chr1 1 100
chr1 100 1
BLAST
output
Minus strand
BED
Plus strand Transcript start site in BED file:
Gene on plus strand gene: start position
(add 1)
Gene on minus strand: end position

Stranded read position in SAM file format (1-based)
Leftmost alignment position on
reference

Use LINUX “awk” to convert file format
chr1 Maker gene 1000 9000 . + . ID=gene00001;Name=EDEN
chr1 999 9000 . . +
GFF3
BED
awk 'BEGIN { OFS = "t" } {if ($3=="gene") print $1, $4-1, $5, ".", ".", $7}' input.gff3 > output.bed
Awk command:

SAM/BAM file formats
HWUSI-EAS525_0042_FC:6:23:10200:18582#0/1 16 1 10 40 35M * 0 0
AGCCAAAGATTGCATCAGTTCTGCTGCTATTTCCT agafgfaffcfdf[fdcffcggggccfdffagggg MD:Z:35 NH:i:1 HI:i:1
NM:i:0 SM:i:40 XQ:i:40 X2:i:0
HWUSI-EAS525_0042_FC:3:28:18734:20197#0/1 16 1 10 40 35M * 0 0
AGCCAAAGATTGCATCAGTTCTGCTGCTATTTCCT hghhghhhhhhhhhhhhhhhhhghhhhhghhfhhh MD:Z:35
NH:i:1 HI:i:1 NM:i:0 SM:i:40 XQ:i:40 X2:i:0
HWUSI-EAS525_0042_FC:3:94:1587:14299#0/1 16 1 10 40 35M * 0 0
AGCCAAAGATTGCATCAGTTCTGCTGCTATTTCCT hfhghhhhhhhhhhghhhhhhhhhhhhhhhhhhhg MD:Z:35
NH:i:1 HI:i:1 NM:i:0 SM:i:40 XQ:i:40 X2:i:0
READ alignment to the reference genome
1 9 44 . 40 -
1 9 44 . 40 -
1 9 44 . 40 -
Chr start end name score strand
Convert to bed file:

Sparse vs dense data profile
Dense
Sparse

Wiggle (wig), BedGraph, and bigwig file format:
fixedStep chrom=chr3 start=400601 step=100 span=5
11
22
33
variableStep chrom=chr3 span=5
400601 11
400701 22
400801 33
Fixed step wig (1-based)
Variable step wig (1-based)
BedGraph (0-based)
chr3 400600 400605 11
chr3 400700 400705 22
chr3 400800 400805 33
Good for dense data
Good for sparse data

BigWig file format replacing the Wig format:
• Compressed and indexed binary format;
• For both sparse and dense data;
• Recommended format for genome browser;

Bedtools genomecov
Bam => bedgraph
Deeptools bamCoverage
Bam => bedgraph or bigwig
Kent Utility:
Conversion between wig, bigwig,
Software for file conversion:

Public genomic databases
NCBI Refseq : NCBI gene annotation
Ensembl: Ensembl gene annotation
UCSC: Integrated with UCSC genome browser
* All three databases use the same genome
assembly

https://ftp.ncbi.nlm.nih.gov/genomes/refseq/
Download from NCBI Refseq FTP site
• Genome fasta
• GFF3
• GTF
• Protein sequence
• Transcript sequence

Ensembl Biomart:
Ortholog in other species:

NCBI
Ensembl
UCSC
JGI
General
databases
Species specific
databases
Drosophila: Flybase
C. elegans: Wormbase
Yeast: SGD
Maize: MaizeGDB
Arabidopsis: TAIR
Matching Gene/Transcript ID
Ensembl BioMart provides IDs
from many different databases

Genome browsers
IGV – desktop application UCSC Genome browser
– web based

Software running on your laptop/desktop
IGV – Integrated Genome Viewer

Select pre-loaded reference genome Create your own reference genome db
Genome
fasta
GFF3 or GTF
file

View multiple tracks
View large data files, not need to upload

epigenomics_2020_lecture1_epigenomics_2020_lecture1.pdf

More Related Content

Similar to epigenomics_2020_lecture1_epigenomics_2020_lecture1.pdf (20)

More from bioinformaticorp (7)

Recently uploaded (20)

epigenomics_2020_lecture1_epigenomics_2020_lecture1.pdf