SlideShare a Scribd company logo
Epigenomics data
analysis
11/30/2020 – 12/18/2020
Qi Sun, William Lai and Jeff Glaubitz
Bioinformatics Facilty & Epigenomics Facility
• gene
• CDS
• mRNA
• exon
• five_prime_UTR
• three_prime_UTR
• rRNA
• tRNA
GFF files
• ncRNA
• tmRNA
• transcript
• mobile_genetic_element
• origin_of_replication
• promoter
• repeat_region
Genome feature annotation
&
Read enrichment pattern
Genome
features
(gff3 feature
types)
Genome feature annotation & Epigenomics
data analysis
• File formats
• Public databases
• Genome browser
• Bedtools
• Deeptools
• Homer
Week 1
Genome features
• Peak calling
• Data QC
Week 2
Peak calling
Week 3
Integrated analysis
•Axt format
•BAM format
•BED format
•BED detail format
•bedGraph format
•barChart and bigBarChart format
•bigBed format
•bigGenePred table format
•bigPsl table format
•bigMaf table format
•bigChain table format
•bigNarrowPeak table format
•bigWig format
•Chain format
•CRAM format
•GenePred table format
•GFF format
•GTF format
•HAL format
•Hic format
•Interact and bigInteract format
•MAF format
•Microarray format
•Net format
•Personal Genome SNP format
•PSL format
•VCF format
•WIG format
File formats listed on UCSC web site
https://genome.ucsc.edu/FAQ/FAQformat.html#format3
Axt
BAM
BED
BED detail
bedGraph
barChart&bigBarCha
rt
bigBed
bigGenePred table
bigPsl table
bigMaf table
bigChain table
bigNarrowPeak table
bigWig
Chain
CRAM
GenePred table
GFF
GTF
HAL
Hic
Interact&bigInteract
MAF
Microarray
Net
Personal Genome
SNP
PSL
VCF
WIG
General format
broadPeak
gappedPeak
narrowPeak
pairedTagAlign
peptideMapping
RNA elements
tagAlign
Encode format
2bit
fasta
fastQ
nib
Sequence
File formats listed on UCSC Genome Browser web site
https://genome.ucsc.edu/FAQ/FAQformat.html#format3
Column Name
1 Chromosome/contig
2 Source of annotation
3 Feature type
4 Start position
5 End position
GFF2, GFF3 and GTF
6 Score
7 Strand ( + or -)
8 Phase (0,1 or 2)
9 Attributes or Group
9-column tab-delimited text file
chr1 Maker CDS 380 401 . + 0 gene_001
chr1 Maker CDS 501 650 . + 2 gene_001
chr1 Maker CDS 700 707 . + 2 gene_001
chr1 Maker start_codon 380 382 . + 0 gene_001
chr1 Maker stop_codon 708 710 . + 0 gene_001
Gff2 example:
chr1 Maker CDS 380 401 . + 0 gene_001
chr1 Maker CDS 501 650 . + 2 gene_001
chr1 Maker CDS 700 707 . + 2 gene_001
chr1 Maker start_codon 380 382 . + 0 gene_001
chr1 Maker stop_codon 708 710 . + 0 gene_001
Gff2 example: Feature type
GFF2 vs GTF
Group
Feature type:
group:
GFF2: "CDS" "start_codon" "stop_codon" and "exon"
GTF: All GFF2 types plus optional types: 5UTR, 3UTR, inter, inter_CNS, and intron_CNS
GFF2: Single group attribute, not able to represent multiple splicing forms of the same gene
GTF: Two mandatory attributes: gene_id “gene_001"; transcript_id “gene_001.1"
chr1 Maker gene 1000 9000. + . ID=gene00001;Name=EDEN
chr1 Maker mRNA 1050 9000. + . ID=mRNA00001;Parent=gene00001;Name=EDEN.1
chr1 Maker exon 1050 1500. + . ID=exon00002;Parent=mRNA00001
chr1 Maker exon 3000 3902. + . ID=exon00003;Parent=mRNA00001
chr1 Maker CDS 1201 1500. + 0ID=cds00001;Parent=mRNA00001
chr1 Maker CDS 3000 3902. + 0ID=cds00001;Parent=mRNA00001
GFF3 vs GTF
GFF3: Two mandatory attributes: ID and Parent
GTF: Two mandatory attributes: gene_id “gene0001"; transcript_id
“mRNA00001"
GFF3 example:
gffread, a tool to convert between GFF3 and GTF
gffread -E -T -o output.gtf input.gff3
gffread -E -G -o output.gff3 input.gtf
GFF3 -> GTF format
FTF -> GFF3 format
chr1 Maker gene 1000 9000. + . ID=gene00001;Name=EDEN
chr1 Maker mRNA 1050 9000. + . ID=mRNA00001;Parent=gene00001;Name=EDEN.1
chr1 Maker exon 1050 1500. + . ID=exon00002;Parent=mRNA00001
chr1 Maker exon 1050 1500. + .
gene_id “gene0001"; transcript_id “mRNA00001“;
gene_name "AT1G01010";
Lost information when converting from GFF3 to GTF
GFF3
GTF
gffread -E -G -o output.gff3 input.gtf
mRNA feature created from
first exon
gffread input.gff3 -g input.fasta -w transcript.fasta -x protein.fasta
Extract mRNA and protein sequences from the genome
using gffread
• Chromosome names should match between files;
• “Chr1”, “chr1” and “1” are different
CDS
antisense_RNA
antisense_lncRNA
exon
five_prime_UTR
gene
lnc_RNA
mRNA
miRNA
miRNA_primary_transcript
ncRNA
protein
pseudogene
pseudogenic_exon
pseudogenic_tRNA
pseudogenic_transcript
rRNA
snRNA
snoRNA
tRNA
three_prime_UTR
transcript_region
transposable_element
transposable_element_gene
transposon_fragment
uORF
zcat ara.gff.gz|cut -f3|sort |uniq
Chr1 Araport11 lnc_RNA 11101 11372 . + . ID=AT1G03987.1;Parent=AT1G03987;Name=AT1G03987.1
Chr3 Araport11 tRNA 2918413 2918486 74 - . ID=AT3G09505.1;Parent=AT3G09505;Note=tRNA-Arg
Chr2 Araport11 transposable_element 2782038 2785706 . + .ID=AT2TE12270;Name=AT2TE12270;Alias=ATCOPIA75
Other features in gff files:
Column Name
1 Chromosome/contig
2 Start position
3 End position
4 Name
5 Score
BED
chr1 1000 5000 gene1 . + 1000 5000 0 2 567,488, 0,3512
chr2 2000 6000 gene2 . - 2000 6000 0 2 433,399, 0,3601
6 Strand ( + or -)
7-9 thickStart, thickEnd, itemRgb
10 blockCount
11 blockSizes
12 blockStarts
chr1 1000 5000
chr2 2000 6000
3-column format
12-column format
BED  GFF3
chr1 Maker gene 1000 9000. + . ID=gene00001;Name=EDEN
chr1 Maker mRNA 1050 9000. + . ID=mRNA00001;Parent=gene00001;Name=EDEN.1
chr1 Maker exon 1050 1500. + . ID=exon00002;Parent=mRNA00001
chr1 999 9000 . . +
GFF3
BED
• Most software ignore the last 6 columns of BED
BED, bam
0-start, half-open (0-based): first nucleotide: 0; last nucleotide: 100
GFF, GTF, vcf, genbank, sam, blast output
1-start, fully-closed (1-based): first nucleotide: 1; last nucleotide: 100
Coordinate system for a 100-bp sequence)
Start/end positions of stranded DNA sequence:
chr1 0 100 . . +
chr1 0 100 . . -
Plus strand
Minus strand
chr1 1 100
chr1 100 1
BLAST
output
Minus strand
BED
Plus strand Transcript start site in BED file:
Gene on plus strand gene: start position
(add 1)
Gene on minus strand: end position
Stranded read position in SAM file format (1-based)
Leftmost alignment position on
reference
Use LINUX “awk” to convert file format
chr1 Maker gene 1000 9000 . + . ID=gene00001;Name=EDEN
chr1 999 9000 . . +
GFF3
BED
awk 'BEGIN { OFS = "t" } {if ($3=="gene") print $1, $4-1, $5, ".", ".", $7}' input.gff3 > output.bed
Awk command:
SAM/BAM file formats
HWUSI-EAS525_0042_FC:6:23:10200:18582#0/1 16 1 10 40 35M * 0 0
AGCCAAAGATTGCATCAGTTCTGCTGCTATTTCCT agafgfaffcfdf[fdcffcggggccfdffagggg MD:Z:35 NH:i:1 HI:i:1
NM:i:0 SM:i:40 XQ:i:40 X2:i:0
HWUSI-EAS525_0042_FC:3:28:18734:20197#0/1 16 1 10 40 35M * 0 0
AGCCAAAGATTGCATCAGTTCTGCTGCTATTTCCT hghhghhhhhhhhhhhhhhhhhghhhhhghhfhhh MD:Z:35
NH:i:1 HI:i:1 NM:i:0 SM:i:40 XQ:i:40 X2:i:0
HWUSI-EAS525_0042_FC:3:94:1587:14299#0/1 16 1 10 40 35M * 0 0
AGCCAAAGATTGCATCAGTTCTGCTGCTATTTCCT hfhghhhhhhhhhhghhhhhhhhhhhhhhhhhhhg MD:Z:35
NH:i:1 HI:i:1 NM:i:0 SM:i:40 XQ:i:40 X2:i:0
READ alignment to the reference genome
1 9 44 . 40 -
1 9 44 . 40 -
1 9 44 . 40 -
Chr start end name score strand
Convert to bed file:
Wiggle (wig) file format:
Sparse vs dense data profile
Dense
Sparse
Wiggle (wig), BedGraph, and bigwig file format:
fixedStep chrom=chr3 start=400601 step=100 span=5
11
22
33
variableStep chrom=chr3 span=5
400601 11
400701 22
400801 33
Fixed step wig (1-based)
Variable step wig (1-based)
BedGraph (0-based)
chr3 400600 400605 11
chr3 400700 400705 22
chr3 400800 400805 33
Good for dense data
Good for sparse data
BigWig file format replacing the Wig format:
• Compressed and indexed binary format;
• For both sparse and dense data;
• Recommended format for genome browser;
Bedtools genomecov
Bam => bedgraph
Deeptools bamCoverage
Bam => bedgraph or bigwig
Kent Utility:
Conversion between wig, bigwig,
Software for file conversion:
Public genomic databases
NCBI Refseq : NCBI gene annotation
Ensembl: Ensembl gene annotation
UCSC: Integrated with UCSC genome browser
* All three databases use the same genome
assembly
https://ftp.ncbi.nlm.nih.gov/genomes/refseq/
Download from NCBI Refseq FTP site
• Genome fasta
• GFF3
• GTF
• Protein sequence
• Transcript sequence
Download from ENSEMBL FTP
UCSC table browser
Ensembl Biomart:
Features:
Ensembl Biomart:
Ortholog in other species:
NCBI
Ensembl
UCSC
JGI
General
databases
Species specific
databases
Drosophila: Flybase
C. elegans: Wormbase
Yeast: SGD
Maize: MaizeGDB
Arabidopsis: TAIR
Matching Gene/Transcript ID
Ensembl BioMart provides IDs
from many different databases
Genome browsers
IGV – desktop application UCSC Genome browser
– web based
Software running on your laptop/desktop
IGV – Integrated Genome Viewer
Select pre-loaded reference genome Create your own reference genome db
Genome
fasta
GFF3 or GTF
file
Load data track
View multiple tracks
View large data files, not need to upload

More Related Content

PPTX
Workshop NGS data analysis - 2
PPTX
CS Lecture 2017 04-11 from Data to Precision Medicine
PPTX
Church gmod2012 pt2
PDF
Course on parsing methods for biologists with a focus on ChIP-seq data
PPTX
Bioinfo ngs data format visualization v2
PPTX
Bioinformatics t8-go-hmm v2014
PDF
Forsharing cshl2011 sequencing
PPTX
2015 bioinformatics go_hmm_wim_vancriekinge
Workshop NGS data analysis - 2
CS Lecture 2017 04-11 from Data to Precision Medicine
Church gmod2012 pt2
Course on parsing methods for biologists with a focus on ChIP-seq data
Bioinfo ngs data format visualization v2
Bioinformatics t8-go-hmm v2014
Forsharing cshl2011 sequencing
2015 bioinformatics go_hmm_wim_vancriekinge

Similar to epigenomics_2020_lecture1_epigenomics_2020_lecture1.pdf (20)

PPTX
Bioinformatics t8-go-hmm wim-vancriekinge_v2013
PPTX
Next-generation sequencing format and visualization with ngs.plot
PPTX
VCF and RDF
PPTX
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
PPTX
Bioinformatica t8-go-hmm
PPTX
2015 bioinformatics databases_wim_vancriekinge
PPTX
Bioinformatic tool for Annotation of gene
PPTX
Functional ANNOTATION OF GENOME.pptx
PDF
RNASeq Experiment Design
PPTX
Bioinformatics t2-databases wim-vancriekinge_v2013
PPTX
160628 giab for festival of genomics
PDF
The UCSC genome browser: A Neuroscience focused overview
PPTX
2016 bergen-sars
PPTX
Complementing Computation with Visualization in Genomics
PDF
Bioinfornatics Practical Lab Manual For Biotech
PPTX
2016 bioinformatics i_databases_wim_vancriekinge
PPT
PDF
Discovery and annotation of variants by exome analysis using NGS
PPT
Bioinformatica 08-12-2011-t8-go-hmm
PPT
Bioinformatics MiRON
Bioinformatics t8-go-hmm wim-vancriekinge_v2013
Next-generation sequencing format and visualization with ngs.plot
VCF and RDF
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Bioinformatica t8-go-hmm
2015 bioinformatics databases_wim_vancriekinge
Bioinformatic tool for Annotation of gene
Functional ANNOTATION OF GENOME.pptx
RNASeq Experiment Design
Bioinformatics t2-databases wim-vancriekinge_v2013
160628 giab for festival of genomics
The UCSC genome browser: A Neuroscience focused overview
2016 bergen-sars
Complementing Computation with Visualization in Genomics
Bioinfornatics Practical Lab Manual For Biotech
2016 bioinformatics i_databases_wim_vancriekinge
Discovery and annotation of variants by exome analysis using NGS
Bioinformatica 08-12-2011-t8-go-hmm
Bioinformatics MiRON
Ad

More from bioinformaticorp (7)

PDF
5_Alineamientos múltiples_Alineamientos múltiples.pdf
PDF
4_BLAST_comparacion entre secuencias.pdf
PDF
3_Alineamientos de pares de secuencias.pdf
PDF
2_Estadística de secuencias_biologicas.pdf
PDF
1_Bases de datos_biologicas_secuencias.pdf
PDF
Introduction_software_to_DIAMOND-MEGAN.pdf
PDF
Produccion-de-aminoacidos_Stanley_Miuller.pdf
5_Alineamientos múltiples_Alineamientos múltiples.pdf
4_BLAST_comparacion entre secuencias.pdf
3_Alineamientos de pares de secuencias.pdf
2_Estadística de secuencias_biologicas.pdf
1_Bases de datos_biologicas_secuencias.pdf
Introduction_software_to_DIAMOND-MEGAN.pdf
Produccion-de-aminoacidos_Stanley_Miuller.pdf
Ad

Recently uploaded (20)

PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PDF
01-Introduction-to-Information-Management.pdf
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PDF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
PDF
A systematic review of self-coping strategies used by university students to ...
PPTX
Presentation on HIE in infants and its manifestations
PPTX
Cell Structure & Organelles in detailed.
PDF
Classroom Observation Tools for Teachers
PDF
GENETICS IN BIOLOGY IN SECONDARY LEVEL FORM 3
PPTX
202450812 BayCHI UCSC-SV 20250812 v17.pptx
PDF
Anesthesia in Laparoscopic Surgery in India
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PDF
VCE English Exam - Section C Student Revision Booklet
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PDF
FourierSeries-QuestionsWithAnswers(Part-A).pdf
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PPTX
Pharma ospi slides which help in ospi learning
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
Module 4: Burden of Disease Tutorial Slides S2 2025
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
01-Introduction-to-Information-Management.pdf
Pharmacology of Heart Failure /Pharmacotherapy of CHF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
A systematic review of self-coping strategies used by university students to ...
Presentation on HIE in infants and its manifestations
Cell Structure & Organelles in detailed.
Classroom Observation Tools for Teachers
GENETICS IN BIOLOGY IN SECONDARY LEVEL FORM 3
202450812 BayCHI UCSC-SV 20250812 v17.pptx
Anesthesia in Laparoscopic Surgery in India
STATICS OF THE RIGID BODIES Hibbelers.pdf
O5-L3 Freight Transport Ops (International) V1.pdf
VCE English Exam - Section C Student Revision Booklet
Supply Chain Operations Speaking Notes -ICLT Program
FourierSeries-QuestionsWithAnswers(Part-A).pdf
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
Pharma ospi slides which help in ospi learning
Abdominal Access Techniques with Prof. Dr. R K Mishra

epigenomics_2020_lecture1_epigenomics_2020_lecture1.pdf

  • 1. Epigenomics data analysis 11/30/2020 – 12/18/2020 Qi Sun, William Lai and Jeff Glaubitz Bioinformatics Facilty & Epigenomics Facility
  • 2. • gene • CDS • mRNA • exon • five_prime_UTR • three_prime_UTR • rRNA • tRNA GFF files • ncRNA • tmRNA • transcript • mobile_genetic_element • origin_of_replication • promoter • repeat_region Genome feature annotation & Read enrichment pattern Genome features (gff3 feature types)
  • 3. Genome feature annotation & Epigenomics data analysis • File formats • Public databases • Genome browser • Bedtools • Deeptools • Homer Week 1 Genome features • Peak calling • Data QC Week 2 Peak calling Week 3 Integrated analysis
  • 4. •Axt format •BAM format •BED format •BED detail format •bedGraph format •barChart and bigBarChart format •bigBed format •bigGenePred table format •bigPsl table format •bigMaf table format •bigChain table format •bigNarrowPeak table format •bigWig format •Chain format •CRAM format •GenePred table format •GFF format •GTF format •HAL format •Hic format •Interact and bigInteract format •MAF format •Microarray format •Net format •Personal Genome SNP format •PSL format •VCF format •WIG format File formats listed on UCSC web site https://genome.ucsc.edu/FAQ/FAQformat.html#format3
  • 5. Axt BAM BED BED detail bedGraph barChart&bigBarCha rt bigBed bigGenePred table bigPsl table bigMaf table bigChain table bigNarrowPeak table bigWig Chain CRAM GenePred table GFF GTF HAL Hic Interact&bigInteract MAF Microarray Net Personal Genome SNP PSL VCF WIG General format broadPeak gappedPeak narrowPeak pairedTagAlign peptideMapping RNA elements tagAlign Encode format 2bit fasta fastQ nib Sequence File formats listed on UCSC Genome Browser web site https://genome.ucsc.edu/FAQ/FAQformat.html#format3
  • 6. Column Name 1 Chromosome/contig 2 Source of annotation 3 Feature type 4 Start position 5 End position GFF2, GFF3 and GTF 6 Score 7 Strand ( + or -) 8 Phase (0,1 or 2) 9 Attributes or Group 9-column tab-delimited text file chr1 Maker CDS 380 401 . + 0 gene_001 chr1 Maker CDS 501 650 . + 2 gene_001 chr1 Maker CDS 700 707 . + 2 gene_001 chr1 Maker start_codon 380 382 . + 0 gene_001 chr1 Maker stop_codon 708 710 . + 0 gene_001 Gff2 example:
  • 7. chr1 Maker CDS 380 401 . + 0 gene_001 chr1 Maker CDS 501 650 . + 2 gene_001 chr1 Maker CDS 700 707 . + 2 gene_001 chr1 Maker start_codon 380 382 . + 0 gene_001 chr1 Maker stop_codon 708 710 . + 0 gene_001 Gff2 example: Feature type GFF2 vs GTF Group Feature type: group: GFF2: "CDS" "start_codon" "stop_codon" and "exon" GTF: All GFF2 types plus optional types: 5UTR, 3UTR, inter, inter_CNS, and intron_CNS GFF2: Single group attribute, not able to represent multiple splicing forms of the same gene GTF: Two mandatory attributes: gene_id “gene_001"; transcript_id “gene_001.1"
  • 8. chr1 Maker gene 1000 9000. + . ID=gene00001;Name=EDEN chr1 Maker mRNA 1050 9000. + . ID=mRNA00001;Parent=gene00001;Name=EDEN.1 chr1 Maker exon 1050 1500. + . ID=exon00002;Parent=mRNA00001 chr1 Maker exon 3000 3902. + . ID=exon00003;Parent=mRNA00001 chr1 Maker CDS 1201 1500. + 0ID=cds00001;Parent=mRNA00001 chr1 Maker CDS 3000 3902. + 0ID=cds00001;Parent=mRNA00001 GFF3 vs GTF GFF3: Two mandatory attributes: ID and Parent GTF: Two mandatory attributes: gene_id “gene0001"; transcript_id “mRNA00001" GFF3 example:
  • 9. gffread, a tool to convert between GFF3 and GTF gffread -E -T -o output.gtf input.gff3 gffread -E -G -o output.gff3 input.gtf GFF3 -> GTF format FTF -> GFF3 format
  • 10. chr1 Maker gene 1000 9000. + . ID=gene00001;Name=EDEN chr1 Maker mRNA 1050 9000. + . ID=mRNA00001;Parent=gene00001;Name=EDEN.1 chr1 Maker exon 1050 1500. + . ID=exon00002;Parent=mRNA00001 chr1 Maker exon 1050 1500. + . gene_id “gene0001"; transcript_id “mRNA00001“; gene_name "AT1G01010"; Lost information when converting from GFF3 to GTF GFF3 GTF gffread -E -G -o output.gff3 input.gtf mRNA feature created from first exon
  • 11. gffread input.gff3 -g input.fasta -w transcript.fasta -x protein.fasta Extract mRNA and protein sequences from the genome using gffread • Chromosome names should match between files; • “Chr1”, “chr1” and “1” are different
  • 12. CDS antisense_RNA antisense_lncRNA exon five_prime_UTR gene lnc_RNA mRNA miRNA miRNA_primary_transcript ncRNA protein pseudogene pseudogenic_exon pseudogenic_tRNA pseudogenic_transcript rRNA snRNA snoRNA tRNA three_prime_UTR transcript_region transposable_element transposable_element_gene transposon_fragment uORF zcat ara.gff.gz|cut -f3|sort |uniq Chr1 Araport11 lnc_RNA 11101 11372 . + . ID=AT1G03987.1;Parent=AT1G03987;Name=AT1G03987.1 Chr3 Araport11 tRNA 2918413 2918486 74 - . ID=AT3G09505.1;Parent=AT3G09505;Note=tRNA-Arg Chr2 Araport11 transposable_element 2782038 2785706 . + .ID=AT2TE12270;Name=AT2TE12270;Alias=ATCOPIA75 Other features in gff files:
  • 13. Column Name 1 Chromosome/contig 2 Start position 3 End position 4 Name 5 Score BED chr1 1000 5000 gene1 . + 1000 5000 0 2 567,488, 0,3512 chr2 2000 6000 gene2 . - 2000 6000 0 2 433,399, 0,3601 6 Strand ( + or -) 7-9 thickStart, thickEnd, itemRgb 10 blockCount 11 blockSizes 12 blockStarts chr1 1000 5000 chr2 2000 6000 3-column format 12-column format
  • 14. BED  GFF3 chr1 Maker gene 1000 9000. + . ID=gene00001;Name=EDEN chr1 Maker mRNA 1050 9000. + . ID=mRNA00001;Parent=gene00001;Name=EDEN.1 chr1 Maker exon 1050 1500. + . ID=exon00002;Parent=mRNA00001 chr1 999 9000 . . + GFF3 BED • Most software ignore the last 6 columns of BED BED, bam 0-start, half-open (0-based): first nucleotide: 0; last nucleotide: 100 GFF, GTF, vcf, genbank, sam, blast output 1-start, fully-closed (1-based): first nucleotide: 1; last nucleotide: 100 Coordinate system for a 100-bp sequence)
  • 15. Start/end positions of stranded DNA sequence: chr1 0 100 . . + chr1 0 100 . . - Plus strand Minus strand chr1 1 100 chr1 100 1 BLAST output Minus strand BED Plus strand Transcript start site in BED file: Gene on plus strand gene: start position (add 1) Gene on minus strand: end position
  • 16. Stranded read position in SAM file format (1-based) Leftmost alignment position on reference
  • 17. Use LINUX “awk” to convert file format chr1 Maker gene 1000 9000 . + . ID=gene00001;Name=EDEN chr1 999 9000 . . + GFF3 BED awk 'BEGIN { OFS = "t" } {if ($3=="gene") print $1, $4-1, $5, ".", ".", $7}' input.gff3 > output.bed Awk command:
  • 18. SAM/BAM file formats HWUSI-EAS525_0042_FC:6:23:10200:18582#0/1 16 1 10 40 35M * 0 0 AGCCAAAGATTGCATCAGTTCTGCTGCTATTTCCT agafgfaffcfdf[fdcffcggggccfdffagggg MD:Z:35 NH:i:1 HI:i:1 NM:i:0 SM:i:40 XQ:i:40 X2:i:0 HWUSI-EAS525_0042_FC:3:28:18734:20197#0/1 16 1 10 40 35M * 0 0 AGCCAAAGATTGCATCAGTTCTGCTGCTATTTCCT hghhghhhhhhhhhhhhhhhhhghhhhhghhfhhh MD:Z:35 NH:i:1 HI:i:1 NM:i:0 SM:i:40 XQ:i:40 X2:i:0 HWUSI-EAS525_0042_FC:3:94:1587:14299#0/1 16 1 10 40 35M * 0 0 AGCCAAAGATTGCATCAGTTCTGCTGCTATTTCCT hfhghhhhhhhhhhghhhhhhhhhhhhhhhhhhhg MD:Z:35 NH:i:1 HI:i:1 NM:i:0 SM:i:40 XQ:i:40 X2:i:0 READ alignment to the reference genome 1 9 44 . 40 - 1 9 44 . 40 - 1 9 44 . 40 - Chr start end name score strand Convert to bed file:
  • 19. Wiggle (wig) file format:
  • 20. Sparse vs dense data profile Dense Sparse
  • 21. Wiggle (wig), BedGraph, and bigwig file format: fixedStep chrom=chr3 start=400601 step=100 span=5 11 22 33 variableStep chrom=chr3 span=5 400601 11 400701 22 400801 33 Fixed step wig (1-based) Variable step wig (1-based) BedGraph (0-based) chr3 400600 400605 11 chr3 400700 400705 22 chr3 400800 400805 33 Good for dense data Good for sparse data
  • 22. BigWig file format replacing the Wig format: • Compressed and indexed binary format; • For both sparse and dense data; • Recommended format for genome browser;
  • 23. Bedtools genomecov Bam => bedgraph Deeptools bamCoverage Bam => bedgraph or bigwig Kent Utility: Conversion between wig, bigwig, Software for file conversion:
  • 24. Public genomic databases NCBI Refseq : NCBI gene annotation Ensembl: Ensembl gene annotation UCSC: Integrated with UCSC genome browser * All three databases use the same genome assembly
  • 25. https://ftp.ncbi.nlm.nih.gov/genomes/refseq/ Download from NCBI Refseq FTP site • Genome fasta • GFF3 • GTF • Protein sequence • Transcript sequence
  • 30. NCBI Ensembl UCSC JGI General databases Species specific databases Drosophila: Flybase C. elegans: Wormbase Yeast: SGD Maize: MaizeGDB Arabidopsis: TAIR Matching Gene/Transcript ID Ensembl BioMart provides IDs from many different databases
  • 31. Genome browsers IGV – desktop application UCSC Genome browser – web based
  • 32. Software running on your laptop/desktop IGV – Integrated Genome Viewer
  • 33. Select pre-loaded reference genome Create your own reference genome db Genome fasta GFF3 or GTF file
  • 35. View multiple tracks View large data files, not need to upload