SlideShare a Scribd company logo
Sequencing data analysis
Workshop – part 2 / mapping to a reference genome



                      Outline

            Previously in this workshop…

      Mapping to a reference genome – the steps

    Mapping to a reference genome – the workshop




                  Maté Ongenaert
Previously in this workshop…
Introduction – the real cost of sequencing
Previously in this workshop…
Introduction – the real cost of sequencing
Previously in this workshop…
  The workflow of NGS data analysis
                            Data analysis

                 Raw machine reads… What’s next?

                Preprocessing (machine/technology)
                 - adaptors, indexes, conversions,…
                 - machine/technology dependent

              Reads with associated qualities (universal)
                              - FASTQ
                            - QC check

            Depending on application (general applicable)
        - ‘de novo’ assembly of genome (bacterial genomes,…)
         - Mapping to a reference genome  mapped reads
                          - SAM/BAM/…

             High-level analysis (specific for application)
                            - SNP calling
                           - Peak calling
Previously in this workshop…
  The workflow of NGS data analysis
Previously in this workshop…
                                     Main data formats
                                     Raw sequence reads:

- Represent the sequence ~ FASTA
  >SEQUENCE_IDENTIFIER
  GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT


- Extension: represent the quality, per base ~ FASTQ – Q for quality
Score ~ phred ~ ASCII table ~ phred + 33 = Sanger
  @SEQUENCE_IDENTIFIER
  GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
  +
  !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65



- Machine and platform independent and compressed: SRA (NCBI)
Get the original FASTQ file using SRATools (NCBI)
Previously in this workshop…
                                Main data formats
- Now moving to a common file format  SAM / BAM (Sequence Alignment/Map)
- BAM: binary (read: computer-readable, indexed, compressed) ‘form’ of SAM

DESCRIPTION OF THE 11 FIELDS IN THE ALIGNMENT SECTION

# QNAME: template name
#FLAG
#RNAME: reference name
# POS: mapping position
#MAPQ: mapping quality
#CIGAR: CIGAR string
#RNEXT: reference name of the mate/next fragment
#PNEXT: position of the mate/next fragment
#TLEN: observed template length
#SEQ: fragment sequence
#QUAL: ASCII of Phred-scale base quality+33

#Headers
@HD VN:1.3 SO:coordinate
@SQ SN:ref LN:45

#Alignment block
r001 163 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG *
r002 0 ref 9 30 3S6M1P1I4M * 0 0 AAAAGATAAGGATA *
r003 0 ref 9 30 5H6M * 0 0 AGCTAA * NM:i:1
r004 0 ref 16 30 6M14N5M * 0 0 ATAGCTTCAGC *
Previously in this workshop…
                                         Main data formats
- BED files (location / annotation / scores): Browser Extensible Data
Used for mapping / annotation / peak locations / - extension: bigBED (binary)
FIELDS USED:
# chr
# start
# end
# name
# score
# strand

track   name=pairedReads description="Clone Paired Reads" useScore=1
#chr    start end name score strand
chr22   1000 5000 cloneA 960 +
chr22   2000 6000 cloneB 900 –


- BEDGraph files (location, combined with score)
Used to represent peak scores
track type=bedGraph name="BedGraph Format" description="BedGraph format"
visibility=full color=200,100,0 altColor=0,100,200 priority=20
#chr start    end      score
chr19 59302000 59302300 -1.0
chr19 59302300 59302600 -0.75
chr19 59302600 59302900 -0.50
Previously in this workshop…
                                       Main data formats
- WIG files (location / annotation / scores): wiggle
Used for visulization or summarize data, in most cases count data or normalized count
data (RPKM) – extension: BigWig – binary versions (often used in GEO for ChIP-seq peaks)




browser position chr19:59304200-59310700
browser hide all

#150 base wide bar graph at arbitrarily spaced positions,
#threshold line drawn at y=11.76
#autoScale off viewing range set to [0:25]
#priority = 10 positions this as the first graph

track type=wiggle_0 name="variableStep" description="variableStep format"
visibility=full autoScale=off viewLimits=0.0:25.0 color=50,150,255
yLineMark=11.76 yLineOnOff=on priority=10
variableStep chrom=chr19 span=150
59304701 10.0
59304901 12.5
59305401 15.0
59305601 17.5
59305901 20.0
59306081 17.5
Previously in this workshop…
                                    Main data formats
- GFF format (General Feature Format) or GTF
Used for annotation of genetic / genomic features – such as all coding genes in Ensembl
Often used in downstream analysis to assign annotation to regions / peaks / …
FIELDS USED:

# seqname (the name of the sequence)
# source (the program that generated this feature)
# feature (the name of this type of feature – for example: exon)
# start (the starting position of the feature in the sequence)
# end (the ending position of the feature)
# score (a score between 0 and 1000)
# strand (valid entries include '+', '-', or '.')
# frame (if the feature is a coding exon, frame should be a number between
0-2 that represents the reading frame of the first base. If the feature is
not a coding exon, the value should be '.'.)
# group (all lines with the same group are linked together into a single
item)

track name=regulatory description="TeleGene(tm)    Regulatory Regions"
#chr   source   feature   start    end   scores    tr fr group
chr22 TeleGene enhancer 1000000 1001000 500        + . touch1
chr22 TeleGene promoter 1010000 1010100 900        + . touch1
chr22 TeleGene promoter 1020000 1020000 800        - . touch2
Previously in this workshop…
                                     Main data formats
- VCF format (Variant Call Format)
For SNP representation
Previously in this workshop…
                                  Main data formats
- http://genome.ucsc.edu/FAQ/FAQformat.html

- UCSC brower data formats, including all most commonly used formats that are
  accepted and widely used

- In addition, ENCODE data formats (narrowPeak / broadPEAK)
Sequencing data analysis
Workshop – part 2 / mapping to a reference genome



                      Outline

            Previously in this workshop…

      Mapping to a reference genome – the steps

    Mapping to a reference genome – the workshop




                  Maté Ongenaert
Mapping to a reference genome
                                      The workflow
Mapping:

Aligning the raw sequence reads to a reference genome by using an indexing strategy and
aligning algorithm, taking into account the quality scores and with specific conditions

- Raw sequence reads with quality scores: FASTQ
- Reference genome: FASTA files can be downloaded (UCSC/Ensembl)

- Sequence reads <> reference genome: alignment
- To perform an efficient alignment, an indexing strategy is used
- For instance (BWA/Bowtie): FM indexes (based on burrows-wheeler algorithm) on the
  reference genome and/or the sequence reads

- Specific conditions: single-end or paired-end; how many mismatches allowed; trade-off
  speed/accuracy/specificity; local re-alignment afterwards for improved indel calling; …

>> Result: mapped sequence reads: chr / start / end / quality >> SAM file (>> BAM)
Mapping to a reference genome
                                       The workflow
The reference genome

- Sequences (human; rat: mouse:…) can be downloaded from UCSC (Golden path) or
  Ensembl
- Difficulty: download in 2bit format (needs convertor) >> fasta files (.fa)
- Need to be indexed by the mapping program you are going to use

- BWA: bwa index
- Bowtie: bowtie-build (pre-computed indexes available)

- BWA example:

bwa index [-p prefix] [-a algoType] [-c] <in.db.fasta>
Index database sequences in the FASTA format.

OPTIONS:
-c         Build color-space index. The input fast should be in nucleotide space.
-p STR     Prefix of the output database [same as db filename]
-a STR     Algorithm for constructing BWT index. Available options are:
is         IS linear-time algorithm for constructing suffix array.
           It requires 5.37N memory where N is the size of the database.
bwtsw      Algorithm implemented in BWT-SW. This method works with the whole human genome
Mapping to a reference genome
                                     The workflow
The sequencing reads

- Sequence reads with quality scores: FASTQ files from the machine
- Depending on the mapping program, need to be indexed as well

- BWA: converts reads to SA coordinates (Suffix Array) based on the reference genome
  index
- Bowtie: not needed: indexing and aligning in one step

- BWA:
- Index reference genome
- Index sequence reads (INPUT: FASTQ and REF. GENOME ) >> SA coordinates (OUTPUT:
  SAI)
- SA coordinates (INPUT: SAI/FASTQ and REF. GENOME >> SAM/BAM (OUTPUT)
Mapping to a reference genome
                                       The workflow
aln        bwa aln [-n][-o][-e][-d][-i][-k][-l][-t][-cRN][-M][-O][-E][-q]
            <in.db.fasta> <in.query.fq> > <out.sai>

Find the SA coordinates of the input reads.
Maximum maxSeedDiff differences are allowed in the first seedLen subsequence
maximum maxDiff differences are allowed in the whole sequence.

OPTIONS:
-n NUM     Maximum edit distance if the value is INT
-o INT     Maximum number of gap opens
-e INT     Maximum number of gap extensions, -1 for k-difference mode
-d INT     Disallow a long deletion within INT bp towards the 3’-end
-i INT     Disallow an indel within INT bp towards the ends [5]
-l INT     Take the first INT subsequence as seed.
-k INT     Maximum edit distance in the seed
-t INT     Number of threads (multi-threading mode)
-M INT     Mismatch penalty
-O INT     Gap open penalty
-E INT     Gap extension penalty
-R INT     Proceed with suboptimal alignments
-c         Reverse query but not complement it
-N         Disable iterative search.
-q INT     Parameter for read trimming.
-I         The input is in the Illumina 1.3+ read format (quality equals ASCII-64)
-B INT     Length of barcode starting from the 5’-end.
-b         Specify the input read sequence file is the BAM format.
-0         When -b is specified, only use single-end reads in mapping.
-1         When -b is specified, only use the first read in a read pair in mapping
-2         When -b is specified, only use the second read in a read pair in mapping
Mapping to a reference genome
                                       The workflow
samse      bwa samse [-n maxOcc] <in.db.fasta> <in.sai> <in.fq> > <out.sam>
Generate alignments in the SAM format given single-end reads
Repetitive hits will be randomly chosen.

OPTIONS:
-n INT     Maximum number of alignments to output in the XA tag for reads paired properly.
-r STR     Specify the read group in a format like ‘@RGtID:footSM:bar’


sampe      bwa sampe [-a][-o][-n][-N][-P]<in.db.fasta>
<in1.sai><in2.sai><in1.fq><in2.fq> ><out.sam>
Generate alignments in the SAM format given paired-end reads.
Repetitive read pairs will be placed randomly.

OPTIONS:
-a INT     Maximum insert size for a read pair to be considered being mapped properly.
-o INT     Maximum occurrences of a read for pairing.
-P         Load the entire FM-index into memory to reduce disk operations
-n INT     Maximum number of alignments to output in the XA tag for reads paired properly
-N INT     Maximum number of alignments to output in the XA tag for disconcordant read pairs
-r STR     Specify the read group in a format like ‘@RGtID:footSM:bar’
Sequencing data analysis
Workshop – part 2 / mapping to a reference genome



                      Outline

            Previously in this workshop…

      Mapping to a reference genome – the steps

    Mapping to a reference genome – the workshop




                  Maté Ongenaert
Mapping to a reference genome
                                       The workshop
Mapping using BWA

bwa-0.5.9 aln -t 4 /opt/genomes/GRCh37/index/bwa/GRCh37 SRR058523.fastq > SRR058523.sai

bwa-0.5.9 BWA and its version
aln: alignement functionality of BWA
-t 4: use 4 processes (CPU cores) at the same time to speed up
/opt/genomes/index/bwa/GRCh37: location of the reference genome index
SRR058523.fastq: fastq file to align to the reference
> Indicates outputting to a file
SRR058523.sai: the output file (SA Index file)

Maps the input sequences (FASTQ) to the reference genome index  output: indexes of
 the reads

No ‘real genomic mapping’ thus, this would need a next step…
Mapping to a reference genome
                                       The workshop
Mapping using BWA

bwa-0.5.9 samse /opt/genomes/GRCh37/index/bwa/GRCh37 SRR058523.sai SRR058523.fastq |
samtools-0.1.18 view -bhSo PHF6-unsorted.bam –


bwa-0.5.9 BWA and its version
samse: single-end mapping and output to sam format
/opt/genomes/index/bwa/GRCh37: location of the reference genome index
SRR058523.sai: the reads index
SRR058523.fastq: the raw reads and quality scores

This would output a sam file (> SRR058523.sam) for instance
But we don’t need the SAM file, we would like a BAM file  processing by samtools

| is the ‘pipe’ symbol: hands over the output from one command to the other
samtools-0.1.18: samtools and its version
view: the command to process sam files
- B output BAM ; h print the headers; S input is SAM; o output name
PHF6-unsorted.bam: output file name
- End of the | symbol (end of second command)
Mapping to a reference genome
                                        The workshop
Mapping using BWA

bwa-0.5.9 aln -t 4 /opt/genomes/GRCh37/index/bwa/GRCh37 SRR058523.fastq > SRR058523.sai

bwa-0.5.9 samse /opt/genomes/GRCh37/index/bwa/GRCh37 SRR058523.sai SRR058523.fastq |
samtools-0.1.18 view -bhSo PHF8-unsorted.bam –

Two-step process in BWA

Next steps: process the BAM file  sort and index it (using samtools)

samtools-0.1.18 sort PHF8-unsorted.bam PHF8-sorted

Creates a sorted BAM file (PHF6-sorted.bam)
samtools-0.1.18 index PHF8-sorted.bam

Indexes the sorted BAM file (and created a BAM index file – PHF6-sorted.bam.bai)
Mapping to a reference genome
                                         The workshop
BAM: what’s next?

So, now we have the sorted and indexed BAM file – what’s next?

This file is the starting point for all other analysis, depending on the application:

ChIP-seq: peak calling
SNP calling
RNA-seq: calculate gene-expression levels of the transcripts / find splice variants

What are the first things?
- Visualize it (IGV can load BAM files)
- First downstream analysis: QC and basic statistics (how many mapped reads, quality
  distribution, distribution accross chromosomes,…)
Mapping to a reference genome
                                        The workshop
First downstream analysis

- QC and basic statistics (how many mapped reads, quality distribution, distribution
  accross chromosomes, information on paired-end reads,…)

Samstat
/opt/samstat/samstat PHF8-sorted.bam



- Outputs a HTML file with statistics
Mapping to a reference genome
                                                The workshop
First downstream analysis

- QC and basic statistics (how many mapped reads, quality distribution, distribution
  accross chromosomes, information on paired-end reads,…)

BamUtil (stats)

Bam stats --in PHF8-sorted.bam –-basic --phred        --baseSum

Number of records read = 15732744

TotalReads(e6)   15.73
MappedReads(e6) 15.04
PairedReads(e6) 15.73
ProperPair(e6)   14.65
DuplicateReads(e6)                  0.00
QCFailureReads(e6)                  0.00

MappingRate(%)   95.59
PairedReads(%)   100.00
ProperPair(%)    93.11
DupRate(%)       0.00
QCFailRate(%)    0.00

TotalBases(e6)   802.37
BasesInMappedReads(e6)              766.95

Quality          Count
33               0
34               0
35               71373
36               0
37               0
38               203544
39               403649
40               921714
41               2081099
42               1974615
43               2285826
Mapping to a reference genome
                                       The workshop
First downstream analysis

- QC and basic statistics (how many mapped reads, quality distribution, distribution
  accross chromosomes, information on paired-end reads,…)

Samtools
samtools-0.1.18 idxstats PHF8-sorted.bam

1      249250621        503714   0
2      243199373        345217   0
3      198022430        273477   0
4      191154276        229016   0
5      180915260        360339   0
6      171115067        257468   0
7      159138663        269704   0
8      146364022        242656   0
9      141213431        203505   0
10     135534747        237496   0
11     135006516        218116   0
12     133851895        231426   0
13     115169878        106831   0
14     107349540        119062   0
15     102531392        141351   0
16     90354753         183004   0
17     81195210         187024   0
18     78077248         86101    0
Mapping to a reference genome
                                     The workshop
First downstream analysis

- Think about PCR duplicates  you may want to remove them (or set a ‘flag’ in the BAM
  file, indicating it is a duplicate)
- Samtools rmdup or Picard MarkDuplicates

- Find out how these tools work and what otyher flags are used in BAM files
- Can you make statistics with the BAM flags?
Mapping to a reference genome
                                     The workshop
Mapping – now let’s start!

- Mapping is only the starting point for most downstream analysis tools
- Depends on the application and what you want to do:

    - Exome sequencing / whole genome sequencing: SNP calling (samtools): based on
      mapping quality / coverage /  identification of SNPs (VCF output format)

    - ChIP-seq: peak calling: based on coverage of ChIP and input, enriched regions are
      identified (BED output, BEDgraph and/or WIG files)

    - RNA-seq: assign reads to the transcripts, normalize (length of exon and number of
      reads in the sequencing library = RPKM)  (relative) expression levels 
      identification of differentially expressed genes
Blok
de   Van…
       ETER

More Related Content

PPTX
Workshop NGS data analysis - 3
PDF
BWA-MEM2-IPDPS 2019
PPT
NGS - QC & Dataformat
PDF
Stefano Giordano
PPTX
Enery efficient data prefetching
PDF
CUHK System for the Spoken Web Search task at Mediaeval 2012
PDF
Solaris DTrace, An Introduction
PDF
C++ Data-flow Parallelism sounds great! But how practical is it? Let’s see ho...
Workshop NGS data analysis - 3
BWA-MEM2-IPDPS 2019
NGS - QC & Dataformat
Stefano Giordano
Enery efficient data prefetching
CUHK System for the Spoken Web Search task at Mediaeval 2012
Solaris DTrace, An Introduction
C++ Data-flow Parallelism sounds great! But how practical is it? Let’s see ho...

What's hot (19)

PPT
3rd 3DDRESD: ReCPU 4 NIDS
PDF
Performance and predictability
PPT
Simulation and Performance Analysis of AODV using NS-2.34
PPT
Prelim Slides
PDF
Inference accelerators
PPTX
Protocol implementation on NS2
PDF
Attention mechanisms with tensorflow
PDF
A Domain-Specific Embedded Language for Programming Parallel Architectures.
PDF
Tma ph d_school_2011
PDF
Fann tool users_guide
PDF
Hack Like It's 2013 (The Workshop)
PDF
Model Based Schedulability Analysis of Java Bytecode Programs Executed on Com...
PPT
Tridiagonal solver in gpu
PDF
Pragmatic optimization in modern programming - modern computer architecture c...
PPT
BioMake BOSC 2004
PDF
2010 nephee 01_smart_grid과제진행및이슈사항_20100630_kimduho
PDF
High-Performance Physics Solver Design for Next Generation Consoles
PPT
FEC & File Multicast
PDF
Performance and predictability (1)
3rd 3DDRESD: ReCPU 4 NIDS
Performance and predictability
Simulation and Performance Analysis of AODV using NS-2.34
Prelim Slides
Inference accelerators
Protocol implementation on NS2
Attention mechanisms with tensorflow
A Domain-Specific Embedded Language for Programming Parallel Architectures.
Tma ph d_school_2011
Fann tool users_guide
Hack Like It's 2013 (The Workshop)
Model Based Schedulability Analysis of Java Bytecode Programs Executed on Com...
Tridiagonal solver in gpu
Pragmatic optimization in modern programming - modern computer architecture c...
BioMake BOSC 2004
2010 nephee 01_smart_grid과제진행및이슈사항_20100630_kimduho
High-Performance Physics Solver Design for Next Generation Consoles
FEC & File Multicast
Performance and predictability (1)
Ad

Viewers also liked (20)

PPTX
Workshop NGS data analysis - 1
PPTX
Integrative transcriptomics to study non-coding RNA functions
PDF
Knowledge management for integrative omics data analysis
PDF
Linux for bioinformatics
PDF
Semantic Web from the 2013 Perspective
KEY
Genomics in the Cloud
PDF
Next-generation sequencing - variation discovery
PPTX
Data Management for Quantitative Biology - Data sources (Next generation tech...
PDF
Bio2RDF @ W3C HCLS2009
PDF
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...
PPTX
Easygenomics ISCB Cloud section 2012
PDF
Genome voyager-beta-brochure
PPTX
Ecobouwers opendeur passiefhuis Lokeren
PPTX
Strong reversal of the lung fibrosis disease signature by autotaxin inhibitor...
ODP
Next-generation sequencing: Data mangement
PDF
Multiple mouse reference genomes and strain specific gene annotations
PDF
Quality Control of NGS Data
PPTX
ENCODE project: brief summary of main findings
PDF
NGS Data Preprocessing
PDF
Wellcome Trust Advances Course: NGS Course - Lecture1
Workshop NGS data analysis - 1
Integrative transcriptomics to study non-coding RNA functions
Knowledge management for integrative omics data analysis
Linux for bioinformatics
Semantic Web from the 2013 Perspective
Genomics in the Cloud
Next-generation sequencing - variation discovery
Data Management for Quantitative Biology - Data sources (Next generation tech...
Bio2RDF @ W3C HCLS2009
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...
Easygenomics ISCB Cloud section 2012
Genome voyager-beta-brochure
Ecobouwers opendeur passiefhuis Lokeren
Strong reversal of the lung fibrosis disease signature by autotaxin inhibitor...
Next-generation sequencing: Data mangement
Multiple mouse reference genomes and strain specific gene annotations
Quality Control of NGS Data
ENCODE project: brief summary of main findings
NGS Data Preprocessing
Wellcome Trust Advances Course: NGS Course - Lecture1
Ad

Similar to Workshop NGS data analysis - 2 (20)

PDF
20110524zurichngs 1st pub
PDF
Discovery and annotation of variants by exome analysis using NGS
PPTX
Bioinfo ngs data format visualization v2
PPTX
Rnaseq forgenefinding
PDF
Overview of methods for variant calling from next-generation sequence data
PDF
Introducing data analysis: reads to results
PPTX
Next-generation sequencing data format and visualization with ngs.plot 2015
PPT
20100516 bioinformatics kapushesky_lecture08
PDF
What can we do with microbial WGS data? - t.seemann - mc gill summer 2016 - ...
PPTX
CS Lecture 2017 04-11 from Data to Precision Medicine
PPTX
Next-generation sequencing format and visualization with ngs.plot
PPTX
Imgc2011 bioinformatics tutorial
PPTX
Enabling Large Scale Sequencing Studies through Science as a Service
PDF
Variant analysis and whole exome sequencing
PPTX
2012 sept 18_thug_biotech
PPTX
2014 nci-edrn
PDF
20110524zurichngs 2nd pub
PPTX
Dgaston dec-06-2012
PDF
RNASeq Experiment Design
PPT
Creating a SNP calling pipeline
20110524zurichngs 1st pub
Discovery and annotation of variants by exome analysis using NGS
Bioinfo ngs data format visualization v2
Rnaseq forgenefinding
Overview of methods for variant calling from next-generation sequence data
Introducing data analysis: reads to results
Next-generation sequencing data format and visualization with ngs.plot 2015
20100516 bioinformatics kapushesky_lecture08
What can we do with microbial WGS data? - t.seemann - mc gill summer 2016 - ...
CS Lecture 2017 04-11 from Data to Precision Medicine
Next-generation sequencing format and visualization with ngs.plot
Imgc2011 bioinformatics tutorial
Enabling Large Scale Sequencing Studies through Science as a Service
Variant analysis and whole exome sequencing
2012 sept 18_thug_biotech
2014 nci-edrn
20110524zurichngs 2nd pub
Dgaston dec-06-2012
RNASeq Experiment Design
Creating a SNP calling pipeline

More from Maté Ongenaert (12)

PDF
Unleash transcriptomics to gain insights in disease mechanisms: integration i...
PPTX
Bots & spiders
PPTX
Exploring the neuroblastoma epigenome: perspectives for improved prognosis
PPTX
High-throughput proteomics: from understanding data to predicting them
PPTX
Microarray data and pathway analysis: example from the bench
PPT
Large scale machine learning challenges for systems biology
PPTX
Race against the sequencing machine: processing of raw DNA sequence data at t...
PDF
Bringing the data back to the researchers
PPTX
The post-genomic era: epigenetic sequencing applications and data integration
PPTX
Introduction
PPTX
Literature managment training
PPTX
Scientific literature managment - exercises
Unleash transcriptomics to gain insights in disease mechanisms: integration i...
Bots & spiders
Exploring the neuroblastoma epigenome: perspectives for improved prognosis
High-throughput proteomics: from understanding data to predicting them
Microarray data and pathway analysis: example from the bench
Large scale machine learning challenges for systems biology
Race against the sequencing machine: processing of raw DNA sequence data at t...
Bringing the data back to the researchers
The post-genomic era: epigenetic sequencing applications and data integration
Introduction
Literature managment training
Scientific literature managment - exercises

Recently uploaded (20)

PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PPTX
Lesson notes of climatology university.
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PPTX
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PDF
Pre independence Education in Inndia.pdf
PDF
Computing-Curriculum for Schools in Ghana
PDF
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
PDF
RMMM.pdf make it easy to upload and study
PPTX
Pharma ospi slides which help in ospi learning
PPTX
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PDF
O7-L3 Supply Chain Operations - ICLT Program
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PPTX
Cell Structure & Organelles in detailed.
PDF
Classroom Observation Tools for Teachers
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PPTX
GDM (1) (1).pptx small presentation for students
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
Lesson notes of climatology university.
human mycosis Human fungal infections are called human mycosis..pptx
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
STATICS OF THE RIGID BODIES Hibbelers.pdf
Pre independence Education in Inndia.pdf
Computing-Curriculum for Schools in Ghana
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
RMMM.pdf make it easy to upload and study
Pharma ospi slides which help in ospi learning
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
Renaissance Architecture: A Journey from Faith to Humanism
O7-L3 Supply Chain Operations - ICLT Program
Module 4: Burden of Disease Tutorial Slides S2 2025
Cell Structure & Organelles in detailed.
Classroom Observation Tools for Teachers
Pharmacology of Heart Failure /Pharmacotherapy of CHF
GDM (1) (1).pptx small presentation for students
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx

Workshop NGS data analysis - 2

  • 1. Sequencing data analysis Workshop – part 2 / mapping to a reference genome Outline Previously in this workshop… Mapping to a reference genome – the steps Mapping to a reference genome – the workshop Maté Ongenaert
  • 2. Previously in this workshop… Introduction – the real cost of sequencing
  • 3. Previously in this workshop… Introduction – the real cost of sequencing
  • 4. Previously in this workshop… The workflow of NGS data analysis Data analysis Raw machine reads… What’s next? Preprocessing (machine/technology) - adaptors, indexes, conversions,… - machine/technology dependent Reads with associated qualities (universal) - FASTQ - QC check Depending on application (general applicable) - ‘de novo’ assembly of genome (bacterial genomes,…) - Mapping to a reference genome  mapped reads - SAM/BAM/… High-level analysis (specific for application) - SNP calling - Peak calling
  • 5. Previously in this workshop… The workflow of NGS data analysis
  • 6. Previously in this workshop… Main data formats Raw sequence reads: - Represent the sequence ~ FASTA >SEQUENCE_IDENTIFIER GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT - Extension: represent the quality, per base ~ FASTQ – Q for quality Score ~ phred ~ ASCII table ~ phred + 33 = Sanger @SEQUENCE_IDENTIFIER GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65 - Machine and platform independent and compressed: SRA (NCBI) Get the original FASTQ file using SRATools (NCBI)
  • 7. Previously in this workshop… Main data formats - Now moving to a common file format  SAM / BAM (Sequence Alignment/Map) - BAM: binary (read: computer-readable, indexed, compressed) ‘form’ of SAM DESCRIPTION OF THE 11 FIELDS IN THE ALIGNMENT SECTION # QNAME: template name #FLAG #RNAME: reference name # POS: mapping position #MAPQ: mapping quality #CIGAR: CIGAR string #RNEXT: reference name of the mate/next fragment #PNEXT: position of the mate/next fragment #TLEN: observed template length #SEQ: fragment sequence #QUAL: ASCII of Phred-scale base quality+33 #Headers @HD VN:1.3 SO:coordinate @SQ SN:ref LN:45 #Alignment block r001 163 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG * r002 0 ref 9 30 3S6M1P1I4M * 0 0 AAAAGATAAGGATA * r003 0 ref 9 30 5H6M * 0 0 AGCTAA * NM:i:1 r004 0 ref 16 30 6M14N5M * 0 0 ATAGCTTCAGC *
  • 8. Previously in this workshop… Main data formats - BED files (location / annotation / scores): Browser Extensible Data Used for mapping / annotation / peak locations / - extension: bigBED (binary) FIELDS USED: # chr # start # end # name # score # strand track name=pairedReads description="Clone Paired Reads" useScore=1 #chr start end name score strand chr22 1000 5000 cloneA 960 + chr22 2000 6000 cloneB 900 – - BEDGraph files (location, combined with score) Used to represent peak scores track type=bedGraph name="BedGraph Format" description="BedGraph format" visibility=full color=200,100,0 altColor=0,100,200 priority=20 #chr start end score chr19 59302000 59302300 -1.0 chr19 59302300 59302600 -0.75 chr19 59302600 59302900 -0.50
  • 9. Previously in this workshop… Main data formats - WIG files (location / annotation / scores): wiggle Used for visulization or summarize data, in most cases count data or normalized count data (RPKM) – extension: BigWig – binary versions (often used in GEO for ChIP-seq peaks) browser position chr19:59304200-59310700 browser hide all #150 base wide bar graph at arbitrarily spaced positions, #threshold line drawn at y=11.76 #autoScale off viewing range set to [0:25] #priority = 10 positions this as the first graph track type=wiggle_0 name="variableStep" description="variableStep format" visibility=full autoScale=off viewLimits=0.0:25.0 color=50,150,255 yLineMark=11.76 yLineOnOff=on priority=10 variableStep chrom=chr19 span=150 59304701 10.0 59304901 12.5 59305401 15.0 59305601 17.5 59305901 20.0 59306081 17.5
  • 10. Previously in this workshop… Main data formats - GFF format (General Feature Format) or GTF Used for annotation of genetic / genomic features – such as all coding genes in Ensembl Often used in downstream analysis to assign annotation to regions / peaks / … FIELDS USED: # seqname (the name of the sequence) # source (the program that generated this feature) # feature (the name of this type of feature – for example: exon) # start (the starting position of the feature in the sequence) # end (the ending position of the feature) # score (a score between 0 and 1000) # strand (valid entries include '+', '-', or '.') # frame (if the feature is a coding exon, frame should be a number between 0-2 that represents the reading frame of the first base. If the feature is not a coding exon, the value should be '.'.) # group (all lines with the same group are linked together into a single item) track name=regulatory description="TeleGene(tm) Regulatory Regions" #chr source feature start end scores tr fr group chr22 TeleGene enhancer 1000000 1001000 500 + . touch1 chr22 TeleGene promoter 1010000 1010100 900 + . touch1 chr22 TeleGene promoter 1020000 1020000 800 - . touch2
  • 11. Previously in this workshop… Main data formats - VCF format (Variant Call Format) For SNP representation
  • 12. Previously in this workshop… Main data formats - http://genome.ucsc.edu/FAQ/FAQformat.html - UCSC brower data formats, including all most commonly used formats that are accepted and widely used - In addition, ENCODE data formats (narrowPeak / broadPEAK)
  • 13. Sequencing data analysis Workshop – part 2 / mapping to a reference genome Outline Previously in this workshop… Mapping to a reference genome – the steps Mapping to a reference genome – the workshop Maté Ongenaert
  • 14. Mapping to a reference genome The workflow Mapping: Aligning the raw sequence reads to a reference genome by using an indexing strategy and aligning algorithm, taking into account the quality scores and with specific conditions - Raw sequence reads with quality scores: FASTQ - Reference genome: FASTA files can be downloaded (UCSC/Ensembl) - Sequence reads <> reference genome: alignment - To perform an efficient alignment, an indexing strategy is used - For instance (BWA/Bowtie): FM indexes (based on burrows-wheeler algorithm) on the reference genome and/or the sequence reads - Specific conditions: single-end or paired-end; how many mismatches allowed; trade-off speed/accuracy/specificity; local re-alignment afterwards for improved indel calling; … >> Result: mapped sequence reads: chr / start / end / quality >> SAM file (>> BAM)
  • 15. Mapping to a reference genome The workflow The reference genome - Sequences (human; rat: mouse:…) can be downloaded from UCSC (Golden path) or Ensembl - Difficulty: download in 2bit format (needs convertor) >> fasta files (.fa) - Need to be indexed by the mapping program you are going to use - BWA: bwa index - Bowtie: bowtie-build (pre-computed indexes available) - BWA example: bwa index [-p prefix] [-a algoType] [-c] <in.db.fasta> Index database sequences in the FASTA format. OPTIONS: -c Build color-space index. The input fast should be in nucleotide space. -p STR Prefix of the output database [same as db filename] -a STR Algorithm for constructing BWT index. Available options are: is IS linear-time algorithm for constructing suffix array. It requires 5.37N memory where N is the size of the database. bwtsw Algorithm implemented in BWT-SW. This method works with the whole human genome
  • 16. Mapping to a reference genome The workflow The sequencing reads - Sequence reads with quality scores: FASTQ files from the machine - Depending on the mapping program, need to be indexed as well - BWA: converts reads to SA coordinates (Suffix Array) based on the reference genome index - Bowtie: not needed: indexing and aligning in one step - BWA: - Index reference genome - Index sequence reads (INPUT: FASTQ and REF. GENOME ) >> SA coordinates (OUTPUT: SAI) - SA coordinates (INPUT: SAI/FASTQ and REF. GENOME >> SAM/BAM (OUTPUT)
  • 17. Mapping to a reference genome The workflow aln bwa aln [-n][-o][-e][-d][-i][-k][-l][-t][-cRN][-M][-O][-E][-q] <in.db.fasta> <in.query.fq> > <out.sai> Find the SA coordinates of the input reads. Maximum maxSeedDiff differences are allowed in the first seedLen subsequence maximum maxDiff differences are allowed in the whole sequence. OPTIONS: -n NUM Maximum edit distance if the value is INT -o INT Maximum number of gap opens -e INT Maximum number of gap extensions, -1 for k-difference mode -d INT Disallow a long deletion within INT bp towards the 3’-end -i INT Disallow an indel within INT bp towards the ends [5] -l INT Take the first INT subsequence as seed. -k INT Maximum edit distance in the seed -t INT Number of threads (multi-threading mode) -M INT Mismatch penalty -O INT Gap open penalty -E INT Gap extension penalty -R INT Proceed with suboptimal alignments -c Reverse query but not complement it -N Disable iterative search. -q INT Parameter for read trimming. -I The input is in the Illumina 1.3+ read format (quality equals ASCII-64) -B INT Length of barcode starting from the 5’-end. -b Specify the input read sequence file is the BAM format. -0 When -b is specified, only use single-end reads in mapping. -1 When -b is specified, only use the first read in a read pair in mapping -2 When -b is specified, only use the second read in a read pair in mapping
  • 18. Mapping to a reference genome The workflow samse bwa samse [-n maxOcc] <in.db.fasta> <in.sai> <in.fq> > <out.sam> Generate alignments in the SAM format given single-end reads Repetitive hits will be randomly chosen. OPTIONS: -n INT Maximum number of alignments to output in the XA tag for reads paired properly. -r STR Specify the read group in a format like ‘@RGtID:footSM:bar’ sampe bwa sampe [-a][-o][-n][-N][-P]<in.db.fasta> <in1.sai><in2.sai><in1.fq><in2.fq> ><out.sam> Generate alignments in the SAM format given paired-end reads. Repetitive read pairs will be placed randomly. OPTIONS: -a INT Maximum insert size for a read pair to be considered being mapped properly. -o INT Maximum occurrences of a read for pairing. -P Load the entire FM-index into memory to reduce disk operations -n INT Maximum number of alignments to output in the XA tag for reads paired properly -N INT Maximum number of alignments to output in the XA tag for disconcordant read pairs -r STR Specify the read group in a format like ‘@RGtID:footSM:bar’
  • 19. Sequencing data analysis Workshop – part 2 / mapping to a reference genome Outline Previously in this workshop… Mapping to a reference genome – the steps Mapping to a reference genome – the workshop Maté Ongenaert
  • 20. Mapping to a reference genome The workshop Mapping using BWA bwa-0.5.9 aln -t 4 /opt/genomes/GRCh37/index/bwa/GRCh37 SRR058523.fastq > SRR058523.sai bwa-0.5.9 BWA and its version aln: alignement functionality of BWA -t 4: use 4 processes (CPU cores) at the same time to speed up /opt/genomes/index/bwa/GRCh37: location of the reference genome index SRR058523.fastq: fastq file to align to the reference > Indicates outputting to a file SRR058523.sai: the output file (SA Index file) Maps the input sequences (FASTQ) to the reference genome index  output: indexes of the reads No ‘real genomic mapping’ thus, this would need a next step…
  • 21. Mapping to a reference genome The workshop Mapping using BWA bwa-0.5.9 samse /opt/genomes/GRCh37/index/bwa/GRCh37 SRR058523.sai SRR058523.fastq | samtools-0.1.18 view -bhSo PHF6-unsorted.bam – bwa-0.5.9 BWA and its version samse: single-end mapping and output to sam format /opt/genomes/index/bwa/GRCh37: location of the reference genome index SRR058523.sai: the reads index SRR058523.fastq: the raw reads and quality scores This would output a sam file (> SRR058523.sam) for instance But we don’t need the SAM file, we would like a BAM file  processing by samtools | is the ‘pipe’ symbol: hands over the output from one command to the other samtools-0.1.18: samtools and its version view: the command to process sam files - B output BAM ; h print the headers; S input is SAM; o output name PHF6-unsorted.bam: output file name - End of the | symbol (end of second command)
  • 22. Mapping to a reference genome The workshop Mapping using BWA bwa-0.5.9 aln -t 4 /opt/genomes/GRCh37/index/bwa/GRCh37 SRR058523.fastq > SRR058523.sai bwa-0.5.9 samse /opt/genomes/GRCh37/index/bwa/GRCh37 SRR058523.sai SRR058523.fastq | samtools-0.1.18 view -bhSo PHF8-unsorted.bam – Two-step process in BWA Next steps: process the BAM file  sort and index it (using samtools) samtools-0.1.18 sort PHF8-unsorted.bam PHF8-sorted Creates a sorted BAM file (PHF6-sorted.bam) samtools-0.1.18 index PHF8-sorted.bam Indexes the sorted BAM file (and created a BAM index file – PHF6-sorted.bam.bai)
  • 23. Mapping to a reference genome The workshop BAM: what’s next? So, now we have the sorted and indexed BAM file – what’s next? This file is the starting point for all other analysis, depending on the application: ChIP-seq: peak calling SNP calling RNA-seq: calculate gene-expression levels of the transcripts / find splice variants What are the first things? - Visualize it (IGV can load BAM files) - First downstream analysis: QC and basic statistics (how many mapped reads, quality distribution, distribution accross chromosomes,…)
  • 24. Mapping to a reference genome The workshop First downstream analysis - QC and basic statistics (how many mapped reads, quality distribution, distribution accross chromosomes, information on paired-end reads,…) Samstat /opt/samstat/samstat PHF8-sorted.bam - Outputs a HTML file with statistics
  • 25. Mapping to a reference genome The workshop First downstream analysis - QC and basic statistics (how many mapped reads, quality distribution, distribution accross chromosomes, information on paired-end reads,…) BamUtil (stats) Bam stats --in PHF8-sorted.bam –-basic --phred --baseSum Number of records read = 15732744 TotalReads(e6) 15.73 MappedReads(e6) 15.04 PairedReads(e6) 15.73 ProperPair(e6) 14.65 DuplicateReads(e6) 0.00 QCFailureReads(e6) 0.00 MappingRate(%) 95.59 PairedReads(%) 100.00 ProperPair(%) 93.11 DupRate(%) 0.00 QCFailRate(%) 0.00 TotalBases(e6) 802.37 BasesInMappedReads(e6) 766.95 Quality Count 33 0 34 0 35 71373 36 0 37 0 38 203544 39 403649 40 921714 41 2081099 42 1974615 43 2285826
  • 26. Mapping to a reference genome The workshop First downstream analysis - QC and basic statistics (how many mapped reads, quality distribution, distribution accross chromosomes, information on paired-end reads,…) Samtools samtools-0.1.18 idxstats PHF8-sorted.bam 1 249250621 503714 0 2 243199373 345217 0 3 198022430 273477 0 4 191154276 229016 0 5 180915260 360339 0 6 171115067 257468 0 7 159138663 269704 0 8 146364022 242656 0 9 141213431 203505 0 10 135534747 237496 0 11 135006516 218116 0 12 133851895 231426 0 13 115169878 106831 0 14 107349540 119062 0 15 102531392 141351 0 16 90354753 183004 0 17 81195210 187024 0 18 78077248 86101 0
  • 27. Mapping to a reference genome The workshop First downstream analysis - Think about PCR duplicates  you may want to remove them (or set a ‘flag’ in the BAM file, indicating it is a duplicate) - Samtools rmdup or Picard MarkDuplicates - Find out how these tools work and what otyher flags are used in BAM files - Can you make statistics with the BAM flags?
  • 28. Mapping to a reference genome The workshop Mapping – now let’s start! - Mapping is only the starting point for most downstream analysis tools - Depends on the application and what you want to do: - Exome sequencing / whole genome sequencing: SNP calling (samtools): based on mapping quality / coverage /  identification of SNPs (VCF output format) - ChIP-seq: peak calling: based on coverage of ChIP and input, enriched regions are identified (BED output, BEDgraph and/or WIG files) - RNA-seq: assign reads to the transcripts, normalize (length of exon and number of reads in the sequencing library = RPKM)  (relative) expression levels  identification of differentially expressed genes
  • 29. Blok de Van… ETER