SlideShare a Scribd company logo
RNA-Seq Data Analysis

National Bureau of Animal Genetic Resources
Karnal
Transcriptome Sequencing
Sequencing steady state RNA in a sample is known as
RNA-Seq. It is free of limitations such as prior
knowledge about the organism is not required.
RNA-Seq is useful to unravel inaccessible complexities
of transcriptomics such as finding novel transcripts and
isoforms.
Data set produced is large and complex; interpretation
is not straight forward.
Making sense of RNA-Seq data…….
Depends upon the scientific question of interest.
For example allele specific expression requires accurate
determination of the transcribed SNPs.
Finding novel transcripts will help in finding fusion gene
events and aberrations in cancer samples.
Applications of RNA-Seq
Abundance estimation
2. Alternative splicing
3. RNA editing
4. Finding novel transcripts
5. Finding isoforms
And many more…..
1.
From RNA-seq reads
to differential
expression results:
Oshlack et al. Genome
Biology 2010, 11:220
Mapping Reads to Reference: CLC bio Workbench
 The

RNA-Seq analysis is done in several steps: First, all genes
are extracted from the reference genome (using annotations of
type gene). Other annotations on the gene sequences are
preserved (e.g. CDS information about coding sequences etc).

 Next, all

annotated transcripts (using annotations of type
mRNA) are extracted. If there are several annotated splice
variants, they are all extracted. Note that the mRNA
annotation type is used for extracting the exon-exon
boundaries.
Mapping Examples
The mapping parameters









Maximum number of mismatches : short reads (shorter than 56
nucleotides, except for color space data which are always treated as
long reads). This is the maximum number of mismatches to be
allowed. Maximum value is 3, except for color space where it is 2.
Minimum length fraction : the default is 0.9 which means that at
least 90 % of the bases need to align to the reference.
Minimum similarity fraction : the default setting at 0.8 and the default
setting for the length fraction, it means that 90 % of the read should
align with 80 % similarity in order to include the read.
Maximum number of hits for a read : a read that matches to more
distinct places in the references than the ’Maximum number of hits
for a read’ specified will not be mapped
Strand-specific alignment : Mapping reads to specific strand
Summarization
Summarization
Summarization
Summarization
Summarization : Mapping Statistics
Summarization : Detailed Mapping Statistics
Summarization : Parameters









Transcripts: The number of transcripts based on the mRNA
annotations on the reference. Note that this is not based on the
sequencing data - only on the annotations already on the reference
sequence(s).
Exon length: The total length of all exons (not all transcripts).
Unique gene reads : This is the number of reads that match uniquely to
the gene.
Total gene reads: This is all the reads that are mapped to this gene --both reads that map uniquely to the gene and reads that matched to
more positions in the reference (but fewer than the ’Maximum
number of hits for a read’ parameter) which were assigned to this
gene.
RPKM: Reads Per Kilobase of exon model per Million mapped reads is
the expression value measured in RPKM [Mortazavi et al., 2008]:
RPKM = total exon reads/ mapped reads(millions)exon length (KB) .
Visualizing Mapping
Read Quality Assessment
Basic Statistics Summary



The Basic Statistics module generates some simple



composition statistics for the file analysed.


Filename: The original filename of the file which was analysed.



File type: Says whether the file appeared to contain actual base calls or
colorspace data which had to be converted to base calls.



Total Sequences: A count of the total number of sequences processed.
There are two values reported, actual and estimated.



Sequence Length: Provides the length of the shortest and longest
sequence in the set. If all sequences are the same length only one value
is reported.



%GC: The overall %GC of all bases in all sequences



Warning



Basic Statistics never raises a warning.


This view shows an overview of the range of
quality values across all bases at each position
in the FastQ file. For each position a
BoxWhisker type plot is drawn. The elements
of the plot are as follows:



The central red line is the median value



The yellow box represents
quartilerange (25-75%)



The upper and lower whiskers represent
the10% and 90% points

the

inter-

The blue line represents the mean quality. The y-axis on the graph shows the
quality scores. The higher the score the better the base call. The background of the
graph divides the y axis into very good quality calls (green), calls of reasonable
quality (orange), and calls of poor quality (red). The quality of calls on most
platforms will degrade as the run progresses, so it is common to see base calls
falling into the orange area towards the end of a read. It should be mentioned that
there are number of different ways to encode a quality score in a FastQ file.
FastQC attempts to automatically determine which encoding method was used,
the title of the graph will describe the encoding FastQC thinks your file used.
The per sequence quality score report allows you
to see if a subset of your sequences have
Universally low quality values. It is often the case
that a subset of sequences will have
universally poor quality,
often because they are
poorly imaged (on the edge of the field
of view
etc),
however these should represent only a
small percentage of
the total sequences. If a
significant proportion of the sequences
in a run
have
overall low quality then this could
indicate some kind of
systematic problem - possibly with just part of
the run (for example one end of a flowcell).
Normalization
Differential expression
Clustering
Comparison of Expression Profile
Expression Profile of Specific Pathways
Systems Biology : Gostat Analysis

Best GOs
Genes
(Max: 100)
GO:0003735 Mitochondria mrpl42 mrpl41 ndufa13
ndufb5 timm13 etfb
ndufa3 atp5d atp5j2
ndufb7 mrpl14 ndufa5
ndufa11 mrpl34
GO:0005840 Ribosome

rps2 mrpl42 rps18
rps17 mrpl41 rps23
mrps18c rplp2 mrpl14
rpl9 rps29 mrpl34

Count
150
12

Total
18253
156

12

163

P-Value
4.78E-06

4.78E-06

More Related Content

PDF
RNA-seq Analysis
PDF
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
PDF
An introduction to RNA-seq data analysis
PDF
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
PPTX
RNASeq - Analysis Pipeline for Differential Expression
PPTX
Introduction to Next Generation Sequencing
PPTX
VNTR and RFLP
PPTX
Next generation sequencing
RNA-seq Analysis
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
An introduction to RNA-seq data analysis
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
RNASeq - Analysis Pipeline for Differential Expression
Introduction to Next Generation Sequencing
VNTR and RFLP
Next generation sequencing

What's hot (20)

PPTX
Databases short nucletide polymorphism
PDF
Next generation sequencing
PPTX
Next Generation Sequencing of DNA
PPTX
Rna seq and chip seq
PPT
Sequence Analysis
PPTX
Next generation sequencing
PDF
Introduction to next generation sequencing
PPT
Structural genomics
PDF
MEGA (Molecular Evolutionary Genetics Analysis)
PPTX
Gene prediction and expression
PPTX
Next generation sequencing technologies for crop improvement
PDF
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
PDF
RNAseq Analysis
PPTX
encode project
DOCX
Next generation sequencing
PPT
Assembly and finishing
PDF
Genome Assembly
PPT
PHYLOGENETICS WITH MEGA
PDF
Overview of Next Gen Sequencing Data Analysis
PPTX
Next-Generation Sequencing and Data Analysis.pptx
Databases short nucletide polymorphism
Next generation sequencing
Next Generation Sequencing of DNA
Rna seq and chip seq
Sequence Analysis
Next generation sequencing
Introduction to next generation sequencing
Structural genomics
MEGA (Molecular Evolutionary Genetics Analysis)
Gene prediction and expression
Next generation sequencing technologies for crop improvement
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
RNAseq Analysis
encode project
Next generation sequencing
Assembly and finishing
Genome Assembly
PHYLOGENETICS WITH MEGA
Overview of Next Gen Sequencing Data Analysis
Next-Generation Sequencing and Data Analysis.pptx
Ad

Viewers also liked (11)

PDF
RNA-Seq analysis of blueberry fruit identifies candidate genes involved in ri...
PPTX
Introduction to Single-cell RNA-seq
PPTX
2012 august 16 systems biology rna seq v2
PDF
Introduction to Galaxy and RNA-Seq
PPTX
Catalyzing Plant Science Research with RNA-seq
PPTX
Why Transcriptome? Why RNA-Seq? ENCODE answers….
PDF
Bioo Scientific - Reduced Bias Small RNA Library Prep with Gel-Free or Low-In...
PDF
Galaxy RNA-Seq Analysis: Tuxedo Protocol
PPTX
Correcting bias and variation in small RNA sequencing for optimal (microRNA) ...
PPTX
RNA-seq differential expression analysis
PDF
RNA-seq: general concept, goal and experimental design - part 1
RNA-Seq analysis of blueberry fruit identifies candidate genes involved in ri...
Introduction to Single-cell RNA-seq
2012 august 16 systems biology rna seq v2
Introduction to Galaxy and RNA-Seq
Catalyzing Plant Science Research with RNA-seq
Why Transcriptome? Why RNA-Seq? ENCODE answers….
Bioo Scientific - Reduced Bias Small RNA Library Prep with Gel-Free or Low-In...
Galaxy RNA-Seq Analysis: Tuxedo Protocol
Correcting bias and variation in small RNA sequencing for optimal (microRNA) ...
RNA-seq differential expression analysis
RNA-seq: general concept, goal and experimental design - part 1
Ad

Similar to Rna seq pipeline (20)

PPTX
Tools for Transcriptome Data Analysis
PPTX
RNA-Seq_Presentation
PPTX
Dgaston dec-06-2012
PDF
rnaseq_from_babelomics
PPTX
Bioinformatics
POT
RNA-seq quality control and pre-processing
PDF
rnaseq2015-02-18-170327193409.pdf
PDF
Processing Raw scRNA-Seq Sequencing Data
PDF
20140711 4 e_tseng_ercc2.0_workshop
PDF
Dna data compression algorithms based on redundancy
PPTX
Rnaseq forgenefinding
PDF
[2017-05-29] DNASmartTagger
PPT
Pathema Burkholderia Annotation Jamboree: Prokaryotic Annotation Overview
PDF
RNA sequencing analysis tutorial with NGS
PDF
RNASeq Experiment Design
PPT
20100516 bioinformatics kapushesky_lecture08
DOC
Bioinformatics
PDF
RNA-Seq Data Analysis: An abstract Guide
PDF
RSEM and DE packages
Tools for Transcriptome Data Analysis
RNA-Seq_Presentation
Dgaston dec-06-2012
rnaseq_from_babelomics
Bioinformatics
RNA-seq quality control and pre-processing
rnaseq2015-02-18-170327193409.pdf
Processing Raw scRNA-Seq Sequencing Data
20140711 4 e_tseng_ercc2.0_workshop
Dna data compression algorithms based on redundancy
Rnaseq forgenefinding
[2017-05-29] DNASmartTagger
Pathema Burkholderia Annotation Jamboree: Prokaryotic Annotation Overview
RNA sequencing analysis tutorial with NGS
RNASeq Experiment Design
20100516 bioinformatics kapushesky_lecture08
Bioinformatics
RNA-Seq Data Analysis: An abstract Guide
RSEM and DE packages

More from Karan Veer Singh (20)

PPT
Pcr primer design
PPTX
Yak genetic resources of india
PPT
DNA Barcoding
PPT
Microsatellites Markers
PPTX
Tick identification guide
PPTX
Social groups for awareness
PPTX
Access and Benefit sharing from Genetic Resources
PPTX
Indian acts governing different IPRs
PPTX
Ip protected invention in the field of biotechnology
PPT
Patent In Molecular Biology
PPT
Genome annotation 2013
PPT
NGS - QC & Dataformat
PPT
MICROSATELITE Markers for LIVESTOCK Genetic DIVERSITY ANALYSES
PPT
Semen Banking for conservation of livestock biodiversity
PPTX
DiGE....2-D gel electrophoresis
PPTX
PPS
PPT
Electrophoresis
PPT
Electrophoresis
Pcr primer design
Yak genetic resources of india
DNA Barcoding
Microsatellites Markers
Tick identification guide
Social groups for awareness
Access and Benefit sharing from Genetic Resources
Indian acts governing different IPRs
Ip protected invention in the field of biotechnology
Patent In Molecular Biology
Genome annotation 2013
NGS - QC & Dataformat
MICROSATELITE Markers for LIVESTOCK Genetic DIVERSITY ANALYSES
Semen Banking for conservation of livestock biodiversity
DiGE....2-D gel electrophoresis
Electrophoresis
Electrophoresis

Recently uploaded (20)

PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PDF
Microbial disease of the cardiovascular and lymphatic systems
PDF
Complications of Minimal Access Surgery at WLH
PPTX
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
PDF
Basic Mud Logging Guide for educational purpose
PPTX
PPH.pptx obstetrics and gynecology in nursing
PDF
Anesthesia in Laparoscopic Surgery in India
PDF
Sports Quiz easy sports quiz sports quiz
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PDF
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PPTX
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
PPTX
GDM (1) (1).pptx small presentation for students
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PDF
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
PPTX
Institutional Correction lecture only . . .
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PPTX
Microbial diseases, their pathogenesis and prophylaxis
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
human mycosis Human fungal infections are called human mycosis..pptx
Microbial disease of the cardiovascular and lymphatic systems
Complications of Minimal Access Surgery at WLH
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
Basic Mud Logging Guide for educational purpose
PPH.pptx obstetrics and gynecology in nursing
Anesthesia in Laparoscopic Surgery in India
Sports Quiz easy sports quiz sports quiz
O5-L3 Freight Transport Ops (International) V1.pdf
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
Introduction_to_Human_Anatomy_and_Physiology_for_B.Pharm.pptx
GDM (1) (1).pptx small presentation for students
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
Institutional Correction lecture only . . .
Renaissance Architecture: A Journey from Faith to Humanism
Supply Chain Operations Speaking Notes -ICLT Program
Microbial diseases, their pathogenesis and prophylaxis
102 student loan defaulters named and shamed – Is someone you know on the list?

Rna seq pipeline

  • 1. RNA-Seq Data Analysis National Bureau of Animal Genetic Resources Karnal
  • 2. Transcriptome Sequencing Sequencing steady state RNA in a sample is known as RNA-Seq. It is free of limitations such as prior knowledge about the organism is not required. RNA-Seq is useful to unravel inaccessible complexities of transcriptomics such as finding novel transcripts and isoforms. Data set produced is large and complex; interpretation is not straight forward.
  • 3. Making sense of RNA-Seq data……. Depends upon the scientific question of interest. For example allele specific expression requires accurate determination of the transcribed SNPs. Finding novel transcripts will help in finding fusion gene events and aberrations in cancer samples.
  • 4. Applications of RNA-Seq Abundance estimation 2. Alternative splicing 3. RNA editing 4. Finding novel transcripts 5. Finding isoforms And many more….. 1.
  • 5. From RNA-seq reads to differential expression results: Oshlack et al. Genome Biology 2010, 11:220
  • 6. Mapping Reads to Reference: CLC bio Workbench  The RNA-Seq analysis is done in several steps: First, all genes are extracted from the reference genome (using annotations of type gene). Other annotations on the gene sequences are preserved (e.g. CDS information about coding sequences etc).  Next, all annotated transcripts (using annotations of type mRNA) are extracted. If there are several annotated splice variants, they are all extracted. Note that the mRNA annotation type is used for extracting the exon-exon boundaries.
  • 8. The mapping parameters      Maximum number of mismatches : short reads (shorter than 56 nucleotides, except for color space data which are always treated as long reads). This is the maximum number of mismatches to be allowed. Maximum value is 3, except for color space where it is 2. Minimum length fraction : the default is 0.9 which means that at least 90 % of the bases need to align to the reference. Minimum similarity fraction : the default setting at 0.8 and the default setting for the length fraction, it means that 90 % of the read should align with 80 % similarity in order to include the read. Maximum number of hits for a read : a read that matches to more distinct places in the references than the ’Maximum number of hits for a read’ specified will not be mapped Strand-specific alignment : Mapping reads to specific strand
  • 14. Summarization : Detailed Mapping Statistics
  • 15. Summarization : Parameters      Transcripts: The number of transcripts based on the mRNA annotations on the reference. Note that this is not based on the sequencing data - only on the annotations already on the reference sequence(s). Exon length: The total length of all exons (not all transcripts). Unique gene reads : This is the number of reads that match uniquely to the gene. Total gene reads: This is all the reads that are mapped to this gene --both reads that map uniquely to the gene and reads that matched to more positions in the reference (but fewer than the ’Maximum number of hits for a read’ parameter) which were assigned to this gene. RPKM: Reads Per Kilobase of exon model per Million mapped reads is the expression value measured in RPKM [Mortazavi et al., 2008]: RPKM = total exon reads/ mapped reads(millions)exon length (KB) .
  • 18. Basic Statistics Summary  The Basic Statistics module generates some simple  composition statistics for the file analysed.  Filename: The original filename of the file which was analysed.  File type: Says whether the file appeared to contain actual base calls or colorspace data which had to be converted to base calls.  Total Sequences: A count of the total number of sequences processed. There are two values reported, actual and estimated.  Sequence Length: Provides the length of the shortest and longest sequence in the set. If all sequences are the same length only one value is reported.  %GC: The overall %GC of all bases in all sequences  Warning  Basic Statistics never raises a warning.
  • 19.  This view shows an overview of the range of quality values across all bases at each position in the FastQ file. For each position a BoxWhisker type plot is drawn. The elements of the plot are as follows:  The central red line is the median value  The yellow box represents quartilerange (25-75%)  The upper and lower whiskers represent the10% and 90% points the inter- The blue line represents the mean quality. The y-axis on the graph shows the quality scores. The higher the score the better the base call. The background of the graph divides the y axis into very good quality calls (green), calls of reasonable quality (orange), and calls of poor quality (red). The quality of calls on most platforms will degrade as the run progresses, so it is common to see base calls falling into the orange area towards the end of a read. It should be mentioned that there are number of different ways to encode a quality score in a FastQ file. FastQC attempts to automatically determine which encoding method was used, the title of the graph will describe the encoding FastQC thinks your file used.
  • 20. The per sequence quality score report allows you to see if a subset of your sequences have Universally low quality values. It is often the case that a subset of sequences will have universally poor quality, often because they are poorly imaged (on the edge of the field of view etc), however these should represent only a small percentage of the total sequences. If a significant proportion of the sequences in a run have overall low quality then this could indicate some kind of systematic problem - possibly with just part of the run (for example one end of a flowcell).
  • 25. Expression Profile of Specific Pathways
  • 26. Systems Biology : Gostat Analysis Best GOs Genes (Max: 100) GO:0003735 Mitochondria mrpl42 mrpl41 ndufa13 ndufb5 timm13 etfb ndufa3 atp5d atp5j2 ndufb7 mrpl14 ndufa5 ndufa11 mrpl34 GO:0005840 Ribosome rps2 mrpl42 rps18 rps17 mrpl41 rps23 mrps18c rplp2 mrpl14 rpl9 rps29 mrpl34 Count 150 12 Total 18253 156 12 163 P-Value 4.78E-06 4.78E-06