SlideShare a Scribd company logo
RNA-Seq Analysis
Simon Andrews, Laura Biggins, Sarah Inglesfield
simon.andrews@babraham.ac.uk
v2023-02
A
A
RNA-Seq Libraries
rRNA depleted mRNA
Fragment
Random prime + RT
2nd
strand synthesis (+ U)
A-tailing
Adapter Ligation
(U strand degradation)
Sequencing
NNNN
u u u u
u u u u
u u u u A
A
T
T
A T
Reference based RNA-Seq Analysis
QC Trimming Mapping
Mapped QC
Exploration
and
Quantitation
Statistical Analysis
Sequence Data Processing
Raw Sequence Quality Control
@HWUSI-EAS611:34:6669YAAXX:1:1:5069:1159 1:N:0:
TCGATAATACCGTTTTTTTCCGTTTGATGTTGATACCATT
+
IIHIIHIIIIIIIIIIIIIIIIIIIIIIIHIIIIHIIIII
@HWUSI-EAS611:34:6669YAAXX:1:1:5243:1158 1:N:0:
TATCTGTAGATTTCACAGACTCAAATGTAAATATGCAGAG
+
DF=DBD<BBFGGGGGGGBD@GGGD4@CA3CGG>DDD:D,B
@HWUSI-EAS611:34:6669YAAXX:1:1:5266:1162 1:N:0:
GGAGGAAGTATCACTTCCTTGCCTGCCTCCTCTGGGGCCT
+
:GBGGGGGGGGGDGGDEDGGDGGGGDHHDHGHHGBGG:GG
FastQ Format Data
FastQC
• Base call quality
• Composition
• Duplication
• Contamination
QC: Base Call Quality
Read Position
Call Quality
(Phred score)
QC: Composition
Read Position
QC: Duplication (blue trace)
Level of duplication
Percentage
of library
Adapters and Trimming
Library Structure
Insert
Adapter Adapter
Primer Read 1
Insert
Adapter Adapter
Primer Read 1
Primer
Read 2
Trimming Adapters
Insert
Adapter Adapter
Primer Read 1
Trimming Quality
Poor quality data tends
to be at the 3’ end
Mapping to a reference
Mapping
Exon 1 Exon 2 Exon 3 Genome
Simple mapping within exons
Mapping between exons
Spliced mapping
RNA-Seq Mapping Software
• HiSat2 (https://ccb.jhu.edu/software/hisat2/)
• Star (http://code.google.com/p/rna-star/)
• Tophat (http://tophat.cbcb.umd.edu/)
HiSat2 pipeline
Reference FastA files Indexed Genome
Reference GTF Models
Pool of known splice
junctions
Reads
(fastq)
Maps with known junctions Report
Maps convincingly with
novel junction?
Report
Yes
Yes
Discard
Add
No
Mapped Data QC
Mapping Statistics
Time loading forward index: 00:01:10
Time loading reference: 00:00:05
Multiseed full-index search: 00:20:47
24548251 reads; of these:
24548251 (100.00%) were paired; of these:
1472534 (6.00%) aligned concordantly 0 times
21491188 (87.55%) aligned concordantly exactly 1 time
1584529 (6.45%) aligned concordantly >1 times
94.00% overall alignment rate
Time searching: 00:20:52
Overall time: 00:22:02
Mapping Statistics
Exercise: RNA-Seq QC and Data Processing
Running programs in Linux
• Open a shell (text based OS interface)
• Type the name of the program you want to run
– Add on any options the program needs
– Press return - the program will run
– When the program ends control will return to the shell
• Run the next program!
Running programs
user@server:~$ ls
Desktop Documents Downloads examples.desktop
Music Pictures Public Templates Videos
user@server:~$
Command prompt - you can't enter a command unless you can see this
The command we're going to run (ls in this case, to list files)
The output of the command - just text in this case
The structure of a unix command
ls -ltd --reverse Downloads/ Desktop/ Documents/
Program
name
Switches Data
(normally files)
Each option or section is separated by spaces. Options or files with spaces in must be put in quotes.
Command line switches
• Change the behaviour of the program
• Come in two flavours (each option often has both types available)
– Minus plus single letter (eg -x -c -z)
• Can be combined (eg -xcz)
– Two minuses plus a word (eg --extract --gzip)
• Can't be combined
• Some take an additional value
-f somfile.txt (specify a filename)
--width=30 (specify a value)
Specifying file paths
• Specify names from whichever directory you are currently in
– If I'm in /home/simon
– Data/big_data.fq.gz
• is the same as /home/simon/Data/big_data.fq.gz
• Move to the directory with the data and just use file names
– cd Data
– big_data.fq.gz
home
simon
Data
big_data.fq.gz
Command line completion
• Most errors in commands are typing errors in either program
names or file paths
• Shells (ie BASH) can help with this by offering to complete path
names for you
• Command line completion is achieved by typing a partial path
and then pressing the TAB key (to the left of Q)
Command line completion
List of files / folders:
Desktop
Documents
Downloads
Music
Public
Published
Templates
Videos
T [TAB] → Templates
P [TAB] → Publ
Do [TAB] → [beep]
Do [TAB] [TAB] → Documents Downloads
Doc [TAB] → Documents
You should ALWAYS use TAB completion to fill in paths for
locations which exist so you can't make typing mistakes
(it obviously won't work for output files though)
Debugging Tips
• If anything (except the splice site extraction) completes almost immediately then it
didn't work!
• Look for errors before asking for help. They will either be
– The last piece of text before the program exited
– The first piece of text produced after it started (followed by the help file)
• To see if a program is running go to another shell and look at the last file produced
to see if it's growing
• Programs which are stuck can be cancelled with Control+C
Some useful commands
cd mydir Change directory to mydir
ls -ltrh List files in the current directory, show details and put
the newest files at the bottom
less x.txt View the x.txt text file
Return = down one line
Space = down one page
q = quit
Data Visualisation and Exploration
Viewing Mapped Data
• Reads over exons
• Reads over introns
• Reads in intergenic regions
• Strand specificity
SeqMonk RNA-Seq QC (good)
SeqMonk RNA-Seq QC (bad)
SeqMonk RNA-Seq QC (bad)
Look at poor QC samples
Duplication (again)
Exon
Exon
Duplication (good)
Duplication (moderate)
Duplication (bad)
Fixing Duplication?
• If duplication is biased (some genes more than others)
– Can’t be ‘fixed’ – can still analyse but be cautious
• If it’s unbiased (everything is duplicated)
– Doesn’t affect quantitation
– Will affect statistics
– Can estimate global level and correct raw counts
Quantitation
Exon 1 Exon 2 Exon 3
Exon 1 Exon 3
Splice form 1
Splice form 2
Definitely splice form 1
Definitely splice form 2
Ambiguous
Simple Quantitation - Forget splicing
• Count read overlaps with exons of each gene
– Consider library directionality
– Simple
– Gene level quantitation
– Many programs
• Seqmonk (graphical)
• Feature Counts (subread)
• BEDTools
• HTSeq
Analysing Splicing
• Try to quantitate transcripts (cufflinks, RSEM, bitSeq)
• Quantitate exons and compare to gene (EdgeR, DEXSeq)
• Quantitate splicing events (rMATS, MAJIQ)
Normalisation: RPKM / FPKM / TPM
• RPKM (Reads per kilobase of transcript per million reads of library)
– Corrects for total library coverage
– Corrects for gene length
– Comparable between different genes within the same dataset
• FPKM (Fragments per kilobase of transcript per million fragments of library)
– Only relevant for paired end libraries
– Pairs are not independent observation
– Effectively halves raw counts
• TPM (transcripts per million)
– Normalises to transcript copies instead of reads
– Corrects for cases where the average transcript length differs between samples
Log2
Linear
Eef1a1
Actb
Lars2
Eef2
CD74
Visualising Expression and Normalisation
Visualising Normalisation
Visualising Normalisation
Size Factor Normalisation
• Make an ‘average’ sample from the mean of expression for each
gene across all samples
• For each sample calculate the distribution of differences between
the data in that sample and the equivalent in the ‘average’ sample
• Use the median of the difference distribution to normalise the
data
Normalisation – Coverage Outliers
Normalisation – DNA Contamination
Normalisation – DNA Contamination
Normalisation – DNA Contamination
Exploratory Analyses
• Time to understand your data
– Behaviour of raw data and annotation
– Clustering of samples (PCA / tSNE etc)
– Pairwise comparisons of samples and
groups
– Are expected effects present (eg KO)?
– Can I validate other aspects of the
samples (eg sex)
– Can I see obvious changes?
– Are the changes convincing?
Differential Expression Statistics
Differential Expression
Normalised
expression
(+ confidence)
Raw counts
Continuous stats Binomial stats
Mapped data
DE-Seq2 binomial Stats
• Are the counts we see for gene X in condition 1 consistent with those for
gene X in condition 2?
• Size factors
– Estimator of library sampling depth
– More stable measure than total coverage
– Based on median ratio between conditions
• Variance – required for Negative Binomial distribution
– Insufficient observations to allow direct measure
– Custom variance distribution fitted to real data
– Smooth distribution assumed to allow fitting
Dispersion shrinkage
• Plot observed per gene dispersion
• Calculate average dispersion for genes
with similar observation
• Individual dispersions regressed towards
the mean. Weighted by
– Distance from mean
– Number of observations
• Points more than 2SD above the mean
are not regressed
5x5 Replicates
8,022 out of 18,570 genes (43%) identified
as DE using DESeq (p<0.05)
Needs further filtering
Two options:
1. Decrease the p-value cutoff
2. Filter on magnitude of change
(both are a bit rubbish)
Visualising Differential Expression Results
Visualising Differential Expression Results
Filter by p-value (fdr < 10-20
) Filter by fold change (abs log2 change > 1.5)
Fold Change Shrinkage
• Aims to make the log2 Fold change a more useful value
• Tries to remove systematic biases
• Two types:
1. Fold Change Shrinkage – removes bias from both expression level
and variance, produces a modified fold change
2. Intensity difference – removes bias from just expression level,
produces a p-value
Fold Change Shrinkage
RNA sequencing data analysis course by Simon Andrews
Result Validation
(Can I believe the hits?)
2900097C17Rik RIKEN cDNA 2900097C17 gene
Hbb-b1 hemoglobin, beta adult major chain
Rps27a-ps2 ribosomal protein S27A, pseudogene 2
C230073G13Rik RIKEN cDNA C230073G13 gene
mt-Atp8 mitochondrially encoded ATP synthase 8
mt-Nd4l mitochondrially encoded NADH dehydrogenase
AC151712.4 erythroid differentiation regulator 1
Gm5641 predicted gene 5641
Validation
Data Exploration and Analysis Practical
Experimental Design
for RNA-Seq
Practical Experiment Design
• What type of library?
• What type of sequencing?
• How many reads?
• How many replicates?
What type of library?
• Directional libraries if possible
– Easier to spot contamination
– No mixed signals from antisense transcription
– May be difficult for low input samples
• mRNA vs total vs depletion etc.
– Down to experimental questions
– Remember LINC RNA may not have polyA tail
– Active transcription vs standing mRNA pool
What type of sequencing
• Depends on your interest
– Expression quantitation of known genes
• 50bp single end is fine
– Expression plus splice junction usage
• 100bp (or longer if possible) single end
– Novel transcript discovery or per transcript expression
• 100bp paired end
How many reads
• Typically aim for 20 million reads for human / mouse sized
genome
• More reads:
– De-novo discovery
– Low expressed transcripts
• More replicates more useful than more reads
Replicates
• Compared to arrays, RNA-Seq is a very clean technical measure of expression
– Generally don’t run technical replicates
– Must run biological replicates
• For clean systems (eg cell lines) 3x3 or 4x4 is common
• Higher numbers required as the system gets more variable
• Always plan for at least one sample to fail
• Randomise across sample groups
Power Analysis
• Power Analysis is not simple for RNA-Seq data
– Not a single test – one test per gene
– Need to apply multiple testing correction
– Each gene will have different power
• Power correlates with observation level
• Variations in variance per gene
• Several tools exist to automate power analysis
– All require parameters which are difficult to estimate, and have dramatic
effects on the outcome
Power Analysis
Tools available
• RnaSeqSampleSize https://cqs-vumc.shinyapps.io/rnaseqsamplesizeweb/
• Scotty http://scotty.genetics.utah.edu/
• All require an estimate of count vs variance
– Pilot data (if only!)
– “Similar” studies
We are planning a RNA sequencing experiment to identify differential gene expression between two groups. Prior data
indicates that the minimum average read counts among the prognostic genes in the control group is 500, the maximum
dispersion is 0.1, and the ratio of the geometric mean of normalization factors is 1. Suppose that the total number of genes
for testing is 10000 and the top 100 genes are prognostic. If the desired minimum fold change is 3, we will need to study 4
subjects in each group to be able to reject the null hypothesis that the population means of the two groups are equal with
probability (power) 0.8 using exact test. The FDR associated with this test of this null hypothesis is 0.05.
Power Curves
Useful links
• FastQC http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
• HiSat2 https://ccb.jhu.edu/software/hisat2/
• SeqMonk http://www.bioinformatics.babraham.ac.uk/projects/seqmonk/
• Cufflinks http://cufflinks.cbcb.umd.edu/
• DESeq2 https://bioconductor.org/packages/release/bioc/html/DESeq2.html
• Bioconductor http://www.bioconductor.org/
• DupRadar http://sourceforge.net/projects/dupradar/

More Related Content

PDF
RNA sequencing analysis tutorial with NGS
PPTX
RNA-Seq_analysis_course(2).pptx
PDF
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
PPTX
Dgaston dec-06-2012
PDF
Introducing data analysis: reads to results
PDF
RNASeq Experiment Design
PPT
Rna seq pipeline
RNA sequencing analysis tutorial with NGS
RNA-Seq_analysis_course(2).pptx
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Dgaston dec-06-2012
Introducing data analysis: reads to results
RNASeq Experiment Design
Rna seq pipeline

Similar to RNA sequencing data analysis course by Simon Andrews (20)

PDF
Quality control of sequencing with fast qc obtained with
POT
RNA-seq quality control and pre-processing
PDF
RNA-Seq Data Analysis: An abstract Guide
PDF
Gwas.emes.comp
PPTX
RNASeq - Analysis Pipeline for Differential Expression
PDF
20110524zurichngs 2nd pub
PDF
NGS: Mapping and de novo assembly
PPTX
Workshop NGS data analysis - 2
PPTX
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
PPT
Exome Sequencing
PDF
SeqinR - biological data handling
PDF
What can we do with microbial WGS data? - t.seemann - mc gill summer 2016 - ...
PPTX
Tools for Transcriptome Data Analysis
PPTX
Workshop NGS data analysis - 1
PPTX
EiB Seminar from Antoni Miñarro, Ph.D
PPTX
Data analysis patterns, tools and data types in genomics
PPTX
Bioinformatics t2-databases v2014
PPTX
RNA-seq differential expression analysis
PDF
RNA-seq: general concept, goal and experimental design - part 1
Quality control of sequencing with fast qc obtained with
RNA-seq quality control and pre-processing
RNA-Seq Data Analysis: An abstract Guide
Gwas.emes.comp
RNASeq - Analysis Pipeline for Differential Expression
20110524zurichngs 2nd pub
NGS: Mapping and de novo assembly
Workshop NGS data analysis - 2
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
Exome Sequencing
SeqinR - biological data handling
What can we do with microbial WGS data? - t.seemann - mc gill summer 2016 - ...
Tools for Transcriptome Data Analysis
Workshop NGS data analysis - 1
EiB Seminar from Antoni Miñarro, Ph.D
Data analysis patterns, tools and data types in genomics
Bioinformatics t2-databases v2014
RNA-seq differential expression analysis
RNA-seq: general concept, goal and experimental design - part 1
Ad

Recently uploaded (20)

PPTX
ECG_Course_Presentation د.محمد صقران ppt
PPTX
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
PDF
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
PDF
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
PDF
HPLC-PPT.docx high performance liquid chromatography
PDF
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
PPTX
Classification Systems_TAXONOMY_SCIENCE8.pptx
PPTX
microscope-Lecturecjchchchchcuvuvhc.pptx
PPTX
famous lake in india and its disturibution and importance
PPTX
Microbiology with diagram medical studies .pptx
PPTX
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
PPTX
Introduction to Cardiovascular system_structure and functions-1
PPTX
BIOMOLECULES PPT........................
PDF
bbec55_b34400a7914c42429908233dbd381773.pdf
PDF
Biophysics 2.pdffffffffffffffffffffffffff
PDF
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
PDF
Placing the Near-Earth Object Impact Probability in Context
PPTX
Derivatives of integument scales, beaks, horns,.pptx
PDF
Mastering Bioreactors and Media Sterilization: A Complete Guide to Sterile Fe...
PPTX
Comparative Structure of Integument in Vertebrates.pptx
ECG_Course_Presentation د.محمد صقران ppt
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
HPLC-PPT.docx high performance liquid chromatography
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
Classification Systems_TAXONOMY_SCIENCE8.pptx
microscope-Lecturecjchchchchcuvuvhc.pptx
famous lake in india and its disturibution and importance
Microbiology with diagram medical studies .pptx
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
Introduction to Cardiovascular system_structure and functions-1
BIOMOLECULES PPT........................
bbec55_b34400a7914c42429908233dbd381773.pdf
Biophysics 2.pdffffffffffffffffffffffffff
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
Placing the Near-Earth Object Impact Probability in Context
Derivatives of integument scales, beaks, horns,.pptx
Mastering Bioreactors and Media Sterilization: A Complete Guide to Sterile Fe...
Comparative Structure of Integument in Vertebrates.pptx
Ad

RNA sequencing data analysis course by Simon Andrews

  • 1. RNA-Seq Analysis Simon Andrews, Laura Biggins, Sarah Inglesfield simon.andrews@babraham.ac.uk v2023-02
  • 2. A A RNA-Seq Libraries rRNA depleted mRNA Fragment Random prime + RT 2nd strand synthesis (+ U) A-tailing Adapter Ligation (U strand degradation) Sequencing NNNN u u u u u u u u u u u u A A T T A T
  • 3. Reference based RNA-Seq Analysis QC Trimming Mapping Mapped QC Exploration and Quantitation Statistical Analysis
  • 7. FastQC • Base call quality • Composition • Duplication • Contamination
  • 8. QC: Base Call Quality Read Position Call Quality (Phred score)
  • 10. QC: Duplication (blue trace) Level of duplication Percentage of library
  • 12. Library Structure Insert Adapter Adapter Primer Read 1 Insert Adapter Adapter Primer Read 1 Primer Read 2
  • 14. Trimming Quality Poor quality data tends to be at the 3’ end
  • 15. Mapping to a reference
  • 16. Mapping Exon 1 Exon 2 Exon 3 Genome Simple mapping within exons Mapping between exons Spliced mapping
  • 17. RNA-Seq Mapping Software • HiSat2 (https://ccb.jhu.edu/software/hisat2/) • Star (http://code.google.com/p/rna-star/) • Tophat (http://tophat.cbcb.umd.edu/)
  • 18. HiSat2 pipeline Reference FastA files Indexed Genome Reference GTF Models Pool of known splice junctions Reads (fastq) Maps with known junctions Report Maps convincingly with novel junction? Report Yes Yes Discard Add No
  • 20. Mapping Statistics Time loading forward index: 00:01:10 Time loading reference: 00:00:05 Multiseed full-index search: 00:20:47 24548251 reads; of these: 24548251 (100.00%) were paired; of these: 1472534 (6.00%) aligned concordantly 0 times 21491188 (87.55%) aligned concordantly exactly 1 time 1584529 (6.45%) aligned concordantly >1 times 94.00% overall alignment rate Time searching: 00:20:52 Overall time: 00:22:02
  • 22. Exercise: RNA-Seq QC and Data Processing
  • 23. Running programs in Linux • Open a shell (text based OS interface) • Type the name of the program you want to run – Add on any options the program needs – Press return - the program will run – When the program ends control will return to the shell • Run the next program!
  • 24. Running programs user@server:~$ ls Desktop Documents Downloads examples.desktop Music Pictures Public Templates Videos user@server:~$ Command prompt - you can't enter a command unless you can see this The command we're going to run (ls in this case, to list files) The output of the command - just text in this case
  • 25. The structure of a unix command ls -ltd --reverse Downloads/ Desktop/ Documents/ Program name Switches Data (normally files) Each option or section is separated by spaces. Options or files with spaces in must be put in quotes.
  • 26. Command line switches • Change the behaviour of the program • Come in two flavours (each option often has both types available) – Minus plus single letter (eg -x -c -z) • Can be combined (eg -xcz) – Two minuses plus a word (eg --extract --gzip) • Can't be combined • Some take an additional value -f somfile.txt (specify a filename) --width=30 (specify a value)
  • 27. Specifying file paths • Specify names from whichever directory you are currently in – If I'm in /home/simon – Data/big_data.fq.gz • is the same as /home/simon/Data/big_data.fq.gz • Move to the directory with the data and just use file names – cd Data – big_data.fq.gz home simon Data big_data.fq.gz
  • 28. Command line completion • Most errors in commands are typing errors in either program names or file paths • Shells (ie BASH) can help with this by offering to complete path names for you • Command line completion is achieved by typing a partial path and then pressing the TAB key (to the left of Q)
  • 29. Command line completion List of files / folders: Desktop Documents Downloads Music Public Published Templates Videos T [TAB] → Templates P [TAB] → Publ Do [TAB] → [beep] Do [TAB] [TAB] → Documents Downloads Doc [TAB] → Documents You should ALWAYS use TAB completion to fill in paths for locations which exist so you can't make typing mistakes (it obviously won't work for output files though)
  • 30. Debugging Tips • If anything (except the splice site extraction) completes almost immediately then it didn't work! • Look for errors before asking for help. They will either be – The last piece of text before the program exited – The first piece of text produced after it started (followed by the help file) • To see if a program is running go to another shell and look at the last file produced to see if it's growing • Programs which are stuck can be cancelled with Control+C
  • 31. Some useful commands cd mydir Change directory to mydir ls -ltrh List files in the current directory, show details and put the newest files at the bottom less x.txt View the x.txt text file Return = down one line Space = down one page q = quit
  • 32. Data Visualisation and Exploration
  • 33. Viewing Mapped Data • Reads over exons • Reads over introns • Reads in intergenic regions • Strand specificity
  • 37. Look at poor QC samples
  • 42. Fixing Duplication? • If duplication is biased (some genes more than others) – Can’t be ‘fixed’ – can still analyse but be cautious • If it’s unbiased (everything is duplicated) – Doesn’t affect quantitation – Will affect statistics – Can estimate global level and correct raw counts
  • 43. Quantitation Exon 1 Exon 2 Exon 3 Exon 1 Exon 3 Splice form 1 Splice form 2 Definitely splice form 1 Definitely splice form 2 Ambiguous
  • 44. Simple Quantitation - Forget splicing • Count read overlaps with exons of each gene – Consider library directionality – Simple – Gene level quantitation – Many programs • Seqmonk (graphical) • Feature Counts (subread) • BEDTools • HTSeq
  • 45. Analysing Splicing • Try to quantitate transcripts (cufflinks, RSEM, bitSeq) • Quantitate exons and compare to gene (EdgeR, DEXSeq) • Quantitate splicing events (rMATS, MAJIQ)
  • 46. Normalisation: RPKM / FPKM / TPM • RPKM (Reads per kilobase of transcript per million reads of library) – Corrects for total library coverage – Corrects for gene length – Comparable between different genes within the same dataset • FPKM (Fragments per kilobase of transcript per million fragments of library) – Only relevant for paired end libraries – Pairs are not independent observation – Effectively halves raw counts • TPM (transcripts per million) – Normalises to transcript copies instead of reads – Corrects for cases where the average transcript length differs between samples
  • 50. Size Factor Normalisation • Make an ‘average’ sample from the mean of expression for each gene across all samples • For each sample calculate the distribution of differences between the data in that sample and the equivalent in the ‘average’ sample • Use the median of the difference distribution to normalise the data
  • 52. Normalisation – DNA Contamination
  • 53. Normalisation – DNA Contamination
  • 54. Normalisation – DNA Contamination
  • 55. Exploratory Analyses • Time to understand your data – Behaviour of raw data and annotation – Clustering of samples (PCA / tSNE etc) – Pairwise comparisons of samples and groups – Are expected effects present (eg KO)? – Can I validate other aspects of the samples (eg sex) – Can I see obvious changes? – Are the changes convincing?
  • 57. Differential Expression Normalised expression (+ confidence) Raw counts Continuous stats Binomial stats Mapped data
  • 58. DE-Seq2 binomial Stats • Are the counts we see for gene X in condition 1 consistent with those for gene X in condition 2? • Size factors – Estimator of library sampling depth – More stable measure than total coverage – Based on median ratio between conditions • Variance – required for Negative Binomial distribution – Insufficient observations to allow direct measure – Custom variance distribution fitted to real data – Smooth distribution assumed to allow fitting
  • 59. Dispersion shrinkage • Plot observed per gene dispersion • Calculate average dispersion for genes with similar observation • Individual dispersions regressed towards the mean. Weighted by – Distance from mean – Number of observations • Points more than 2SD above the mean are not regressed
  • 60. 5x5 Replicates 8,022 out of 18,570 genes (43%) identified as DE using DESeq (p<0.05) Needs further filtering Two options: 1. Decrease the p-value cutoff 2. Filter on magnitude of change (both are a bit rubbish) Visualising Differential Expression Results
  • 61. Visualising Differential Expression Results Filter by p-value (fdr < 10-20 ) Filter by fold change (abs log2 change > 1.5)
  • 62. Fold Change Shrinkage • Aims to make the log2 Fold change a more useful value • Tries to remove systematic biases • Two types: 1. Fold Change Shrinkage – removes bias from both expression level and variance, produces a modified fold change 2. Intensity difference – removes bias from just expression level, produces a p-value
  • 65. Result Validation (Can I believe the hits?)
  • 66. 2900097C17Rik RIKEN cDNA 2900097C17 gene Hbb-b1 hemoglobin, beta adult major chain Rps27a-ps2 ribosomal protein S27A, pseudogene 2 C230073G13Rik RIKEN cDNA C230073G13 gene mt-Atp8 mitochondrially encoded ATP synthase 8 mt-Nd4l mitochondrially encoded NADH dehydrogenase AC151712.4 erythroid differentiation regulator 1 Gm5641 predicted gene 5641 Validation
  • 67. Data Exploration and Analysis Practical
  • 69. Practical Experiment Design • What type of library? • What type of sequencing? • How many reads? • How many replicates?
  • 70. What type of library? • Directional libraries if possible – Easier to spot contamination – No mixed signals from antisense transcription – May be difficult for low input samples • mRNA vs total vs depletion etc. – Down to experimental questions – Remember LINC RNA may not have polyA tail – Active transcription vs standing mRNA pool
  • 71. What type of sequencing • Depends on your interest – Expression quantitation of known genes • 50bp single end is fine – Expression plus splice junction usage • 100bp (or longer if possible) single end – Novel transcript discovery or per transcript expression • 100bp paired end
  • 72. How many reads • Typically aim for 20 million reads for human / mouse sized genome • More reads: – De-novo discovery – Low expressed transcripts • More replicates more useful than more reads
  • 73. Replicates • Compared to arrays, RNA-Seq is a very clean technical measure of expression – Generally don’t run technical replicates – Must run biological replicates • For clean systems (eg cell lines) 3x3 or 4x4 is common • Higher numbers required as the system gets more variable • Always plan for at least one sample to fail • Randomise across sample groups
  • 74. Power Analysis • Power Analysis is not simple for RNA-Seq data – Not a single test – one test per gene – Need to apply multiple testing correction – Each gene will have different power • Power correlates with observation level • Variations in variance per gene • Several tools exist to automate power analysis – All require parameters which are difficult to estimate, and have dramatic effects on the outcome
  • 76. Tools available • RnaSeqSampleSize https://cqs-vumc.shinyapps.io/rnaseqsamplesizeweb/ • Scotty http://scotty.genetics.utah.edu/ • All require an estimate of count vs variance – Pilot data (if only!) – “Similar” studies We are planning a RNA sequencing experiment to identify differential gene expression between two groups. Prior data indicates that the minimum average read counts among the prognostic genes in the control group is 500, the maximum dispersion is 0.1, and the ratio of the geometric mean of normalization factors is 1. Suppose that the total number of genes for testing is 10000 and the top 100 genes are prognostic. If the desired minimum fold change is 3, we will need to study 4 subjects in each group to be able to reject the null hypothesis that the population means of the two groups are equal with probability (power) 0.8 using exact test. The FDR associated with this test of this null hypothesis is 0.05.
  • 78. Useful links • FastQC http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ • HiSat2 https://ccb.jhu.edu/software/hisat2/ • SeqMonk http://www.bioinformatics.babraham.ac.uk/projects/seqmonk/ • Cufflinks http://cufflinks.cbcb.umd.edu/ • DESeq2 https://bioconductor.org/packages/release/bioc/html/DESeq2.html • Bioconductor http://www.bioconductor.org/ • DupRadar http://sourceforge.net/projects/dupradar/

Editor's Notes

  • #1: The format of the course will be… Firstly – theory Steps involved in an RNA-seq analysis Practical session
  • #2: Point out parts in library prep which are relevant later on. Start with single strand RNA. Can't sequence because it's RNA and too long Must fragment to make short bits Must convert to DNA, but that loses directionality - which is bad. Random priming and RT (causes biases later) Normal Illumina library prep once double stranded. Can retain strand information by tagging and degrading one of the stands. Which one determines same strand or opposing strand specific libraries.
  • #3: Focus on experiment types where you have a reference. Mention transcriptome assembly.
  • #8: Some QC looks very similar for different types of sequencing
  • #9: Random primed so no positional bias is expected. Expect to see 4 horizontal lines, not always all 25%, GC and AT might be different Will see biases in the actual data. Will be flagged as a problem. Assume hexamer libraries have all possible hexamers in equal proportion Assume that all possible hexamers bind and extend with equal efficiency. Should worry if you don't see this. Any slight biases in the analysis will be similar over all samples.
  • #10: Definitely a problem. Hard to assess from raw sequence. Shouldn't expect that all sequence are present equally – measuring expression levels You will see high duplication levels, this isn't necessarily a problem. Can look much more sensitively after mapping.
  • #16: Explain paired end sequence and colours. Must use a splicing aware read mapper.
  • #17: TopHat was original. Mapped to transcriptome first, then to genome if no hit. STAR and HiSat are very similar. Map directly to genome but with knowledge of splice sites. We're using Hisat as it uses less memory. Pick one and stick to it.
  • #18: Discovery of new junctions depends a lot on where the junction is in the read. 50/50 easy to discover 90/10 probably impossible Can do 1 or 2 pass processing. 2 pass is more complete - uses first pass just to find junctions but is slow. In practise 1 pass is fine and you hardly lose anything.
  • #21: Important thing to check is consistency MultiQC
  • #33: Zoomed right in on one gene
  • #34: To get a broader overview – RNA-seq QC report examines all the data
  • #39: Coincidental vs technical (PCR) duplication. Anything that isn’t a smooth distribution is more worrying
  • #41: What to do about bad duplication; Duplication doesn't affect corrected quantitation. Mainly affects count based statistics. Can estimate basal level of duplication and divide raw counts. Only matters for statistical tests. Don't use unless there is a problem.
  • #43: This is a simple example – about as simple as you get. Paired end mapping as before. Get a lot of ambiguous reads.
  • #45: Not going to do any of this today. Transcript level expression: Expectation maximisation model to create the mostly likely redistribution of counts between different splice forms. Can work well in some cases. Can be *very* wrong in others. Very difficult to evaluate. Counting junctions is much less complex. Set of counts which you can treat the same as expression levels. Splicing decisions is nice - ratio test.
  • #46: RPKM and TPM are very functionally equivalent. Main thing is the scale - TPM values are much higher. Can be an issue when you log transform. Very common to get negative logRPKM values which is fine, but people don't like.
  • #48: Point out the upper scoop on a log scale. Changes in high expressed genes can mess up the normalisation. Particularly rRNA but there are others too. If normalisation is messed up then apply additional correction. Percentile normalisation - pick a nicely behaving part of the distribution. Size factor normalisation - take the median difference between all genes of two sets. Most of the time it's fine. Don't do additional correction if you don't need to.
  • #49: Point out the upper scoop on a log scale. Changes in high expressed genes can mess up the normalisation. Particularly rRNA but there are others too. If normalisation is messed up then apply additional correction. Percentile normalisation - pick a nicely behaving part of the distribution. Size factor normalisation - take the median difference between all genes of two sets. Most of the time it's fine. Don't do additional correction if you don't need to.
  • #51: Point out the upper scoop on a log scale. Changes in high expressed genes can mess up the normalisation. Particularly rRNA but there are others too. If normalisation is messed up then apply additional correction. Percentile normalisation - pick a nicely behaving part of the distribution. Size factor normalisation - take the median difference between all genes of two sets. Most of the time it's fine. Don't do additional correction if you don't need to.
  • #52: Apply DNA contamination estimation and subtraction only if you know you have a problem. Can fix problems which a mathematical transformation can't.
  • #54: Apply DNA contamination estimation and subtraction only if you know you have a problem. Can fix problems which a mathematical transformation can't.
  • #57: Count based is natural fit for raw count quantitation Continuous would work with cufflinks or other transcript level quantitation. Binomial stats are very powerful with good power to detect changes.
  • #58: Talk about DESeq but others are very similar (EdgeR, BaySeq etc). Big problem is that people don't do enough replicates. Need a statistical kludge to work around this. We don’t get enough observations for an individual gene to get a good measure of variance. Need to share information between genes.
  • #59: There is a global relationship between variance and observation level. Makes sense - low observed is very variable, high observed is more stable. Can construct a global regression line of the relationship. In the test we don't use the observed variance but a 'shrunken' version of this where it is moved towards the global line. Amount it moves dependent on Number of replicates Distance from line For most genes this is fine and improves the analysis. For some hyper-variable genes though it's *bad* Worst offenders removed by not shrinking any points more than 2SD above the mean. Nothing statistical or magical about 2SD. Other points won't hit that limit. Bottom line is that the fewer replicates you have the more you rely on the global model.
  • #60: Always look at your results. Spots obvious errors in calculation.
  • #73: Example cell lines in a dish with a compound added. Samples collected in the wild, dissected, extracted, posted, processed at different times.