SlideShare a Scribd company logo
RNA-­‐seq	
  data	
  analysis	
  tutorial	
  
Andrea	
  Sboner	
  
2015-­‐05-­‐21	
  
NGS	
  Experiment	
  
Data	
  management:	
  
	
  
	
  Mapping	
  the	
  reads	
  
	
  CreaCng	
  summaries	
  
	
  
	
  
	
  
	
  
	
  
	
  
Downstream	
  analysis:	
  the	
  interes)ng	
  stuff	
  
DifferenCal	
  expression,	
  chimeric	
  transcripts,	
  novel	
  
transcribed	
  regions,	
  etc.	
  
What	
  is	
  RNA-­‐seq?	
  
• Next-­‐generaCon	
  sequencing	
  applied	
  to	
  the	
  
“transcriptome”	
  
	
  
ApplicaCons:	
  
	
  Gene	
  (exon,	
  isoform)	
  expression	
  esCmaCon	
  
	
  Differen)al	
  gene	
  (exon,	
  isoform)	
  expression	
  
analysis	
  
	
  Discovery	
  of	
  novel	
  transcribed	
  regions	
  
	
  Discovery/Detec)on	
  of	
  chimeric	
  transcripts	
  
	
  Allele	
  specific	
  expression	
  
	
  …	
  
NGS	
  Experiment	
  
Data	
  management:	
  
	
  
	
  Mapping	
  the	
  reads	
  
	
  CreaCng	
  summaries	
  
	
  
	
  
	
  
	
  
	
  
	
  
Downstream	
  analysis:	
  the	
  interes)ng	
  stuff	
  
DifferenCal	
  expression,	
  chimeric	
  transcripts,	
  novel	
  
transcribed	
  regions,	
  etc.	
  
QC	
  and	
  pre-­‐processing	
  
• First	
  step	
  in	
  QC:	
  	
  
– Look	
  at	
  quality	
  scores	
  to	
  see	
  if	
  sequencing	
  was	
  successful	
  
• Sequence	
  data	
  usually	
  stored	
  in	
  FASTQ	
  format:	
  
	
  
@BI:080831_SL-XAN_0004_30BV1AAXX:8:1:731:1429#0/1
GTTTCAACGGGTGTTGGAATCCACACCAAACAATGGCTACCTCTATCACCC
+
hbhhP_Z[`VFhHNU]KTWPHHIKMIIJKDJGGJGEDECDCGCABEAFEB
Header	
  (typically	
  w/	
  flowcell	
  #)	
  	
  
Sequence	
  	
  
Quality	
  scores	
  
flow	
  cell	
  lane	
   Cle	
  number	
   x-­‐coordinate	
   y-­‐coordinate	
  
provided	
  by	
  user	
   1st	
  end	
  of	
  paired	
  read	
  
40,34,40,40,16,31,26,28,27,32,22,6,40,8,14,21,29,11,20,23,16,…	
  
ASCII	
  table	
  
Numerical	
  quality	
  scores	
  
Typical	
  range	
  of	
  quality	
  scores:	
  0	
  ~	
  40	
  
	
  
Freely	
  available	
  tools	
  for	
  QC	
  
• FastQC	
  
– hep://www.bioinformaCcs.bbsrc.ac.uk/projects/fastqc/	
  
– Nice	
  GUI	
  and	
  command	
  line	
  interface	
  
• FASTX-­‐Toolkit	
  
– hep://hannonlab.cshl.edu/fastx_toolkit/index.html	
  
– Tools	
  for	
  QC	
  as	
  well	
  as	
  trimming	
  reads,	
  removing	
  adapters,	
  
filtering	
  by	
  read	
  quality,	
  etc.	
  
• Galaxy	
  
– hep://main.g2.bx.psu.edu/	
  
– Web	
  interface	
  
– Many	
  funcCons	
  but	
  analyses	
  are	
  done	
  on	
  remote	
  server	
  
FastQC	
  
• GUI	
  mode	
  
fastqc	
  
	
  
• Command	
  line	
  mode	
  
fastqc	
  	
  fastq_files	
  	
  	
  –o	
  output_directory	
  
– will	
  create	
  fastq_file_fastqc.zip	
  in	
  output	
  directory	
  
FastQC	
  
read	
  1	
  
rule	
  of	
  thumb:	
  average	
  quality	
  >	
  20	
  for	
  the	
  first	
  36bp	
  
-­‐	
  median	
  
-­‐	
  mean	
  
What	
  to	
  do	
  when	
  quality	
  is	
  poor?	
  
• Trim	
  the	
  reads	
  
• FASTX-­‐toolkit	
  
– fastx_trimmer	
  	
  
–f	
  N	
  –l	
  N	
  
– fastq_quality_filter	
  	
  
-­‐q	
  N	
  –p	
  N	
  
– Fastx_clipper	
  	
  
-­‐a	
  ADAPTER	
  
NGS	
  Experiment	
  
Data	
  management:	
  
	
  
	
  Mapping	
  the	
  reads	
  
	
  CreaCng	
  summaries	
  
	
  
	
  
	
  
	
  
	
  
	
  
Downstream	
  analysis:	
  the	
  interes)ng	
  stuff	
  
DifferenCal	
  expression,	
  chimeric	
  transcripts,	
  novel	
  
transcribed	
  regions,	
  etc.	
  
Mapping	
  
InsCtute	
  for	
  ComputaConal	
  Biomedicine	
  
Mapping	
  
ATCCAGCATTCGCGAAGTCGTA	
  
Mapping	
  to	
  a	
  reference	
  
• Genome	
  
• Transcriptome	
  
• Genome	
  +	
  Transcriptome	
  
• Transcriptome	
  +	
  Genome	
  
• Genome	
  +	
  	
  splice	
  juncCon	
  library	
  
reference	
  	
  
transcriptome	
  
Alignment	
  tools	
  
• BWA	
  
– hep://bio-­‐bwa.sourceforge.net/bwa.shtml	
  	
  
– Gapped	
  alignments	
  (good	
  for	
  indel	
  detecCon)	
  
• BowCe	
  
– hep://bowCe-­‐bio.sourceforge.net/index.shtml	
  	
  
– Supports	
  gapped	
  alignments	
  in	
  latest	
  version	
  (bowCe	
  2)	
  
• TopHat	
  
– hep://tophat.cbcb.umd.edu/	
  	
  
– Good	
  for	
  discovering	
  novel	
  transcripts	
  in	
  RNA-­‐seq	
  data	
  
– Builds	
  exon	
  models	
  and	
  splice	
  juncCons	
  de	
  novo.	
  
– Requires	
  more	
  CPU	
  Cme	
  and	
  disk	
  space	
  
• STAR	
  
– heps://code.google.com/p/rna-­‐star/	
  	
  
– Detects	
  splice	
  juncCons	
  de	
  novo	
  
– Super	
  fast:	
  ~10min	
  for	
  200M	
  reads	
  but	
  
– Requires	
  21Gb	
  of	
  memory	
  
• More	
  than	
  70	
  short-­‐read	
  aligners:	
  	
  
– hep://en.wikipedia.org/wiki/List_of_sequence_alignment_sooware	
  	
  
NGS	
  Experiment	
  
Data	
  management:	
  
	
  
	
  Mapping	
  the	
  reads	
  
	
  CreaCng	
  summaries	
  
	
  
	
  
	
  
	
  
	
  
	
  
Downstream	
  analysis:	
  the	
  interes)ng	
  stuff	
  
DifferenCal	
  expression,	
  chimeric	
  transcripts,	
  novel	
  
transcribed	
  regions,	
  etc.	
  
Analyzing	
  RNA-­‐Seq	
  experiments	
  
• How	
  many	
  molecules	
  of	
  mRNA1	
  are	
  in	
  my	
  
sample?	
  	
  
– EsCmaCng	
  expression	
  
• Is	
  the	
  amount	
  or	
  mRNA1	
  in	
  sample/group	
  A	
  
different	
  from	
  sample/group	
  B	
  ?	
  
– DifferenCal	
  analysis	
  
Es)ma)ng	
  expression:	
  counCng	
  how	
  
many	
  RNA-­‐seq	
  reads	
  map	
  to	
  genes	
  
• Using	
  R	
  
– summarizeOverlaps	
  in	
  GenomicRanges	
  	
  
– easyRNASeq	
  	
  
• Using	
  Python	
  
– htseq-­‐count	
  	
  
• How	
  it	
  works:	
  
– SAM/BAM	
  files	
  (TopHat2,	
  STAR,	
  …)	
  
– Gene	
  annotaCon	
  (GFF,	
  GTF	
  format)	
  
GFF/GTF	
  file	
  format:	
  
hep://en.wikipedia.org/wiki/General_feature_format	
  	
  
hep://useast.ensembl.org/info/website/upload/gff.html	
  	
  
hep://www.sanger.ac.uk/resources/sooware/gff/	
  
hep://www.sequenceontology.org/gff3.shtml	
  	
  	
  
GFF/GTF	
  file	
  format:	
  
hep://en.wikipedia.org/wiki/General_feature_format	
  	
  
hep://useast.ensembl.org/info/website/upload/gff.html	
  	
  
hep://www.sanger.ac.uk/resources/sooware/gff/	
  
hep://www.sequenceontology.org/gff3.shtml	
  	
  	
  
Tutorial:	
  RNA-­‐seq	
  count	
  matrix	
  
• Download	
  	
  
– hep://icb.med.cornell.edu/faculty/sboner/lab/
EpigenomicsWorkshop/count_matrix.txt	
  
	
  
• Load	
  into	
  R,	
  inspect	
  
Tutorial:	
  RNA-­‐seq	
  count	
  matrix	
  
# working directory
getwd()
# read in count matrix
countData <- read.csv("count_matrix.txt",
header=T, row.names=1, sep="t")
dim(countData)
head(countData)
Read	
  counts	
  
GENE ctrl1 ctrl2 ctrl3 treat1 treat2 treat3
0610005C13Rik 1438 1104 1825 1348 1154 1005
0610007N19Rik 1012 1152 1139 878 885 835
0610007P14Rik 704 796 881 826 865 929
0610009B22Rik 757 802 780 885 853 987
0610009D07Rik 1107 1183 1220 1258 1221 1428
…	
  	
   …	
  	
   …	
  	
   …	
  	
   …	
  	
   …	
  	
   …	
  	
  
24009	
  rows,	
  i.e.	
  genes	
  
6	
  columns,	
  i.e.	
  samples	
  
Tutorial:	
  Basic	
  QC	
  
barplot(colSums(countData)*1e-6,
names=colnames(countData),
ylab="Library size (millions)")
Tutorial:	
  Basic	
  QC	
  
barplot(colSums(countData)*1e-6,
names=colnames(countData),
ylab="Library size (millions)")
Analyzing	
  expression	
  
• How	
  many	
  molecules	
  of	
  mRNA1	
  are	
  in	
  my	
  
sample?	
  	
  
– EsCmaCng	
  expression	
  
• Is	
  the	
  amount	
  or	
  mRNA1	
  in	
  sample/group	
  A	
  
different	
  from	
  sample/group	
  B	
  ?	
  
– DifferenCal	
  analysis	
  
Tutorial:	
  Installing	
  BioConductor	
  
packages	
  
source("http://bioconductor.org/biocLite.R")
biocLite("DESeq2")
	
  
hep://www.bioconductor.org/	
  	
  
M.	
  I.	
  Love,	
  W.	
  Huber,	
  S.	
  Anders:	
  Moderated	
  esCmaCon	
  of	
  fold	
  change	
  and	
  
dispersion	
  for	
  RNA-­‐Seq	
  data	
  with	
  DESeq2.	
  bioRxiv	
  (2014).	
  doi:
10.1101/002832	
  [1]	
  
Tutorial:	
  DESeq2	
  analysis	
  
# load library
library(DESeq2)
# create experiment labels (two conditions)
colData <- DataFrame(condition=factor(c("ctrl","ctrl",
"ctrl", "treat", "treat", "treat")))
# create DESeq input matrix
dds <- DESeqDataSetFromMatrix(countData, colData,
formula(~ condition))
# run DEseq
dds <- DESeq(dds)
# visualize differentially expressed genes
plotMA(dds)
Tutorial:	
  DESeq2	
  analysis	
  
# load library
library(DESeq2)
# create experiment labels (two conditions)
colData <- DataFrame(condition=factor(c("ctrl","ctrl", "ctrl", "treat", "treat", "treat")))
# create DESeq input matrix
dds <- DESeqDataSetFromMatrix(countData, colData, formula(~ condition))
# run DEseq
dds <- DESeq(dds)
# visualize differentially expressed genes
plotMA(dds)
Tutorial:	
  DESeq2	
  analysis	
  
# load library
library(DESeq2)
# create experiment labels (two conditions)
colData <- DataFrame(condition=factor(c("ctrl","ctrl", "ctrl", "treat", "treat", "treat")))
# create DESeq input matrix
dds <- DESeqDataSetFromMatrix(countData, colData, formula(~ condition))
# run DEseq
dds <- DESeq(dds)
# visualize differentially expressed genes
plotMA(dds)
# get differentially expressed genes
res <- results(dds)
# order by BH adjusted p-value
resOrdered <- res[order(res$padj),]
# top of ordered matrix
head(resOrdered)
Tutorial:	
  DESeq2	
  analysis	
  
# get differentially expressed genes
res <- results(dds)
# order by BH adjusted p-value
resOrdered <- res[order(res$padj),]
# top of ordered matrix
head(resOrdered)
DataFrame with 6 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
Pck1 19300.0081 -2.3329116 0.16519373 -14.12228 2.768978e-45 3.986497e-41
Fras1 1202.1842 -0.8469410 0.06499738 -13.03039 8.219001e-39 5.916448e-35
S100a14 590.6305 2.1903041 0.17608923 12.43860 1.612985e-35 7.740716e-32
Ugt1a2 2759.7012 -1.7037495 0.15339576 -11.10689 1.161372e-28 4.180067e-25
Crip1 681.0106 0.7717364 0.07264577 10.62328 2.322502e-26 5.572844e-23
Smpdl3a 11152.4458 0.3398371 0.03195000 10.63653 2.014913e-26 5.572844e-23
# how many differentially expressed genes ? FDR=10%, |fold-change|>2 (up and down)
Tutorial:	
  DESeq2	
  analysis	
  
# how many differentially expressed genes ? FDR=10%, |fold-change|>2 (up and down)
# get differentially expressed gene matrix
sig <- resOrdered[!is.na(resOrdered$padj) &
resOrdered$padj<0.10 &
abs(resOrdered$log2FoldChange)>=1,]
Tutorial:	
  DESeq2	
  analysis	
  
# how many differentially expressed genes ? FDR=10%, |fold-change|>2 (up and down)
# get differentially expressed gene matrix
sig <- resOrdered[!is.na(resOrdered$padj) &
resOrdered$padj<0.10 &
abs(resOrdered$log2FoldChange)>=1,]
head(sig)
DataFrame with 6 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
Pck1 19300 -2.33 0.165 -14.12 2.77e-45 3.99e-41
S100a14 591 2.19 0.176 12.44 1.61e-35 7.74e-32
Ugt1a2 2760 -1.70 0.153 -11.11 1.16e-28 4.18e-25
Pklr 787 -1.00 0.097 -10.34 4.62e-25 9.49e-22
Mlph 1321 1.20 0.117 10.20 1.90e-24 3.42e-21
Ifit1 285 1.39 0.156 8.94 3.76e-19 3.38e-16
dim(sig)
# how to create a heat map
Tutorial:	
  Heat	
  Map	
  
# how to create a heat map
# select genes
selected <- rownames(sig);selected
## load libraries for the heat map
library("RColorBrewer")
source("http://bioconductor.org/biocLite.R")
biocLite(”gplots”)
library("gplots")
# colors of the heat map
hmcol <- colorRampPalette(brewer.pal(9, "GnBu"))(100) ## hmcol <- heat.colors
heatmap.2( log2(counts(dds,normalized=TRUE)[rownames(dds) %in% selected,]),
col = hmcol, scale="row”,
Rowv = TRUE, Colv = FALSE,
dendrogram="row",
trace="none",
margin=c(4,6), cexRow=0.5, cexCol=1, keysize=1 )
Tutorial:	
  Heat	
  Map	
  
# how to create a heat map
library("RColorBrewer")
library("gplots")
# colors of the heat map
hmcol <- colorRampPalette(brewer.pal(9, "GnBu"))(100) ## hmcol <- heat.colors
heatmap.2(log2(counts(dds,normalized=TRUE)[rownames(dds) %in% selected,]),
col = hmcol, Rowv = TRUE, Colv = FALSE, scale="row", dendrogram="row", trace="none",
margin=c(4,6), cexRow=0.5, cexCol=1, keysize=1 )
SelecCng	
  the	
  most	
  differenCally	
  
expressed	
  genes	
  and	
  run	
  GO	
  analysis	
  
# universe
universe <- rownames(resOrdered)
# load mouse annotation and ID library
biocLite(“org.Mm.eg.db”)
library(org.Mm.eg.db)
# convert gene names to Entrez ID
genemap <- select(org.Mm.eg.db, selected, "ENTREZID", "SYMBOL")
univmap <- select(org.Mm.eg.db, universe, "ENTREZID", "SYMBOL")
# load GO scoring package
biocLite(“GOstats”)
library(GOstats)
# set up analysis
param<- new ("GOHyperGParams", geneIds = genemap, universeGeneIds=univmap, annotation="org.Mm.eg.db",
ontology="BP",pvalueCutoff=0.01, conditional=FALSE, testDirection="over")
# run analysis
hyp<-hyperGTest(param)
# visualize
summary(hyp)
## Select/sort on Pvalue, Count, etc.
Summary	
  
• Intro	
  of	
  RNA-­‐seq	
  
• EsCmaCng	
  expression	
  levels	
  
• DifferenCal	
  expression	
  analysis	
  with	
  DESeq2	
  
• Andrea	
  Sboner:	
  ans2077@med.cornell.edu	
  

More Related Content

PPTX
Dgaston dec-06-2012
PPTX
Bioinformatics
PDF
rnaseq2015-02-18-170327193409.pdf
POT
RNA-seq quality control and pre-processing
PPTX
RNA sequencing data analysis course by Simon Andrews
PPTX
Tools for Transcriptome Data Analysis
Dgaston dec-06-2012
Bioinformatics
rnaseq2015-02-18-170327193409.pdf
RNA-seq quality control and pre-processing
RNA sequencing data analysis course by Simon Andrews
Tools for Transcriptome Data Analysis

Similar to RNA sequencing analysis tutorial with NGS (20)

PPTX
RNASeq - Analysis Pipeline for Differential Expression
PPSX
Anne_Vaittinen_advanced_seminar_presentation
PDF
RNA-seq: analysis of raw data and preprocessing - part 2
PPTX
RNA-Seq_Presentation
PDF
RNA-Seq Data Analysis: An abstract Guide
PPTX
RNA-Seq_analysis_course(2).pptx
PDF
[2017-05-29] DNASmartTagger
PPTX
Rnaseq forgenefinding
PDF
NGS: Mapping and de novo assembly
PDF
20110524zurichngs 1st pub
PDF
Part 2 of RNA-seq for DE analysis: Investigating raw data
PDF
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
PPTX
Next-generation sequencing format and visualization with ngs.plot
PDF
Introduction to Apollo for i5k
PDF
An introduction to RNA-seq data analysis
PPTX
Bioinfo ngs data format visualization v2
PPTX
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
PDF
Processing Raw scRNA-Seq Sequencing Data
PDF
RNASeq Experiment Design
PPTX
Transcriptome project
RNASeq - Analysis Pipeline for Differential Expression
Anne_Vaittinen_advanced_seminar_presentation
RNA-seq: analysis of raw data and preprocessing - part 2
RNA-Seq_Presentation
RNA-Seq Data Analysis: An abstract Guide
RNA-Seq_analysis_course(2).pptx
[2017-05-29] DNASmartTagger
Rnaseq forgenefinding
NGS: Mapping and de novo assembly
20110524zurichngs 1st pub
Part 2 of RNA-seq for DE analysis: Investigating raw data
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Next-generation sequencing format and visualization with ngs.plot
Introduction to Apollo for i5k
An introduction to RNA-seq data analysis
Bioinfo ngs data format visualization v2
Genome in a Bottle - Towards new benchmarks for the “dark matter” of the huma...
Processing Raw scRNA-Seq Sequencing Data
RNASeq Experiment Design
Transcriptome project
Ad

Recently uploaded (20)

PPTX
Pharma ospi slides which help in ospi learning
PDF
Sports Quiz easy sports quiz sports quiz
PPTX
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
PDF
Insiders guide to clinical Medicine.pdf
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
PDF
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PDF
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
PDF
Anesthesia in Laparoscopic Surgery in India
PDF
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
PPTX
PPH.pptx obstetrics and gynecology in nursing
PDF
Complications of Minimal Access Surgery at WLH
PDF
VCE English Exam - Section C Student Revision Booklet
PDF
01-Introduction-to-Information-Management.pdf
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PPTX
Lesson notes of climatology university.
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PPTX
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
PDF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
Pharma ospi slides which help in ospi learning
Sports Quiz easy sports quiz sports quiz
IMMUNITY IMMUNITY refers to protection against infection, and the immune syst...
Insiders guide to clinical Medicine.pdf
102 student loan defaulters named and shamed – Is someone you know on the list?
ANTIBIOTICS.pptx.pdf………………… xxxxxxxxxxxxx
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
STATICS OF THE RIGID BODIES Hibbelers.pdf
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
Anesthesia in Laparoscopic Surgery in India
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
PPH.pptx obstetrics and gynecology in nursing
Complications of Minimal Access Surgery at WLH
VCE English Exam - Section C Student Revision Booklet
01-Introduction-to-Information-Management.pdf
Supply Chain Operations Speaking Notes -ICLT Program
Lesson notes of climatology university.
Abdominal Access Techniques with Prof. Dr. R K Mishra
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
Ad

RNA sequencing analysis tutorial with NGS

  • 1. RNA-­‐seq  data  analysis  tutorial   Andrea  Sboner   2015-­‐05-­‐21  
  • 2. NGS  Experiment   Data  management:      Mapping  the  reads    CreaCng  summaries               Downstream  analysis:  the  interes)ng  stuff   DifferenCal  expression,  chimeric  transcripts,  novel   transcribed  regions,  etc.  
  • 3. What  is  RNA-­‐seq?   • Next-­‐generaCon  sequencing  applied  to  the   “transcriptome”     ApplicaCons:    Gene  (exon,  isoform)  expression  esCmaCon    Differen)al  gene  (exon,  isoform)  expression   analysis    Discovery  of  novel  transcribed  regions    Discovery/Detec)on  of  chimeric  transcripts    Allele  specific  expression    …  
  • 4. NGS  Experiment   Data  management:      Mapping  the  reads    CreaCng  summaries               Downstream  analysis:  the  interes)ng  stuff   DifferenCal  expression,  chimeric  transcripts,  novel   transcribed  regions,  etc.  
  • 5. QC  and  pre-­‐processing   • First  step  in  QC:     – Look  at  quality  scores  to  see  if  sequencing  was  successful   • Sequence  data  usually  stored  in  FASTQ  format:     @BI:080831_SL-XAN_0004_30BV1AAXX:8:1:731:1429#0/1 GTTTCAACGGGTGTTGGAATCCACACCAAACAATGGCTACCTCTATCACCC + hbhhP_Z[`VFhHNU]KTWPHHIKMIIJKDJGGJGEDECDCGCABEAFEB Header  (typically  w/  flowcell  #)     Sequence     Quality  scores   flow  cell  lane   Cle  number   x-­‐coordinate   y-­‐coordinate   provided  by  user   1st  end  of  paired  read   40,34,40,40,16,31,26,28,27,32,22,6,40,8,14,21,29,11,20,23,16,…   ASCII  table   Numerical  quality  scores   Typical  range  of  quality  scores:  0  ~  40    
  • 6. Freely  available  tools  for  QC   • FastQC   – hep://www.bioinformaCcs.bbsrc.ac.uk/projects/fastqc/   – Nice  GUI  and  command  line  interface   • FASTX-­‐Toolkit   – hep://hannonlab.cshl.edu/fastx_toolkit/index.html   – Tools  for  QC  as  well  as  trimming  reads,  removing  adapters,   filtering  by  read  quality,  etc.   • Galaxy   – hep://main.g2.bx.psu.edu/   – Web  interface   – Many  funcCons  but  analyses  are  done  on  remote  server  
  • 7. FastQC   • GUI  mode   fastqc     • Command  line  mode   fastqc    fastq_files      –o  output_directory   – will  create  fastq_file_fastqc.zip  in  output  directory  
  • 8. FastQC   read  1   rule  of  thumb:  average  quality  >  20  for  the  first  36bp   -­‐  median   -­‐  mean  
  • 9. What  to  do  when  quality  is  poor?   • Trim  the  reads   • FASTX-­‐toolkit   – fastx_trimmer     –f  N  –l  N   – fastq_quality_filter     -­‐q  N  –p  N   – Fastx_clipper     -­‐a  ADAPTER  
  • 10. NGS  Experiment   Data  management:      Mapping  the  reads    CreaCng  summaries               Downstream  analysis:  the  interes)ng  stuff   DifferenCal  expression,  chimeric  transcripts,  novel   transcribed  regions,  etc.  
  • 11. Mapping   InsCtute  for  ComputaConal  Biomedicine  
  • 13. Mapping  to  a  reference   • Genome   • Transcriptome   • Genome  +  Transcriptome   • Transcriptome  +  Genome   • Genome  +    splice  juncCon  library   reference     transcriptome  
  • 14. Alignment  tools   • BWA   – hep://bio-­‐bwa.sourceforge.net/bwa.shtml     – Gapped  alignments  (good  for  indel  detecCon)   • BowCe   – hep://bowCe-­‐bio.sourceforge.net/index.shtml     – Supports  gapped  alignments  in  latest  version  (bowCe  2)   • TopHat   – hep://tophat.cbcb.umd.edu/     – Good  for  discovering  novel  transcripts  in  RNA-­‐seq  data   – Builds  exon  models  and  splice  juncCons  de  novo.   – Requires  more  CPU  Cme  and  disk  space   • STAR   – heps://code.google.com/p/rna-­‐star/     – Detects  splice  juncCons  de  novo   – Super  fast:  ~10min  for  200M  reads  but   – Requires  21Gb  of  memory   • More  than  70  short-­‐read  aligners:     – hep://en.wikipedia.org/wiki/List_of_sequence_alignment_sooware    
  • 15. NGS  Experiment   Data  management:      Mapping  the  reads    CreaCng  summaries               Downstream  analysis:  the  interes)ng  stuff   DifferenCal  expression,  chimeric  transcripts,  novel   transcribed  regions,  etc.  
  • 16. Analyzing  RNA-­‐Seq  experiments   • How  many  molecules  of  mRNA1  are  in  my   sample?     – EsCmaCng  expression   • Is  the  amount  or  mRNA1  in  sample/group  A   different  from  sample/group  B  ?   – DifferenCal  analysis  
  • 17. Es)ma)ng  expression:  counCng  how   many  RNA-­‐seq  reads  map  to  genes   • Using  R   – summarizeOverlaps  in  GenomicRanges     – easyRNASeq     • Using  Python   – htseq-­‐count     • How  it  works:   – SAM/BAM  files  (TopHat2,  STAR,  …)   – Gene  annotaCon  (GFF,  GTF  format)  
  • 18. GFF/GTF  file  format:   hep://en.wikipedia.org/wiki/General_feature_format     hep://useast.ensembl.org/info/website/upload/gff.html     hep://www.sanger.ac.uk/resources/sooware/gff/   hep://www.sequenceontology.org/gff3.shtml      
  • 19. GFF/GTF  file  format:   hep://en.wikipedia.org/wiki/General_feature_format     hep://useast.ensembl.org/info/website/upload/gff.html     hep://www.sanger.ac.uk/resources/sooware/gff/   hep://www.sequenceontology.org/gff3.shtml      
  • 20. Tutorial:  RNA-­‐seq  count  matrix   • Download     – hep://icb.med.cornell.edu/faculty/sboner/lab/ EpigenomicsWorkshop/count_matrix.txt     • Load  into  R,  inspect  
  • 21. Tutorial:  RNA-­‐seq  count  matrix   # working directory getwd() # read in count matrix countData <- read.csv("count_matrix.txt", header=T, row.names=1, sep="t") dim(countData) head(countData)
  • 22. Read  counts   GENE ctrl1 ctrl2 ctrl3 treat1 treat2 treat3 0610005C13Rik 1438 1104 1825 1348 1154 1005 0610007N19Rik 1012 1152 1139 878 885 835 0610007P14Rik 704 796 881 826 865 929 0610009B22Rik 757 802 780 885 853 987 0610009D07Rik 1107 1183 1220 1258 1221 1428 …     …     …     …     …     …     …     24009  rows,  i.e.  genes   6  columns,  i.e.  samples  
  • 23. Tutorial:  Basic  QC   barplot(colSums(countData)*1e-6, names=colnames(countData), ylab="Library size (millions)")
  • 24. Tutorial:  Basic  QC   barplot(colSums(countData)*1e-6, names=colnames(countData), ylab="Library size (millions)")
  • 25. Analyzing  expression   • How  many  molecules  of  mRNA1  are  in  my   sample?     – EsCmaCng  expression   • Is  the  amount  or  mRNA1  in  sample/group  A   different  from  sample/group  B  ?   – DifferenCal  analysis  
  • 26. Tutorial:  Installing  BioConductor   packages   source("http://bioconductor.org/biocLite.R") biocLite("DESeq2")   hep://www.bioconductor.org/     M.  I.  Love,  W.  Huber,  S.  Anders:  Moderated  esCmaCon  of  fold  change  and   dispersion  for  RNA-­‐Seq  data  with  DESeq2.  bioRxiv  (2014).  doi: 10.1101/002832  [1]  
  • 27. Tutorial:  DESeq2  analysis   # load library library(DESeq2) # create experiment labels (two conditions) colData <- DataFrame(condition=factor(c("ctrl","ctrl", "ctrl", "treat", "treat", "treat"))) # create DESeq input matrix dds <- DESeqDataSetFromMatrix(countData, colData, formula(~ condition)) # run DEseq dds <- DESeq(dds) # visualize differentially expressed genes plotMA(dds)
  • 28. Tutorial:  DESeq2  analysis   # load library library(DESeq2) # create experiment labels (two conditions) colData <- DataFrame(condition=factor(c("ctrl","ctrl", "ctrl", "treat", "treat", "treat"))) # create DESeq input matrix dds <- DESeqDataSetFromMatrix(countData, colData, formula(~ condition)) # run DEseq dds <- DESeq(dds) # visualize differentially expressed genes plotMA(dds)
  • 29. Tutorial:  DESeq2  analysis   # load library library(DESeq2) # create experiment labels (two conditions) colData <- DataFrame(condition=factor(c("ctrl","ctrl", "ctrl", "treat", "treat", "treat"))) # create DESeq input matrix dds <- DESeqDataSetFromMatrix(countData, colData, formula(~ condition)) # run DEseq dds <- DESeq(dds) # visualize differentially expressed genes plotMA(dds) # get differentially expressed genes res <- results(dds) # order by BH adjusted p-value resOrdered <- res[order(res$padj),] # top of ordered matrix head(resOrdered)
  • 30. Tutorial:  DESeq2  analysis   # get differentially expressed genes res <- results(dds) # order by BH adjusted p-value resOrdered <- res[order(res$padj),] # top of ordered matrix head(resOrdered) DataFrame with 6 rows and 6 columns baseMean log2FoldChange lfcSE stat pvalue padj <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> Pck1 19300.0081 -2.3329116 0.16519373 -14.12228 2.768978e-45 3.986497e-41 Fras1 1202.1842 -0.8469410 0.06499738 -13.03039 8.219001e-39 5.916448e-35 S100a14 590.6305 2.1903041 0.17608923 12.43860 1.612985e-35 7.740716e-32 Ugt1a2 2759.7012 -1.7037495 0.15339576 -11.10689 1.161372e-28 4.180067e-25 Crip1 681.0106 0.7717364 0.07264577 10.62328 2.322502e-26 5.572844e-23 Smpdl3a 11152.4458 0.3398371 0.03195000 10.63653 2.014913e-26 5.572844e-23 # how many differentially expressed genes ? FDR=10%, |fold-change|>2 (up and down)
  • 31. Tutorial:  DESeq2  analysis   # how many differentially expressed genes ? FDR=10%, |fold-change|>2 (up and down) # get differentially expressed gene matrix sig <- resOrdered[!is.na(resOrdered$padj) & resOrdered$padj<0.10 & abs(resOrdered$log2FoldChange)>=1,]
  • 32. Tutorial:  DESeq2  analysis   # how many differentially expressed genes ? FDR=10%, |fold-change|>2 (up and down) # get differentially expressed gene matrix sig <- resOrdered[!is.na(resOrdered$padj) & resOrdered$padj<0.10 & abs(resOrdered$log2FoldChange)>=1,] head(sig) DataFrame with 6 rows and 6 columns baseMean log2FoldChange lfcSE stat pvalue padj <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> Pck1 19300 -2.33 0.165 -14.12 2.77e-45 3.99e-41 S100a14 591 2.19 0.176 12.44 1.61e-35 7.74e-32 Ugt1a2 2760 -1.70 0.153 -11.11 1.16e-28 4.18e-25 Pklr 787 -1.00 0.097 -10.34 4.62e-25 9.49e-22 Mlph 1321 1.20 0.117 10.20 1.90e-24 3.42e-21 Ifit1 285 1.39 0.156 8.94 3.76e-19 3.38e-16 dim(sig) # how to create a heat map
  • 33. Tutorial:  Heat  Map   # how to create a heat map # select genes selected <- rownames(sig);selected ## load libraries for the heat map library("RColorBrewer") source("http://bioconductor.org/biocLite.R") biocLite(”gplots”) library("gplots") # colors of the heat map hmcol <- colorRampPalette(brewer.pal(9, "GnBu"))(100) ## hmcol <- heat.colors heatmap.2( log2(counts(dds,normalized=TRUE)[rownames(dds) %in% selected,]), col = hmcol, scale="row”, Rowv = TRUE, Colv = FALSE, dendrogram="row", trace="none", margin=c(4,6), cexRow=0.5, cexCol=1, keysize=1 )
  • 34. Tutorial:  Heat  Map   # how to create a heat map library("RColorBrewer") library("gplots") # colors of the heat map hmcol <- colorRampPalette(brewer.pal(9, "GnBu"))(100) ## hmcol <- heat.colors heatmap.2(log2(counts(dds,normalized=TRUE)[rownames(dds) %in% selected,]), col = hmcol, Rowv = TRUE, Colv = FALSE, scale="row", dendrogram="row", trace="none", margin=c(4,6), cexRow=0.5, cexCol=1, keysize=1 )
  • 35. SelecCng  the  most  differenCally   expressed  genes  and  run  GO  analysis   # universe universe <- rownames(resOrdered) # load mouse annotation and ID library biocLite(“org.Mm.eg.db”) library(org.Mm.eg.db) # convert gene names to Entrez ID genemap <- select(org.Mm.eg.db, selected, "ENTREZID", "SYMBOL") univmap <- select(org.Mm.eg.db, universe, "ENTREZID", "SYMBOL") # load GO scoring package biocLite(“GOstats”) library(GOstats) # set up analysis param<- new ("GOHyperGParams", geneIds = genemap, universeGeneIds=univmap, annotation="org.Mm.eg.db", ontology="BP",pvalueCutoff=0.01, conditional=FALSE, testDirection="over") # run analysis hyp<-hyperGTest(param) # visualize summary(hyp) ## Select/sort on Pvalue, Count, etc.
  • 36. Summary   • Intro  of  RNA-­‐seq   • EsCmaCng  expression  levels   • DifferenCal  expression  analysis  with  DESeq2   • Andrea  Sboner:  ans2077@med.cornell.edu