Pasteur deep seq analysis practical Part - 2015

Deep Seq Data Analysis
Part II
Christophe.antoniewski@upmc.fr
http://drosophile.org
Mouse Genetics
January 29, 2015, 13:30–
15:00
http://fr.slideshare.net/christopheantoniewski/

The method section available on line
RNA isolation and library construction
Both human and mouse blastomeres were prepared using identical protocols. Single
blastomeres were isolated by removing the zona pellucida using acidic tyrode
solution (Sigma, catalogue no. T1788), then separated by gentle mouth pipetting in a
calcium-free medium. Single cells were washed twice with 1× PBS containing 0.1%
BSA before placing in lysis buffer. RNA was isolated from single cells or single morula
embryos and amplified as described previously14. Library construction was
performed following Illumina manufacturer suggestions. Libraries were sequenced
on the Illumina Hiseq2000 platform and sequencing reads that contained polyA, low
quality, and adapters were pre-filtered before mapping. Filtered reads were mapped
to the hg19 genome and mm9 genome using default parameters from BWA aligner29,
and reads that failed to map to the genome were re-mapped to their respective
mRNA sequences to capture reads that span exons.
Transcriptional profiling
In both human and mouse cases, data normalization was performed by transforming
uniquely mapped transcript reads to RPKM30. Genes with low expression in all stages
(average RPKM < 0.5) were filtered out, followed by quantile normalization. For
differential expression, we compared every time point to its previous time point
using default parameters in DESeq using normalized read counts. Genes were called
differentially expressed if they exhibited a Benjamini and Hochberg–adjusted P value
(FDR) <5% and a mean fold change of >2.

Data 1
GEO dataset accession: GSE44183
http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE44183
• Take the SRP identifier at the bottom of the page: SRP018525
• Search for this identifier in EBI SRA ENA SRA Galaxy tool
• Check for your experiment accession by clicking on the SRX…. links
• Click on the fastq files (galaxy) links
 Files are uploaded in yellow datasets that show up in the current history
GSM1080195: mouse oocyte 1; Mus musculus; RNA-Seq
1 ILLUMINA (Illumina HiSeq 2000) run: 16.4M spots, 3G bases, 1.9Gb downloads
Accession: SRX229784
GSM1080196: mouse oocyte 2; Mus musculus; RNA-Seq
1 ILLUMINA (Illumina HiSeq 2000) run: 20.2M spots, 3.6G bases, 2.4Gb downloads
GSM1080197: mouse pronuclei 1; Mus musculus; RNA-Seq
1 ILLUMINA (Illumina HiSeq 2000) run: 17.2M spots, 3.1G bases, 2Gb downloads
• Register in mississippi.fr
• Take an identifier :
oocyte1@pasteur.fr
• oocyte2@pasteur.fr
• pronuclei1@pasteur.fr
• And the same password:
gsgalaxy
• Click on “Analyze Data”
• You are by default on an unnamed
history
• Name it “Datasets”

Data 2
• Click on “Share Data  Data Libraries”
• Click on “Public Datasets”
• Click on “Mouse Pasteur”
• Check boxes corresponding RefSeq_Genes_mm9.gtf, and your datasets
• Click on the “Go” item
• Click on “Analyze Data”
• Look at the imported data sets (3 green boxes)
• Look at their content (eye)
• Look at their metadata (info icon)
The dataset are already available from the server

Read Mapping
1. Type “fastqc” in the search field at the left-hand column
2. Click on “FastQC:Read QC reports using FastQC”
3. Select your first fastq data set
4. Run the tool
5. Select the yellow box (running tool)
6. Click on the “redo” box
7. Select your second fastq data set
8. Run the tool  it will take 4-5 min max
9. Search for “bwa” in the tool search field
10. Select “Map with BWA for Illumina”
11. Lets have a look to the tool form
Filtered reads were mapped to the hg19 genome and mm9 genome using
default parameters from BWA aligner29, and reads that failed to map to the
genome were re-mapped to their respective mRNA sequences to capture
reads that span exons.
1. The procedure is not reproducible because metadata and
parameters are lacking.
2. The procedure is out of date
• The article has been published in 2013
• Tophat has been published in 2009, 2012 – Tophat2 in April 2013

Read Mapping using Tophat2
See https://wiki.galaxyproject.org/Events/GCC2014/TrainingDay?action=AttachFile&do=view&target=RNA-SeqAltSlides.pdf
For a nice introduction to RNA-seq analysis

Read Mapping using Tophat2 in Galaxy
1. Create a new history and name it “tophat2 alignment”
2. Copy your 2 fastq files from the previous history, as well as the RefSeq.gtf reference file
3. Rename the files and put an annotation
4. Find and fill in the tophat2 tool form
5. Run the tool
6. Select your first fastq data set
7. Run the tool
8. While it is running look at the metadata
9. Rename the datasets using the pencil box
10. Import Two other datasets
11. Re-run the Tophat2 on these datasets
12. Look at the job in the admin panel (reproducible analyses)
13. Look at the tool on the galaxy tool repository
14. Stop all running tools
15. Import the history “GS SRP018525 tophat2”
16. Visualize your reads in Trackster (1 gtf track + 1 condition mapping)
17. Optional, visualize junctions, etc…
18. Compare with another public genome browser (UCSC or Ensembl)
Paired-end reads were mapped to the mm9 genome using Tophat2 the
parameters ---, and the RefSeq gtf mm9 annotation as a guide.

Read Counting using featureCounts in
Galaxy
1. Create a new history called “Read Counts”
2. Copy the accepted hits datasets from the “imported: GS SRP018525 tophat2” history
as well as the RefSef GTF guide
3. You have now 6 datasets in the “Read Counts” history
4. Run feature count once on oocyte 1 data
5. Re-run the tool for oocyte 2 and pronuclei 1, 2, 3
6. Change the metadata of featureCount summaries
7. Iteratively paste the featureCounts outputs using the Paste two files side by side tool
8.  We have a hit Table
9. Rename it FeatureCounts HIT TABLE
10. We can visualize data using chart

Differential count analysis
1. Create a new history called “Differential count analysis”
2. Copy the “FeatureCounts HIT TABLE”
3. Run “Differential_Count models using BioConductor packages” on the FeatureCounts
HIT TABLE
4. Review the results
5. Yet, we did not reproduce the sup Fig. 1

DESeq Analysis
1. Let’s examine Fig.1, together with the published methods
2. The information is wrong, but we will approach the figure, trying to guess what has
been really done
3. Copy the “FeatureCounts HIT TABLE” in a new history called “my DESeq approach”
4. To run the Deseq(1) package we need to reformat the HIT TABLE
5. With a text editor OR within Galaxy
1. Cut columns
2. Remove header
3. Upload new header
4. Manipulate header
5. Concatenate files
6. Run the tool “DESeq Profiling (replicates) with sample replicates”
7. Get the R code available in the public library: Rscript_for_Sup_Fig1a
8. Run the Docker Tool Factory tool with this R code to generate the figure
9. Run the tool “DESeq2 Profiling”
10. Re-run the Docker Tool Factory tool with the same R code on the DESeq2 DE analysis
Transcriptional profiling
In both human and mouse cases, data normalization was performed by
transforming uniquely mapped transcript reads to RPKM30. Genes with low
expression in all stages (average RPKM < 0.5) were filtered out, followed by
quantile normalization. For differential expression, we compared every time
point to its previous time point using default parameters in DESeq using
normalized read counts. Genes were called differentially expressed if they
exhibited a Benjamini and Hochberg–adjusted P value (FDR) <5% and a mean
fold change of >2.

Optional: comparison between the
tophat2 approach and the BWA
approach
1. Sharing the “SRP018525 BWA” history
2. Sharing the “Comparison BWA / Tophat” visualization
3. Analyze the differences

Pasteur deep seq analysis practical Part - 2015

More Related Content

Viewers also liked (12)

Similar to Pasteur deep seq analysis practical Part - 2015 (20)

Recently uploaded (20)

Pasteur deep seq analysis practical Part - 2015