SlideShare a Scribd company logo
Deep Seq Data Analysis
Theoretical training
Christophe.antoniewski@upmc.fr
http://artbio.fr
Mouse Genetics
January 21, 2016, 13:30–15:00
Sequencing Technologies
Latest commercialized Sequencing Technology
e Sequencing-by-pH-variations in ION TORRENT
Sequencing Technologies : Quantitative Facts
Sequencing Technologies : Focus on Illumina
technology
Deep sequencing applications
◆
◆
◆
◆
◆
◆
◆
◆
◆
◆
◆
◆
◆
◆
◆
◆
◆
◆
◆
◆
◆
◆
◆
High throughput sequencing of DNA or RNA provides Qualitative (sequence) and Quantitative (number of reads) information
Stranded RNAseq
library
20-30nt RNA gel
purification
Small RNA library
(Biases)
Library “Bar
coding”
ChIPseq library preparation
(Non Directional)
What can I do with my sequence reads ?
◆
➢
◆ …
➢
◆ …
➢
Platform
Selection
Library
Preparation
Sequencing
Quality Control
Alignment Assembly
Visualization & Statistics
• Normalization (library comparison)
• Peak finding (Binding sites, Breakpoints, etc…)
• Differential Calling (expression, variants, etc)
What am I going to sequence ? For what analysis ?
Technical biases and
limitations
Specific benefits
(Read length, single or paired ends, number of
reads)
Whole genome
Whole exome
Target
enrichment
Size selection –
Stranded/unstranded ?
Amplification
Single Cell Protocol
Length of the read
Single or paired
ends
Number of lanes (depth of
sequencing)
Adapter
Clipping
Quality
trimming
Contaminant and Sequencing
Errors
Biases in GC contents
Bowtie
BWA……
Nature Methods 2009
P Flicek & E Birney
Velvet, Oases
Trinity, SOAP
SSAKE……
PLoS ONE 6(3)
Zhang W, Chen J, et al. (2011)
R, mathlab
& Open Source software
tools
Flowchart of a sequencing
project
Think to the number of replicates
Basic Material for mining sequencing data
◆
◆
◆
◆
◆
◆
◆
◆
◆
◆ …
◆
Connect to our server
$ ssh lbcd41.snv.jussieu.fr
$ mkdir <mydir>
$ cd <mydir>
What is this big* fastq file containning ?
→
→
…
…
...
mouse@GED-Server:~/raw_data$ more GKG-13.fastq
@HWIEAS210R_0028:2:1:3019:1114#AGAAGA/1 Header
TNGGAACTTCATACCGTGCTCTCTGTAGGCACCATCAA Sequence
+HWIEAS210R_0028:2:1:3019:1114#AGAAGA/1 Header
bBb`bfffffhhhhhhhhhhhhhhhhhhhfhhhhhhgh Sequence Quality (ASCII encoded)
@HWIEAS210R_0028:2:1:3925:1114#AGAAGA/1
TNCTTGGACTACATATGGTTGAGGGTTGTACTGTAGGC
+HWIEAS210R_0028:2:1:3925:1114#AGAAGA/1
]B]VWaaaaaagggfggggggcggggegdgfgeggbab
@HWIEAS210R_0028:2:1:6220:1114#AGAAGA/1
TNGGAACTTCATACCGTGCTCTCTGTAGGCACCATCAA
+HWIEAS210R_0028:2:1:6220:1114#AGAAGA/1
aB^^afffffhhhhhhhhhhhhhhhhhhhhhhhchhhh
@HWIEAS210R_0028:2:1:6252:1115#AGAAGA/1
TNCTTGGACTACATATGGTTGAGGGTTGTACTGTAGGC
+HWIEAS210R_0028:2:1:6252:1115#AGAAGA/1
aBa^ddeeehhhhhhhhhhhhhhhhghhhhhhhefff
@HWIEAS210R_0028:2:1:6534:1114#AGAAGA/1
TNAATGCACTATCTGGTACGACTGTAGGCACCATCAAT
+HWIEAS210R_0028:2:1:6534:1114#AGAAGA/1
aB^^eeeeegcggfffffffcfffgcgcfffffR^^]
@HWIEAS210R_0028:2:1:8869:1114#AGAAGA/1
GNGGACTGAAGTGGAGCTGTAGGCACCATCAATAGATC
+HWIEAS210R_0028:2:1:8869:1114#AGAAGA/1
aBaaaeeeeehhhhhhhhhhhhfgfhhgfhhhhgga^^
How many sequence reads in my file ?
→ wc - l <path/to/my/file>
mouse@GED-Server:~/raw_data$ wc -l GKG-13.fastq
25703828 GKG-13.fastq
mouse@GED-Server:~/raw_data$ grep -c -e "^@" GKG-13.fastq
6425957
in python interpreter:
>>> 25703828 / 4
6425957
Are my sequence reads containing the adapter ?
→ cat <path/file> | grep CTGTAGG | wc –l
→ grep -c "CTGTAGG" <path/file>
mouse@GED-Server:~/raw_data$ cat GKG-13.fastq | grep CTGTAGG | wc -l
6355061
mouse@GED-Server:~/raw_data$ grep -c "CTGTAGG" GKG-13.fastq
6355061
6 355 061 out of
6 425 957 sequences
… not bad (98.8%)
My 3’ adapter: CTGTAGGCACCATCAAT
mouse@GED-Server:~/raw_data$ cat GKG-13.fastq | grep ATCTCGT| wc -l
308
A contrario
$mouse@GED-Server:~/raw_data$ cat GKG-13.fastq | perl -ne 'print if /^[ATGCN]{22}CTGTAGG/' | wc -l
Outputs the content
of a file, line by line
The output is passed
to the input of the
next command
perl interpreter is called
with –ne options (loop
& execute)
In line perl code
Regular expression
The output is passed
to the input of the
next command
wc with –l option
counts the lines
A more advanced example of combining Unix
commands
1 675 469 22nt long reads with 3’ flanking CTGTAGG adapter sequence
Clipping adapter sequences
Unix Operating Systems already contain powerful native tools for sequence analyses
cat GKG-13.fastq | perl -ne 'if (/^(.+CTGTAGG)/) {print "$1n"}' | more
mouse@GED-Server:~/raw_data$
cat GKG-13.fastq | perl -ne 'if (/^([GATC]{18,})CTGTAGG/) {$count++; print ">$countn"; print
"$1n"}' > clipped_GKG13.fasta
Final command line clipper
Sequence Quality Control
http://www.bioinformatics.babraham.ac.
uk/projects/fastqc/
FastQC, GUI version
http://bowtie-bio.sourceforge.
net/
Bowtie aligns reads on indexed
genomes
mouse@GED-Server:~/instructor$bowtie ../genomes/Dmel_r5.49 -f clipped_GKG13.fasta -v 1 -k 1 -p 6 --
al droso_matched_GKG-13.fa --un unmatched_GKG13.fa -S > GKG13_bowtie_output.sam
A bowtie alignment (command lines)
../genomes/Dmel_r5.49
-f clipped_GKG13.fasta
-v 1
-k 1
-p 6
--al droso_matched_GKG-13.fa
--un unmatched_GKG13.fa
-S
> GKG13_bowtie_output.sam
# reads processed: 5930851
# reads with at least one reported alignment: 4992296 (84.18%)
# reads that failed to align: 938555 (15.82%)
Reported 4992296 alignments to 1 output stream(s)
mouse@GED-Server:~/genomes$ bowtie-build Dmel_r5.49.fa Dmel_r5.49
Bowtie outputs
deepseq$ ls -laht
-rw-r--r-- 1 deepseq staff 351M Mar 24 17:46 GKG13_bowtie_output.tabulated
-rw-r--r-- 1 deepseq staff 156M Mar 24 17:46 droso_matched_GKG-13.fa
-rw-r--r-- 1 deepseq staff 28M Mar 24 17:46 unmatched_GKG13.fa
SAM alignment : $ more GKG13_bowtie_output.sam
Aligned reads: $ more droso_matched_GKG-13.fa
Unaligned reads: $ more unmatched_GKG13.fa
SAM - BAM
Formats
Raw sequence: Fastq (quality), Fasta (w/o quality)
Aligned sequence:
Genome annotation:
GFF, GTF,
Sam
Bam
• Sorted
• Indexed
• Compressed
GFF - GTF
•
•
•
•
•
•
•
•
Pileup Format
seq1 272 T 24 ,.$.....,,.,.,...,,,.,..^+. <<<+;<<<<<<<<<<<=<;<;7<&
seq1 273 T 23 ,.....,,.,.,...,,,.,..A <<<;<<<<<<<<<3<=<<<;<<+
seq1 274 T 23 ,.$....,,.,.,...,,,.,... 7<7;<;<<<<<<<<<=<;<;<<6
seq1 275 A 23 ,$....,,.,.,...,,,.,...^l. <+;9*<<<<<<<<<=<<:;<<<<
seq1 276 G 22 ...T,,.,.,...,,,.,.... 33;+<<7=7<<7<&<<1;<<6<
seq1 277 T 22 ....,,.,.,.C.,,,.,..G. +7<;<<<<<<<&<=<<:;<<&<
seq1 278 G 23 ....,,.,.,...,,,.,....^k. %38*<<;<7<<7<=<<<;<<<<<
seq1 279 C 23 A..T,,.,.,...,,,.,..... ;75&<<<<<<<<<=<<<9<<:<<
Next week, we will perform an NGS analysis using the Galaxy framework.
We will speak about Accessibility, Reproducibility and Transparency.
Please have a look to http://galaxyproject.org/
You can register and try it
Also, access to http://lbcd41.snv.jussieu.fr with
login: (to be communicated)
password: (to be communicated)
AND
Register (Menu “user” → “register”) with your email address

More Related Content

PPTX
The Next Linux Superpower: eBPF Primer
PPT
CAMERA metagenomic annotation pipeline
PPTX
20141219 workshop methylation sequencing analysis
PPTX
NGS techniques and data
PDF
Tracer Evaluation
PPTX
PLNOG20 - Paweł Małachowski - Stress your DUT–wykorzystanie narzędzi open sou...
PPTX
Modern Linux Tracing Landscape
PPT
Lec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Multicore
The Next Linux Superpower: eBPF Primer
CAMERA metagenomic annotation pipeline
20141219 workshop methylation sequencing analysis
NGS techniques and data
Tracer Evaluation
PLNOG20 - Paweł Małachowski - Stress your DUT–wykorzystanie narzędzi open sou...
Modern Linux Tracing Landscape
Lec13 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Multicore

What's hot (20)

PPTX
و کشف بد افزار OSSEC
PDF
eBPF Trace from Kernel to Userspace
ODP
Как HeadHunter удалось безопасно нарушить RFC 793 (TCP) и обойти сетевые лову...
PDF
Как HeadHunter удалось безопасно нарушить RFC 793 (TCP) и обойти сетевые лову...
PDF
LPC2019 BPF Tracing Tools
PDF
True stories on the analysis of network activity using Python
PDF
Tracing MariaDB server with bpftrace - MariaDB Server Fest 2021
PDF
BPF Internals (eBPF)
TXT
Active proxied sessions
PDF
Kernel Recipes 2017: Performance Analysis with BPF
ODP
Predikin and PredikinDB: tools to predict protein kinase peptide specificity
PDF
bcc/BPF tools - Strategy, current tools, future challenges
PPT
Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...
PDF
Kernel Recipes 2017 - Modern Key Management with GPG - Werner Koch
ODP
eBPF maps 101
TXT
Combo fix
PDF
Performance Analysis Tools for Linux Kernel
PDF
Kernel Recipes 2017: Using Linux perf at Netflix
PDF
3 Vampir Trace In Detail
DOCX
Ipv6 test plan for opnfv poc v2.2 spirent-vctlab
و کشف بد افزار OSSEC
eBPF Trace from Kernel to Userspace
Как HeadHunter удалось безопасно нарушить RFC 793 (TCP) и обойти сетевые лову...
Как HeadHunter удалось безопасно нарушить RFC 793 (TCP) и обойти сетевые лову...
LPC2019 BPF Tracing Tools
True stories on the analysis of network activity using Python
Tracing MariaDB server with bpftrace - MariaDB Server Fest 2021
BPF Internals (eBPF)
Active proxied sessions
Kernel Recipes 2017: Performance Analysis with BPF
Predikin and PredikinDB: tools to predict protein kinase peptide specificity
bcc/BPF tools - Strategy, current tools, future challenges
Lec7 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Dynamic Sch...
Kernel Recipes 2017 - Modern Key Management with GPG - Werner Koch
eBPF maps 101
Combo fix
Performance Analysis Tools for Linux Kernel
Kernel Recipes 2017: Using Linux perf at Netflix
3 Vampir Trace In Detail
Ipv6 test plan for opnfv poc v2.2 spirent-vctlab
Ad

Viewers also liked (16)

PPTX
Université Laval - Analyste d'affaires - agence affaires électroniques - Alai...
PPTX
EM Strasbourg - Créateur de valeur, analyse d'affaires et marketing - Alain...
PDF
Mongodb for DBAs
DOCX
Alcances sobre el café en el Perú 2016
PDF
Unofficial Transcript
PDF
The Risks of Lone Working
PDF
FOSET Certificate
PDF
Saxo bank - Annual report 2009
PDF
Topic7.1a compensation basic_factors_in_determining_pay_rates new
PDF
Metodología Cheltenham
PDF
Topic5 1 d_implementing_managementdevelopmentprogram_traininganddevelopingemp...
PDF
CTR: Beyond the Kilt
PPTX
2016 Digital Trends
 
PDF
Topic5 3 c_managing_the_appraisal_interview-rev
PPT
Competency based training & career development
PPTX
RETIREMENT PLANNING SENSITIZATION NOTES-YVONNE CHASONKHANA
Université Laval - Analyste d'affaires - agence affaires électroniques - Alai...
EM Strasbourg - Créateur de valeur, analyse d'affaires et marketing - Alain...
Mongodb for DBAs
Alcances sobre el café en el Perú 2016
Unofficial Transcript
The Risks of Lone Working
FOSET Certificate
Saxo bank - Annual report 2009
Topic7.1a compensation basic_factors_in_determining_pay_rates new
Metodología Cheltenham
Topic5 1 d_implementing_managementdevelopmentprogram_traininganddevelopingemp...
CTR: Beyond the Kilt
2016 Digital Trends
 
Topic5 3 c_managing_the_appraisal_interview-rev
Competency based training & career development
RETIREMENT PLANNING SENSITIZATION NOTES-YVONNE CHASONKHANA
Ad

Similar to Pasteur deep seq_analysis_theory_2016 (20)

PPTX
Stress your DUT
PPTX
Workshop NGS data analysis - 2
PPTX
List intersection for web search: Algorithms, Cost Models, and Optimizations
PDF
Reproducible Computational Pipelines with Docker and Nextflow
PDF
Metrics with Ganglia
PPT
BioMake BOSC 2004
PDF
RNA-seq: analysis of raw data and preprocessing - part 2
PPTX
Introduction to FPGA acceleration
PDF
Finding the path, by Yoshinobu Matsuzaki [APNIC 38 / APOPS 1]
PDF
GDG Cloud Taipei meetup #50 - Build go kit microservices at kubernetes with ...
PDF
Debugging node in prod
PDF
clang-intro
PDF
[Webinar Slides] Programming the Network Dataplane in P4
PDF
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
PDF
Insight Data Engineering - Demo
PDF
[1C2]webrtc 개발, 현재와 미래
PDF
PostgreSQL Monitoring using modern software stacks
PDF
PG-Strom - GPGPU meets PostgreSQL, PGcon2015
PDF
Continuous Application with Structured Streaming 2.0
PDF
Handy Networking Tools and How to Use Them
Stress your DUT
Workshop NGS data analysis - 2
List intersection for web search: Algorithms, Cost Models, and Optimizations
Reproducible Computational Pipelines with Docker and Nextflow
Metrics with Ganglia
BioMake BOSC 2004
RNA-seq: analysis of raw data and preprocessing - part 2
Introduction to FPGA acceleration
Finding the path, by Yoshinobu Matsuzaki [APNIC 38 / APOPS 1]
GDG Cloud Taipei meetup #50 - Build go kit microservices at kubernetes with ...
Debugging node in prod
clang-intro
[Webinar Slides] Programming the Network Dataplane in P4
Apache Spark 2.0: A Deep Dive Into Structured Streaming - by Tathagata Das
Insight Data Engineering - Demo
[1C2]webrtc 개발, 현재와 미래
PostgreSQL Monitoring using modern software stacks
PG-Strom - GPGPU meets PostgreSQL, PGcon2015
Continuous Application with Structured Streaming 2.0
Handy Networking Tools and How to Use Them

Recently uploaded (20)

PPTX
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
PPTX
ECG_Course_Presentation د.محمد صقران ppt
PDF
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
PDF
Placing the Near-Earth Object Impact Probability in Context
PPTX
Classification Systems_TAXONOMY_SCIENCE8.pptx
PPTX
microscope-Lecturecjchchchchcuvuvhc.pptx
PPTX
Introduction to Cardiovascular system_structure and functions-1
PPTX
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
PPTX
Comparative Structure of Integument in Vertebrates.pptx
PDF
bbec55_b34400a7914c42429908233dbd381773.pdf
PPTX
INTRODUCTION TO EVS | Concept of sustainability
PPTX
GEN. BIO 1 - CELL TYPES & CELL MODIFICATIONS
PDF
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
PDF
Sciences of Europe No 170 (2025)
PPTX
7. General Toxicologyfor clinical phrmacy.pptx
PDF
AlphaEarth Foundations and the Satellite Embedding dataset
PDF
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
PDF
An interstellar mission to test astrophysical black holes
PPTX
TOTAL hIP ARTHROPLASTY Presentation.pptx
PPTX
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg
Protein & Amino Acid Structures Levels of protein structure (primary, seconda...
ECG_Course_Presentation د.محمد صقران ppt
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
Placing the Near-Earth Object Impact Probability in Context
Classification Systems_TAXONOMY_SCIENCE8.pptx
microscope-Lecturecjchchchchcuvuvhc.pptx
Introduction to Cardiovascular system_structure and functions-1
cpcsea ppt.pptxssssssssssssssjjdjdndndddd
Comparative Structure of Integument in Vertebrates.pptx
bbec55_b34400a7914c42429908233dbd381773.pdf
INTRODUCTION TO EVS | Concept of sustainability
GEN. BIO 1 - CELL TYPES & CELL MODIFICATIONS
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
Sciences of Europe No 170 (2025)
7. General Toxicologyfor clinical phrmacy.pptx
AlphaEarth Foundations and the Satellite Embedding dataset
CAPERS-LRD-z9:AGas-enshroudedLittleRedDotHostingaBroad-lineActive GalacticNuc...
An interstellar mission to test astrophysical black holes
TOTAL hIP ARTHROPLASTY Presentation.pptx
ANEMIA WITH LEUKOPENIA MDS 07_25.pptx htggtftgt fredrctvg

Pasteur deep seq_analysis_theory_2016

  • 1. Deep Seq Data Analysis Theoretical training Christophe.antoniewski@upmc.fr http://artbio.fr Mouse Genetics January 21, 2016, 13:30–15:00
  • 3. Latest commercialized Sequencing Technology e Sequencing-by-pH-variations in ION TORRENT
  • 4. Sequencing Technologies : Quantitative Facts
  • 5. Sequencing Technologies : Focus on Illumina technology
  • 6. Deep sequencing applications ◆ ◆ ◆ ◆ ◆ ◆ ◆ ◆ ◆ ◆ ◆ ◆ ◆ ◆ ◆ ◆ ◆ ◆ ◆ ◆ ◆ ◆ ◆ High throughput sequencing of DNA or RNA provides Qualitative (sequence) and Quantitative (number of reads) information
  • 8. 20-30nt RNA gel purification Small RNA library (Biases) Library “Bar coding”
  • 10. What can I do with my sequence reads ? ◆ ➢ ◆ … ➢ ◆ … ➢
  • 11. Platform Selection Library Preparation Sequencing Quality Control Alignment Assembly Visualization & Statistics • Normalization (library comparison) • Peak finding (Binding sites, Breakpoints, etc…) • Differential Calling (expression, variants, etc) What am I going to sequence ? For what analysis ? Technical biases and limitations Specific benefits (Read length, single or paired ends, number of reads) Whole genome Whole exome Target enrichment Size selection – Stranded/unstranded ? Amplification Single Cell Protocol Length of the read Single or paired ends Number of lanes (depth of sequencing) Adapter Clipping Quality trimming Contaminant and Sequencing Errors Biases in GC contents Bowtie BWA…… Nature Methods 2009 P Flicek & E Birney Velvet, Oases Trinity, SOAP SSAKE…… PLoS ONE 6(3) Zhang W, Chen J, et al. (2011) R, mathlab & Open Source software tools Flowchart of a sequencing project Think to the number of replicates
  • 12. Basic Material for mining sequencing data ◆ ◆ ◆ ◆ ◆ ◆ ◆ ◆ ◆ ◆ … ◆
  • 13. Connect to our server $ ssh lbcd41.snv.jussieu.fr $ mkdir <mydir> $ cd <mydir>
  • 14. What is this big* fastq file containning ? → → … … ... mouse@GED-Server:~/raw_data$ more GKG-13.fastq @HWIEAS210R_0028:2:1:3019:1114#AGAAGA/1 Header TNGGAACTTCATACCGTGCTCTCTGTAGGCACCATCAA Sequence +HWIEAS210R_0028:2:1:3019:1114#AGAAGA/1 Header bBb`bfffffhhhhhhhhhhhhhhhhhhhfhhhhhhgh Sequence Quality (ASCII encoded) @HWIEAS210R_0028:2:1:3925:1114#AGAAGA/1 TNCTTGGACTACATATGGTTGAGGGTTGTACTGTAGGC +HWIEAS210R_0028:2:1:3925:1114#AGAAGA/1 ]B]VWaaaaaagggfggggggcggggegdgfgeggbab @HWIEAS210R_0028:2:1:6220:1114#AGAAGA/1 TNGGAACTTCATACCGTGCTCTCTGTAGGCACCATCAA +HWIEAS210R_0028:2:1:6220:1114#AGAAGA/1 aB^^afffffhhhhhhhhhhhhhhhhhhhhhhhchhhh @HWIEAS210R_0028:2:1:6252:1115#AGAAGA/1 TNCTTGGACTACATATGGTTGAGGGTTGTACTGTAGGC +HWIEAS210R_0028:2:1:6252:1115#AGAAGA/1 aBa^ddeeehhhhhhhhhhhhhhhhghhhhhhhefff @HWIEAS210R_0028:2:1:6534:1114#AGAAGA/1 TNAATGCACTATCTGGTACGACTGTAGGCACCATCAAT +HWIEAS210R_0028:2:1:6534:1114#AGAAGA/1 aB^^eeeeegcggfffffffcfffgcgcfffffR^^] @HWIEAS210R_0028:2:1:8869:1114#AGAAGA/1 GNGGACTGAAGTGGAGCTGTAGGCACCATCAATAGATC +HWIEAS210R_0028:2:1:8869:1114#AGAAGA/1 aBaaaeeeeehhhhhhhhhhhhfgfhhgfhhhhgga^^
  • 15. How many sequence reads in my file ? → wc - l <path/to/my/file> mouse@GED-Server:~/raw_data$ wc -l GKG-13.fastq 25703828 GKG-13.fastq mouse@GED-Server:~/raw_data$ grep -c -e "^@" GKG-13.fastq 6425957 in python interpreter: >>> 25703828 / 4 6425957
  • 16. Are my sequence reads containing the adapter ? → cat <path/file> | grep CTGTAGG | wc –l → grep -c "CTGTAGG" <path/file> mouse@GED-Server:~/raw_data$ cat GKG-13.fastq | grep CTGTAGG | wc -l 6355061 mouse@GED-Server:~/raw_data$ grep -c "CTGTAGG" GKG-13.fastq 6355061 6 355 061 out of 6 425 957 sequences … not bad (98.8%) My 3’ adapter: CTGTAGGCACCATCAAT mouse@GED-Server:~/raw_data$ cat GKG-13.fastq | grep ATCTCGT| wc -l 308 A contrario
  • 17. $mouse@GED-Server:~/raw_data$ cat GKG-13.fastq | perl -ne 'print if /^[ATGCN]{22}CTGTAGG/' | wc -l Outputs the content of a file, line by line The output is passed to the input of the next command perl interpreter is called with –ne options (loop & execute) In line perl code Regular expression The output is passed to the input of the next command wc with –l option counts the lines A more advanced example of combining Unix commands 1 675 469 22nt long reads with 3’ flanking CTGTAGG adapter sequence
  • 18. Clipping adapter sequences Unix Operating Systems already contain powerful native tools for sequence analyses cat GKG-13.fastq | perl -ne 'if (/^(.+CTGTAGG)/) {print "$1n"}' | more mouse@GED-Server:~/raw_data$ cat GKG-13.fastq | perl -ne 'if (/^([GATC]{18,})CTGTAGG/) {$count++; print ">$countn"; print "$1n"}' > clipped_GKG13.fasta Final command line clipper
  • 21. mouse@GED-Server:~/instructor$bowtie ../genomes/Dmel_r5.49 -f clipped_GKG13.fasta -v 1 -k 1 -p 6 -- al droso_matched_GKG-13.fa --un unmatched_GKG13.fa -S > GKG13_bowtie_output.sam A bowtie alignment (command lines) ../genomes/Dmel_r5.49 -f clipped_GKG13.fasta -v 1 -k 1 -p 6 --al droso_matched_GKG-13.fa --un unmatched_GKG13.fa -S > GKG13_bowtie_output.sam # reads processed: 5930851 # reads with at least one reported alignment: 4992296 (84.18%) # reads that failed to align: 938555 (15.82%) Reported 4992296 alignments to 1 output stream(s) mouse@GED-Server:~/genomes$ bowtie-build Dmel_r5.49.fa Dmel_r5.49
  • 22. Bowtie outputs deepseq$ ls -laht -rw-r--r-- 1 deepseq staff 351M Mar 24 17:46 GKG13_bowtie_output.tabulated -rw-r--r-- 1 deepseq staff 156M Mar 24 17:46 droso_matched_GKG-13.fa -rw-r--r-- 1 deepseq staff 28M Mar 24 17:46 unmatched_GKG13.fa SAM alignment : $ more GKG13_bowtie_output.sam Aligned reads: $ more droso_matched_GKG-13.fa Unaligned reads: $ more unmatched_GKG13.fa
  • 24. Formats Raw sequence: Fastq (quality), Fasta (w/o quality) Aligned sequence: Genome annotation: GFF, GTF, Sam Bam • Sorted • Indexed • Compressed
  • 26. Pileup Format seq1 272 T 24 ,.$.....,,.,.,...,,,.,..^+. <<<+;<<<<<<<<<<<=<;<;7<& seq1 273 T 23 ,.....,,.,.,...,,,.,..A <<<;<<<<<<<<<3<=<<<;<<+ seq1 274 T 23 ,.$....,,.,.,...,,,.,... 7<7;<;<<<<<<<<<=<;<;<<6 seq1 275 A 23 ,$....,,.,.,...,,,.,...^l. <+;9*<<<<<<<<<=<<:;<<<< seq1 276 G 22 ...T,,.,.,...,,,.,.... 33;+<<7=7<<7<&<<1;<<6< seq1 277 T 22 ....,,.,.,.C.,,,.,..G. +7<;<<<<<<<&<=<<:;<<&< seq1 278 G 23 ....,,.,.,...,,,.,....^k. %38*<<;<7<<7<=<<<;<<<<< seq1 279 C 23 A..T,,.,.,...,,,.,..... ;75&<<<<<<<<<=<<<9<<:<<
  • 27. Next week, we will perform an NGS analysis using the Galaxy framework. We will speak about Accessibility, Reproducibility and Transparency. Please have a look to http://galaxyproject.org/ You can register and try it Also, access to http://lbcd41.snv.jussieu.fr with login: (to be communicated) password: (to be communicated) AND Register (Menu “user” → “register”) with your email address