SlideShare a Scribd company logo
Sequencing data analysis
Workshop – part 1 / main principles and data formats



                       Outline

                     Introduction

                   Sequencing flow

        Main data formats throughout this flow




                   Maté Ongenaert
Introduction
Sequencing technology

The real cost of sequencing
Introduction
                                    Sequencing technology

                  The real cost of sequencing

                            Question:

     - What is the fraction of the cost of a NGS study of:
       (1) Sample collection and experimental design
                    (2) Sequencing itself
            (3) Data reduction and management
                  (4) Downstream analysis

Is this a surrealistic question? Not at all, think of you writing a
grant proposal and propose a NGS ChIP-seq experiment of 24
                              samples.

 You would need 3 HiSeq 2000 lanes that cost you        8000 €
 Sample preperation cost                                1000€
 Others                                                 1000 €
Do you ever include analysis costs?? Personel, infrastructure,…
Introduction
               Sequencing technology
The real cost of sequencing
Introduction
Sequencing technology
Introduction
Sequencing technology
Introduction
Sequencing technology
Introduction
Sequencing technology
Introduction
Sequencing technology
Sequencing data analysis
Workshop – part 1 / main principles and data formats



                       Outline

                     Introduction

                  Sequencing flow

        Main data formats throughout this flow




                   Maté Ongenaert
Sequencing flow
Steps in sequencing experiments

                         Data analysis

              Raw machine reads… What’s next?

             Preprocessing (machine/technology)
              - adaptors, indexes, conversions,…
              - machine/technology dependent

           Reads with associated qualities (universal)
                           - FASTQ
                         - QC check

         Depending on application (general applicable)
     - ‘de novo’ assembly of genome (bacterial genomes,…)
      - Mapping to a reference genome  mapped reads
                       - SAM/BAM/…

          High-level analysis (specific for application)
                         - SNP calling
                        - Peak calling
Sequencing flow
Steps in sequencing experiments
Sequencing data analysis
Workshop – part 1 / main principles and data formats



                       Outline

                     Introduction

                   Sequencing flow

        Main data formats throughout this flow




                   Maté Ongenaert
Sequencing flow
                         Steps in sequencing experiments




                                    Main data formats:
                                       - Raw reads
                                     - Mapped reads
- Application dependent: ChIP-seq peaks, SNPs: their location and their characteristics
 > Intended for: visualization / further analysis (by humans or computers) / reduction ??
Sequencing data formats
                                                                    Raw reads

                                                            Raw sequence reads:

- Represent the sequence ~ FASTA
     >SEQUENCE_IDENTIFIER
     GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT


- Extension: represent the quality, per base ~ FASTQ – Q for quality
     @SEQUENCE_IDENTIFIER
     GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
     +
     !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65



- OK, the strange signs at the last line indicate the quality at the corresponding base…
  But what’s the decoding scheme? (Nerd alert ahead !!)
- We want to represent quality scores ~ Phred scores
- Q= -10 log P (with P being the chance of a base called in error)
Phred quality scores are logarithmically linked to error probabilities
                                 Probability of incorrect
     Phred Quality Score                                            Base call accuracy
                                       base call
20                            1 in 100                       99 %
30                            1 in 1000                      99.9 %
40                            1 in 10000                     99.99 %
Sequencing data formats
                                       Raw reads

- Phred scores thus typically have 2 digits – you want one digit to allow correspondance
  in the file… What would a nerd do? Use ASCII as lookup-table of course!  one
  character ~ one decimal number
Sequencing data formats
                                             Raw reads
@SEQUENCE_IDENTIFIER
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

 - Ok, thus 5 actually is 53… But the real charachters only start at 33… So 5 is actually 53 -
   33 = 20 phred quality…
Sequencing data formats
                                              Raw reads
@SEQUENCE_IDENTIFIER
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

Example of the identifier line for Illumina data (non-multiplexed):

#@machine_id:lane:tile:x:y:multiplex:pair
@HWUSI-EAS100R:6:73:941:1973#0/1



 -   Phred + 33  Sanger
 -   Illumina 1.3 +  Phred +64
 -   Illumina 1.5 +  Phred +64
 -   Illumina 1.8 +  Phred +33
 -   Solid  Sanger

 Check your instument + version  FastQC will give you a hint which scoring scheme is
 probably used

 Extensions: FASTQ / FQ
Sequencing data formats
                                        Raw reads




- Special: SRA files from NCBI/EBI Sequence Read Archive
- Contains raw sequence data from (GEO) studies for all kinds of instruments and
  platforms
- Exercice: we have submitted NGS (MBD-seq) for 8 NB cell lines into GEO and the raw
  data in SRA, find the SRA files. How would you obtain our originally submitted FASTQ
  files? (HINT: SRA Toolkit)
- Exercice (caution: nerd alert): working in the terminal… Retrieve the FASTQ file from
  the SRA file and perform FastQC analysis
Linux… for human beings?
         The terminal

    What they show in ‘The matrix’ is a real Linux-terminal and
    real commands…
Linux… for human beings?
       The terminal
Linux… for human beings?
       The terminal

                       Server: ***********
                       Port: *****

                       Login: *********
                       Pasw: *********
                       You will not see that you
                       are typing something…
Linux… for human beings?
                                       The terminal

                                                      You are interactively
                                                      logged in now! Meaning
                                                      everything you type is sent
                                                      to the server and executed

                                                      + Fast, no eye-candy
                                                      + Easy to develop a
                                                      command-line interface

                                                      - Not so intuitive
                                                      - Steep learning curve
                                                      - High nerd-level


You may have to type bash to see a line that
starts with student@mellfire:/home/student

Where are you?
/ is root
/home is the folder with user documents
Linux… for human beings?
                                           The terminal

cd
Change directory - cd .. (go to higher level) – cd ../../..

mkdir
Make directory (is a folder)

cp
Copy

mv
Move

ls (-ahl)
List all contents of a folder (DOS: dir)

rm
Remove (DOS: del)

man
Manual (Q to quit man)
Linux… for human beings?
                                             The terminal

vi
Text editor (:q! to exit from vi)

head and tail
See first lines / last lines of a textfile

top
Table of processes

who and whoami
Lists of users logged in and useful command for people with schizophrenia
Linux… for human beings?
       The terminal
Sequencing data formats
                                                                       Mapped reads

- Mapping: ‘align’ these raw reads to a reference genome
- Single-end or paired-end data?
- How would you align a short read to the reference?

- Old-school: Smith-Watherman, BLAST, BLAT,…
- Now: mapping tools for short reads that use intelligent indexing and allow mismatches

                                                             Algorithm
                                                                                                                                   Other features
                               Hash table                       Suffix tree                  Merge sorting
                            Hash        Hash                        Enhanced
    Program   Reference                          Suffix tree                      FM-index   Merge sorting   Colorspace   454   Quality   Paired end   Long reads   Bisulfite
                          reference     reads                      suffix array
     SOAP       [51]         X                                                                                                    X           X            X
     MAQ        [54]                     X                                                                       X                X           X                        X
    Mosaik                   X                                                                                   X                X           X            X
     Eland                               X                                                                                        X
   SSAHA2       [61]         X                                                                                                                X            X
    Bowtie      [67]                                                                 X                           X                X           X
     BWA        [69]                                                                 X                           X                            X            X
   BWA-SW       [69]                                                                 X                           X        X                   X            X
    SOAP2       [70]                                                                 X                           X                X           X            X
Sequencing data formats
                                      Mapped reads

- Most commonly used worldwide and in our lab as well: BWA and Bowtie, both using
  Burrows-Wheeler transformations and FM indexes
- Optimized for short NGS reads (from about 30 bp to +- 200 bp)
- Versions exist for longer reads (such as 454): Bowtie2 and BWA-SW

-   What would a file contain, describing mapped reads?
-   Position: chr / start / stop
-   Sequence: read / references
-   Mismatches / indels / vs. the reference
-   Quality informations

- Few years ago, each tool had its own output format  Bowtie,…
- Now moving to a common file format  SAM / BAM (Sequence Alignment/Map)
Sequencing data formats
                                 Mapped reads

- Now moving to a common file format  SAM / BAM (Sequence Alignment/Map)
DESCRIPTION OF THE 11 FIELDS IN THE ALIGNMENT SECTION

# QNAME: template name
#FLAG
#RNAME: reference name
# POS: mapping position
#MAPQ: mapping quality
#CIGAR: CIGAR string
#RNEXT: reference name of the mate/next fragment
#PNEXT: position of the mate/next fragment
#TLEN: observed template length
#SEQ: fragment sequence
#QUAL: ASCII of Phred-scale base quality+33

#Headers
@HD VN:1.3 SO:coordinate
@SQ SN:ref LN:45

#Alignment block
r001 163 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG *
r002 0 ref 9 30 3S6M1P1I4M * 0 0 AAAAGATAAGGATA *
r003 0 ref 9 30 5H6M * 0 0 AGCTAA * NM:i:1
r004 0 ref 16 30 6M14N5M * 0 0 ATAGCTTCAGC *
Sequencing data formats
                                     Mapped reads

- BAM: binary version of SAM: not human readable but indexed for fast access for other
  tools / visualisation / …

- Exercise: view a BAM file in IGV
Sequencing data formats
                                            Other formats

- BED files (location / annotation / scores): Browser Extensible Data
Used for mapping / annotation / peak locations / - extension: bigBED (binary)
FIELDS USED:
# chr
# start
# end
# name
# score
# strand

track   name=pairedReads description="Clone Paired Reads" useScore=1
#chr    start end name score strand
chr22   1000 5000 cloneA 960 +
chr22   2000 6000 cloneB 900 –


- BEDGraph files (location, combined with score)
Used to represent peak scores
track type=bedGraph name="BedGraph Format" description="BedGraph format"
visibility=full color=200,100,0 altColor=0,100,200 priority=20
#chr start    end      score
chr19 59302000 59302300 -1.0
chr19 59302300 59302600 -0.75
chr19 59302600 59302900 -0.50
Sequencing data formats
                                           Other formats

- WIG files (location / annotation / scores): wiggle
Used for visulization or summarize data, in most cases count data or normalized count
data (RPKM) – extension: BigWig – binary versions (often used in GEO for ChIP-seq peaks)




browser position chr19:59304200-59310700
browser hide all

#150 base wide bar graph at arbitrarily spaced positions,
#threshold line drawn at y=11.76
#autoScale off viewing range set to [0:25]
#priority = 10 positions this as the first graph

track type=wiggle_0 name="variableStep" description="variableStep format"
visibility=full autoScale=off viewLimits=0.0:25.0 color=50,150,255
yLineMark=11.76 yLineOnOff=on priority=10
variableStep chrom=chr19 span=150
59304701 10.0
59304901 12.5
59305401 15.0
59305601 17.5
59305901 20.0
59306081 17.5
Sequencing data formats
                                      Other formats

- GFF format (General Feature Format)
Used for annotation of genetic / genomic features – such as all coding genes in Ensembl
Often used in downstream analysis to assign annotation to regions / peaks / …
FIELDS USED:

# seqname (the name of the sequence)
# source (the program that generated this feature)
# feature (the name of this type of feature – for example: exon)
# start (the starting position of the feature in the sequence)
# end (the ending position of the feature)
# score (a score between 0 and 1000)
# strand (valid entries include '+', '-', or '.')
# frame (if the feature is a coding exon, frame should be a number between
0-2 that represents the reading frame of the first base. If the feature is
not a coding exon, the value should be '.'.)
# group (all lines with the same group are linked together into a single
item)

track name=regulatory description="TeleGene(tm)    Regulatory Regions"
#chr   source   feature   start    end   scores    tr fr group
chr22 TeleGene enhancer 1000000 1001000 500        + . touch1
chr22 TeleGene promoter 1010000 1010100 900        + . touch1
chr22 TeleGene promoter 1020000 1020000 800        - . touch2
Sequencing data formats
                                     Other formats

- VCF format (Variant Call Format)
For SNP representation
Sequencing data formats
                                    Other formats

- http://genome.ucsc.edu/FAQ/FAQformat.html

- UCSC brower data formats, including all most commonly used formats that are
  accepted and widely used

- In addition, ENCODE data formats (narrowPeak / broadPEAK)
Blok
de   Van…
       ETER

More Related Content

PPTX
NGS.pptx
PPTX
2 whole genome sequencing and analysis
PPTX
NGS data formats and analyses
PPTX
Gemome annotation
PPTX
A Comparison of NGS Platforms.
PDF
Basics of Genome Assembly
PDF
Ngs intro_v6_public
PPTX
Next Gen Sequencing (NGS) Technology Overview
NGS.pptx
2 whole genome sequencing and analysis
NGS data formats and analyses
Gemome annotation
A Comparison of NGS Platforms.
Basics of Genome Assembly
Ngs intro_v6_public
Next Gen Sequencing (NGS) Technology Overview

What's hot (20)

POT
RNA-seq quality control and pre-processing
PDF
Introduction to next generation sequencing
PDF
Overview of Next Gen Sequencing Data Analysis
PPTX
Next generation sequencing
PDF
RNA-seq Analysis
PPTX
Bioinformatics tools for NGS data analysis
PPTX
Primer Designing (General Rules)
PPTX
Whole exome sequencing(wes)
PDF
Next generation sequencing
PPTX
Roche Pyrosequencing 454 ; Next generation DNA Sequencing
PPTX
Ngs ppt
PDF
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
PDF
FastQC and Prinseqlite
PDF
Introduction to real-Time Quantitative PCR (qPCR) - Download the slides
PDF
PacBio SMRT - THIRD GENERATION SEQUENCING TECHNIQUE
PPTX
Nanopore sequencing (NGS)
PPTX
RNA-seq Data Analysis Overview
PPT
Rna seq pipeline
PPTX
Multiple sequence alignment
PPTX
Introduction to Next Generation Sequencing
RNA-seq quality control and pre-processing
Introduction to next generation sequencing
Overview of Next Gen Sequencing Data Analysis
Next generation sequencing
RNA-seq Analysis
Bioinformatics tools for NGS data analysis
Primer Designing (General Rules)
Whole exome sequencing(wes)
Next generation sequencing
Roche Pyrosequencing 454 ; Next generation DNA Sequencing
Ngs ppt
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
FastQC and Prinseqlite
Introduction to real-Time Quantitative PCR (qPCR) - Download the slides
PacBio SMRT - THIRD GENERATION SEQUENCING TECHNIQUE
Nanopore sequencing (NGS)
RNA-seq Data Analysis Overview
Rna seq pipeline
Multiple sequence alignment
Introduction to Next Generation Sequencing
Ad

Similar to Workshop NGS data analysis - 1 (20)

PDF
20110524zurichngs 2nd pub
PDF
Introducing data analysis: reads to results
PPTX
Workshop NGS data analysis - 2
PDF
Discovery and annotation of variants by exome analysis using NGS
PDF
SeqinR - biological data handling
PDF
20110524zurichngs 1st pub
PDF
Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS
PDF
BITS: Basics of sequence databases
PPTX
BEACON 101: Sequencing tech
PPTX
2015 illinois-talk
PPTX
Bioinfo ngs data format visualization v2
PPTX
NGS File formats
PDF
Pasteur deep seq_analysis_theory_2016
PPTX
Making powerful science: an introduction to NGS data analysis
PDF
Guy Coates
PDF
NGS: Mapping and de novo assembly
PDF
What can we do with microbial WGS data? - t.seemann - mc gill summer 2016 - ...
PPTX
Next-generation sequencing format and visualization with ngs.plot
PPTX
2014 nci-edrn
PPTX
2015 Bioc4010 lecture1and2
20110524zurichngs 2nd pub
Introducing data analysis: reads to results
Workshop NGS data analysis - 2
Discovery and annotation of variants by exome analysis using NGS
SeqinR - biological data handling
20110524zurichngs 1st pub
Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS
BITS: Basics of sequence databases
BEACON 101: Sequencing tech
2015 illinois-talk
Bioinfo ngs data format visualization v2
NGS File formats
Pasteur deep seq_analysis_theory_2016
Making powerful science: an introduction to NGS data analysis
Guy Coates
NGS: Mapping and de novo assembly
What can we do with microbial WGS data? - t.seemann - mc gill summer 2016 - ...
Next-generation sequencing format and visualization with ngs.plot
2014 nci-edrn
2015 Bioc4010 lecture1and2
Ad

More from Maté Ongenaert (17)

PDF
Unleash transcriptomics to gain insights in disease mechanisms: integration i...
PPTX
Strong reversal of the lung fibrosis disease signature by autotaxin inhibitor...
PPTX
Ecobouwers opendeur passiefhuis Lokeren
PPTX
Workshop NGS data analysis - 3
PPTX
ENCODE project: brief summary of main findings
PPTX
Bots & spiders
PPTX
Exploring the neuroblastoma epigenome: perspectives for improved prognosis
PPTX
High-throughput proteomics: from understanding data to predicting them
PPTX
Microarray data and pathway analysis: example from the bench
PPT
Large scale machine learning challenges for systems biology
PPTX
Integrative transcriptomics to study non-coding RNA functions
PPTX
Race against the sequencing machine: processing of raw DNA sequence data at t...
PDF
Bringing the data back to the researchers
PPTX
The post-genomic era: epigenetic sequencing applications and data integration
PPTX
Introduction
PPTX
Literature managment training
PPTX
Scientific literature managment - exercises
Unleash transcriptomics to gain insights in disease mechanisms: integration i...
Strong reversal of the lung fibrosis disease signature by autotaxin inhibitor...
Ecobouwers opendeur passiefhuis Lokeren
Workshop NGS data analysis - 3
ENCODE project: brief summary of main findings
Bots & spiders
Exploring the neuroblastoma epigenome: perspectives for improved prognosis
High-throughput proteomics: from understanding data to predicting them
Microarray data and pathway analysis: example from the bench
Large scale machine learning challenges for systems biology
Integrative transcriptomics to study non-coding RNA functions
Race against the sequencing machine: processing of raw DNA sequence data at t...
Bringing the data back to the researchers
The post-genomic era: epigenetic sequencing applications and data integration
Introduction
Literature managment training
Scientific literature managment - exercises

Recently uploaded (20)

PPTX
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PDF
RMMM.pdf make it easy to upload and study
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
PDF
Complications of Minimal Access Surgery at WLH
PDF
STATICS OF THE RIGID BODIES Hibbelers.pdf
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
Mark Klimek Lecture Notes_240423 revision books _173037.pdf
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PPTX
Week 4 Term 3 Study Techniques revisited.pptx
PDF
O7-L3 Supply Chain Operations - ICLT Program
PDF
Pre independence Education in Inndia.pdf
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PDF
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PDF
Classroom Observation Tools for Teachers
PDF
Anesthesia in Laparoscopic Surgery in India
PDF
Business Ethics Teaching Materials for college
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
BOWEL ELIMINATION FACTORS AFFECTING AND TYPES
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
RMMM.pdf make it easy to upload and study
Renaissance Architecture: A Journey from Faith to Humanism
Complications of Minimal Access Surgery at WLH
STATICS OF THE RIGID BODIES Hibbelers.pdf
Final Presentation General Medicine 03-08-2024.pptx
Mark Klimek Lecture Notes_240423 revision books _173037.pdf
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
Week 4 Term 3 Study Techniques revisited.pptx
O7-L3 Supply Chain Operations - ICLT Program
Pre independence Education in Inndia.pdf
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
3rd Neelam Sanjeevareddy Memorial Lecture.pdf
Module 4: Burden of Disease Tutorial Slides S2 2025
Pharmacology of Heart Failure /Pharmacotherapy of CHF
Classroom Observation Tools for Teachers
Anesthesia in Laparoscopic Surgery in India
Business Ethics Teaching Materials for college
O5-L3 Freight Transport Ops (International) V1.pdf

Workshop NGS data analysis - 1

  • 1. Sequencing data analysis Workshop – part 1 / main principles and data formats Outline Introduction Sequencing flow Main data formats throughout this flow Maté Ongenaert
  • 3. Introduction Sequencing technology The real cost of sequencing Question: - What is the fraction of the cost of a NGS study of: (1) Sample collection and experimental design (2) Sequencing itself (3) Data reduction and management (4) Downstream analysis Is this a surrealistic question? Not at all, think of you writing a grant proposal and propose a NGS ChIP-seq experiment of 24 samples. You would need 3 HiSeq 2000 lanes that cost you 8000 € Sample preperation cost 1000€ Others 1000 € Do you ever include analysis costs?? Personel, infrastructure,…
  • 4. Introduction Sequencing technology The real cost of sequencing
  • 10. Sequencing data analysis Workshop – part 1 / main principles and data formats Outline Introduction Sequencing flow Main data formats throughout this flow Maté Ongenaert
  • 11. Sequencing flow Steps in sequencing experiments Data analysis Raw machine reads… What’s next? Preprocessing (machine/technology) - adaptors, indexes, conversions,… - machine/technology dependent Reads with associated qualities (universal) - FASTQ - QC check Depending on application (general applicable) - ‘de novo’ assembly of genome (bacterial genomes,…) - Mapping to a reference genome  mapped reads - SAM/BAM/… High-level analysis (specific for application) - SNP calling - Peak calling
  • 12. Sequencing flow Steps in sequencing experiments
  • 13. Sequencing data analysis Workshop – part 1 / main principles and data formats Outline Introduction Sequencing flow Main data formats throughout this flow Maté Ongenaert
  • 14. Sequencing flow Steps in sequencing experiments Main data formats: - Raw reads - Mapped reads - Application dependent: ChIP-seq peaks, SNPs: their location and their characteristics > Intended for: visualization / further analysis (by humans or computers) / reduction ??
  • 15. Sequencing data formats Raw reads Raw sequence reads: - Represent the sequence ~ FASTA >SEQUENCE_IDENTIFIER GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT - Extension: represent the quality, per base ~ FASTQ – Q for quality @SEQUENCE_IDENTIFIER GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65 - OK, the strange signs at the last line indicate the quality at the corresponding base… But what’s the decoding scheme? (Nerd alert ahead !!) - We want to represent quality scores ~ Phred scores - Q= -10 log P (with P being the chance of a base called in error) Phred quality scores are logarithmically linked to error probabilities Probability of incorrect Phred Quality Score Base call accuracy base call 20 1 in 100 99 % 30 1 in 1000 99.9 % 40 1 in 10000 99.99 %
  • 16. Sequencing data formats Raw reads - Phred scores thus typically have 2 digits – you want one digit to allow correspondance in the file… What would a nerd do? Use ASCII as lookup-table of course!  one character ~ one decimal number
  • 17. Sequencing data formats Raw reads @SEQUENCE_IDENTIFIER GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65 - Ok, thus 5 actually is 53… But the real charachters only start at 33… So 5 is actually 53 - 33 = 20 phred quality…
  • 18. Sequencing data formats Raw reads @SEQUENCE_IDENTIFIER GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65 Example of the identifier line for Illumina data (non-multiplexed): #@machine_id:lane:tile:x:y:multiplex:pair @HWUSI-EAS100R:6:73:941:1973#0/1 - Phred + 33  Sanger - Illumina 1.3 +  Phred +64 - Illumina 1.5 +  Phred +64 - Illumina 1.8 +  Phred +33 - Solid  Sanger Check your instument + version  FastQC will give you a hint which scoring scheme is probably used Extensions: FASTQ / FQ
  • 19. Sequencing data formats Raw reads - Special: SRA files from NCBI/EBI Sequence Read Archive - Contains raw sequence data from (GEO) studies for all kinds of instruments and platforms - Exercice: we have submitted NGS (MBD-seq) for 8 NB cell lines into GEO and the raw data in SRA, find the SRA files. How would you obtain our originally submitted FASTQ files? (HINT: SRA Toolkit) - Exercice (caution: nerd alert): working in the terminal… Retrieve the FASTQ file from the SRA file and perform FastQC analysis
  • 20. Linux… for human beings? The terminal What they show in ‘The matrix’ is a real Linux-terminal and real commands…
  • 21. Linux… for human beings? The terminal
  • 22. Linux… for human beings? The terminal Server: *********** Port: ***** Login: ********* Pasw: ********* You will not see that you are typing something…
  • 23. Linux… for human beings? The terminal You are interactively logged in now! Meaning everything you type is sent to the server and executed + Fast, no eye-candy + Easy to develop a command-line interface - Not so intuitive - Steep learning curve - High nerd-level You may have to type bash to see a line that starts with student@mellfire:/home/student Where are you? / is root /home is the folder with user documents
  • 24. Linux… for human beings? The terminal cd Change directory - cd .. (go to higher level) – cd ../../.. mkdir Make directory (is a folder) cp Copy mv Move ls (-ahl) List all contents of a folder (DOS: dir) rm Remove (DOS: del) man Manual (Q to quit man)
  • 25. Linux… for human beings? The terminal vi Text editor (:q! to exit from vi) head and tail See first lines / last lines of a textfile top Table of processes who and whoami Lists of users logged in and useful command for people with schizophrenia
  • 26. Linux… for human beings? The terminal
  • 27. Sequencing data formats Mapped reads - Mapping: ‘align’ these raw reads to a reference genome - Single-end or paired-end data? - How would you align a short read to the reference? - Old-school: Smith-Watherman, BLAST, BLAT,… - Now: mapping tools for short reads that use intelligent indexing and allow mismatches Algorithm Other features Hash table Suffix tree Merge sorting Hash Hash Enhanced Program Reference Suffix tree FM-index Merge sorting Colorspace 454 Quality Paired end Long reads Bisulfite reference reads suffix array SOAP [51] X X X X MAQ [54] X X X X X Mosaik X X X X X Eland X X SSAHA2 [61] X X X Bowtie [67] X X X X BWA [69] X X X X BWA-SW [69] X X X X X SOAP2 [70] X X X X X
  • 28. Sequencing data formats Mapped reads - Most commonly used worldwide and in our lab as well: BWA and Bowtie, both using Burrows-Wheeler transformations and FM indexes - Optimized for short NGS reads (from about 30 bp to +- 200 bp) - Versions exist for longer reads (such as 454): Bowtie2 and BWA-SW - What would a file contain, describing mapped reads? - Position: chr / start / stop - Sequence: read / references - Mismatches / indels / vs. the reference - Quality informations - Few years ago, each tool had its own output format  Bowtie,… - Now moving to a common file format  SAM / BAM (Sequence Alignment/Map)
  • 29. Sequencing data formats Mapped reads - Now moving to a common file format  SAM / BAM (Sequence Alignment/Map) DESCRIPTION OF THE 11 FIELDS IN THE ALIGNMENT SECTION # QNAME: template name #FLAG #RNAME: reference name # POS: mapping position #MAPQ: mapping quality #CIGAR: CIGAR string #RNEXT: reference name of the mate/next fragment #PNEXT: position of the mate/next fragment #TLEN: observed template length #SEQ: fragment sequence #QUAL: ASCII of Phred-scale base quality+33 #Headers @HD VN:1.3 SO:coordinate @SQ SN:ref LN:45 #Alignment block r001 163 ref 7 30 8M2I4M1D3M = 37 39 TTAGATAAAGGATACTG * r002 0 ref 9 30 3S6M1P1I4M * 0 0 AAAAGATAAGGATA * r003 0 ref 9 30 5H6M * 0 0 AGCTAA * NM:i:1 r004 0 ref 16 30 6M14N5M * 0 0 ATAGCTTCAGC *
  • 30. Sequencing data formats Mapped reads - BAM: binary version of SAM: not human readable but indexed for fast access for other tools / visualisation / … - Exercise: view a BAM file in IGV
  • 31. Sequencing data formats Other formats - BED files (location / annotation / scores): Browser Extensible Data Used for mapping / annotation / peak locations / - extension: bigBED (binary) FIELDS USED: # chr # start # end # name # score # strand track name=pairedReads description="Clone Paired Reads" useScore=1 #chr start end name score strand chr22 1000 5000 cloneA 960 + chr22 2000 6000 cloneB 900 – - BEDGraph files (location, combined with score) Used to represent peak scores track type=bedGraph name="BedGraph Format" description="BedGraph format" visibility=full color=200,100,0 altColor=0,100,200 priority=20 #chr start end score chr19 59302000 59302300 -1.0 chr19 59302300 59302600 -0.75 chr19 59302600 59302900 -0.50
  • 32. Sequencing data formats Other formats - WIG files (location / annotation / scores): wiggle Used for visulization or summarize data, in most cases count data or normalized count data (RPKM) – extension: BigWig – binary versions (often used in GEO for ChIP-seq peaks) browser position chr19:59304200-59310700 browser hide all #150 base wide bar graph at arbitrarily spaced positions, #threshold line drawn at y=11.76 #autoScale off viewing range set to [0:25] #priority = 10 positions this as the first graph track type=wiggle_0 name="variableStep" description="variableStep format" visibility=full autoScale=off viewLimits=0.0:25.0 color=50,150,255 yLineMark=11.76 yLineOnOff=on priority=10 variableStep chrom=chr19 span=150 59304701 10.0 59304901 12.5 59305401 15.0 59305601 17.5 59305901 20.0 59306081 17.5
  • 33. Sequencing data formats Other formats - GFF format (General Feature Format) Used for annotation of genetic / genomic features – such as all coding genes in Ensembl Often used in downstream analysis to assign annotation to regions / peaks / … FIELDS USED: # seqname (the name of the sequence) # source (the program that generated this feature) # feature (the name of this type of feature – for example: exon) # start (the starting position of the feature in the sequence) # end (the ending position of the feature) # score (a score between 0 and 1000) # strand (valid entries include '+', '-', or '.') # frame (if the feature is a coding exon, frame should be a number between 0-2 that represents the reading frame of the first base. If the feature is not a coding exon, the value should be '.'.) # group (all lines with the same group are linked together into a single item) track name=regulatory description="TeleGene(tm) Regulatory Regions" #chr source feature start end scores tr fr group chr22 TeleGene enhancer 1000000 1001000 500 + . touch1 chr22 TeleGene promoter 1010000 1010100 900 + . touch1 chr22 TeleGene promoter 1020000 1020000 800 - . touch2
  • 34. Sequencing data formats Other formats - VCF format (Variant Call Format) For SNP representation
  • 35. Sequencing data formats Other formats - http://genome.ucsc.edu/FAQ/FAQformat.html - UCSC brower data formats, including all most commonly used formats that are accepted and widely used - In addition, ENCODE data formats (narrowPeak / broadPEAK)
  • 36. Blok de Van… ETER