SlideShare a Scribd company logo
4
Most read
10
Most read
12
Most read
Surya Saha ss2489@cornell.edu
BTI PGRP Summer Internship Program 2014
Slides: https://bitly.com/BioinfoInternEx2014
Quality Control of NGS Data
1. Evaluation
2. Preprocessing
Quality Control of NGS Data
7/8/2014 BTI PGRP Summer Internship Program 2014 2
Slide credit: Aureliano Bombarely
Goal:
Learn the use of read evaluation programs keeping
attention in relevant parameters such as quality score and
length distributions and reads duplications.
Data:
(Illumina data for two tomato ripening stages)
/home/bioinfo/Data/ch4_demo_dataset.tar.gz
Tools:
tar -zxvf (command line, untar and unzip the files)
head (command line, take a quick look of the files)
mv (command line, change the name of the files)
grep (command line, find/count patterns in files)
FASTX toolkit (command line, process fasta/fastq)
FastQC (gui, to calculate several stats for each file)
Evaluation
7/8/2014 BTI PGRP Summer Internship Program 2014 3
Slide credit: Aureliano Bombarely
Exercise 1:
1. Untar and Unzip the file:
/home/bioinfo/Data/ch4_demo_dataset.tar.gz
2. Raw data will be found in two dirs: breaker and
immature_fruit. Print the first 10 lines for the files:
SRR404331_ch4.fq, SRR404333_ch4.fq,
SRR404334_ch4.fq and SRR404336_ch4.fq.
Question 1.1: Do these files have fastq format?
3. Change the extension of the .fq files to .fastq
Evaluation
7/8/2014 BTI PGRP Summer Internship Program 2014 4
Slide credit: Aureliano Bombarely
Exercise 1:
4. Count number of sequences in each fastq file using
commands you learnt earlier.
5. Convert the fastq files to fasta.
6. Explore other tools in the FASTX toolkit.
7. Now count the number of sequences in fasta file and see
if the number of sequences has changed.
Evaluation
Tip: Use ‘grep’
Tip: Use ‘fastq_to_fasta -h’ to see help
Use Google if you are stuck
7/8/2014 BTI PGRP Summer Internship Program 2014 5
Slide credit: Aureliano Bombarely
Evaluation: Sequence Quality
Good
Illumina
dataset
7/8/2014 BTI PGRP Summer Internship Program 2014 6
Evaluation: Sequence Quality
7/8/2014 BTI PGRP Summer Internship Program 2014 7
Good
Illumina
dataset
Poor
Illumina
dataset
Evaluation: Sequence Quality
7/8/2014 BTI PGRP Summer Internship Program 2014 8
454
Pacific
Biosciences
Evaluation: Sequence Content
Good
Illumina
dataset
7/8/2014 BTI PGRP Summer Internship Program 2014 9
Evaluation: Sequence Content
7/8/2014 BTI PGRP Summer Internship Program 2014 10
Good
Illumina
dataset
Poor
Illumina
dataset
Evaluation: Duplication
Good
Illumina
dataset
7/8/2014 BTI PGRP Summer Internship Program 2014 11
Evaluation: Duplication
7/8/2014 BTI PGRP Summer Internship Program 2014 12
Good
Illumina
dataset
Poor
Illumina
dataset
Evaluation: Overrepresented Sequences
Good
Illumina
dataset
7/8/2014 BTI PGRP Summer Internship Program 2014 13
Evaluation: Overrepresented Sequences
7/8/2014 BTI PGRP Summer Internship Program 2014 14
Good
Illumina
dataset
Poor
Illumina
dataset
Evaluation: Kmer content
Good
Illumina
dataset
7/8/2014 BTI PGRP Summer Internship Program 2014 15
Evaluation: Kmer content
7/8/2014 BTI PGRP Summer Internship Program 2014 16
Good
Illumina
dataset
Poor
Illumina
dataset
Evaluation: Kmer content
7/8/2014 BTI PGRP Summer Internship Program 2014 17
454
Pacific
Biosciences
Question 2.2: How many sequences there are per file in FastQC?
Question 2.3: Which is the length range for these reads?
Question 2.4: Which is the quality score range for these reads? Which
one looks best quality-wise?
Question 2.5: Do these datasets have read overrepresentation?
Question 2.6: Looking into the kmer content, do you think that the samples
have an adaptor?
Evaluation
Exercise 2:
1.Type ‘fastqc’ to start the FastQC program. Load the four
fastq sequence files in the program.
7/8/2014 BTI PGRP Summer Internship Program 2014 18
Goal:
Trim the low quality ends of the reads and remove
the short reads.
Data:
(Illumina data for two tomato ripening stages)
ch4_demo_dataset.tar.gz
Tools:
fastq-mcf (command line tool to process reads)
FastQC (gui, to calculate several stats for each file)
Preprocessing
7/8/2014 BTI PGRP Summer Internship Program 2014 19
Exercise 3:
• Download the file: adapters1.fa from
ftp://ftp.solgenomics.net/user_requests/aubombarely/courses/RNAseqCorpoica/a
dapters1.fa
• Run the read processing program over each of the datasets
using
• Min. qscore of 30
• Min. length of 40 bp
• Type ‘fastqc’ to start the FastQC program. Load the four
new fastq sequence files. Compare the results with the
previous datasets.
Preprocessing
Tip: Use ‘fastqc -h’ to see help
7/8/2014 BTI PGRP Summer Internship Program 2014 20
Need Help??
7/8/2014 BTI PGRP Summer Internship Program 2014 21
Solutions: https://bitly.com/BioinfoInternExSol2014

More Related Content

PPT
NGS - QC & Dataformat
PPT
Gene Ontology Project
PDF
Basics of Data Analysis in Bioinformatics
PDF
Multiple sequence alignment
PDF
FastQC and Prinseqlite
PPTX
Phylogenetic data analysis
PDF
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...
PDF
Genomic Data Analysis
NGS - QC & Dataformat
Gene Ontology Project
Basics of Data Analysis in Bioinformatics
Multiple sequence alignment
FastQC and Prinseqlite
Phylogenetic data analysis
Variant (SNP) calling - an introduction (with a worked example, using FreeBay...
Genomic Data Analysis

What's hot (20)

PPT
DNA Barcoding
PPT
PPT
Metagenomic analysis
PDF
Introduction to Metagenomics. Applications, Approaches and Tools (Bioinformat...
PDF
Basics of Genome Assembly
PPTX
16 s rRNA Gene Sequencing for Bacterial Identification
PPTX
Transcriptomics: A time efficient tool for crop improvement
PDF
Illumina sequencing introduction
PPTX
Transcriptomics approaches
PPTX
Computational Genomics - Bioinformatics - IK
PPT
Alignments
PDF
Next-generation sequencing and quality control: An Introduction (2016)
PPTX
Whole genome sequencing of bacteria & analysis
PDF
RNA-seq: general concept, goal and experimental design - part 1
PPT
Phylogenetic analysis
PPTX
NGS data formats and analyses
PPT
Multiple sequence alignment
PPTX
Uses of Artificial Intelligence in Bioinformatics
DNA Barcoding
Metagenomic analysis
Introduction to Metagenomics. Applications, Approaches and Tools (Bioinformat...
Basics of Genome Assembly
16 s rRNA Gene Sequencing for Bacterial Identification
Transcriptomics: A time efficient tool for crop improvement
Illumina sequencing introduction
Transcriptomics approaches
Computational Genomics - Bioinformatics - IK
Alignments
Next-generation sequencing and quality control: An Introduction (2016)
Whole genome sequencing of bacteria & analysis
RNA-seq: general concept, goal and experimental design - part 1
Phylogenetic analysis
NGS data formats and analyses
Multiple sequence alignment
Uses of Artificial Intelligence in Bioinformatics
Ad

Similar to Quality Control of NGS Data (20)

PDF
Quality Control of NGS Data Solutions
PDF
Quality Control of Sequencing Data
PDF
MAVRL Workshop 2014 - Python Materials Genomics (pymatgen)
PPTX
PTU: Using Provenance for Repeatability
PPTX
Reproducible, Automated and Portable Computational and Data Science Experimen...
PDF
Sharing massive data analysis: from provenance to linked experiment reports
PPTX
Gnocchi batching
PDF
HiPEAC 2019 Tutorial - Maestro RTOS
PDF
Auditing and Maintaining Provenance in Software Packages
PDF
Ipaw14 presentation Quan, Tanu, Ian
DOCX
information management Project.docx
PPTX
From Buffer-Overflowing Genomic Tools to Securing Biomedical File Formats
PPTX
Apigee deploy grunt plugin.1.0
PPTX
Scalable Hadoop-Based Pooled Time Series of Big Video Data from the Deep Web
PDF
PyCon JP 2024 Streamlining Testing in a Large Python Codebase .pdf
PPT
Qtp-training A presentation for beginers
PDF
Fedora Iptables
PDF
EuroPython 2024 - Streamlining Testing in a Large Python Codebase
PDF
Getting the most out of multi-GPU on Inference stage using Hadoop-spark cluster
PPT
KineMatik November 2010
Quality Control of NGS Data Solutions
Quality Control of Sequencing Data
MAVRL Workshop 2014 - Python Materials Genomics (pymatgen)
PTU: Using Provenance for Repeatability
Reproducible, Automated and Portable Computational and Data Science Experimen...
Sharing massive data analysis: from provenance to linked experiment reports
Gnocchi batching
HiPEAC 2019 Tutorial - Maestro RTOS
Auditing and Maintaining Provenance in Software Packages
Ipaw14 presentation Quan, Tanu, Ian
information management Project.docx
From Buffer-Overflowing Genomic Tools to Securing Biomedical File Formats
Apigee deploy grunt plugin.1.0
Scalable Hadoop-Based Pooled Time Series of Big Video Data from the Deep Web
PyCon JP 2024 Streamlining Testing in a Large Python Codebase .pdf
Qtp-training A presentation for beginers
Fedora Iptables
EuroPython 2024 - Streamlining Testing in a Large Python Codebase
Getting the most out of multi-GPU on Inference stage using Hadoop-spark cluster
KineMatik November 2010
Ad

More from Surya Saha (20)

PDF
An open access resource portal for arthropod vectors and agricultural pathosy...
PDF
Functional annotation of invertebrate genomes
PDF
Saha UC Davis Plant Pathology seminar Infrastructure for battling the Citrus ...
PPTX
Updates on Citrusgreening.org database from USDA NIFA project meeting
PPTX
Updates on the ACP v3 genome and annotation from USDA NIFA project meeting
PDF
AgriVectors: A Data and Systems Resource for Arthropod Vectors of Plant Diseases
PDF
Visualization of insect vector-plant pathogen interactions in the citrus gree...
PDF
Deciphering the genome of Diaphorina citri to develop solutions for the citru...
PDF
Sequencing 2017
PDF
Community resources for all y’all Omics
PDF
CitrusCyc: Metabolic Pathway Databases for the C. clementina and C. sinensis...
PDF
Using Long Reads, Optical Maps and Long-Range Scaffolding to improve the Diap...
PDF
Sequencing 2016
PDF
Tomato Genome Build SL3.0
PDF
Sequencing and Bioinformatics PGRP Summer 2015
PDF
Quality Control of Sequencing Data
PDF
Sequencing: The Next Generation 2015
PDF
Tomato Genome SL2.50 and Beyond…
PDF
Sequencing
PDF
Sequencing, Genome Assembly and the SGN Platform
An open access resource portal for arthropod vectors and agricultural pathosy...
Functional annotation of invertebrate genomes
Saha UC Davis Plant Pathology seminar Infrastructure for battling the Citrus ...
Updates on Citrusgreening.org database from USDA NIFA project meeting
Updates on the ACP v3 genome and annotation from USDA NIFA project meeting
AgriVectors: A Data and Systems Resource for Arthropod Vectors of Plant Diseases
Visualization of insect vector-plant pathogen interactions in the citrus gree...
Deciphering the genome of Diaphorina citri to develop solutions for the citru...
Sequencing 2017
Community resources for all y’all Omics
CitrusCyc: Metabolic Pathway Databases for the C. clementina and C. sinensis...
Using Long Reads, Optical Maps and Long-Range Scaffolding to improve the Diap...
Sequencing 2016
Tomato Genome Build SL3.0
Sequencing and Bioinformatics PGRP Summer 2015
Quality Control of Sequencing Data
Sequencing: The Next Generation 2015
Tomato Genome SL2.50 and Beyond…
Sequencing
Sequencing, Genome Assembly and the SGN Platform

Recently uploaded (20)

PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PDF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
PDF
Basic Mud Logging Guide for educational purpose
PPTX
master seminar digital applications in india
PDF
Insiders guide to clinical Medicine.pdf
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PPTX
Pharma ospi slides which help in ospi learning
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PDF
O7-L3 Supply Chain Operations - ICLT Program
PPTX
Lesson notes of climatology university.
PPTX
Pharmacology of Heart Failure /Pharmacotherapy of CHF
PDF
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
PDF
Abdominal Access Techniques with Prof. Dr. R K Mishra
PPTX
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PDF
Sports Quiz easy sports quiz sports quiz
PDF
RMMM.pdf make it easy to upload and study
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PDF
102 student loan defaulters named and shamed – Is someone you know on the list?
Module 4: Burden of Disease Tutorial Slides S2 2025
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
Basic Mud Logging Guide for educational purpose
master seminar digital applications in india
Insiders guide to clinical Medicine.pdf
Final Presentation General Medicine 03-08-2024.pptx
Pharma ospi slides which help in ospi learning
Supply Chain Operations Speaking Notes -ICLT Program
O7-L3 Supply Chain Operations - ICLT Program
Lesson notes of climatology university.
Pharmacology of Heart Failure /Pharmacotherapy of CHF
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
Abdominal Access Techniques with Prof. Dr. R K Mishra
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
Sports Quiz easy sports quiz sports quiz
RMMM.pdf make it easy to upload and study
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
102 student loan defaulters named and shamed – Is someone you know on the list?

Quality Control of NGS Data

  • 1. Surya Saha ss2489@cornell.edu BTI PGRP Summer Internship Program 2014 Slides: https://bitly.com/BioinfoInternEx2014 Quality Control of NGS Data
  • 2. 1. Evaluation 2. Preprocessing Quality Control of NGS Data 7/8/2014 BTI PGRP Summer Internship Program 2014 2 Slide credit: Aureliano Bombarely
  • 3. Goal: Learn the use of read evaluation programs keeping attention in relevant parameters such as quality score and length distributions and reads duplications. Data: (Illumina data for two tomato ripening stages) /home/bioinfo/Data/ch4_demo_dataset.tar.gz Tools: tar -zxvf (command line, untar and unzip the files) head (command line, take a quick look of the files) mv (command line, change the name of the files) grep (command line, find/count patterns in files) FASTX toolkit (command line, process fasta/fastq) FastQC (gui, to calculate several stats for each file) Evaluation 7/8/2014 BTI PGRP Summer Internship Program 2014 3 Slide credit: Aureliano Bombarely
  • 4. Exercise 1: 1. Untar and Unzip the file: /home/bioinfo/Data/ch4_demo_dataset.tar.gz 2. Raw data will be found in two dirs: breaker and immature_fruit. Print the first 10 lines for the files: SRR404331_ch4.fq, SRR404333_ch4.fq, SRR404334_ch4.fq and SRR404336_ch4.fq. Question 1.1: Do these files have fastq format? 3. Change the extension of the .fq files to .fastq Evaluation 7/8/2014 BTI PGRP Summer Internship Program 2014 4 Slide credit: Aureliano Bombarely
  • 5. Exercise 1: 4. Count number of sequences in each fastq file using commands you learnt earlier. 5. Convert the fastq files to fasta. 6. Explore other tools in the FASTX toolkit. 7. Now count the number of sequences in fasta file and see if the number of sequences has changed. Evaluation Tip: Use ‘grep’ Tip: Use ‘fastq_to_fasta -h’ to see help Use Google if you are stuck 7/8/2014 BTI PGRP Summer Internship Program 2014 5 Slide credit: Aureliano Bombarely
  • 6. Evaluation: Sequence Quality Good Illumina dataset 7/8/2014 BTI PGRP Summer Internship Program 2014 6
  • 7. Evaluation: Sequence Quality 7/8/2014 BTI PGRP Summer Internship Program 2014 7 Good Illumina dataset Poor Illumina dataset
  • 8. Evaluation: Sequence Quality 7/8/2014 BTI PGRP Summer Internship Program 2014 8 454 Pacific Biosciences
  • 9. Evaluation: Sequence Content Good Illumina dataset 7/8/2014 BTI PGRP Summer Internship Program 2014 9
  • 10. Evaluation: Sequence Content 7/8/2014 BTI PGRP Summer Internship Program 2014 10 Good Illumina dataset Poor Illumina dataset
  • 11. Evaluation: Duplication Good Illumina dataset 7/8/2014 BTI PGRP Summer Internship Program 2014 11
  • 12. Evaluation: Duplication 7/8/2014 BTI PGRP Summer Internship Program 2014 12 Good Illumina dataset Poor Illumina dataset
  • 13. Evaluation: Overrepresented Sequences Good Illumina dataset 7/8/2014 BTI PGRP Summer Internship Program 2014 13
  • 14. Evaluation: Overrepresented Sequences 7/8/2014 BTI PGRP Summer Internship Program 2014 14 Good Illumina dataset Poor Illumina dataset
  • 15. Evaluation: Kmer content Good Illumina dataset 7/8/2014 BTI PGRP Summer Internship Program 2014 15
  • 16. Evaluation: Kmer content 7/8/2014 BTI PGRP Summer Internship Program 2014 16 Good Illumina dataset Poor Illumina dataset
  • 17. Evaluation: Kmer content 7/8/2014 BTI PGRP Summer Internship Program 2014 17 454 Pacific Biosciences
  • 18. Question 2.2: How many sequences there are per file in FastQC? Question 2.3: Which is the length range for these reads? Question 2.4: Which is the quality score range for these reads? Which one looks best quality-wise? Question 2.5: Do these datasets have read overrepresentation? Question 2.6: Looking into the kmer content, do you think that the samples have an adaptor? Evaluation Exercise 2: 1.Type ‘fastqc’ to start the FastQC program. Load the four fastq sequence files in the program. 7/8/2014 BTI PGRP Summer Internship Program 2014 18
  • 19. Goal: Trim the low quality ends of the reads and remove the short reads. Data: (Illumina data for two tomato ripening stages) ch4_demo_dataset.tar.gz Tools: fastq-mcf (command line tool to process reads) FastQC (gui, to calculate several stats for each file) Preprocessing 7/8/2014 BTI PGRP Summer Internship Program 2014 19
  • 20. Exercise 3: • Download the file: adapters1.fa from ftp://ftp.solgenomics.net/user_requests/aubombarely/courses/RNAseqCorpoica/a dapters1.fa • Run the read processing program over each of the datasets using • Min. qscore of 30 • Min. length of 40 bp • Type ‘fastqc’ to start the FastQC program. Load the four new fastq sequence files. Compare the results with the previous datasets. Preprocessing Tip: Use ‘fastqc -h’ to see help 7/8/2014 BTI PGRP Summer Internship Program 2014 20
  • 21. Need Help?? 7/8/2014 BTI PGRP Summer Internship Program 2014 21 Solutions: https://bitly.com/BioinfoInternExSol2014