SlideShare a Scribd company logo
Making powerful science: an
introduction to NGS data analysis
Dr Adam Cribbs
Group leader in systems biology
Botnar Research Centre
Introduction
PhD
Prof Fionula Brennan
Tregs in Rheumatoid Arthritis
Postdoctoral scientist
Prof Sir Marc Feldmann
Prof Udo Oppermann
Epigenetics of T cells
MRC Career development fellowship
Prof Chris Ponting
Dr David Sims
Systems biology
MRC Career development fellowship
PI position
Current research
Purpose of this section
• Introduction to the concepts in NGS data analysis
• Data formats and quality control
• Challenges in data analysis
• Software and pipelines
Application of NGS sequencing
Sequencing of
Genomic DNA
Sequencing of
DNA library
Sequencing of
cDNA library
Whole genome sequencing
• Genome re-sequencing
• de novo genome sequencing
• Metagenomics applications
Epigenetic profiling
• Methylation sequencing
• Nucleosome footprinting
Genomic footprinting
• ChIP sequencing
Targeted sequencing
• PCR-amplified regions
• Capture-enriched DNA
Transcriptome analysis
• Novel RNA classes (lncRNAs)
• Novel splice variants
Transcriptome expression
• mRNA
• Small RNA
RNA footprinting
• Ribosomal footprinting
• RNA-IP sequencing
Bioinformatic challenges
Now I have my data what do I do????
Bioinformatic challenges
• 2.7 billion to hundreds
of £
• NGS pushed the need for
bioinformatics and big
data analytics
• Need for power!!
Need for computation
• Need for computer power
• VERY large files (10s of millions of lines)
• Impossible to use familiar tools such as python
• Impossible memory usage and execution time
• Need for a large amount of compute power
• Compute clusters
• Parallel code and multi threading to speed up analysis
• Need for faster software
• Pipelines
• Bioinformatics power!
• Properly structured working
Data management issues
• How to store data – very large raw data
• Alternative data structures – e.g. binary storage (bam
files)
• Certain studies use different amounts of storage
• RNA-seq per file 2Gb
• WGS – 500 Gb files
• Less of an issue now than it used to be 3-5 years ago
– hardware improvements
Computational clusters
• Multi-nodes (servers) with multi-cores
• High performance storage (expensive)
• In-line storage
• Fast networks (50Gb Ethernet between nodes)
• Located in a single data centre
• Need skilled data-admin staff to monitor and fix
issues
Cloud based analysis
• Pros
• Flexible
• Pay for what you use
• Don’t need to maintain a data centre
• Cons
• Transfer big data over internet is slow
• You pay for bandwidth
• Lower performance – disk IO
• Privacy/data concerns
• More expensive for long term projects
The future
• NGS arrived in 2007/2008
• No-one predicted NGS in 2001
• How can we really predict the future?
• Problems will always remain:
• Software always lags behind hardware
Bioinformatics and computational biology
• The term bioinformatician can mean many things
• Usually little biology background but quantitative
skills
• Computational biologist is usually someone with a
biology and quantitative background
• There is definitely a massive skills shortage in both
How to learn computational skills
• Introduction to Next-gen data analysis
• EBI in Cambridge - https://www.ebi.ac.uk/training/online/course/functional-
genomics-ii-common-technologies-and-data-analysis-methods/next-generation
• OBDI program
• 3 month short term training for a particular skill
• https://www.imm.ox.ac.uk/research/units-and-centres/mrc-wimm-
centre-for-computational-biology/training
• Undertake part of your PhD in a computational group
NGS analysis
NGS data analysis
Raw reads from
sequencer
Quality assessment of
reads
Mapping
Pathway
analysis
Gene
networks
Data storage and
visualisation
Quality control of reads
• Sequencing output:
• Reads + quality
• Flat files – are very large – inefficient but it’s the
standard
• Question: is the quality of my sequencing data good?
Quality control of reads
• Fastqc – babraham institute
• https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
Tools to deal with read QC
• Fastx-toolkit to optimize different datasets
• Fastq-screen – check that all of your data is not
contaminated
• Trimming to improve quality
• Trimmomatic
• Cutadapt
• There are many many more!
• But beware of removing too many reads or trimming
too much
Mapping reads to genome/transcriptome
• Mapping data is very important to get correct
• Many different mappers – make sure you use the
latest software
• Always treat your samples consistently
Mapping reads to genome/transcriptome
• Main issues:
• Number of mismatches
• Number of multi-hits
• Mates expected distance
• Exon junction
GTF file for mapping
• File format for reference sequence
Mapping reads to genome
• Which one to use???
• Depends on application
Mapping reads to transcriptome
• Which one to use???
• Depends on application
• Don’t use tophat or hisat – use Tophat2 and hisat2
SAM/BAM format
• Standard mapping output
• Sequence alignment map (SAM)
• Tab delimited
• 11 mandatory fields
1. Read name
2. Flag
3. Reference
4. Position
5. Quality
6. Cigar
7. Ref name of mate
8. Pos of mate
10. Seq
9. Template len
SAM/BAM format
• FLAG
• CIGAR
SAM/BAM tools
• Commandline
• Samtools
• view
• Index
• Sort
• Picard
• MarkDuplicates
• Python
• Pysam – maintained and developed by CGAT (Andreas
Hager)
Workflows: RNA-seq
RNA-seq workflow for DEG
• Workflow1:
• Tophat2 (align) -> cufflinks
(transcript assembly) ->
cuffdiff (DEG) -> cuffmerge
(merge assemblies)
• Workflow 2:
• Hisat2 (align with any spliced
mapper) -> featurecounts
(counting reads to
transcripts) -> DESeq2 or
EdgeR (DEG)
Hisat2 alignment
DESeq2
featurecounts
General linear model that
accounts for negative
binomial distribution
Count data
• Following featurecounts you are left with a counts table
Fewer genes with large counts and
more with fewer counts
DEG methods compared
• Which model to use????
• My preference is DESeq2
• Well written and better support
• edgeR not accounting for typeI errors as well?
Microarray
RNA-seq
DESeq2 model
• Model overview:
• First fits a GLM to the data using a sample size factor
• Cooks distance for counts outlier detection
• Dispersion is measured
• zero-centered normal prior to shrink lower end
• Wald test or LRT test
Pathway analysis
• Pathway analysis helps to identify novel pathways that may be
disease relevant
• Skewed towards cancer
• Not always informative
• Paid vs public
Biological interpretation
• The most important part and most difficult
• Can be a problem when dealing with a company
• Language barrier between biologist and bioinformatician
• Visualising data helps overcome this
Developing pipelines
• To speed up your analysis and make your code
reproducible you need to write pipelines
https://github.com/Acribbs/scflow
Further resources
Further resources
• Please email me
• MOOCS:
• Coursera : https://www.coursera.org/learn/bioinformatics-
methods-1
• Edex: https://www.edx.org/micromasters/bioinformatics
• Programming skills:
• Codeacademy
• EBI Introduction to Next-generation sequencing
course - competitive

More Related Content

PPTX
Making powerful science: an introduction to NGS and beyond
PDF
A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...
PDF
Ngs intro_v6_public
PDF
New Technologies at the Center for Bioinformatics & Functional Genomics at Mi...
PDF
Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...
PPTX
GLBIO/CCBC Metagenomics Workshop
PDF
Introduction to Galaxy and RNA-Seq
PPTX
How to cluster and sequence an ngs library (james hadfield160416)
Making powerful science: an introduction to NGS and beyond
A Tovchigrechko - MGTAXA: a toolkit and webserver for predicting taxonomy of ...
Ngs intro_v6_public
New Technologies at the Center for Bioinformatics & Functional Genomics at Mi...
Next-Generation Sequencing an Intro to Tech and Applications: NGS Tech Overvi...
GLBIO/CCBC Metagenomics Workshop
Introduction to Galaxy and RNA-Seq
How to cluster and sequence an ngs library (james hadfield160416)

What's hot (20)

PPTX
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
PDF
DNA_Services
PDF
Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS
PPTX
Introduction to Single-cell RNA-seq
PDF
Galaxy RNA-Seq Analysis: Tuxedo Protocol
PPTX
NGx Sequencing 101-platforms
PDF
Analysis of ChIP-Seq Data
PDF
ChIP-seq - Data processing
PDF
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
PDF
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
PPTX
Workshop NGS data analysis - 1
PPTX
Ngs de novo assembly progresses and challenges
PDF
Exploring new frontiers with next-generation sequencing
PDF
Ngs part i 2013
PDF
Annotating nc-RNAs with Rfam
PDF
Advanced NGS Library Prep for Challenging Samples
PDF
Introduction to Next-Generation Sequencing (NGS) Technology
PDF
The QIAseq NGS Portfolio for Cancer Research: Sample-to-Insight for All
PPTX
ECCMID 2015 Meet-The-Expert: Bioinformatics Tools
PPTX
NEXT GENERATION SEQUENCING
Galaxy dna-seq-variant calling-presentationandpractical_gent_april-2016
DNA_Services
Curso de Genómica - UAT (VHIR) 2012 - Análisis de datos de NGS
Introduction to Single-cell RNA-seq
Galaxy RNA-Seq Analysis: Tuxedo Protocol
NGx Sequencing 101-platforms
Analysis of ChIP-Seq Data
ChIP-seq - Data processing
So you want to do a: RNAseq experiment, Differential Gene Expression Analysis
Introduction to RNA-seq and RNA-seq Data Analysis (UEB-UAT Bioinformatics Cou...
Workshop NGS data analysis - 1
Ngs de novo assembly progresses and challenges
Exploring new frontiers with next-generation sequencing
Ngs part i 2013
Annotating nc-RNAs with Rfam
Advanced NGS Library Prep for Challenging Samples
Introduction to Next-Generation Sequencing (NGS) Technology
The QIAseq NGS Portfolio for Cancer Research: Sample-to-Insight for All
ECCMID 2015 Meet-The-Expert: Bioinformatics Tools
NEXT GENERATION SEQUENCING
Ad

Similar to Making powerful science: an introduction to NGS data analysis (20)

PPTX
2012 sept 18_thug_biotech
PDF
Challenges and Opportunities of Big Data Genomics
PDF
ChipSeq Data Analysis
PDF
Digital RNAseq for Gene Expression Profiling: Digital RNAseq Webinar Part 2
PPTX
Distributed approach for Peptide Identification
PPTX
Data analysis patterns, tools and data types in genomics
PPTX
NGS File formats
PDF
Sc12 workshop-writeup
PDF
Towards a common data file format for hyperspectral images
PDF
Don't Be Scared. Data Don't Bite. Introduction to Big Data.
PPTX
Flashy prefetching for high performance flash drives
PPTX
Automating the process of continuously prioritising data, updating and deploy...
PPTX
Hadoop ecosystem for health/life sciences
PPTX
Genome in a bottle for ashg grc giab workshop 181016
PPT
High Throughput Sequencing Technologies: What We Can Know
PDF
Building Big Data Streaming Architectures
PDF
Research data zone: veilige en geoptimaliseerde netwerkomgeving voor onderzoe...
PDF
Understanding and Designing Ultra low latency systems | Low Latency | Ultra L...
PDF
BioTeam Bhanu Rekepalli Presentation at BICoB 2015
PPT
Where do we currently stand at ICARDA?
2012 sept 18_thug_biotech
Challenges and Opportunities of Big Data Genomics
ChipSeq Data Analysis
Digital RNAseq for Gene Expression Profiling: Digital RNAseq Webinar Part 2
Distributed approach for Peptide Identification
Data analysis patterns, tools and data types in genomics
NGS File formats
Sc12 workshop-writeup
Towards a common data file format for hyperspectral images
Don't Be Scared. Data Don't Bite. Introduction to Big Data.
Flashy prefetching for high performance flash drives
Automating the process of continuously prioritising data, updating and deploy...
Hadoop ecosystem for health/life sciences
Genome in a bottle for ashg grc giab workshop 181016
High Throughput Sequencing Technologies: What We Can Know
Building Big Data Streaming Architectures
Research data zone: veilige en geoptimaliseerde netwerkomgeving voor onderzoe...
Understanding and Designing Ultra low latency systems | Low Latency | Ultra L...
BioTeam Bhanu Rekepalli Presentation at BICoB 2015
Where do we currently stand at ICARDA?
Ad

Recently uploaded (20)

PPTX
surgery guide for USMLE step 2-part 1.pptx
PDF
CT Anatomy for Radiotherapy.pdf eryuioooop
PPT
ASRH Presentation for students and teachers 2770633.ppt
DOC
Adobe Premiere Pro CC Crack With Serial Key Full Free Download 2025
PPT
genitourinary-cancers_1.ppt Nursing care of clients with GU cancer
PPTX
POLYCYSTIC OVARIAN SYNDROME.pptx by Dr( med) Charles Amoateng
PPT
OPIOID ANALGESICS AND THEIR IMPLICATIONS
PPTX
JUVENILE NASOPHARYNGEAL ANGIOFIBROMA.pptx
PPTX
ca esophagus molecula biology detailaed molecular biology of tumors of esophagus
PDF
Handout_ NURS 220 Topic 10-Abnormal Pregnancy.pdf
PPTX
Slider: TOC sampling methods for cleaning validation
PPT
Breast Cancer management for medicsl student.ppt
PPTX
neonatal infection(7392992y282939y5.pptx
PDF
Therapeutic Potential of Citrus Flavonoids in Metabolic Inflammation and Ins...
PPTX
Electromyography (EMG) in Physiotherapy: Principles, Procedure & Clinical App...
PPTX
SKIN Anatomy and physiology and associated diseases
PDF
Intl J Gynecology Obste - 2021 - Melamed - FIGO International Federation o...
PPTX
Acid Base Disorders educational power point.pptx
PDF
NEET PG 2025 | 200 High-Yield Recall Topics Across All Subjects
PPT
1b - INTRODUCTION TO EPIDEMIOLOGY (comm med).ppt
surgery guide for USMLE step 2-part 1.pptx
CT Anatomy for Radiotherapy.pdf eryuioooop
ASRH Presentation for students and teachers 2770633.ppt
Adobe Premiere Pro CC Crack With Serial Key Full Free Download 2025
genitourinary-cancers_1.ppt Nursing care of clients with GU cancer
POLYCYSTIC OVARIAN SYNDROME.pptx by Dr( med) Charles Amoateng
OPIOID ANALGESICS AND THEIR IMPLICATIONS
JUVENILE NASOPHARYNGEAL ANGIOFIBROMA.pptx
ca esophagus molecula biology detailaed molecular biology of tumors of esophagus
Handout_ NURS 220 Topic 10-Abnormal Pregnancy.pdf
Slider: TOC sampling methods for cleaning validation
Breast Cancer management for medicsl student.ppt
neonatal infection(7392992y282939y5.pptx
Therapeutic Potential of Citrus Flavonoids in Metabolic Inflammation and Ins...
Electromyography (EMG) in Physiotherapy: Principles, Procedure & Clinical App...
SKIN Anatomy and physiology and associated diseases
Intl J Gynecology Obste - 2021 - Melamed - FIGO International Federation o...
Acid Base Disorders educational power point.pptx
NEET PG 2025 | 200 High-Yield Recall Topics Across All Subjects
1b - INTRODUCTION TO EPIDEMIOLOGY (comm med).ppt

Making powerful science: an introduction to NGS data analysis

  • 1. Making powerful science: an introduction to NGS data analysis Dr Adam Cribbs Group leader in systems biology Botnar Research Centre
  • 2. Introduction PhD Prof Fionula Brennan Tregs in Rheumatoid Arthritis Postdoctoral scientist Prof Sir Marc Feldmann Prof Udo Oppermann Epigenetics of T cells MRC Career development fellowship Prof Chris Ponting Dr David Sims Systems biology MRC Career development fellowship PI position
  • 4. Purpose of this section • Introduction to the concepts in NGS data analysis • Data formats and quality control • Challenges in data analysis • Software and pipelines
  • 5. Application of NGS sequencing Sequencing of Genomic DNA Sequencing of DNA library Sequencing of cDNA library Whole genome sequencing • Genome re-sequencing • de novo genome sequencing • Metagenomics applications Epigenetic profiling • Methylation sequencing • Nucleosome footprinting Genomic footprinting • ChIP sequencing Targeted sequencing • PCR-amplified regions • Capture-enriched DNA Transcriptome analysis • Novel RNA classes (lncRNAs) • Novel splice variants Transcriptome expression • mRNA • Small RNA RNA footprinting • Ribosomal footprinting • RNA-IP sequencing
  • 6. Bioinformatic challenges Now I have my data what do I do????
  • 7. Bioinformatic challenges • 2.7 billion to hundreds of £ • NGS pushed the need for bioinformatics and big data analytics • Need for power!!
  • 8. Need for computation • Need for computer power • VERY large files (10s of millions of lines) • Impossible to use familiar tools such as python • Impossible memory usage and execution time • Need for a large amount of compute power • Compute clusters • Parallel code and multi threading to speed up analysis • Need for faster software • Pipelines • Bioinformatics power! • Properly structured working
  • 9. Data management issues • How to store data – very large raw data • Alternative data structures – e.g. binary storage (bam files) • Certain studies use different amounts of storage • RNA-seq per file 2Gb • WGS – 500 Gb files • Less of an issue now than it used to be 3-5 years ago – hardware improvements
  • 10. Computational clusters • Multi-nodes (servers) with multi-cores • High performance storage (expensive) • In-line storage • Fast networks (50Gb Ethernet between nodes) • Located in a single data centre • Need skilled data-admin staff to monitor and fix issues
  • 11. Cloud based analysis • Pros • Flexible • Pay for what you use • Don’t need to maintain a data centre • Cons • Transfer big data over internet is slow • You pay for bandwidth • Lower performance – disk IO • Privacy/data concerns • More expensive for long term projects
  • 12. The future • NGS arrived in 2007/2008 • No-one predicted NGS in 2001 • How can we really predict the future? • Problems will always remain: • Software always lags behind hardware
  • 13. Bioinformatics and computational biology • The term bioinformatician can mean many things • Usually little biology background but quantitative skills • Computational biologist is usually someone with a biology and quantitative background • There is definitely a massive skills shortage in both
  • 14. How to learn computational skills • Introduction to Next-gen data analysis • EBI in Cambridge - https://www.ebi.ac.uk/training/online/course/functional- genomics-ii-common-technologies-and-data-analysis-methods/next-generation • OBDI program • 3 month short term training for a particular skill • https://www.imm.ox.ac.uk/research/units-and-centres/mrc-wimm- centre-for-computational-biology/training • Undertake part of your PhD in a computational group
  • 16. NGS data analysis Raw reads from sequencer Quality assessment of reads Mapping Pathway analysis Gene networks Data storage and visualisation
  • 17. Quality control of reads • Sequencing output: • Reads + quality • Flat files – are very large – inefficient but it’s the standard • Question: is the quality of my sequencing data good?
  • 18. Quality control of reads • Fastqc – babraham institute • https://www.bioinformatics.babraham.ac.uk/projects/fastqc/
  • 19. Tools to deal with read QC • Fastx-toolkit to optimize different datasets • Fastq-screen – check that all of your data is not contaminated • Trimming to improve quality • Trimmomatic • Cutadapt • There are many many more! • But beware of removing too many reads or trimming too much
  • 20. Mapping reads to genome/transcriptome • Mapping data is very important to get correct • Many different mappers – make sure you use the latest software • Always treat your samples consistently
  • 21. Mapping reads to genome/transcriptome • Main issues: • Number of mismatches • Number of multi-hits • Mates expected distance • Exon junction
  • 22. GTF file for mapping • File format for reference sequence
  • 23. Mapping reads to genome • Which one to use??? • Depends on application
  • 24. Mapping reads to transcriptome • Which one to use??? • Depends on application • Don’t use tophat or hisat – use Tophat2 and hisat2
  • 25. SAM/BAM format • Standard mapping output • Sequence alignment map (SAM) • Tab delimited • 11 mandatory fields 1. Read name 2. Flag 3. Reference 4. Position 5. Quality 6. Cigar 7. Ref name of mate 8. Pos of mate 10. Seq 9. Template len
  • 27. SAM/BAM tools • Commandline • Samtools • view • Index • Sort • Picard • MarkDuplicates • Python • Pysam – maintained and developed by CGAT (Andreas Hager)
  • 29. RNA-seq workflow for DEG • Workflow1: • Tophat2 (align) -> cufflinks (transcript assembly) -> cuffdiff (DEG) -> cuffmerge (merge assemblies) • Workflow 2: • Hisat2 (align with any spliced mapper) -> featurecounts (counting reads to transcripts) -> DESeq2 or EdgeR (DEG) Hisat2 alignment DESeq2 featurecounts General linear model that accounts for negative binomial distribution
  • 30. Count data • Following featurecounts you are left with a counts table Fewer genes with large counts and more with fewer counts
  • 31. DEG methods compared • Which model to use???? • My preference is DESeq2 • Well written and better support • edgeR not accounting for typeI errors as well? Microarray RNA-seq
  • 32. DESeq2 model • Model overview: • First fits a GLM to the data using a sample size factor • Cooks distance for counts outlier detection • Dispersion is measured • zero-centered normal prior to shrink lower end • Wald test or LRT test
  • 33. Pathway analysis • Pathway analysis helps to identify novel pathways that may be disease relevant • Skewed towards cancer • Not always informative • Paid vs public
  • 34. Biological interpretation • The most important part and most difficult • Can be a problem when dealing with a company • Language barrier between biologist and bioinformatician • Visualising data helps overcome this
  • 35. Developing pipelines • To speed up your analysis and make your code reproducible you need to write pipelines https://github.com/Acribbs/scflow
  • 37. Further resources • Please email me • MOOCS: • Coursera : https://www.coursera.org/learn/bioinformatics- methods-1 • Edex: https://www.edx.org/micromasters/bioinformatics • Programming skills: • Codeacademy • EBI Introduction to Next-generation sequencing course - competitive