SlideShare a Scribd company logo
The importance (and absence)
of annotation in the Next
Generation Sequence Data
Hugh Shanahan & Jamie Alnasir
Hugh.Shanahan@rhul.ac.uk
@hughshanahan
Results to be published in GigaScience
It was the best of times
• Many exciting experiments based on gathering huge amounts of data.
• 100,000 Genomes in the UK, many others
• Elixir - Exabytes of biomedical data in the next decade
• Large experiments - SKA, LHC
• Opening up of Government data
• Up ahead - Sensor networks and Monitoring Cities
• Machine Learning is now a widely accepted tool in analysing data and
in making decisions.
• Evidence-based policy becoming the norm.
It was the worst of times
• Leaks appearing in the Scientific process.
• In domains with many possible relationships, most
published results are wrong (Ioannidis, PLoS
Medicine, 2005).
• 1/4 of 67 published experiments on drug targets
reproduced (Prinz et al., Nat. Rev. Drug Disc., 2011)
• 39% of key Psychology experiments could be
reproduced (Nature News, 2015).
Poor statistics?
• Naive use of p-value
calculations across fields.
• Banning use of Null
Hypothesis Significance Test
Procedure in Basic and
Applies Social Psychology
(Trafimow and Marks, BASP,
2015)
• Not the end of the story…more
like the tip of the iceberg
(Leek and Peng, Nature 2015)
Lessons learnt
• Results from individual experiments are probably
wrong.
• Bias in your data means your conclusions are
even more likely to be wrong.
• Meta-analyses help.
• Understand how you got the data you have.
Sequence Read Archive
• Central repository of sequence data.
• Nearly 30,000 genomic and transcriptomics
experiments stored and freely available.
• 2 x 1015 nucleotides stored
On the importance (and absence) of annotation in Next Generation Sequencing Data
• Based on Next Generation Sequencing
• Step reduction in cost of sequencing
• ~$thousands for a human genome
• Potentially an enormous resource
• But how do you get that data?
Good news
• SRA data is open
• Stored in a sensible way (uses SQL)
• API and documentation to access it
Mucky business
• Data stored in SRA are short reads.
• ~100 nucleotide-long fragments which are then
assembled.
• Very long pipeline to get from a sample to this
step.
• Pipeline (Protocol in their lingo) is VARIABLE
On the importance (and absence) of annotation in Next Generation Sequencing Data
Obvious question
• Is there any evidence of bias in the data due to
varying the protocol?
Even More Obvious
Question
• Where is the metadata on the pipeline
(protocol)?
4% of experiments describe all of the
steps
What’s more…
• Metadata are stored as text fields.
• Hugely difficult task to parse.
• Submitters are not obliged to fill this data in.
• Confusion about what level to enter data in.
Bottom line
• For much of the SRA data, there is a “known
unknown” about biases due to preparation.
• It’s very unlikely we’ll ever be able to figure that
out.
Why should you be paying
attention?
• As a member of the public - it’s your money
down the drain ($108-$109)
• As a researcher - all of this undermines
confidence in Science as a whole.
• If you work with big (and more particularly)
complex data - the same issues will crop up for
you.
Answers?
• Understand how you got your data - even if it’s a step
for modelling.
• Metadata is crucial.
• Organising your data is crucial.
• Use Ontologies
• Use discrete keywords
• Get people to use it
In summary :-
We want to do all the clever stuff….
Most of the time we need to deal with
a ton of pitchblende to find the milligram
of Radium ..

More Related Content

PDF
Beyond Proofs of Concept for Biomedical AI
PPTX
Upgrading the Scholarly Infrastructure
PPTX
Micropublication WormBase Workshop International Worm Meeting 2015
PPTX
Towards Automated AI-guided Drug Discovery Labs
PPTX
AI is the Future of Drug Discovery
PDF
CEDAR work bench for metadata management
PPTX
Interpreting Complex Real World Data for Pharmaceutical Research
PPTX
Automating the process of continuously prioritising data, updating and deploy...
Beyond Proofs of Concept for Biomedical AI
Upgrading the Scholarly Infrastructure
Micropublication WormBase Workshop International Worm Meeting 2015
Towards Automated AI-guided Drug Discovery Labs
AI is the Future of Drug Discovery
CEDAR work bench for metadata management
Interpreting Complex Real World Data for Pharmaceutical Research
Automating the process of continuously prioritising data, updating and deploy...

What's hot (15)

PDF
Cheminformatics Workflows Using Mobile Apps for Drug Discovery
PPTX
GWAS in a model organism: Arabidopsis thaliana
PPTX
Interpreting transcriptomics (ers berlin 2017)
PPTX
Working with Quertle
PPTX
Reproducible research: theory
PDF
Ai in drug design webinar 26 feb 2019
PPTX
AI in translational medicine webinar
PDF
RDA Scholarly Infrastructure 2015
PPT
eScience at the Royal Society of Chemistry and our current initiatives
PPT
eScience Resources for the Chemistry Community from the Royal Society of Chem...
PDF
From Queries to Algorithms to Advanced ML: 3 Pharmaceutical Graph Use Cases
PPTX
Making Open the Default - Bjorn Brembs
PDF
Analyzing Perturbed Co-Expression Networks in Cancer Using a Graph Database
PPTX
Advancing Foundation and Practice of Software Analytics
PDF
The CEDAR Workbench: An Ontology-Assisted Environment for Authoring Metadata ...
Cheminformatics Workflows Using Mobile Apps for Drug Discovery
GWAS in a model organism: Arabidopsis thaliana
Interpreting transcriptomics (ers berlin 2017)
Working with Quertle
Reproducible research: theory
Ai in drug design webinar 26 feb 2019
AI in translational medicine webinar
RDA Scholarly Infrastructure 2015
eScience at the Royal Society of Chemistry and our current initiatives
eScience Resources for the Chemistry Community from the Royal Society of Chem...
From Queries to Algorithms to Advanced ML: 3 Pharmaceutical Graph Use Cases
Making Open the Default - Bjorn Brembs
Analyzing Perturbed Co-Expression Networks in Cancer Using a Graph Database
Advancing Foundation and Practice of Software Analytics
The CEDAR Workbench: An Ontology-Assisted Environment for Authoring Metadata ...
Ad

Viewers also liked (20)

PPT
CEUTF - TEOLOGIA
PDF
Relazione Progetto cRIO
PPT
Energiak etorkizunean maketa
PPTX
Top Ten Digital Engagement Tools - WASHTO 2013 Annual Meeting
KEY
Tips for UXD that works
DOCX
Formato plano 10th week5_complex_sent
PDF
Ict4 d rhul talk
PDF
Formato de clase 8y9 acronyms
PDF
Openid+Opensocial
PPTX
Galeria Rammstein Slides
PDF
Formato de clase 8y9 future
PDF
Linux & Open Source - Lezione 1
DOCX
Formato plano 7th week4_simpl_pasrvspastcont
PPS
VPI Ontario
PDF
Folio
PDF
Presentazione Progetto CRio
PPSX
Viernes santo la merced 2012
DOCX
Formato plano 6th week6_future_simple
PPS
CEUTF - TEOLOGIA
Relazione Progetto cRIO
Energiak etorkizunean maketa
Top Ten Digital Engagement Tools - WASHTO 2013 Annual Meeting
Tips for UXD that works
Formato plano 10th week5_complex_sent
Ict4 d rhul talk
Formato de clase 8y9 acronyms
Openid+Opensocial
Galeria Rammstein Slides
Formato de clase 8y9 future
Linux & Open Source - Lezione 1
Formato plano 7th week4_simpl_pasrvspastcont
VPI Ontario
Folio
Presentazione Progetto CRio
Viernes santo la merced 2012
Formato plano 6th week6_future_simple
Ad

Similar to On the importance (and absence) of annotation in Next Generation Sequencing Data (20)

PPTX
Finding and Accessing Human Genomics Datasets
PDF
Dia sds2015 web version
PPTX
2016 09 cxo forum
ODP
Life sciences big data use cases
PPTX
Genome sharing projects around the world nijmegen oct 29 - 2015
PPTX
From Replication Crisis to Credibility Revolution
PPT
Activities at the Royal Society of Chemistry to gather, extract and analyze b...
PPTX
Workshop finding and accessing data - fiona - lunteren april 18 2016
PPTX
Workshop - finding and accessing data - Cambridge August 22 2016
PDF
CINECA webinar slides: Modular and reproducible workflows for federated molec...
PPTX
The fourth paradigm: data intensive scientific discovery - Jisc Digifest 2016
PPTX
sience 2.0 : an illustration of good research practices in a real study
PDF
High Performance Computing and the Opportunity with Cognitive Technology
PPTX
In Silico Approaches for Predicting Hazards from Chemical Structure and Exist...
PDF
Escaping Flatland: Interactive High-Dimensional Data Analysis in Drug Discove...
PPTX
RNP support to data-driven research
PPTX
Workshop finding and accessing data - fiona nadia charlotte - cambridge apr...
PDF
Amia tb-review-08
PDF
Using Machine Learning to Automate Clinical Pathways
PDF
Dr. Eliot Siegel: Watson and Deep QA Software in Pursuit of Personalized Medi...
Finding and Accessing Human Genomics Datasets
Dia sds2015 web version
2016 09 cxo forum
Life sciences big data use cases
Genome sharing projects around the world nijmegen oct 29 - 2015
From Replication Crisis to Credibility Revolution
Activities at the Royal Society of Chemistry to gather, extract and analyze b...
Workshop finding and accessing data - fiona - lunteren april 18 2016
Workshop - finding and accessing data - Cambridge August 22 2016
CINECA webinar slides: Modular and reproducible workflows for federated molec...
The fourth paradigm: data intensive scientific discovery - Jisc Digifest 2016
sience 2.0 : an illustration of good research practices in a real study
High Performance Computing and the Opportunity with Cognitive Technology
In Silico Approaches for Predicting Hazards from Chemical Structure and Exist...
Escaping Flatland: Interactive High-Dimensional Data Analysis in Drug Discove...
RNP support to data-driven research
Workshop finding and accessing data - fiona nadia charlotte - cambridge apr...
Amia tb-review-08
Using Machine Learning to Automate Clinical Pathways
Dr. Eliot Siegel: Watson and Deep QA Software in Pursuit of Personalized Medi...

Recently uploaded (20)

PPTX
Probability.pptx pearl lecture first year
PPTX
Seminar Hypertension and Kidney diseases.pptx
PPTX
perinatal infections 2-171220190027.pptx
PPTX
BODY FLUIDS AND CIRCULATION class 11 .pptx
PDF
CHAPTER 2 The Chemical Basis of Life Lecture Outline.pdf
PPT
LEC Synthetic Biology and its application.ppt
PDF
Social preventive and pharmacy. Pdf
PPTX
Presentation1 INTRODUCTION TO ENZYMES.pptx
PPTX
Understanding the Circulatory System……..
PDF
Is Earendel a Star Cluster?: Metal-poor Globular Cluster Progenitors at z ∼ 6
PDF
CHAPTER 3 Cell Structures and Their Functions Lecture Outline.pdf
PPTX
Substance Disorders- part different drugs change body
PPTX
POULTRY PRODUCTION AND MANAGEMENTNNN.pptx
PDF
Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...
PDF
Communicating Health Policies to Diverse Populations (www.kiu.ac.ug)
PPTX
TORCH INFECTIONS in pregnancy with toxoplasma
PDF
Science Form five needed shit SCIENEce so
PPTX
A powerpoint on colorectal cancer with brief background
PDF
S2 SOIL BY TR. OKION.pdf based on the new lower secondary curriculum
PDF
Looking into the jet cone of the neutrino-associated very high-energy blazar ...
Probability.pptx pearl lecture first year
Seminar Hypertension and Kidney diseases.pptx
perinatal infections 2-171220190027.pptx
BODY FLUIDS AND CIRCULATION class 11 .pptx
CHAPTER 2 The Chemical Basis of Life Lecture Outline.pdf
LEC Synthetic Biology and its application.ppt
Social preventive and pharmacy. Pdf
Presentation1 INTRODUCTION TO ENZYMES.pptx
Understanding the Circulatory System……..
Is Earendel a Star Cluster?: Metal-poor Globular Cluster Progenitors at z ∼ 6
CHAPTER 3 Cell Structures and Their Functions Lecture Outline.pdf
Substance Disorders- part different drugs change body
POULTRY PRODUCTION AND MANAGEMENTNNN.pptx
Cosmic Outliers: Low-spin Halos Explain the Abundance, Compactness, and Redsh...
Communicating Health Policies to Diverse Populations (www.kiu.ac.ug)
TORCH INFECTIONS in pregnancy with toxoplasma
Science Form five needed shit SCIENEce so
A powerpoint on colorectal cancer with brief background
S2 SOIL BY TR. OKION.pdf based on the new lower secondary curriculum
Looking into the jet cone of the neutrino-associated very high-energy blazar ...

On the importance (and absence) of annotation in Next Generation Sequencing Data

  • 1. The importance (and absence) of annotation in the Next Generation Sequence Data Hugh Shanahan & Jamie Alnasir Hugh.Shanahan@rhul.ac.uk @hughshanahan Results to be published in GigaScience
  • 2. It was the best of times • Many exciting experiments based on gathering huge amounts of data. • 100,000 Genomes in the UK, many others • Elixir - Exabytes of biomedical data in the next decade • Large experiments - SKA, LHC • Opening up of Government data • Up ahead - Sensor networks and Monitoring Cities • Machine Learning is now a widely accepted tool in analysing data and in making decisions. • Evidence-based policy becoming the norm.
  • 3. It was the worst of times • Leaks appearing in the Scientific process. • In domains with many possible relationships, most published results are wrong (Ioannidis, PLoS Medicine, 2005). • 1/4 of 67 published experiments on drug targets reproduced (Prinz et al., Nat. Rev. Drug Disc., 2011) • 39% of key Psychology experiments could be reproduced (Nature News, 2015).
  • 4. Poor statistics? • Naive use of p-value calculations across fields. • Banning use of Null Hypothesis Significance Test Procedure in Basic and Applies Social Psychology (Trafimow and Marks, BASP, 2015) • Not the end of the story…more like the tip of the iceberg (Leek and Peng, Nature 2015)
  • 5. Lessons learnt • Results from individual experiments are probably wrong. • Bias in your data means your conclusions are even more likely to be wrong. • Meta-analyses help. • Understand how you got the data you have.
  • 6. Sequence Read Archive • Central repository of sequence data. • Nearly 30,000 genomic and transcriptomics experiments stored and freely available. • 2 x 1015 nucleotides stored
  • 8. • Based on Next Generation Sequencing • Step reduction in cost of sequencing • ~$thousands for a human genome • Potentially an enormous resource • But how do you get that data?
  • 9. Good news • SRA data is open • Stored in a sensible way (uses SQL) • API and documentation to access it
  • 10. Mucky business • Data stored in SRA are short reads. • ~100 nucleotide-long fragments which are then assembled. • Very long pipeline to get from a sample to this step. • Pipeline (Protocol in their lingo) is VARIABLE
  • 12. Obvious question • Is there any evidence of bias in the data due to varying the protocol?
  • 13. Even More Obvious Question • Where is the metadata on the pipeline (protocol)?
  • 14. 4% of experiments describe all of the steps
  • 15. What’s more… • Metadata are stored as text fields. • Hugely difficult task to parse. • Submitters are not obliged to fill this data in. • Confusion about what level to enter data in.
  • 16. Bottom line • For much of the SRA data, there is a “known unknown” about biases due to preparation. • It’s very unlikely we’ll ever be able to figure that out.
  • 17. Why should you be paying attention? • As a member of the public - it’s your money down the drain ($108-$109) • As a researcher - all of this undermines confidence in Science as a whole. • If you work with big (and more particularly) complex data - the same issues will crop up for you.
  • 18. Answers? • Understand how you got your data - even if it’s a step for modelling. • Metadata is crucial. • Organising your data is crucial. • Use Ontologies • Use discrete keywords • Get people to use it
  • 19. In summary :- We want to do all the clever stuff….
  • 20. Most of the time we need to deal with a ton of pitchblende to find the milligram of Radium ..