On the importance (and absence) of annotation in Next Generation Sequencing Data

The importance (and absence)
of annotation in the Next
Generation Sequence Data
Hugh Shanahan & Jamie Alnasir
Hugh.Shanahan@rhul.ac.uk
@hughshanahan
Results to be published in GigaScience

It was the best of times
• Many exciting experiments based on gathering huge amounts of data.
• 100,000 Genomes in the UK, many others
• Elixir - Exabytes of biomedical data in the next decade
• Large experiments - SKA, LHC
• Opening up of Government data
• Up ahead - Sensor networks and Monitoring Cities
• Machine Learning is now a widely accepted tool in analysing data and
in making decisions.
• Evidence-based policy becoming the norm.

It was the worst of times
• Leaks appearing in the Scientiﬁc process.
• In domains with many possible relationships, most
published results are wrong (Ioannidis, PLoS
Medicine, 2005).
• 1/4 of 67 published experiments on drug targets
reproduced (Prinz et al., Nat. Rev. Drug Disc., 2011)
• 39% of key Psychology experiments could be
reproduced (Nature News, 2015).

Poor statistics?
• Naive use of p-value
calculations across fields.
• Banning use of Null
Hypothesis Significance Test
Procedure in Basic and
Applies Social Psychology
(Trafimow and Marks, BASP,
2015)
• Not the end of the story…more
like the tip of the iceberg
(Leek and Peng, Nature 2015)

Lessons learnt
• Results from individual experiments are probably
wrong.
• Bias in your data means your conclusions are
even more likely to be wrong.
• Meta-analyses help.
• Understand how you got the data you have.

Sequence Read Archive
• Central repository of sequence data.
• Nearly 30,000 genomic and transcriptomics
experiments stored and freely available.
• 2 x 1015 nucleotides stored

On the importance (and absence) of annotation in Next Generation Sequencing Data

• Based on Next Generation Sequencing
• Step reduction in cost of sequencing
• ~$thousands for a human genome
• Potentially an enormous resource
• But how do you get that data?

Good news
• SRA data is open
• Stored in a sensible way (uses SQL)
• API and documentation to access it

Mucky business
• Data stored in SRA are short reads.
• ~100 nucleotide-long fragments which are then
assembled.
• Very long pipeline to get from a sample to this
step.
• Pipeline (Protocol in their lingo) is VARIABLE

Obvious question
• Is there any evidence of bias in the data due to
varying the protocol?

Even More Obvious
Question
• Where is the metadata on the pipeline
(protocol)?

4% of experiments describe all of the
steps

What’s more…
• Metadata are stored as text fields.
• Hugely difficult task to parse.
• Submitters are not obliged to fill this data in.
• Confusion about what level to enter data in.

Bottom line
• For much of the SRA data, there is a “known
unknown” about biases due to preparation.
• It’s very unlikely we’ll ever be able to ﬁgure that
out.

Why should you be paying
attention?
• As a member of the public - it’s your money
down the drain ($108-$109)
• As a researcher - all of this undermines
conﬁdence in Science as a whole.
• If you work with big (and more particularly)
complex data - the same issues will crop up for
you.

Answers?
• Understand how you got your data - even if it’s a step
for modelling.
• Metadata is crucial.
• Organising your data is crucial.
• Use Ontologies
• Use discrete keywords
• Get people to use it

In summary :-
We want to do all the clever stuff….

Most of the time we need to deal with
a ton of pitchblende to ﬁnd the milligram
of Radium ..

On the importance (and absence) of annotation in Next Generation Sequencing Data

More Related Content

What's hot (15)

Viewers also liked (20)

Similar to On the importance (and absence) of annotation in Next Generation Sequencing Data (20)

Recently uploaded (20)

On the importance (and absence) of annotation in Next Generation Sequencing Data