Big biomedical data is a lie

Big Biomedical Data is a Lie

Taming large datasets for translational research
Paul Agapow 
Data Science Institute

Imperial College London

<p.agapow@imperial.ac.uk>

2018/1/31

Disclosure / About me
• Data Science Institute
(Imperial College London)
• Big rich biomedical datasets
for translational research &
precision medicine
• Novel & advanced
computation for research
• No actual or potential
conﬂict of interest in relation
to this presentation

– An analyst
“Nice training set. Where’s your
data?”

Biomedical big data is often not big enough
• Average trial size on
ClinicalTrials.gov < 100
• Average #samples per
GEO dataset < 100
• Average GWAS cohort
size ~9000 (median
~2500)
• 1,064 ICU admissions for
ﬂu in UK 2016/2017
season
• Curse of dimensionality
• Deep learning requires
“thousands” of samples
for training (at least p2?)
• GWAS needs 3K+ for
large effects, 10K or
more for small effects …
• Sub-populations will be
smaller

Platforms are a problem not a panacea
• Biomedical data lakes / warehouses aren’t working
• Each is an island unto itself
• Tools can’t understand data formats
• High demands on user (meaning, context)
• Poor standardisation / harmonisation tools (curation effort == analysis
effort)
• A world of distributed data
• A world of many computational idioms
• (Self) lock-in

Computers are not getting faster
• Data is embiggening
• Can’t rely on cheap
computation to get us out
of a hole
• Many HPC idioms, most
awkward (e.g. Map-
Reduce)
• Db schema struggle at
scale

What if every gene effects every other gene?
• Pritchard’s omnigenics
(2017):
• Kevin Bacon effect
• Implicated genes are a
few drivers and an
enormous number of
“related” loci
• How do we pick the
“important”genes?

Statisticians hate us
• P-hacking
• Garden of forking paths
• Reversion to mean
• Multiple hypothesis testing
• False discovery
• P-values
• Which method is best?

In summary
• Data isn’t big (enough)
• Platforms are a problem
• Computation isn’t saving us
• Diseases are complicated
• We don’t know what we’re doing

Allow bigger datasets
• “Allow” reuse & combining
not “build”
• Assemble datasets
according to standards
(CDISC, EDAM, HPO)
• Poor tools but getting
better: trmk / Arborist, eHS
• Issue of trust
Your study data in Excel
Import: start the import wizard to create a
study based on your study data.
Save: st
tranSMA
Load: us
your da
Your study l
tmtk ⬆ Python library
Send to the
Arborist web
application for
easy
collaboration!
From Excel
to tranSMART
in five simple steps
Try it at http://arborist-test-trait.thehyve.net/demo.
Code at https://github.com/thehyve/arborist under GPL v3 license.
1
Validate: let the toolkit check the
tranSMART-specific requirements.
Edit: ma
with the
2
The Arborist ⬇ Visual editor
Collaborate on data modelling with non-technical data experts in the
secure Arborist web application.
● Restructure the tranSMART tree with drag and drop
● Rename variables and values
● Add and edit metadata for any tree node
● Work with both low and high dimensional data
tmtk notable python commands
The main object in the tmtk workflow is the Study. It provides an API for modifying and

eTRIKS project
• Via IMI: Europe’s largest public-private initiative
• Data intensive translational research
• Sharing data (standards, starter kit)
• Open knowledge platform
• Sustainable service

Example: U-BIOPRED
• Unbiased BIOmarkers in PREDiction
of respiratory disease outcomes
• 900+ patients, 16 clinical centres +
other studies combined via
standards
• Outputs:
• Common tranSMART db
• 40+ academic publications
• Subtyping of asthmatics

Use your data better
• Pre-training (data without labels)
• Initial training with mediocre data
• Adapt
• Transfer learning (labels / output changes)
• Domain adaptation (data / input changes)
• Don’t use deep learning

Example: text extraction
• Aim: extract biological relationships from publications to
build asthma knowledge base
• Using BEL statements
• Domain expert time is prohibitive
• Use previous efforts as training

Example: text classification for systematic reviews
• Aim: find similar or related publications within corpus
• Actual aim: find which which method of text classification
is “best” (Validation)
• Data: 15 Drug Control Reviews & Neuropathic Pain
dataset
• Classify with random forest, naive bayes, SVM & CNNs
• Which has best recall?

When you don’t know what to use, use SVMs
Conclusion
Dataset WSS Classifier Dataset WSS Classifier
ACE Inhibitors 0.26 SVM NSAIDS 0.14 SVM
ADHD 0.35 MNB Opioids 0.23 SVM
Antihistamines 0.19 MNB Oral
Hypoglycemics
0.21 SVM
Atypical
Antipsychotics
0.12 SVM PPI 0.17 SVM
Beta Blockers 0.13 SVM Skeletal Muscle
Relaxants
0.21 SVM
CCB 0.21 SVM Statins 0.19 SVM
Estrogen 0.25 SVM Triptans 0.22 SVM
Neuropathic Pain 0.61 CNN Urinary
Incontinence
0.25 SVM

Not platforms but meta-platforms
• The monolithic platform is dead
• We live in a world of
distributed data
• Avoid lock-in
• Don’t try to do everything
• Interoperability
• Allow different computational
idioms

tranSMART redevelopment
• eTRIKS enhancements
• i2b2 merger
• Next-generation tranSMART
• Major refactoring & performance ﬁxes
• Additional tools & visualisation
• Component architecture
• Just a warehouse with API

Better HPC idioms
• Spark
• Map-Reduce but doesn’t
persist back between steps
• Better for iterative
processing
• Does less violence to
problem
• Graphs & ML

Example: Spark for clustering
• Subtyping / stratiﬁcation
• Popular methods are
computationally prohibitive
on rich data
• (Also ground truth unclear)
• “Sparkify”, compare, validate
on asthma cohort

Hypothesis generation vs validation
• Generating leads vs.
testing
• Machine learning for:
• hypothesis generation
/ exploration
• streamlining of
laborious manual
tasks
• Validate!

Conclusions
• Big biomedical data is often not big, but we can make it
bigger
• We don't need more platforms, we need platforms that
work together
• Sometimes Big Data approaches are useful, sometimes
not: choose wisely
• Trust but verify (especially machine learning)

Thanks
• Data Science Institute, ICL
• Fayzal Ghantiwala (Bloomberg)
• Nazanin Zounemat Kermani (ICL)
• Mansoor Saqi (EISBM / ICL)
• Jose Saray (EISBM)
• eTRIKS consortium
• U-BIOPRED consortium

Big biomedical data is a lie

More Related Content

What's hot (6)

Similar to Big biomedical data is a lie (20)

More from Paul Agapow (20)

Recently uploaded (20)

Big biomedical data is a lie