SlideShare a Scribd company logo
Big Biomedical Data is a Lie

Taming large datasets for translational research
Paul Agapow

Data Science Institute

Imperial College London

<p.agapow@imperial.ac.uk>

2018/1/31
Disclosure / About me
• Data Science Institute
(Imperial College London)
• Big rich biomedical datasets
for translational research &
precision medicine
• Novel & advanced
computation for research
• No actual or potential
conflict of interest in relation
to this presentation
– An analyst
“Nice training set. Where’s your
data?”
Biomedical big data is often not big enough
• Average trial size on
ClinicalTrials.gov < 100
• Average #samples per
GEO dataset < 100
• Average GWAS cohort
size ~9000 (median
~2500)
• 1,064 ICU admissions for
flu in UK 2016/2017
season
• Curse of dimensionality
• Deep learning requires
“thousands” of samples
for training (at least p2?)
• GWAS needs 3K+ for
large effects, 10K or
more for small effects …
• Sub-populations will be
smaller
Platforms are a problem not a panacea
• Biomedical data lakes / warehouses aren’t working
• Each is an island unto itself
• Tools can’t understand data formats
• High demands on user (meaning, context)
• Poor standardisation / harmonisation tools (curation effort == analysis
effort)
• A world of distributed data
• A world of many computational idioms
• (Self) lock-in
Computers are not getting faster
• Data is embiggening
• Can’t rely on cheap
computation to get us out
of a hole
• Many HPC idioms, most
awkward (e.g. Map-
Reduce)
• Db schema struggle at
scale
What if every gene effects every other gene?
• Pritchard’s omnigenics
(2017):	
• Kevin Bacon effect
• Implicated genes are a
few drivers and an
enormous number of
“related” loci
• How do we pick the
“important”genes?
Statisticians hate us
• P-hacking
• Garden of forking paths
• Reversion to mean
• Multiple hypothesis testing
• False discovery
• P-values
• Which method is best?
In summary
• Data isn’t big (enough)
• Platforms are a problem
• Computation isn’t saving us
• Diseases are complicated
• We don’t know what we’re doing
Big biomedical data is a lie
Solutions Responses
Allow bigger datasets
• “Allow” reuse & combining
not “build”
• Assemble datasets
according to standards
(CDISC, EDAM, HPO)
• Poor tools but getting
better: trmk / Arborist, eHS
• Issue of trust
Your study data in Excel
Import: start the import wizard to create a
study based on your study data.
Save: st
tranSMA
Load: us
your da
Your study l
tmtk ⬆ Python library
Send to the
Arborist web
application for
easy
collaboration!
From Excel
to tranSMART
in five simple steps
Try it at http://arborist-test-trait.thehyve.net/demo.
Code at https://github.com/thehyve/arborist under GPL v3 license.
1
Validate: let the toolkit check the
tranSMART-specific requirements.
Edit: ma
with the
2
The Arborist ⬇ Visual editor
Collaborate on data modelling with non-technical data experts in the
secure Arborist web application.
● Restructure the tranSMART tree with drag and drop
● Rename variables and values
● Add and edit metadata for any tree node
● Work with both low and high dimensional data
tmtk notable python commands
The main object in the tmtk workflow is the Study. It provides an API for modifying and
eTRIKS project
• Via IMI: Europe’s largest public-private initiative
• Data intensive translational research
• Sharing data (standards, starter kit)
• Open knowledge platform
• Sustainable service
Example: U-BIOPRED
• Unbiased BIOmarkers in PREDiction
of respiratory disease outcomes
• 900+ patients, 16 clinical centres +
other studies combined via
standards
• Outputs:
• Common tranSMART db
• 40+ academic publications
• Subtyping of asthmatics
Use your data better
• Pre-training (data without labels)
• Initial training with mediocre data
• Adapt
• Transfer learning (labels / output changes)
• Domain adaptation (data / input changes)
• Don’t use deep learning
Example: text extraction
• Aim: extract biological relationships from publications to
build asthma knowledge base
• Using BEL statements
• Domain expert time is prohibitive
• Use previous efforts as training
Example: text classification for systematic reviews
• Aim: find similar or related publications within corpus
• Actual aim: find which which method of text classification
is “best” (Validation)
• Data: 15 Drug Control Reviews & Neuropathic Pain
dataset
• Classify with random forest, naive bayes, SVM & CNNs
• Which has best recall?
When you don’t know what to use, use SVMs
Conclusion
Dataset WSS Classifier Dataset WSS Classifier
ACE Inhibitors 0.26 SVM NSAIDS 0.14 SVM
ADHD 0.35 MNB Opioids 0.23 SVM
Antihistamines 0.19 MNB Oral
Hypoglycemics
0.21 SVM
Atypical
Antipsychotics
0.12 SVM PPI 0.17 SVM
Beta Blockers 0.13 SVM Skeletal Muscle
Relaxants
0.21 SVM
CCB 0.21 SVM Statins 0.19 SVM
Estrogen 0.25 SVM Triptans 0.22 SVM
Neuropathic Pain 0.61 CNN Urinary
Incontinence
0.25 SVM
Not platforms but meta-platforms
• The monolithic platform is dead
• We live in a world of
distributed data
• Avoid lock-in
• Don’t try to do everything
• Interoperability
• Allow different computational
idioms
tranSMART redevelopment
• eTRIKS enhancements
• i2b2 merger
• Next-generation tranSMART
• Major refactoring & performance fixes
• Additional tools & visualisation
• Component architecture
• Just a warehouse with API
Better HPC idioms
• Spark
• Map-Reduce but doesn’t
persist back between steps
• Better for iterative
processing
• Does less violence to
problem
• Graphs & ML
Example: Spark for clustering
• Subtyping / stratification
• Popular methods are
computationally prohibitive
on rich data
• (Also ground truth unclear)
• “Sparkify”, compare, validate
on asthma cohort
Hypothesis generation vs validation
• Generating leads vs.
testing
• Machine learning for:
• hypothesis generation
/ exploration
• streamlining of
laborious manual
tasks
• Validate!
Conclusions
• Big biomedical data is often not big, but we can make it
bigger
• We don't need more platforms, we need platforms that
work together
• Sometimes Big Data approaches are useful, sometimes
not: choose wisely
• Trust but verify (especially machine learning)
Thanks
• Data Science Institute, ICL
• Fayzal Ghantiwala (Bloomberg)
• Nazanin Zounemat Kermani (ICL)
• Mansoor Saqi (EISBM / ICL)
• Jose Saray (EISBM)
• eTRIKS consortium
• U-BIOPRED consortium

More Related Content

PPTX
Ilya Kupershmidt speaks at the Molecular Medicine Tri-Conference
PDF
Is one enough? Data warehousing for biomedical research
PDF
Is that a scientific report or just some cool pictures from the lab? Reproduc...
PDF
The Power of Graphs to Analyze Biological Data
PPTX
Introduction to machine learning
PDF
MOLIERE: Automatic Biomedical Hypothesis Generation System
PDF
Data warehousing solution for Department of Internal Medicine, University of ...
PPTX
Reproducibility Analytics Lab
Ilya Kupershmidt speaks at the Molecular Medicine Tri-Conference
Is one enough? Data warehousing for biomedical research
Is that a scientific report or just some cool pictures from the lab? Reproduc...
The Power of Graphs to Analyze Biological Data
Introduction to machine learning
MOLIERE: Automatic Biomedical Hypothesis Generation System
Data warehousing solution for Department of Internal Medicine, University of ...
Reproducibility Analytics Lab

What's hot (6)

PPTX
data science chapter-4,5,6
PPT
Content is data: pushing re-use to the limit
PDF
Machine Learning in Healthcare: What's Now & What's Next
PPTX
Towards Automated AI-guided Drug Discovery Labs
PPTX
PPT
Will it last? How secure is the longevity of archaeological data?
data science chapter-4,5,6
Content is data: pushing re-use to the limit
Machine Learning in Healthcare: What's Now & What's Next
Towards Automated AI-guided Drug Discovery Labs
Will it last? How secure is the longevity of archaeological data?
Ad

Similar to Big biomedical data is a lie (20)

PDF
Machine Learning for Preclinical Research
PDF
AI for Precision Medicine (Pragmatic preclinical data science)
PDF
Elsevier Medical Graph – mit Machine Learning zu Precision Medicine
PPTX
Interpreting Complex Real World Data for Pharmaceutical Research
PDF
Sun==big data analytics for health care
PDF
Using Healthcare Data for Research @ The Hyve - Campus Party 2016
PPTX
AI and Big Data in Psychiatry: An Introduction and Overview
PPTX
Big Data & ML for Clinical Data
PPTX
Leveraging Machine Learning Techniques Predictive Analytics for Knowledge Dis...
PPTX
Data science 101
PDF
tranSMART Community Meeting 5-7 Nov 13 - Session 2: Creating a Comprehensive ...
PPTX
ThinkFast: Scaling Machine Learning to Modern Demands
PDF
additional Reading dnbvbfdvfivddcdsvfbivdcsdlcd
PPTX
The End of the Drug Development Casino?
PPTX
The current state of prediction in neuroimaging
PDF
Διαχείριση Ανοικτών Ερευνητικών Δεδομένων Υγείας - Π. Μπαμίδης
PPTX
E.Gombocz: Changing the Model in Pharma and Healthcare (DILS Keynote 2013-07...
PDF
2015 04-18-wilson cg
Machine Learning for Preclinical Research
AI for Precision Medicine (Pragmatic preclinical data science)
Elsevier Medical Graph – mit Machine Learning zu Precision Medicine
Interpreting Complex Real World Data for Pharmaceutical Research
Sun==big data analytics for health care
Using Healthcare Data for Research @ The Hyve - Campus Party 2016
AI and Big Data in Psychiatry: An Introduction and Overview
Big Data & ML for Clinical Data
Leveraging Machine Learning Techniques Predictive Analytics for Knowledge Dis...
Data science 101
tranSMART Community Meeting 5-7 Nov 13 - Session 2: Creating a Comprehensive ...
ThinkFast: Scaling Machine Learning to Modern Demands
additional Reading dnbvbfdvfivddcdsvfbivdcsdlcd
The End of the Drug Development Casino?
The current state of prediction in neuroimaging
Διαχείριση Ανοικτών Ερευνητικών Δεδομένων Υγείας - Π. Μπαμίδης
E.Gombocz: Changing the Model in Pharma and Healthcare (DILS Keynote 2013-07...
2015 04-18-wilson cg
Ad

More from Paul Agapow (20)

PDF
Clinical studies & observational trials in the age of AI
PDF
AI in pharma & biotech: possibilities and realities
PDF
Opportunities for AI in drug development 202412.pdf
PDF
Career advice for new bio-(x)-ists, Dec2024.pdf
PDF
Can drug repurposing be saved with AI 202405.pdf
PDF
IA, la clave de la genomica (May 2024).pdf
PDF
Digital Biomarkers, a (too) brief introduction.pdf
PDF
How to make every mistake and still have a career, Feb2024.pdf
PPTX
ML, biomedical data & trust
PDF
Where AI will (and won't) revolutionize biomedicine
PDF
Beyond Proofs of Concept for Biomedical AI
PDF
Multi-omics for drug discovery: what we lose, what we gain
PPTX
ML & AI in pharma: an overview
PDF
ML & AI in Drug development: the hidden part of the iceberg
PDF
Machine learning, health data & the limits of knowledge
PDF
AI in Healthcare
PDF
Get yourself a better bioinformatics job
PDF
Filling the gaps in translational research
PPTX
Bioinformatics! (What is it good for?)
PDF
Patient subtypes: real or not?
Clinical studies & observational trials in the age of AI
AI in pharma & biotech: possibilities and realities
Opportunities for AI in drug development 202412.pdf
Career advice for new bio-(x)-ists, Dec2024.pdf
Can drug repurposing be saved with AI 202405.pdf
IA, la clave de la genomica (May 2024).pdf
Digital Biomarkers, a (too) brief introduction.pdf
How to make every mistake and still have a career, Feb2024.pdf
ML, biomedical data & trust
Where AI will (and won't) revolutionize biomedicine
Beyond Proofs of Concept for Biomedical AI
Multi-omics for drug discovery: what we lose, what we gain
ML & AI in pharma: an overview
ML & AI in Drug development: the hidden part of the iceberg
Machine learning, health data & the limits of knowledge
AI in Healthcare
Get yourself a better bioinformatics job
Filling the gaps in translational research
Bioinformatics! (What is it good for?)
Patient subtypes: real or not?

Recently uploaded (20)

PPTX
famous lake in india and its disturibution and importance
PDF
diccionario toefl examen de ingles para principiante
PDF
bbec55_b34400a7914c42429908233dbd381773.pdf
PPTX
microscope-Lecturecjchchchchcuvuvhc.pptx
PPTX
The KM-GBF monitoring framework – status & key messages.pptx
PDF
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
PPTX
2. Earth - The Living Planet earth and life
PPTX
Derivatives of integument scales, beaks, horns,.pptx
PPTX
Classification Systems_TAXONOMY_SCIENCE8.pptx
PPTX
Introduction to Fisheries Biotechnology_Lesson 1.pptx
PPTX
neck nodes and dissection types and lymph nodes levels
PDF
Mastering Bioreactors and Media Sterilization: A Complete Guide to Sterile Fe...
PPT
Chemical bonding and molecular structure
PDF
. Radiology Case Scenariosssssssssssssss
PPTX
Comparative Structure of Integument in Vertebrates.pptx
PPTX
2. Earth - The Living Planet Module 2ELS
PDF
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
PDF
HPLC-PPT.docx high performance liquid chromatography
PDF
Sciences of Europe No 170 (2025)
PDF
MIRIDeepImagingSurvey(MIDIS)oftheHubbleUltraDeepField
famous lake in india and its disturibution and importance
diccionario toefl examen de ingles para principiante
bbec55_b34400a7914c42429908233dbd381773.pdf
microscope-Lecturecjchchchchcuvuvhc.pptx
The KM-GBF monitoring framework – status & key messages.pptx
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
2. Earth - The Living Planet earth and life
Derivatives of integument scales, beaks, horns,.pptx
Classification Systems_TAXONOMY_SCIENCE8.pptx
Introduction to Fisheries Biotechnology_Lesson 1.pptx
neck nodes and dissection types and lymph nodes levels
Mastering Bioreactors and Media Sterilization: A Complete Guide to Sterile Fe...
Chemical bonding and molecular structure
. Radiology Case Scenariosssssssssssssss
Comparative Structure of Integument in Vertebrates.pptx
2. Earth - The Living Planet Module 2ELS
SEHH2274 Organic Chemistry Notes 1 Structure and Bonding.pdf
HPLC-PPT.docx high performance liquid chromatography
Sciences of Europe No 170 (2025)
MIRIDeepImagingSurvey(MIDIS)oftheHubbleUltraDeepField

Big biomedical data is a lie

  • 1. Big Biomedical Data is a Lie Taming large datasets for translational research Paul Agapow
 Data Science Institute Imperial College London <p.agapow@imperial.ac.uk> 2018/1/31
  • 2. Disclosure / About me • Data Science Institute (Imperial College London) • Big rich biomedical datasets for translational research & precision medicine • Novel & advanced computation for research • No actual or potential conflict of interest in relation to this presentation
  • 3. – An analyst “Nice training set. Where’s your data?”
  • 4. Biomedical big data is often not big enough • Average trial size on ClinicalTrials.gov < 100 • Average #samples per GEO dataset < 100 • Average GWAS cohort size ~9000 (median ~2500) • 1,064 ICU admissions for flu in UK 2016/2017 season • Curse of dimensionality • Deep learning requires “thousands” of samples for training (at least p2?) • GWAS needs 3K+ for large effects, 10K or more for small effects … • Sub-populations will be smaller
  • 5. Platforms are a problem not a panacea • Biomedical data lakes / warehouses aren’t working • Each is an island unto itself • Tools can’t understand data formats • High demands on user (meaning, context) • Poor standardisation / harmonisation tools (curation effort == analysis effort) • A world of distributed data • A world of many computational idioms • (Self) lock-in
  • 6. Computers are not getting faster • Data is embiggening • Can’t rely on cheap computation to get us out of a hole • Many HPC idioms, most awkward (e.g. Map- Reduce) • Db schema struggle at scale
  • 7. What if every gene effects every other gene? • Pritchard’s omnigenics (2017): • Kevin Bacon effect • Implicated genes are a few drivers and an enormous number of “related” loci • How do we pick the “important”genes?
  • 8. Statisticians hate us • P-hacking • Garden of forking paths • Reversion to mean • Multiple hypothesis testing • False discovery • P-values • Which method is best?
  • 9. In summary • Data isn’t big (enough) • Platforms are a problem • Computation isn’t saving us • Diseases are complicated • We don’t know what we’re doing
  • 12. Allow bigger datasets • “Allow” reuse & combining not “build” • Assemble datasets according to standards (CDISC, EDAM, HPO) • Poor tools but getting better: trmk / Arborist, eHS • Issue of trust Your study data in Excel Import: start the import wizard to create a study based on your study data. Save: st tranSMA Load: us your da Your study l tmtk ⬆ Python library Send to the Arborist web application for easy collaboration! From Excel to tranSMART in five simple steps Try it at http://arborist-test-trait.thehyve.net/demo. Code at https://github.com/thehyve/arborist under GPL v3 license. 1 Validate: let the toolkit check the tranSMART-specific requirements. Edit: ma with the 2 The Arborist ⬇ Visual editor Collaborate on data modelling with non-technical data experts in the secure Arborist web application. ● Restructure the tranSMART tree with drag and drop ● Rename variables and values ● Add and edit metadata for any tree node ● Work with both low and high dimensional data tmtk notable python commands The main object in the tmtk workflow is the Study. It provides an API for modifying and
  • 13. eTRIKS project • Via IMI: Europe’s largest public-private initiative • Data intensive translational research • Sharing data (standards, starter kit) • Open knowledge platform • Sustainable service
  • 14. Example: U-BIOPRED • Unbiased BIOmarkers in PREDiction of respiratory disease outcomes • 900+ patients, 16 clinical centres + other studies combined via standards • Outputs: • Common tranSMART db • 40+ academic publications • Subtyping of asthmatics
  • 15. Use your data better • Pre-training (data without labels) • Initial training with mediocre data • Adapt • Transfer learning (labels / output changes) • Domain adaptation (data / input changes) • Don’t use deep learning
  • 16. Example: text extraction • Aim: extract biological relationships from publications to build asthma knowledge base • Using BEL statements • Domain expert time is prohibitive • Use previous efforts as training
  • 17. Example: text classification for systematic reviews • Aim: find similar or related publications within corpus • Actual aim: find which which method of text classification is “best” (Validation) • Data: 15 Drug Control Reviews & Neuropathic Pain dataset • Classify with random forest, naive bayes, SVM & CNNs • Which has best recall?
  • 18. When you don’t know what to use, use SVMs Conclusion Dataset WSS Classifier Dataset WSS Classifier ACE Inhibitors 0.26 SVM NSAIDS 0.14 SVM ADHD 0.35 MNB Opioids 0.23 SVM Antihistamines 0.19 MNB Oral Hypoglycemics 0.21 SVM Atypical Antipsychotics 0.12 SVM PPI 0.17 SVM Beta Blockers 0.13 SVM Skeletal Muscle Relaxants 0.21 SVM CCB 0.21 SVM Statins 0.19 SVM Estrogen 0.25 SVM Triptans 0.22 SVM Neuropathic Pain 0.61 CNN Urinary Incontinence 0.25 SVM
  • 19. Not platforms but meta-platforms • The monolithic platform is dead • We live in a world of distributed data • Avoid lock-in • Don’t try to do everything • Interoperability • Allow different computational idioms
  • 20. tranSMART redevelopment • eTRIKS enhancements • i2b2 merger • Next-generation tranSMART • Major refactoring & performance fixes • Additional tools & visualisation • Component architecture • Just a warehouse with API
  • 21. Better HPC idioms • Spark • Map-Reduce but doesn’t persist back between steps • Better for iterative processing • Does less violence to problem • Graphs & ML
  • 22. Example: Spark for clustering • Subtyping / stratification • Popular methods are computationally prohibitive on rich data • (Also ground truth unclear) • “Sparkify”, compare, validate on asthma cohort
  • 23. Hypothesis generation vs validation • Generating leads vs. testing • Machine learning for: • hypothesis generation / exploration • streamlining of laborious manual tasks • Validate!
  • 24. Conclusions • Big biomedical data is often not big, but we can make it bigger • We don't need more platforms, we need platforms that work together • Sometimes Big Data approaches are useful, sometimes not: choose wisely • Trust but verify (especially machine learning)
  • 25. Thanks • Data Science Institute, ICL • Fayzal Ghantiwala (Bloomberg) • Nazanin Zounemat Kermani (ICL) • Mansoor Saqi (EISBM / ICL) • Jose Saray (EISBM) • eTRIKS consortium • U-BIOPRED consortium