SlideShare a Scribd company logo
Big Data & Machine Learning for
Clinical Data
Paul Agapow <p.agapow@imperial.ac.uk>
Data Science Institute, Imperial College London
 Biomedical science is now data
science
 I was a biochemist, immunologist,
and then a infectious disease
bioinformatician
 I’m now a “biomedical data
scientist”
 I will be a Health Informatics
Director at AstraZeneca
About me & these lectures
WikiMedia Commons
 We increasingly use & need:
 Lots of complex data
 Real world evidence (outside RCTs)
 Computational tools
 Statistical analysis
 Complex interactions
 Precision medicine: prediction &
(sub)typing
 Also:
 Cheap
 Successful in other domains
 But lots of hype and jargon
Biomedical science is now data science
WikiMedia Commons
 The world is increasingly
“datafied” – we make more and
bigger datasets
 Devices
 Routine collection
 Aggregation & integration
 Big Data is “too big”for
conventional approaches
Part 1: Big Data
WikiMedia Commons
 “Quantity has a quality of its
own”
 Often free
 Real
 Rich, deep, interactions
 Needed for ML and other
assumption-light approaches
Why Big Data?
By Ender005 - Own work, CC BY-SA 4.0,
https://commons.wikimedia.org/w/index.php?curid=49888192
 Many diseases with the same clinical presentation have different
molecular phenotypes
 Several overlapping terms
 stratified: separate patients into groups for treatment
 precision:
 tailor treatment to individual
 improved targeted therapies with fewer side effects
 “Right medication, right dose, right patient, right time, right route”
 Also personalised, P4 …
 E.g. asthma
Why Big Data? Precision medicine
 Volume
 Velocity
 Variety
 Veracity
 Value
The 3 / 4 / 5 Vs of Big Data
By MuhammadAbuHijleh - Own work, CC BY-SA 4.0,
https://commons.wikimedia.org/w/index.php?curid=46431834
 Limits labile to technological
progress
 Memory
 Compute
 Data schema
 Solutions: distributed & parallel
computation, new high-end
databases
The problem with volume: tools & platforms
WikiMedia Commons
 Multiple hypothesis testing
and false discovery
 Bias: a sample is not the
population
 The Past is not the Present
 Observation without
understanding
 The curse of dimensionality
 Privacy
 Some ML-specific issues
The problem with volume: methodology
From KDNuggets
 Many, many types of data
 How do we use multiple types?
 Which type do we use?
 Disease is systemic
 Interactions
 Evidence
 Solutions: integrated analysis,
independent analysis with
validation
The problem with variety
Wu, Sanin, Wang (2016) Clinical Applications and Systems
Biomedicine
 Much biodata is uncertain
 Noise
 Mistakes
 People lie
 A sample is not a population
 Incompatible systems
 Most analyses are not reproducible
 Solutions: imputation, standards,
cross-validation etc.
The problem with veracity
By Khaydock - Own work, CC BY-SA 3.0,
https://commons.wikimedia.org/w/index.php?curid=25102900
 How do we
 Re-use data
 Compare data
 Store data from multiple sources
 Even know what data is
 FAIR, OHDSI / OMOPS, HPO
 Even just metadata helps for
cataloguing
 But: multiple & incomplete
standards, translation, complexity
Solution: Standards & ontologies
WikiMedia Commons
 Much data cannot leave its
home institution
 Hospitals
 Registries
 Insurance companies
 Governance is hard & slow
 So take the analysis to the data
 Data looks the same but may
be internally different
Solution: Federated analysis
International Collaboration for Autism Registry Epidemiology
 In a vast sea of biodata, how do you
discover anything? How do you avoid
cherry-picking?
 Solutions:
 Distinguish discovery from
exploration
 Non-parametric methods (e.g.
machine learning)
 Some problems don’t have a single
solution but many (e.g. prediction)
The problem with it all: discoverability
EnterpriseKnowledge.com
 Write analyses as recipes
 Snakemake, Nextflow, Flowr
 Use recreatable computational
systems
 Docker
 “Your biggest collaborator is
you, six months ago”
 But: it’s work
Solution: Reproducibility
From RevolutionR
 Big Data is “too big” for current conventional tools & practices
 But it’s ideal for solving many biomedical problems
 There are problems with valid discovery and just handling the data
 Standards, distributed databases and analysis and
Summary: Big Data
 “a field of Artificial Intelligence”
 “(the science of) getting computers to learn and act like humans do”
 “getting computers to act without being explicitly programmed”
 “computer systems that automatically improve with experience”
 “neural networks”
 “using statistical techniques to give computer systems the ability to
learn”
Part 2: Machine Learning
In practice:
 broadly-defined set of
algorithms that recognise &
generalise patterns in data
 “non-parametric” or
assumption-light
 may require training over
initial dataset
What is Machine Learning?
By Chire - Own work, Public Domain,
https://commons.wikimedia.org/w/index.php?curid=11711077
 Enough data
 Enough compute
 Technical progress
 Need 'good enough'
solutions
 Prediction & forecasting
 Categorization
 Pattern recognition
 Early, startling success
Why now?
Ray Kurzweil The Singularity is Near
How is ML different to stats?
How is ML different to stats?
Statistical Machine
Assumptions strong weak
Data small large
Optimize by fitting training
Solutions “the best” “good enough”
Hypothesis proof exploration
Test p-values etc. validation
In practice:
 a field of scientific research
 machine learning
 neural networks
 deep learning
 more of an objective than a methodology
 computational systems that duplicate / emulate / replace human effort
What is Artificial Intelligence
• Many methods
• Broadly split into:
• Unsupervised: finds structure within data
• e.g. (most) clustering, self-organised maps, principal component
analysis
• Supervised: trained using labelled examples
• e.g. regression, decision trees, naive bayes, neural networks
• Categories can blur
• e.g. k-means, nearest neighbour?
• Which is better?
What are ML methods?
• (Train a model from data)
• This model encapsulates or generalizes the data
• (Validate the model against test data)
• This model transforms features into labels
• Continuous outputs (e.g. real numbers) are regressions
• Discrete outputs (e.g. categories) are classifications
ML terms & process
• Take gene expression profiles from patients and cluster to:
• See genes with similar expression profiles
• Similar patients
• Train a model on radiographs with tumours labelled, use to diagnose
unlabelled images
• Find patients with similar symptoms & signs (computational
phenotypes) in HER
• Train on histories of patients to forecast their future condition
• Find out how terms in a medical corpus relate to each other
Examples of ML
It’s everywhere
Unsupervised learning: clustering
 What does ‘similar’ mean? How
do we measure it?
 Which features & how weighted?
 Noise & overlapping clusters
 Non-numeric, non-ordered data
 What shapes can clusters be?
 How many clusters? When do we
stop?
 …
Clustering isn’t simple
By Chire - Own work, CC BY-SA 3.0,
https://commons.wikimedia.org/w/index.php?curid=17085331
Varies but:
 Start with record-feature matrix
 Normalise data
 (“Supervised”: select number of
clusters)
 Run algorithm
 Validate
Clustering process
WikiMedia Commons
How not to do it
 A cluster partitioning is a hypothesis
 How do we assess? Validate:
 External: compare against external label or data
 e.g. accuracy, entropy
 Internal: goodness of clustering
 e.g. sum squared errors, cluster cohesion & separation,
silhouette
 Relative: against another clustering scheme
 e.g. is this better with 3 or 4 clusters
Validating clusters
Average over each point:
1. Calculate the average distance to all
other members of its cluster, a
2. For each other cluster, calculate the
average distance to every member.
The minimum of these is b
3. The silhouette width is (b−a) /
max(a,b), the higher the better
Clustering process
What if there are sub-clusters or
structure?
• Use hierarchical clustering
• Use homogeneity or
completeness metrics to
compare
Nesting & hierarchies
• Complex, heterogeneous
disease
• Many attempts at clustering
• Use transcriptomic &
proteomic data
• Validate with clinical
• 4 clusters with characteristic
genes & clinical behaviour
Example: asthma
 a.k.a. deep learning, (artificial)
neural networks, “AI”
 A series of layers of nodes, each of
which transforms the previous layer.
 Training sets weights on
transformations
 Capable of learning representations
Supervised learning: deep networks
WikiMedia Commons
 There’s little information in an
individual pixel (gene, data point …)
 But individual data points make up
more complete entities
 Each layer takes the layer below and
creates higher-level entities
(representations) from it.
 The system “recognises” higher-
level features that can appear
anywhere in the data.
What’s a representation?
WikiMedia Commons
 Radiologists are overwhelmed
 Want to catch errors &
double-check
 Train ANN over medical
imagery with tumour labelled
 Accuracy similar to humans
Example: diagnosis from medical imagery
From Nvidia
• The model is right but learns
the wrong thing (from our
point of view)
• Solutions:
• Interpreting models
• Better (more examined) data
Problem: useless solutions
Ribeiro et al. (2016) Why Should I Trust You?
 Reversing the model & asking “why”
 What features are important
 Mechanistic insight
 But many ML models are tangled & horribly complex
 And ML community often uninterested
 Solutions:
 Choose an intepretable model
 Software that explores feature space (LIME, Lift, IML)
Problem: interpretability
• Bias (systematic error) vs. Variance
(random error)
• Want a model that captures the
regularities in training data AND
generalizes to unseen data.
• This is impossible
• Solutions:
• Use a variety of data
• Feature selection
• Regularization
Problem: how do models get it wrong?
From KDNuggets
• What do we want from our ML
models?
• Power / accuracy
• Insight
• Error tolerance
• e.g. drug discovery vs drug safety
Problem: how good do models have to be?
After Harel
• Much (most) data has few positives
• Results in an imbalanced model
• Solutions:
• Over- and under-sampling
• Pre-train with poor data
• Ensemble methods
Problem: imbalanced data & lack of data
DataScience.com
 Machine learning uses large amounts of data with few assumptions to
make models that generalise that data
 This is useful for situations where we don’t have an explicit model and
just need ‘a’ solution.
 But this means we need to examine our data and validate our
solutions
 A ‘bad’ solution can be useful, depending on what you want to
achieve.
Summary: Machine Learning

More Related Content

PDF
Fair by design
PDF
CEDAR work bench for metadata management
PPTX
MPS webinar master deck
PPTX
Combining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
PDF
Machine learning, health data & the limits of knowledge
PPTX
Digital webinar master deck final
PPTX
AI in translational medicine webinar
PDF
Open interoperability standards, tools and services at EMBL-EBI
Fair by design
CEDAR work bench for metadata management
MPS webinar master deck
Combining Explicit and Latent Web Semantics for Maintaining Knowledge Graphs
Machine learning, health data & the limits of knowledge
Digital webinar master deck final
AI in translational medicine webinar
Open interoperability standards, tools and services at EMBL-EBI

What's hot (19)

PPTX
Towards Automated AI-guided Drug Discovery Labs
PPTX
Data Harmonization for a Molecularly Driven Health System
PDF
Hands-on Introduction to Machine Learning
PPTX
Elsevier’s Healthcare Knowledge Graph
PDF
Is that a scientific report or just some cool pictures from the lab? Reproduc...
PDF
Is one enough? Data warehousing for biomedical research
PDF
Beyond Proofs of Concept for Biomedical AI
PPTX
Building an informatics solution to sustain AI-guided cell profiling with hig...
PDF
Heartificial intelligence - claudio-mirti
PPTX
Advancing Foundation and Practice of Software Analytics
PPTX
Medical data diagnosis
PDF
PA webinar on benefits & costs of FAIR implementation in life sciences
PPTX
Ilya Kupershmidt speaks at the Molecular Medicine Tri-Conference
PDF
Considerations and challenges in building an end to-end microbiome workflow
PPTX
AI is the Future of Drug Discovery
PPTX
In Search of a Missing Link in the Data Deluge vs. Data Scarcity Debate
PPTX
Social Networks and Collaborative Platforms for Data Sharing in Radiology
PPTX
2019 Triangle Machine Learning Day - Biomedical Image Understanding and EHRs ...
PDF
Mining Big Data using Genetic Algorithm
Towards Automated AI-guided Drug Discovery Labs
Data Harmonization for a Molecularly Driven Health System
Hands-on Introduction to Machine Learning
Elsevier’s Healthcare Knowledge Graph
Is that a scientific report or just some cool pictures from the lab? Reproduc...
Is one enough? Data warehousing for biomedical research
Beyond Proofs of Concept for Biomedical AI
Building an informatics solution to sustain AI-guided cell profiling with hig...
Heartificial intelligence - claudio-mirti
Advancing Foundation and Practice of Software Analytics
Medical data diagnosis
PA webinar on benefits & costs of FAIR implementation in life sciences
Ilya Kupershmidt speaks at the Molecular Medicine Tri-Conference
Considerations and challenges in building an end to-end microbiome workflow
AI is the Future of Drug Discovery
In Search of a Missing Link in the Data Deluge vs. Data Scarcity Debate
Social Networks and Collaborative Platforms for Data Sharing in Radiology
2019 Triangle Machine Learning Day - Biomedical Image Understanding and EHRs ...
Mining Big Data using Genetic Algorithm
Ad

Similar to Big Data & ML for Clinical Data (20)

PPTX
Becoming Datacentric
PPTX
The MD Anderson / IBM Watson Announcement: What does it mean for machine lear...
PDF
An introduction to machine learning in biomedical research: Key concepts, pr...
PPTX
Melissa Informatics - Data Quality and AI
PDF
Introduction to machine_learning_us
PDF
Machine Learning in Modern Medicine with Erin LeDell at Stanford Med
PPTX
AAPM Foster July 2009
PDF
informatics_future.pdf
PDF
ODSC East 2017: Data Science Models For Good
PPTX
Frankie Rybicki slide set for Deep Learning in Radiology / Medicine
PDF
Big Data, The Community and The Commons (May 12, 2014)
PDF
Artificial Intelligence for Discovery
PPTX
Charleston Conference 2016
PPTX
Big data and machine learning: opportunità per la medicina di precisione e i ...
PDF
(2017/06)Practical points of deep learning for medical imaging
PDF
Big Data in Healthcare and Medical Devices
PDF
AI in Healthcare
PPTX
Using Bioinformatics Data to inform Therapeutics discovery and development
PDF
Big Data in Pharma - Overview and Use Cases
PPTX
Clinical Data and AI
Becoming Datacentric
The MD Anderson / IBM Watson Announcement: What does it mean for machine lear...
An introduction to machine learning in biomedical research: Key concepts, pr...
Melissa Informatics - Data Quality and AI
Introduction to machine_learning_us
Machine Learning in Modern Medicine with Erin LeDell at Stanford Med
AAPM Foster July 2009
informatics_future.pdf
ODSC East 2017: Data Science Models For Good
Frankie Rybicki slide set for Deep Learning in Radiology / Medicine
Big Data, The Community and The Commons (May 12, 2014)
Artificial Intelligence for Discovery
Charleston Conference 2016
Big data and machine learning: opportunità per la medicina di precisione e i ...
(2017/06)Practical points of deep learning for medical imaging
Big Data in Healthcare and Medical Devices
AI in Healthcare
Using Bioinformatics Data to inform Therapeutics discovery and development
Big Data in Pharma - Overview and Use Cases
Clinical Data and AI
Ad

More from Paul Agapow (20)

PDF
Clinical studies & observational trials in the age of AI
PDF
AI in pharma & biotech: possibilities and realities
PDF
Opportunities for AI in drug development 202412.pdf
PDF
Career advice for new bio-(x)-ists, Dec2024.pdf
PDF
Can drug repurposing be saved with AI 202405.pdf
PDF
IA, la clave de la genomica (May 2024).pdf
PDF
Digital Biomarkers, a (too) brief introduction.pdf
PDF
How to make every mistake and still have a career, Feb2024.pdf
PPTX
ML, biomedical data & trust
PDF
Where AI will (and won't) revolutionize biomedicine
PDF
Multi-omics for drug discovery: what we lose, what we gain
PPTX
ML & AI in pharma: an overview
PDF
ML & AI in Drug development: the hidden part of the iceberg
PPTX
The End of the Drug Development Casino?
PDF
Get yourself a better bioinformatics job
PPTX
Interpreting Complex Real World Data for Pharmaceutical Research
PDF
Filling the gaps in translational research
PPTX
Bioinformatics! (What is it good for?)
PDF
Machine Learning for Preclinical Research
PDF
AI for Precision Medicine (Pragmatic preclinical data science)
Clinical studies & observational trials in the age of AI
AI in pharma & biotech: possibilities and realities
Opportunities for AI in drug development 202412.pdf
Career advice for new bio-(x)-ists, Dec2024.pdf
Can drug repurposing be saved with AI 202405.pdf
IA, la clave de la genomica (May 2024).pdf
Digital Biomarkers, a (too) brief introduction.pdf
How to make every mistake and still have a career, Feb2024.pdf
ML, biomedical data & trust
Where AI will (and won't) revolutionize biomedicine
Multi-omics for drug discovery: what we lose, what we gain
ML & AI in pharma: an overview
ML & AI in Drug development: the hidden part of the iceberg
The End of the Drug Development Casino?
Get yourself a better bioinformatics job
Interpreting Complex Real World Data for Pharmaceutical Research
Filling the gaps in translational research
Bioinformatics! (What is it good for?)
Machine Learning for Preclinical Research
AI for Precision Medicine (Pragmatic preclinical data science)

Recently uploaded (20)

PDF
The scientific heritage No 166 (166) (2025)
PPTX
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
PDF
diccionario toefl examen de ingles para principiante
DOCX
Viruses (History, structure and composition, classification, Bacteriophage Re...
PDF
MIRIDeepImagingSurvey(MIDIS)oftheHubbleUltraDeepField
PPTX
Derivatives of integument scales, beaks, horns,.pptx
PPTX
neck nodes and dissection types and lymph nodes levels
PPT
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
PPTX
Classification Systems_TAXONOMY_SCIENCE8.pptx
PDF
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
PPTX
famous lake in india and its disturibution and importance
PPTX
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
PPTX
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
PPTX
Cell Membrane: Structure, Composition & Functions
PDF
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
PPTX
The KM-GBF monitoring framework – status & key messages.pptx
PPT
protein biochemistry.ppt for university classes
PPTX
GEN. BIO 1 - CELL TYPES & CELL MODIFICATIONS
PDF
AlphaEarth Foundations and the Satellite Embedding dataset
PDF
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...
The scientific heritage No 166 (166) (2025)
EPIDURAL ANESTHESIA ANATOMY AND PHYSIOLOGY.pptx
diccionario toefl examen de ingles para principiante
Viruses (History, structure and composition, classification, Bacteriophage Re...
MIRIDeepImagingSurvey(MIDIS)oftheHubbleUltraDeepField
Derivatives of integument scales, beaks, horns,.pptx
neck nodes and dissection types and lymph nodes levels
The World of Physical Science, • Labs: Safety Simulation, Measurement Practice
Classification Systems_TAXONOMY_SCIENCE8.pptx
ELS_Q1_Module-11_Formation-of-Rock-Layers_v2.pdf
famous lake in india and its disturibution and importance
G5Q1W8 PPT SCIENCE.pptx 2025-2026 GRADE 5
DRUG THERAPY FOR SHOCK gjjjgfhhhhh.pptx.
Cell Membrane: Structure, Composition & Functions
Unveiling a 36 billion solar mass black hole at the centre of the Cosmic Hors...
The KM-GBF monitoring framework – status & key messages.pptx
protein biochemistry.ppt for university classes
GEN. BIO 1 - CELL TYPES & CELL MODIFICATIONS
AlphaEarth Foundations and the Satellite Embedding dataset
IFIT3 RNA-binding activity primores influenza A viruz infection and translati...

Big Data & ML for Clinical Data

  • 1. Big Data & Machine Learning for Clinical Data Paul Agapow <p.agapow@imperial.ac.uk> Data Science Institute, Imperial College London
  • 2.  Biomedical science is now data science  I was a biochemist, immunologist, and then a infectious disease bioinformatician  I’m now a “biomedical data scientist”  I will be a Health Informatics Director at AstraZeneca About me & these lectures WikiMedia Commons
  • 3.  We increasingly use & need:  Lots of complex data  Real world evidence (outside RCTs)  Computational tools  Statistical analysis  Complex interactions  Precision medicine: prediction & (sub)typing  Also:  Cheap  Successful in other domains  But lots of hype and jargon Biomedical science is now data science WikiMedia Commons
  • 4.  The world is increasingly “datafied” – we make more and bigger datasets  Devices  Routine collection  Aggregation & integration  Big Data is “too big”for conventional approaches Part 1: Big Data WikiMedia Commons
  • 5.  “Quantity has a quality of its own”  Often free  Real  Rich, deep, interactions  Needed for ML and other assumption-light approaches Why Big Data? By Ender005 - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=49888192
  • 6.  Many diseases with the same clinical presentation have different molecular phenotypes  Several overlapping terms  stratified: separate patients into groups for treatment  precision:  tailor treatment to individual  improved targeted therapies with fewer side effects  “Right medication, right dose, right patient, right time, right route”  Also personalised, P4 …  E.g. asthma Why Big Data? Precision medicine
  • 7.  Volume  Velocity  Variety  Veracity  Value The 3 / 4 / 5 Vs of Big Data By MuhammadAbuHijleh - Own work, CC BY-SA 4.0, https://commons.wikimedia.org/w/index.php?curid=46431834
  • 8.  Limits labile to technological progress  Memory  Compute  Data schema  Solutions: distributed & parallel computation, new high-end databases The problem with volume: tools & platforms WikiMedia Commons
  • 9.  Multiple hypothesis testing and false discovery  Bias: a sample is not the population  The Past is not the Present  Observation without understanding  The curse of dimensionality  Privacy  Some ML-specific issues The problem with volume: methodology From KDNuggets
  • 10.  Many, many types of data  How do we use multiple types?  Which type do we use?  Disease is systemic  Interactions  Evidence  Solutions: integrated analysis, independent analysis with validation The problem with variety Wu, Sanin, Wang (2016) Clinical Applications and Systems Biomedicine
  • 11.  Much biodata is uncertain  Noise  Mistakes  People lie  A sample is not a population  Incompatible systems  Most analyses are not reproducible  Solutions: imputation, standards, cross-validation etc. The problem with veracity By Khaydock - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=25102900
  • 12.  How do we  Re-use data  Compare data  Store data from multiple sources  Even know what data is  FAIR, OHDSI / OMOPS, HPO  Even just metadata helps for cataloguing  But: multiple & incomplete standards, translation, complexity Solution: Standards & ontologies WikiMedia Commons
  • 13.  Much data cannot leave its home institution  Hospitals  Registries  Insurance companies  Governance is hard & slow  So take the analysis to the data  Data looks the same but may be internally different Solution: Federated analysis International Collaboration for Autism Registry Epidemiology
  • 14.  In a vast sea of biodata, how do you discover anything? How do you avoid cherry-picking?  Solutions:  Distinguish discovery from exploration  Non-parametric methods (e.g. machine learning)  Some problems don’t have a single solution but many (e.g. prediction) The problem with it all: discoverability EnterpriseKnowledge.com
  • 15.  Write analyses as recipes  Snakemake, Nextflow, Flowr  Use recreatable computational systems  Docker  “Your biggest collaborator is you, six months ago”  But: it’s work Solution: Reproducibility From RevolutionR
  • 16.  Big Data is “too big” for current conventional tools & practices  But it’s ideal for solving many biomedical problems  There are problems with valid discovery and just handling the data  Standards, distributed databases and analysis and Summary: Big Data
  • 17.  “a field of Artificial Intelligence”  “(the science of) getting computers to learn and act like humans do”  “getting computers to act without being explicitly programmed”  “computer systems that automatically improve with experience”  “neural networks”  “using statistical techniques to give computer systems the ability to learn” Part 2: Machine Learning
  • 18. In practice:  broadly-defined set of algorithms that recognise & generalise patterns in data  “non-parametric” or assumption-light  may require training over initial dataset What is Machine Learning? By Chire - Own work, Public Domain, https://commons.wikimedia.org/w/index.php?curid=11711077
  • 19.  Enough data  Enough compute  Technical progress  Need 'good enough' solutions  Prediction & forecasting  Categorization  Pattern recognition  Early, startling success Why now? Ray Kurzweil The Singularity is Near
  • 20. How is ML different to stats?
  • 21. How is ML different to stats? Statistical Machine Assumptions strong weak Data small large Optimize by fitting training Solutions “the best” “good enough” Hypothesis proof exploration Test p-values etc. validation
  • 22. In practice:  a field of scientific research  machine learning  neural networks  deep learning  more of an objective than a methodology  computational systems that duplicate / emulate / replace human effort What is Artificial Intelligence
  • 23. • Many methods • Broadly split into: • Unsupervised: finds structure within data • e.g. (most) clustering, self-organised maps, principal component analysis • Supervised: trained using labelled examples • e.g. regression, decision trees, naive bayes, neural networks • Categories can blur • e.g. k-means, nearest neighbour? • Which is better? What are ML methods?
  • 24. • (Train a model from data) • This model encapsulates or generalizes the data • (Validate the model against test data) • This model transforms features into labels • Continuous outputs (e.g. real numbers) are regressions • Discrete outputs (e.g. categories) are classifications ML terms & process
  • 25. • Take gene expression profiles from patients and cluster to: • See genes with similar expression profiles • Similar patients • Train a model on radiographs with tumours labelled, use to diagnose unlabelled images • Find patients with similar symptoms & signs (computational phenotypes) in HER • Train on histories of patients to forecast their future condition • Find out how terms in a medical corpus relate to each other Examples of ML
  • 28.  What does ‘similar’ mean? How do we measure it?  Which features & how weighted?  Noise & overlapping clusters  Non-numeric, non-ordered data  What shapes can clusters be?  How many clusters? When do we stop?  … Clustering isn’t simple By Chire - Own work, CC BY-SA 3.0, https://commons.wikimedia.org/w/index.php?curid=17085331
  • 29. Varies but:  Start with record-feature matrix  Normalise data  (“Supervised”: select number of clusters)  Run algorithm  Validate Clustering process WikiMedia Commons
  • 30. How not to do it
  • 31.  A cluster partitioning is a hypothesis  How do we assess? Validate:  External: compare against external label or data  e.g. accuracy, entropy  Internal: goodness of clustering  e.g. sum squared errors, cluster cohesion & separation, silhouette  Relative: against another clustering scheme  e.g. is this better with 3 or 4 clusters Validating clusters
  • 32. Average over each point: 1. Calculate the average distance to all other members of its cluster, a 2. For each other cluster, calculate the average distance to every member. The minimum of these is b 3. The silhouette width is (b−a) / max(a,b), the higher the better Clustering process
  • 33. What if there are sub-clusters or structure? • Use hierarchical clustering • Use homogeneity or completeness metrics to compare Nesting & hierarchies
  • 34. • Complex, heterogeneous disease • Many attempts at clustering • Use transcriptomic & proteomic data • Validate with clinical • 4 clusters with characteristic genes & clinical behaviour Example: asthma
  • 35.  a.k.a. deep learning, (artificial) neural networks, “AI”  A series of layers of nodes, each of which transforms the previous layer.  Training sets weights on transformations  Capable of learning representations Supervised learning: deep networks WikiMedia Commons
  • 36.  There’s little information in an individual pixel (gene, data point …)  But individual data points make up more complete entities  Each layer takes the layer below and creates higher-level entities (representations) from it.  The system “recognises” higher- level features that can appear anywhere in the data. What’s a representation? WikiMedia Commons
  • 37.  Radiologists are overwhelmed  Want to catch errors & double-check  Train ANN over medical imagery with tumour labelled  Accuracy similar to humans Example: diagnosis from medical imagery From Nvidia
  • 38. • The model is right but learns the wrong thing (from our point of view) • Solutions: • Interpreting models • Better (more examined) data Problem: useless solutions Ribeiro et al. (2016) Why Should I Trust You?
  • 39.  Reversing the model & asking “why”  What features are important  Mechanistic insight  But many ML models are tangled & horribly complex  And ML community often uninterested  Solutions:  Choose an intepretable model  Software that explores feature space (LIME, Lift, IML) Problem: interpretability
  • 40. • Bias (systematic error) vs. Variance (random error) • Want a model that captures the regularities in training data AND generalizes to unseen data. • This is impossible • Solutions: • Use a variety of data • Feature selection • Regularization Problem: how do models get it wrong? From KDNuggets
  • 41. • What do we want from our ML models? • Power / accuracy • Insight • Error tolerance • e.g. drug discovery vs drug safety Problem: how good do models have to be? After Harel
  • 42. • Much (most) data has few positives • Results in an imbalanced model • Solutions: • Over- and under-sampling • Pre-train with poor data • Ensemble methods Problem: imbalanced data & lack of data DataScience.com
  • 43.  Machine learning uses large amounts of data with few assumptions to make models that generalise that data  This is useful for situations where we don’t have an explicit model and just need ‘a’ solution.  But this means we need to examine our data and validate our solutions  A ‘bad’ solution can be useful, depending on what you want to achieve. Summary: Machine Learning