SlideShare a Scribd company logo
RISE OF THE MACHINES
THE USE OF MACHINE LEARNING
IN SIMS DATA ANALYSIS
Alex Henderson
University of Manchester
SurfaceSpectra Ltd
http://about.me/henderson.alex
Twitter: @AlexHenderson00
LOOK OUT!
THE MACHINES ARE COMING!!
NO NEED TO BE AFRAID…
Result!
QUESTIONS WE MIGHT ASK
• Exploratory data analysis
• What can we find out about these samples?
• No prior knowledge required
• Differences in chemical or physical state between groups of samples
• Highlights spectral changes as function of group membership
• Need to know which group each spectrum belongs to
• Trend analysis
• Spectral changes as a function of dependent variable:
time, concentration, disease state etc.
• Classification of samples
• Spectral characteristics of groups
• Prediction of unseen samples into known groups
DATA ANALYSIS APPROACHES
CLASSICAL ANALYSIS
Hypothesis driven
Assumes a distribution
of spectral response
MACHINE LEARNING
Data driven
Interrogation of data
leads to hypothesis
Validation always required when
building a predictive model
CLASSICAL ANALYSIS
Assumes data obey the Central Limit Theorem
Data is Normally distributed
(Gaussian or Bell shaped curve)
Mathematically we can derive 4 ‘moments’
• Mean (average)
• Variance (standard deviation)
• Kurtosis (pointedness)
• Skewness (asymmetry)
Other descriptions lead from these parameters
eg Student’s t-test, ANOVA, MANOVA
MACHINE LEARNING
No underlying assumption
Need to generate a description of data
HISTORY OF MVA
Classical multivariate analysis dates from 1930s
Harold Hotelling, Ronald Fisher, Herman Wold and others
• Principal components analysis (PCA)
• Partial least squares (PLS)
• Fisher’s discriminant analysis
• Linear discriminant analysis (LDA), etc.
Slide rule is King!
HISTORY OF MVA CONTINUED
Computers become generally available in 1950s
Speed and reproducibility of calculations becomes easier
New approaches are developed
Term ‘Machine Learning’ coined in 1990’s, describing a
branch of computer science
BRAVE NEW WORLD!
Mechanical Turk plays chess in 1770
NEW?
Mechanical Turk CHEATS at chess in 1770
WELL…
WHAT FITS WHERE?
Classical Analysis Machine Learning
Exploratory
analysis
PCA K-means, HCA
Differences in state
between groups of
samples
Discriminant analysis
LDA, QDA, CVA
Random Forest
Classification
Trend analysis Regression analysis,
MCR
Random Forest
Regression
Classification of
samples
LDA, QDA Random Forest
Classification
SVM
MACHINE LEARNING
CHEAT SHEET
RANDOM FOREST
Ensemble method
Collect lots of weak classifiers to build one strong one
Collection of Decision Trees
Computationally intensive
Developed 1995 – 2001
MATLAB: TreeBagger
Python: scikit.RandomForestClassifier
DECISION TREE
An expression of an algorithm
Weak classifier
Move through each step in turn
Boss
around?
Weather?
Beer
Pay
day?
Beer
Work
Work
WorkYes
No
Sunny
Rainy
Windy
Recent
Long ago
ENSEMBLE OF TREES
Randomly select subsets of variables (m/z intensities)
Train multiple (few hundred) decision trees, each with
different variables
Each tree does the
best it can with only
a portion of the data
See which trees are
best and weight them
higher
VARIABLE ZOO
Ratio measurements taken
for many animals
For example:
• Length of leg
• Number of legs
• Number of wings
• Has horns/antlers?
• Length of neck
• Length of tail
Many examples of each
animal used
No tree gets all measurements
TOO EASY?
The giraffe is easily recognised by the number of legs and
length of neck. Oh, and it’s not a bird…
If any tree has those variables it would always identify the
animal as a giraffe. No need for anything else.
A Gerenuk is a four-legged
mammal with a long neck.
The decision tree was good,
but not good enough
It needed to be tamed by
other trees
The Random Forest model
prevents some trees from
dominating the overall result
WRONG!
GERENUKS. WHO KNEW!?
Polystrene beads
Each bead coated with a
different amino acid
SIMS image using Biotof
256 × 256 pixels
1000 amu
bin-summed to 1 amu
Data courtesy of Nick
Winograd, Penn State
University, USA, ~1999
CLASSIFICATION
EXAMPLE
Two regions on each
bead and also the
substrate selected
One region to train, the
other to test
Each region 400 pixels
Square root taken
Vector normalised
TRAINING AND TEST
REGIONS
Test
Test
Test
RANDOM FOREST MODEL
Training data (3 × 400 spectra)
Each spectrum labelled: bead1, bead 2, or substrate
Random Forest model constructed using
scikit.RandomForestClassifier in Python 3.5
300 trees selected.
Other parameters left as default
Code executed in PyCharm 2017.1.2
Bead 1 Bead 2 Substrate
Bead 1 97.5 1.5 1.0
Bead 2 3.0 96.0 1.0
Substrate 8.3 3.3 88.5
Percentage of correctly predicted values
Diagonal (trace) indicates > 88% of test spectra correctly classified
Caution: Result should be verified by cross-validation or bootstrap
CONFUSION MATRIX
Truth
Prediction
Each decision tree
uses different
combination of mass
values
Determine which m/z
values were used by
the most accurate
trees
This is a measure of
the importance of
those variables: m/z
VARIABLE IMPORTANCE
Trained Random
Forest model used
to predict class for
each pixel in
original image
Render the image
using result of
classification
Total time 15 sec
PREDICTION OF
ENTIRE IMAGE
FTIR – CANCER TISSUE
Epithelium 24.3%
Smooth muscle 50.7%
Lymphocytes 2.5%
Blood 0.2%
Concretion 0.0%
Fibrous stroma 12.3%
ECM 10.0%
Random forest classifier
Trained on exemplars with pathologist
6 hour data acquisition
2.5 million spectra classified in < 60 sec
No staining required, or de-waxing of the sample
Proc. SPIE 9041, Medical Imaging 2014: Digital Pathology, 90410D doi:10.1117/12.2043290
SUMMARY
Machine Learning
methods appear to be
useful tools that we
should consider for
adoption
Unsupervised,
supervised classification
and supervised
regression options are
all available
Increased computer
power may be required,
but Moore’s Law is on
our side here
IMAGE CREDITS
Mechagodzilla: http://list25.com/25-famous-fictional-robots-history/
Simply explained: http://geekandpoke.typepad.com/geekandpoke/2012/01/simply-explained-dp.html
Slide rule: https://commons.wikimedia.org/wiki/File:Slide_rule_scales_back.jpg
Brave new world: https://commons.wikimedia.org/wiki/File:IBM_150_Extra_Engineers_1951.jpg
Mechanical Turk 1: https://commons.wikimedia.org/wiki/File:Tuerkischer_schachspieler_windisch4.jpg
Mechanical Turk 2: https://commons.wikimedia.org/wiki/File:Tuerkischer_schachspieler_racknitz3.jpg
Scikit-learn cheat sheet: http://scikit-learn.org/stable/tutorial/machine_learning_map/
Forest: https://commons.wikimedia.org/wiki/File:Forest_Osaka_Japan.jpg
Animal silhouettes: https://clipartfest.com/categories/view/4c03d8ea8a4bc1ffca947c8b8dab48af25908403/african-
animal-silhouettes-clipart.html
Giraffe: https://img.clipartfest.com/4294d3fb2739e14cec3845ef668dcdc0_life-size-african-animal-wall-african-animal-
silhouettes-clipart_221-203.gif
Gerenuk 1: https://500px.com/cindy_wheeler
Gerenuk 2: http://wordwomanpartialellipsisofthesun.blogspot.co.uk/2015/01/gerenuk-giraffe-necked-gazelle-with.html
Many Gerenuks: http://wordwomanpartialellipsisofthesun.blogspot.co.uk/2015/01/gerenuk-giraffe-necked-gazelle-
with.html
Beads: Nick Winograd, Penn State University, USA
Cancer tissue: Proc. SPIE 9041, Medical Imaging 2014: Digital Pathology, 90410D doi:10.1117/12.2043290
Tetsujin 28: http://goldenani.blogspot.co.uk/2013/01/1963-part-1-on-outside-looking-in.html

More Related Content

PDF
How to validate your model
PPT
Supervised algorithms
PPTX
Probability sampling
PDF
Random forest sgv_ai_talk_oct_2_2018
PPTX
Stat 3203 -multphase sampling
PDF
Conistency of random forests
PDF
Bioactivity Predictive ModelingMay2016
PPTX
Random Forest and KNN is fun
How to validate your model
Supervised algorithms
Probability sampling
Random forest sgv_ai_talk_oct_2_2018
Stat 3203 -multphase sampling
Conistency of random forests
Bioactivity Predictive ModelingMay2016
Random Forest and KNN is fun

Similar to Rise of the Machines: The Use of Machine Learning in SIMS Data Analysis (20)

PPT
Phylogenomic Supertrees. ORP Bininda-Emond
PPTX
Survival Analysis Superlearner
PPTX
Ml7 bagging
PPT
decisiontrees.ppt
PPT
decisiontrees.ppt
PPT
decisiontrees (3).ppt
PPTX
Predictive analytics
PPTX
Machine learning session6(decision trees random forrest)
PDF
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...
PPTX
Comparitive Analysis .pptx Footprinting, Enumeration, Scanning, Sniffing, Soc...
PPTX
Data mining
PPT
Learning to Search Henry Kautz
PPT
Learning to Search Henry Kautz
PPT
Prote-OMIC Data Analysis and Visualization
PDF
Machine learning for_finance
PPTX
Introduction to RandomForests 2004
PPT
IGARSS2011-I-Ling.ppt
PPTX
Vahid Taslimitehrani PhD Dissertation Defense: Contrast Pattern Aided Regress...
PPTX
Moviereview prjct
Phylogenomic Supertrees. ORP Bininda-Emond
Survival Analysis Superlearner
Ml7 bagging
decisiontrees.ppt
decisiontrees.ppt
decisiontrees (3).ppt
Predictive analytics
Machine learning session6(decision trees random forrest)
Finding Needles in Genomic Haystacks with “Wide” Random Forest: Spark Summit ...
Comparitive Analysis .pptx Footprinting, Enumeration, Scanning, Sniffing, Soc...
Data mining
Learning to Search Henry Kautz
Learning to Search Henry Kautz
Prote-OMIC Data Analysis and Visualization
Machine learning for_finance
Introduction to RandomForests 2004
IGARSS2011-I-Ling.ppt
Vahid Taslimitehrani PhD Dissertation Defense: Contrast Pattern Aided Regress...
Moviereview prjct
Ad

More from Alex Henderson (16)

PDF
Towards a common data file format for hyperspectral images
PDF
FAIRSpectra - Towards a common data file format for SIMS images
PDF
FAIRSpectra - Enabling the FAIRification of Analytical Science
PDF
FAIRSpectra - Towards a common data file format for SIMS images
PDF
FAIRSpectra - Enabling the FAIRification of Analytical Science
PDF
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
PDF
Hyperspectral Data Issues
PPTX
The Class Imbalance Problem: AdaBoost to the Rescue?
PDF
Getting started with chemometric classification
PPTX
Too good to be true? How validate your data
PDF
2020 Vision (Dubious Design Decisions)
PDF
To bag, or to boost? A question of balance
PDF
Digging into Data: Analysis and Visualisation in 3D
PDF
What's mine is yours (and vice versa) Data sharing in vibrational spectroscopy
PDF
Interpretation of Static SIMS Spectra
PDF
Secondary Ion Mass Spectrometry
Towards a common data file format for hyperspectral images
FAIRSpectra - Towards a common data file format for SIMS images
FAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Towards a common data file format for SIMS images
FAIRSpectra - Enabling the FAIRification of Analytical Science
FAIRSpectra - Enabling the FAIRification of Spectroscopy and Spectrometry
Hyperspectral Data Issues
The Class Imbalance Problem: AdaBoost to the Rescue?
Getting started with chemometric classification
Too good to be true? How validate your data
2020 Vision (Dubious Design Decisions)
To bag, or to boost? A question of balance
Digging into Data: Analysis and Visualisation in 3D
What's mine is yours (and vice versa) Data sharing in vibrational spectroscopy
Interpretation of Static SIMS Spectra
Secondary Ion Mass Spectrometry
Ad

Recently uploaded (20)

PPT
veterinary parasitology ````````````.ppt
PPTX
Substance Disorders- part different drugs change body
PPT
Heredity-grade-9 Heredity-grade-9. Heredity-grade-9.
PPT
6.1 High Risk New Born. Padetric health ppt
PPTX
POULTRY PRODUCTION AND MANAGEMENTNNN.pptx
PPT
LEC Synthetic Biology and its application.ppt
PPTX
TORCH INFECTIONS in pregnancy with toxoplasma
PDF
Communicating Health Policies to Diverse Populations (www.kiu.ac.ug)
PPTX
A powerpoint on colorectal cancer with brief background
PPT
Animal tissues, epithelial, muscle, connective, nervous tissue
PPTX
Understanding the Circulatory System……..
PPTX
Biomechanics of the Hip - Basic Science.pptx
PPTX
Hypertension_Training_materials_English_2024[1] (1).pptx
PDF
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...
PPTX
Microbes in human welfare class 12 .pptx
PDF
Warm, water-depleted rocky exoplanets with surfaceionic liquids: A proposed c...
PDF
lecture 2026 of Sjogren's syndrome l .pdf
PPTX
BIOMOLECULES PPT........................
PDF
The Land of Punt — A research by Dhani Irwanto
PPTX
endocrine - management of adrenal incidentaloma.pptx
veterinary parasitology ````````````.ppt
Substance Disorders- part different drugs change body
Heredity-grade-9 Heredity-grade-9. Heredity-grade-9.
6.1 High Risk New Born. Padetric health ppt
POULTRY PRODUCTION AND MANAGEMENTNNN.pptx
LEC Synthetic Biology and its application.ppt
TORCH INFECTIONS in pregnancy with toxoplasma
Communicating Health Policies to Diverse Populations (www.kiu.ac.ug)
A powerpoint on colorectal cancer with brief background
Animal tissues, epithelial, muscle, connective, nervous tissue
Understanding the Circulatory System……..
Biomechanics of the Hip - Basic Science.pptx
Hypertension_Training_materials_English_2024[1] (1).pptx
Worlds Next Door: A Candidate Giant Planet Imaged in the Habitable Zone of ↵ ...
Microbes in human welfare class 12 .pptx
Warm, water-depleted rocky exoplanets with surfaceionic liquids: A proposed c...
lecture 2026 of Sjogren's syndrome l .pdf
BIOMOLECULES PPT........................
The Land of Punt — A research by Dhani Irwanto
endocrine - management of adrenal incidentaloma.pptx

Rise of the Machines: The Use of Machine Learning in SIMS Data Analysis

  • 1. RISE OF THE MACHINES THE USE OF MACHINE LEARNING IN SIMS DATA ANALYSIS Alex Henderson University of Manchester SurfaceSpectra Ltd http://about.me/henderson.alex Twitter: @AlexHenderson00
  • 2. LOOK OUT! THE MACHINES ARE COMING!!
  • 3. NO NEED TO BE AFRAID… Result!
  • 4. QUESTIONS WE MIGHT ASK • Exploratory data analysis • What can we find out about these samples? • No prior knowledge required • Differences in chemical or physical state between groups of samples • Highlights spectral changes as function of group membership • Need to know which group each spectrum belongs to • Trend analysis • Spectral changes as a function of dependent variable: time, concentration, disease state etc. • Classification of samples • Spectral characteristics of groups • Prediction of unseen samples into known groups
  • 5. DATA ANALYSIS APPROACHES CLASSICAL ANALYSIS Hypothesis driven Assumes a distribution of spectral response MACHINE LEARNING Data driven Interrogation of data leads to hypothesis Validation always required when building a predictive model
  • 6. CLASSICAL ANALYSIS Assumes data obey the Central Limit Theorem Data is Normally distributed (Gaussian or Bell shaped curve) Mathematically we can derive 4 ‘moments’ • Mean (average) • Variance (standard deviation) • Kurtosis (pointedness) • Skewness (asymmetry) Other descriptions lead from these parameters eg Student’s t-test, ANOVA, MANOVA
  • 7. MACHINE LEARNING No underlying assumption Need to generate a description of data
  • 8. HISTORY OF MVA Classical multivariate analysis dates from 1930s Harold Hotelling, Ronald Fisher, Herman Wold and others • Principal components analysis (PCA) • Partial least squares (PLS) • Fisher’s discriminant analysis • Linear discriminant analysis (LDA), etc. Slide rule is King!
  • 9. HISTORY OF MVA CONTINUED Computers become generally available in 1950s Speed and reproducibility of calculations becomes easier New approaches are developed Term ‘Machine Learning’ coined in 1990’s, describing a branch of computer science
  • 11. Mechanical Turk plays chess in 1770 NEW?
  • 12. Mechanical Turk CHEATS at chess in 1770 WELL…
  • 13. WHAT FITS WHERE? Classical Analysis Machine Learning Exploratory analysis PCA K-means, HCA Differences in state between groups of samples Discriminant analysis LDA, QDA, CVA Random Forest Classification Trend analysis Regression analysis, MCR Random Forest Regression Classification of samples LDA, QDA Random Forest Classification SVM
  • 15. RANDOM FOREST Ensemble method Collect lots of weak classifiers to build one strong one Collection of Decision Trees Computationally intensive Developed 1995 – 2001 MATLAB: TreeBagger Python: scikit.RandomForestClassifier
  • 16. DECISION TREE An expression of an algorithm Weak classifier Move through each step in turn Boss around? Weather? Beer Pay day? Beer Work Work WorkYes No Sunny Rainy Windy Recent Long ago
  • 17. ENSEMBLE OF TREES Randomly select subsets of variables (m/z intensities) Train multiple (few hundred) decision trees, each with different variables Each tree does the best it can with only a portion of the data See which trees are best and weight them higher
  • 18. VARIABLE ZOO Ratio measurements taken for many animals For example: • Length of leg • Number of legs • Number of wings • Has horns/antlers? • Length of neck • Length of tail Many examples of each animal used No tree gets all measurements
  • 19. TOO EASY? The giraffe is easily recognised by the number of legs and length of neck. Oh, and it’s not a bird… If any tree has those variables it would always identify the animal as a giraffe. No need for anything else.
  • 20. A Gerenuk is a four-legged mammal with a long neck. The decision tree was good, but not good enough It needed to be tamed by other trees The Random Forest model prevents some trees from dominating the overall result WRONG!
  • 22. Polystrene beads Each bead coated with a different amino acid SIMS image using Biotof 256 × 256 pixels 1000 amu bin-summed to 1 amu Data courtesy of Nick Winograd, Penn State University, USA, ~1999 CLASSIFICATION EXAMPLE
  • 23. Two regions on each bead and also the substrate selected One region to train, the other to test Each region 400 pixels Square root taken Vector normalised TRAINING AND TEST REGIONS Test Test Test
  • 24. RANDOM FOREST MODEL Training data (3 × 400 spectra) Each spectrum labelled: bead1, bead 2, or substrate Random Forest model constructed using scikit.RandomForestClassifier in Python 3.5 300 trees selected. Other parameters left as default Code executed in PyCharm 2017.1.2
  • 25. Bead 1 Bead 2 Substrate Bead 1 97.5 1.5 1.0 Bead 2 3.0 96.0 1.0 Substrate 8.3 3.3 88.5 Percentage of correctly predicted values Diagonal (trace) indicates > 88% of test spectra correctly classified Caution: Result should be verified by cross-validation or bootstrap CONFUSION MATRIX Truth Prediction
  • 26. Each decision tree uses different combination of mass values Determine which m/z values were used by the most accurate trees This is a measure of the importance of those variables: m/z VARIABLE IMPORTANCE
  • 27. Trained Random Forest model used to predict class for each pixel in original image Render the image using result of classification Total time 15 sec PREDICTION OF ENTIRE IMAGE
  • 28. FTIR – CANCER TISSUE Epithelium 24.3% Smooth muscle 50.7% Lymphocytes 2.5% Blood 0.2% Concretion 0.0% Fibrous stroma 12.3% ECM 10.0% Random forest classifier Trained on exemplars with pathologist 6 hour data acquisition 2.5 million spectra classified in < 60 sec No staining required, or de-waxing of the sample Proc. SPIE 9041, Medical Imaging 2014: Digital Pathology, 90410D doi:10.1117/12.2043290
  • 29. SUMMARY Machine Learning methods appear to be useful tools that we should consider for adoption Unsupervised, supervised classification and supervised regression options are all available Increased computer power may be required, but Moore’s Law is on our side here
  • 30. IMAGE CREDITS Mechagodzilla: http://list25.com/25-famous-fictional-robots-history/ Simply explained: http://geekandpoke.typepad.com/geekandpoke/2012/01/simply-explained-dp.html Slide rule: https://commons.wikimedia.org/wiki/File:Slide_rule_scales_back.jpg Brave new world: https://commons.wikimedia.org/wiki/File:IBM_150_Extra_Engineers_1951.jpg Mechanical Turk 1: https://commons.wikimedia.org/wiki/File:Tuerkischer_schachspieler_windisch4.jpg Mechanical Turk 2: https://commons.wikimedia.org/wiki/File:Tuerkischer_schachspieler_racknitz3.jpg Scikit-learn cheat sheet: http://scikit-learn.org/stable/tutorial/machine_learning_map/ Forest: https://commons.wikimedia.org/wiki/File:Forest_Osaka_Japan.jpg Animal silhouettes: https://clipartfest.com/categories/view/4c03d8ea8a4bc1ffca947c8b8dab48af25908403/african- animal-silhouettes-clipart.html Giraffe: https://img.clipartfest.com/4294d3fb2739e14cec3845ef668dcdc0_life-size-african-animal-wall-african-animal- silhouettes-clipart_221-203.gif Gerenuk 1: https://500px.com/cindy_wheeler Gerenuk 2: http://wordwomanpartialellipsisofthesun.blogspot.co.uk/2015/01/gerenuk-giraffe-necked-gazelle-with.html Many Gerenuks: http://wordwomanpartialellipsisofthesun.blogspot.co.uk/2015/01/gerenuk-giraffe-necked-gazelle- with.html Beads: Nick Winograd, Penn State University, USA Cancer tissue: Proc. SPIE 9041, Medical Imaging 2014: Digital Pathology, 90410D doi:10.1117/12.2043290 Tetsujin 28: http://goldenani.blogspot.co.uk/2013/01/1963-part-1-on-outside-looking-in.html