Rise of the Machines: The Use of Machine Learning in SIMS Data Analysis

RISE OF THE MACHINES
THE USE OF MACHINE LEARNING
IN SIMS DATA ANALYSIS
Alex Henderson
University of Manchester
SurfaceSpectra Ltd
http://about.me/henderson.alex
Twitter: @AlexHenderson00

LOOK OUT!
THE MACHINES ARE COMING!!

NO NEED TO BE AFRAID…
Result!

QUESTIONS WE MIGHT ASK
• Exploratory data analysis
• What can we find out about these samples?
• No prior knowledge required
• Differences in chemical or physical state between groups of samples
• Highlights spectral changes as function of group membership
• Need to know which group each spectrum belongs to
• Trend analysis
• Spectral changes as a function of dependent variable:
time, concentration, disease state etc.
• Classification of samples
• Spectral characteristics of groups
• Prediction of unseen samples into known groups

DATA ANALYSIS APPROACHES
CLASSICAL ANALYSIS
Hypothesis driven
Assumes a distribution
of spectral response
MACHINE LEARNING
Data driven
Interrogation of data
leads to hypothesis
Validation always required when
building a predictive model

CLASSICAL ANALYSIS
Assumes data obey the Central Limit Theorem
Data is Normally distributed
(Gaussian or Bell shaped curve)
Mathematically we can derive 4 ‘moments’
• Mean (average)
• Variance (standard deviation)
• Kurtosis (pointedness)
• Skewness (asymmetry)
Other descriptions lead from these parameters
eg Student’s t-test, ANOVA, MANOVA

MACHINE LEARNING
No underlying assumption
Need to generate a description of data

HISTORY OF MVA
Classical multivariate analysis dates from 1930s
Harold Hotelling, Ronald Fisher, Herman Wold and others
• Principal components analysis (PCA)
• Partial least squares (PLS)
• Fisher’s discriminant analysis
• Linear discriminant analysis (LDA), etc.
Slide rule is King!

HISTORY OF MVA CONTINUED
Computers become generally available in 1950s
Speed and reproducibility of calculations becomes easier
New approaches are developed
Term ‘Machine Learning’ coined in 1990’s, describing a
branch of computer science

Mechanical Turk plays chess in 1770
NEW?

Mechanical Turk CHEATS at chess in 1770
WELL…

WHAT FITS WHERE?
Classical Analysis Machine Learning
Exploratory
analysis
PCA K-means, HCA
Differences in state
between groups of
samples
Discriminant analysis
LDA, QDA, CVA
Random Forest
Classification
Trend analysis Regression analysis,
MCR
Random Forest
Regression
Classification of
samples
LDA, QDA Random Forest
Classification
SVM

RANDOM FOREST
Ensemble method
Collect lots of weak classifiers to build one strong one
Collection of Decision Trees
Computationally intensive
Developed 1995 – 2001
MATLAB: TreeBagger
Python: scikit.RandomForestClassifier

DECISION TREE
An expression of an algorithm
Weak classifier
Move through each step in turn
Boss
around?
Weather?
Beer
Pay
day?
Beer
Work
Work
WorkYes
No
Sunny
Rainy
Windy
Recent
Long ago

ENSEMBLE OF TREES
Randomly select subsets of variables (m/z intensities)
Train multiple (few hundred) decision trees, each with
different variables
Each tree does the
best it can with only
a portion of the data
See which trees are
best and weight them
higher

VARIABLE ZOO
Ratio measurements taken
for many animals
For example:
• Length of leg
• Number of legs
• Number of wings
• Has horns/antlers?
• Length of neck
• Length of tail
Many examples of each
animal used
No tree gets all measurements

TOO EASY?
The giraffe is easily recognised by the number of legs and
length of neck. Oh, and it’s not a bird…
If any tree has those variables it would always identify the
animal as a giraffe. No need for anything else.

A Gerenuk is a four-legged
mammal with a long neck.
The decision tree was good,
but not good enough
It needed to be tamed by
other trees
The Random Forest model
prevents some trees from
dominating the overall result
WRONG!

Polystrene beads
Each bead coated with a
different amino acid
SIMS image using Biotof
256 × 256 pixels
1000 amu
bin-summed to 1 amu
Data courtesy of Nick
Winograd, Penn State
University, USA, ~1999
CLASSIFICATION
EXAMPLE

Two regions on each
bead and also the
substrate selected
One region to train, the
other to test
Each region 400 pixels
Square root taken
Vector normalised
TRAINING AND TEST
REGIONS
Test
Test
Test

RANDOM FOREST MODEL
Training data (3 × 400 spectra)
Each spectrum labelled: bead1, bead 2, or substrate
Random Forest model constructed using
scikit.RandomForestClassifier in Python 3.5
300 trees selected.
Other parameters left as default
Code executed in PyCharm 2017.1.2

Bead 1 Bead 2 Substrate
Bead 1 97.5 1.5 1.0
Bead 2 3.0 96.0 1.0
Substrate 8.3 3.3 88.5
Percentage of correctly predicted values
Diagonal (trace) indicates > 88% of test spectra correctly classified
Caution: Result should be verified by cross-validation or bootstrap
CONFUSION MATRIX
Truth
Prediction

Each decision tree
uses different
combination of mass
values
Determine which m/z
values were used by
the most accurate
trees
This is a measure of
the importance of
those variables: m/z
VARIABLE IMPORTANCE

Trained Random
Forest model used
to predict class for
each pixel in
original image
Render the image
using result of
classification
Total time 15 sec
PREDICTION OF
ENTIRE IMAGE

FTIR – CANCER TISSUE
Epithelium 24.3%
Smooth muscle 50.7%
Lymphocytes 2.5%
Blood 0.2%
Concretion 0.0%
Fibrous stroma 12.3%
ECM 10.0%
Random forest classifier
Trained on exemplars with pathologist
6 hour data acquisition
2.5 million spectra classified in < 60 sec
No staining required, or de-waxing of the sample
Proc. SPIE 9041, Medical Imaging 2014: Digital Pathology, 90410D doi:10.1117/12.2043290

SUMMARY
Machine Learning
methods appear to be
useful tools that we
should consider for
adoption
Unsupervised,
supervised classification
and supervised
regression options are
all available
Increased computer
power may be required,
but Moore’s Law is on
our side here

IMAGE CREDITS
Mechagodzilla: http://list25.com/25-famous-fictional-robots-history/
Simply explained: http://geekandpoke.typepad.com/geekandpoke/2012/01/simply-explained-dp.html
Slide rule: https://commons.wikimedia.org/wiki/File:Slide_rule_scales_back.jpg
Brave new world: https://commons.wikimedia.org/wiki/File:IBM_150_Extra_Engineers_1951.jpg
Mechanical Turk 1: https://commons.wikimedia.org/wiki/File:Tuerkischer_schachspieler_windisch4.jpg
Mechanical Turk 2: https://commons.wikimedia.org/wiki/File:Tuerkischer_schachspieler_racknitz3.jpg
Scikit-learn cheat sheet: http://scikit-learn.org/stable/tutorial/machine_learning_map/
Forest: https://commons.wikimedia.org/wiki/File:Forest_Osaka_Japan.jpg
Animal silhouettes: https://clipartfest.com/categories/view/4c03d8ea8a4bc1ffca947c8b8dab48af25908403/african-
animal-silhouettes-clipart.html
Giraffe: https://img.clipartfest.com/4294d3fb2739e14cec3845ef668dcdc0_life-size-african-animal-wall-african-animal-
silhouettes-clipart_221-203.gif
Gerenuk 1: https://500px.com/cindy_wheeler
Gerenuk 2: http://wordwomanpartialellipsisofthesun.blogspot.co.uk/2015/01/gerenuk-giraffe-necked-gazelle-with.html
Many Gerenuks: http://wordwomanpartialellipsisofthesun.blogspot.co.uk/2015/01/gerenuk-giraffe-necked-gazelle-
with.html
Beads: Nick Winograd, Penn State University, USA
Cancer tissue: Proc. SPIE 9041, Medical Imaging 2014: Digital Pathology, 90410D doi:10.1117/12.2043290
Tetsujin 28: http://goldenani.blogspot.co.uk/2013/01/1963-part-1-on-outside-looking-in.html

Rise of the Machines: The Use of Machine Learning in SIMS Data Analysis

More Related Content

Similar to Rise of the Machines: The Use of Machine Learning in SIMS Data Analysis (20)

More from Alex Henderson (16)

Recently uploaded (20)

Rise of the Machines: The Use of Machine Learning in SIMS Data Analysis