SlideShare a Scribd company logo
Valencian Summer School in Machine Learning
3rd edition
September 14-15, 2017
BigML, Inc 2
Clusters
Finding Similarities
Poul Petersen
CIO, BigML, Inc
BigML, Inc 3Clusters
What is Clustering?
• An unsupervised learning technique
• No labels necessary
• Useful for finding similar instances
• Smart sampling/labelling
• Finds ā€œself-similar" groups of instances
• Customer: groups with similar behavior
• Medical: patients with similar diagnostic measurements
• Defines each group by a ā€œcentroidā€
• Geometric center of the group
• Represents the ā€œaverageā€ member
• Number of centroids (k) can be specified or determined
BigML, Inc 4Clusters
Cluster Centroids
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
Thr Sally 6788 sign food 26339 51
BigML, Inc 5Clusters
Cluster Centroids
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
Thr Sally 6788 sign food 26339 51
auth = pin
amount ~ $100
Same:
date: Mon != Wed
customer: Sally != Bob
account: 6788 != 3421
class: clothes != gas
zip: 26339 != 46140
Different:
date = Wed (2 out of 3)
customer = Bob
account = 3421
auth = pin
class = gas
zip = 46140
amount = $104
Centroid:
similar
BigML, Inc 6Clusters
Use Cases
• Customer segmentation
• Which customers are similar?
• How many natural groups are there?
• Item discovery
• What other items are similar to this one?
• Similarity
• What other instances share a specific property?
• Recommender (almost)
• If you like this item, what other items might you like?
• Active learning
• Labelling unlabelled data efficiently
BigML, Inc 7Clusters
Customer Segmentation
GOAL: Cluster the users by usage
statistics. Identify clusters with a
higher percentage of high LTV users.
Since they have similar usage
patterns, the remaining users in
these clusters may be good
candidates for up-sell.
• Dataset of mobile game users.
• Data for each user consists of usage
statistics and a LTV based on in-
game purchases
• Assumption: Usage correlates to LTV
0%
3%
1%
BigML, Inc 8Clusters
Similarity
GOAL: Cluster the loans by
application profile to rank loan
quality by percentage of trouble
loans in population
• Dataset of Lending Club Loans
• Mark any loan that is currently or has
even been late as ā€œtroubleā€
0%
3%
7%
1%
BigML, Inc 9Clusters
Active Learning
GOAL:
Rather than sample randomly, use clustering to group
patients by similarity and then test a sample from each
cluster to label the data.
• Dataset of diagnostic measurements
of 768 patients.
• Want to test each patient for
diabetes and label the dataset to
build a model but the test is
expensive*.
BigML, Inc 10Clusters
Active Learning
*For a more realistic example of high cost, imagine a dataset with a
billion transactions, each one needing to be labelled as fraud/not-
fraud. Or a million images which need to be labeled as cat/not-cat.
2323
BigML, Inc 11Clusters
Item Discovery
GOAL: Cluster the whiskies by flavor
profile to discover whiskies that have
similar taste.
• Dataset of 86 whiskies
• Each whiskey scored on a scale from
0 to 4 for each of 12 possible flavor
characteristics.
Smoky
Fruity
BigML, Inc 12Clusters
Clusters Demo #1
BigML, Inc 13Clusters
Human Expert
Cluster into 3 groups…
BigML, Inc 14Clusters
Human Expert
BigML, Inc 15Clusters
Human Expert
• Jesa used prior knowledge to select possible features that
separated the objects.
• ā€œroundā€, ā€œskinnyā€, ā€œedgesā€, ā€œhardā€, etc
• Items were then clustered based on the chosen features
• Separation quality was then tested to ensure:
• met criteria of K=3
• groups were sufficiently ā€œdistantā€
• no crossover
BigML, Inc 16Clusters
Human Expert
• Length/Width
• greater than 1 => ā€œskinnyā€
• equal to 1 => ā€œroundā€
• less than 1 => invert
• Number of Surfaces
• distinct surfaces require ā€œedgesā€ which have corners
• easier to count
Create features that capture these object differences
BigML, Inc 17Clusters
Clustering Features
Object Length / Width Num Surfaces
penny 1 3
dime 1 3
knob 1 4
eraser 2.75 6
box 1 6
block 1.6 6
screw 8 3
battery 5 3
key 4.25 3
bead 1 2
BigML, Inc 18Clusters
Plot by Features
Num

Surfaces
Length / Width
box block eraser
knob
penny

dime
bead
key battery screw
K-Means Key Insight:

We can find clusters using distances

in n-dimensional feature space
K=3
BigML, Inc 19Clusters
Plot by Features
Num

Surfaces
Length / Width
box block eraser
knob
penny

dime
bead
key battery screw
K-Means

Find ā€œbestā€ (minimum distance)

circles that include all points
BigML, Inc 20Clusters
K-Means Algorithm
K=3
BigML, Inc 21Clusters
K-Means Algorithm
K=3
Repeat until centroids stop moving
BigML, Inc 22Clusters
Features Matter
Metal Other
Wood
BigML, Inc 23Clusters
Convergence
Convergence guaranteed

but not necessarily unique

Starting points important (K++)
BigML, Inc 24Clusters
Starting Points
• Random points or instances in n-dimensional space
• Chose points ā€œfarthestā€ away from each other
• but this is sensitive to outliers
• k++
• the first center is chosen randomly from instances
• each subsequent center is chosen from the remaining
instances with probability proportional to its squared distance
from the point's closest existing cluster center
BigML, Inc 25Clusters
Scaling Matters
price
number of bedrooms
d = 160,000
d = 1
BigML, Inc 26Clusters
Other Tricks
• What is the distance to a ā€œmissing valueā€?
• What is the distance between categorical values?
• What is the distance between text features?
• Does it have to be Euclidean distance?
• Unknown ā€œKā€?
BigML, Inc 27Clusters
Distance to Missing?
• Nonsense! Try replacing missing values with:
• Maximum
• Mean
• Median
• Minimum
• Zero
• Ignore instances with missing values
BigML, Inc 28Clusters
Distance to Categorical?
• Define special distance function: For two instances š‘„ and š‘¦	
 Ā 
and the categorical field š‘Ž:
• if š‘„ š‘Ž	
 Ā ļ¼	
 Ā  š‘¦ š‘Ž then

(š‘„,š‘¦)distanceļ¼0	
 Ā (or field scaling value) 

else 

(š‘„,š‘¦)distanceļ¼1
Approach: similar to ā€œk-prototypesā€
BigML, Inc 29Clusters
Distance to Categorical?
animal favorite toy toy color
cat ball red
cat ball green
d=0 d=0 d=1
cat laser red
dog squeaky red
d=1 d=1 d=0
D = 1
Then compute Euclidean distance between vectors
D = √2
Note: the centroid is assigned the most common
category of the member instances
BigML, Inc 30Clusters
Text Vectors
1
Cosine Similarity
0
-1
"hippo" "safari" "zebra" ….
1 0 1 …
1 1 0 …
0 1 1 …
Text Field #1
Text Field #2
Features(thousands)
• Cosine	
 Ā Similarity	
 Ā 
• cos() between two vectors
• 1 if collinear, 0 if orthogonal
• only positive vectors: 0	
  ≤	
 Ā CS	
  ≤	
 Ā 1
• Cosine	
 Ā Distanceļ¼1ļ¼Cosine	
 Ā Similarity	
 Ā 
• CD(TF1,	
 Ā TF2)	
 Ā =	
 Ā 0.5
BigML, Inc 31Clusters
Finding K: G-Means
BigML, Inc 32Clusters
Finding K: G-Means
BigML, Inc 33Clusters
Finding K: G-Means
Let K=2
Keep 1, Split 1
New K=3
BigML, Inc 34Clusters
Finding K: G-Means
Let K=3
Keep 1, Split 2
New K=5
BigML, Inc 35Clusters
Finding K: G-Means
Let K=5
K=5
BigML, Inc 36Clusters
Clusters Demo #2
BigML, Inc 37Clusters
Summary
• Cluster Purpose
• Unsupervised technique for finding self-similar groups
of instances
• Number of centroids (k) can be inputed or computed
• Outputs list of centroids
• Configuration:
• Algorithm: K-means / G-means
• Cluster Parameter: k or critical value
• Default missing / Summary fields / Scales / Weights
• Model Clusters
• Centroid / Batchcentroids
BigML, Inc 2
Anomaly Detection
Finding the Unusual
Poul Petersen
CIO, BigML, Inc
BigML, Inc 3Anomaly Detection
What is Anomaly Detection?
• An unsupervised learning technique
• No labels necessary
• Useful for finding unusual instances
• Filtering, finding mistakes, 1-class classifiers
• Finds instances that do not match
• Customer: big or small spender for profile
• Medical: healthy patient despite indicative diagnostics
• Defines each unusual instance by an ā€œanomaly scoreā€
• in BigML: 0ļ¼normal,	
 Ā 1ļ¼unusual, and 0.7	
  ≫	
 Ā 0.6	
  ﹄0.5

• Standard deviation, distributions, etc
BigML, Inc 4Anomaly Detection
Clusters
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
Thr Sally 6788 sign food 26339 51
BigML, Inc 5Anomaly Detection
Clusters
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
Thr Sally 6788 sign food 26339 51
similar
BigML, Inc 6Anomaly Detection
Anomaly Detection
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
Thr Sally 6788 sign food 26339 51
BigML, Inc 7Anomaly Detection
Anomaly Detection
date customer account auth class zip amount
Mon Bob 3421 pin clothes 46140 135
Tue Bob 3421 sign food 46140 401
Tue Alice 2456 pin food 12222 234
Wed Sally 6788 pin gas 26339 94
Wed Bob 3421 pin tech 21350 2459
Wed Bob 3421 pin gas 46140 83
Thr Sally 6788 sign food 26339 51
anomaly
• Amount $2,459 is higher than all other transactions
• It is the only transaction
• In zip 21350
• for the purchase class "tech"
BigML, Inc 8Anomaly Detection
Use Cases
• Unusual instance discovery - "exploration"
• Intrusion Detection - "looking for unusual usage patterns"
• Fraud - "looking for unusual behavior"
• Identify Incorrect Data - "looking for mistakes"
• Remove Outliers - "improve model quality"
• Model Competence / Input Data Drift
BigML, Inc 9Anomaly Detection
Removing Outliers
• Models need to generalize
• Outliers negatively impact generalization
GOAL: Use anomaly detector to identify most anomalous
points and then remove them before modeling.
DATASET FILTERED
DATASET
ANOMALY
DETECTOR
CLEAN
MODEL
BigML, Inc 10Anomaly Detection
Diabetes Anomalies
DIABETES
SOURCE
DIABETES
DATASET
TRAIN SET
TEST SET
ALL
MODEL
CLEAN
DATASET
FILTER
ALL
MODEL
ALL
EVALUATION
CLEAN
EVALUATION
COMPARE
EVALUATIONS
ANAOMALY
DETECTOR
BigML, Inc 11Anomaly Detection
Anomaly Demo #1
BigML, Inc 12Anomaly Detection
Intrusion Detection
GOAL: Identify unusual command line behavior per user and
across all users that might indicate an intrusion.
• Dataset of command line history for users
• Data for each user consists of commands,
flags, working directories, etc.
• Assumption: Users typically issue the
same flag patterns and work in certain
directories
Per User Per Dir All User All Dir
BigML, Inc 13Anomaly Detection
Fraud
• Dataset of credit card transactions
• Additional user profile information
GOAL: Cluster users by profile and use multiple anomaly
scores to detect transactions that are anomalous on multiple
levels.
Card Level User Level Similar User Level
BigML, Inc 14Anomaly Detection
Model Competence
• After putting a model it into production, data that is being
predicted can become statistically different than the
training data.
• Train an anomaly detector at the same time as the model.
GOAL: For every prediction, compute an anomaly score. If the
anomaly score is high, then the model may not be competent
and should not be trusted.
Prediction T T
Confidence 86% 84%
Anomaly Score 0.5367 0.7124
Competent? Y N
At Prediction TimeAt Training Time
DATASET
MODEL
ANOMALY
DETECTOR
BigML, Inc 15Anomaly Detection
Univariate Approach
• Single variable: heights, test scores, etc
• Assume the value is distributed ā€œnormallyā€
• Compute standard deviation
• a measure of how ā€œspread outā€ the numbers are
• the square root of the variance (The average of the squared
differences from the Mean.)
• Depending on the number of instances, choose a ā€œmultipleā€
of standard deviations to indicate an anomaly. A multiple of 3
for 1000 instances removes ~ 3 outliers.
BigML, Inc 16Anomaly Detection
Univariate Approach
measurement
frequency
outliersoutliers
• Available in BigML API
BigML, Inc 17Anomaly Detection
Benford’s Law
• In real-life numeric sets the small digits occur
disproportionately often as leading significant digits.
• Applications include:
• accounting records
• electricity bills
• street addresses
• stock prices
• population numbers
• death rates
• lengths of rivers
• Available in BigML API
BigML, Inc 18Anomaly Detection
Multivariate Matters
BigML, Inc 19Anomaly Detection
Multivariate Matters
BigML, Inc 20Anomaly Detection
Human Expert
Most Unusual?
BigML, Inc 21Anomaly Detection
Human Expert
ā€œRoundā€ā€œSkinnyā€ ā€œCornersā€
ā€œSkinnyā€
but not ā€œsmoothā€
No
ā€œCornersā€
Not
ā€œRoundā€
Key Insight

The ā€œmost unusualā€ object

is different in some way from

every partition of the features.
Most unusual
BigML, Inc 22Anomaly Detection
Human Expert
• Human used prior knowledge to select possible features
that separated the objects.
• ā€œroundā€, ā€œskinnyā€, ā€œsmoothā€, ā€œcornersā€
• Items were then separated based on the chosen features
• Each cluster was then examined to see which object fit
the least well in its cluster and did not fit any other cluster
BigML, Inc 23Anomaly Detection
Human Expert
• Length/Width
• greater than 1 => ā€œskinnyā€
• equal to 1 => ā€œroundā€
• less than 1 => invert
• Number of Surfaces
• distinct surfaces require ā€œedgesā€ which have corners
• easier to count
• Smooth - true or false
Create features that capture these object differences
BigML, Inc 24Anomaly Detection
Anomaly Features
Object Length / Width Num Surfaces Smooth
penny 1 3 TRUE
dime 1 3 TRUE
knob 1 4 TRUE
eraser 2.75 6 TRUE
box 1 6 TRUE
block 1.6 6 TRUE
screw 8 3 FALSE
battery 5 3 TRUE
key 4.25 3 FALSE
bead 1 2 TRUE
BigML, Inc 25Anomaly Detection
length/width > 5
smooth?
box
blockeraser
knob
penny/dime
bead
key
battery
screw
num surfaces = 6
length/width =1
length/width < 2
Know that ā€œsplitsā€ matter - don’t know the order
TrueFalse
TrueFalse TrueFalse
FalseTrue
TrueFalse
Random Splits
BigML, Inc 26Anomaly Detection
Isolation Forest
Grow a random decision tree until
each instance is in its own leaf
ā€œeasyā€ to isolate
ā€œhardā€ to isolate
Depth
Now repeat the process several times and
use average Depth to compute anomaly
score: 0 (similar) -> 1 (dissimilar)
BigML, Inc 27Anomaly Detection
Isolation Forest Scoring
D = 3
D = 6
D = 2
S=0.45
Map avg depth
to final score
f1 f2 f3
i1 red cat ball
i2 red cat ball
i3 red cat box
i4 blue dog pen
For the instance, i2
Find the depth in each tree
BigML, Inc 28Anomaly Detection
Model Competence
• A low anomaly score means the loan is similar to the
modeled loans.
• A high anomaly score means you should not trust the
model.
Prediction T T
Confidence
86% 84%
Anomaly
Score
0.5367 0.7124
Competent? Y N
OPEN LOANS
PREDICTION
ANOMALY
SCORE
CLOSED LOAN
MODEL
CLOSED LOAN
ANOMALY DETECTOR
BigML, Inc 29Anomaly Detection
Anomaly Demo #2
BigML, Inc 30Anomaly Detection
1-Class Classifier?
• You place an advertisement in a local newspaper
• You collect demographic information about all responders
• Now you want to market in a new locality with direct letters
• To optimize mailing costs, need to predict who will respond
• But, can not distinguish not interested from didn’t see the ad
• Train an anomaly detector on the 1-class data
• Pick the households with the lowest scores for mailing:
• If a household has a low anomaly score, then they are
ā€œsimilarā€ to enough of your positive responders and
therefore may respond as well
• If an individual has a high anomaly score, then they are
dissimilar from all previous responders and therefore are
less likely to respond.
BigML, Inc 31Anomaly Detection
Summary
• Anomaly detection is the process of finding unusual instances
• Some techniques and how they work:
• Univariate: standard deviation
• Benford’s law
• Isolation Forest
• Applications
• Filtering to improve models
• Finding mistakes, fraud, and intruders
• Knowing when to retrain a model (competence)
• 1-class classifiers
• In general… unsupervised learning techniques:
• Require more finesse and interpretation
• Are more commonly part of a multistep workflow
VSSML17 L3. Clusters and Anomaly Detection

More Related Content

PDF
VSSML17 L2. Ensembles and Logistic Regressions
PDF
VSSML17 L4. Association Discovery and Latent Dirichlet Allocation
PDF
BSSML17 - Clusters
PDF
BSSML17 - Anomaly Detection
PDF
VSSML17 L6. Time Series and Deepnets
PDF
VSSML17 L5. Basic Data Transformations and Feature Engineering
PDF
BSSML17 - Ensembles
PDF
BSSML16 L3. Clusters and Anomaly Detection
VSSML17 L2. Ensembles and Logistic Regressions
VSSML17 L4. Association Discovery and Latent Dirichlet Allocation
BSSML17 - Clusters
BSSML17 - Anomaly Detection
VSSML17 L6. Time Series and Deepnets
VSSML17 L5. Basic Data Transformations and Feature Engineering
BSSML17 - Ensembles
BSSML16 L3. Clusters and Anomaly Detection

What's hot (20)

PDF
BSSML17 - Logistic Regressions
PDF
BSSML17 - Basic Data Transformations
PDF
BSSML16 L4. Association Discovery and Topic Modeling
PDF
BSSML17 - Introduction, Models, Evaluations
PDF
BSSML16 L2. Ensembles and Logistic Regressions
PDF
BSSML16 L1. Introduction, Models, and Evaluations
PDF
BSSML17 - Topic Models
PDF
DutchMLSchool. Clusters and Anomalies
PDF
BSSML16 L5. Summary Day 1 Sessions
PDF
BSSML17 - Deepnets
PDF
VSSML16 L3. Clusters and Anomaly Detection
PDF
BSSML17 - Time Series
PDF
BSSML17 - Association Discovery
PDF
BSSML16 L6. Basic Data Transformations
PDF
BSSML17 - Feature Engineering
PDF
MLSEV. Cluster Analysis and Anomaly Detection
PDF
VSSML17 Review. Summary Day 1 Sessions
PDF
DutchMLSchool. ML: A Technical Perspective
PPTX
Feature Engineering
Ā 
PDF
BSSML17 - API and WhizzML
BSSML17 - Logistic Regressions
BSSML17 - Basic Data Transformations
BSSML16 L4. Association Discovery and Topic Modeling
BSSML17 - Introduction, Models, Evaluations
BSSML16 L2. Ensembles and Logistic Regressions
BSSML16 L1. Introduction, Models, and Evaluations
BSSML17 - Topic Models
DutchMLSchool. Clusters and Anomalies
BSSML16 L5. Summary Day 1 Sessions
BSSML17 - Deepnets
VSSML16 L3. Clusters and Anomaly Detection
BSSML17 - Time Series
BSSML17 - Association Discovery
BSSML16 L6. Basic Data Transformations
BSSML17 - Feature Engineering
MLSEV. Cluster Analysis and Anomaly Detection
VSSML17 Review. Summary Day 1 Sessions
DutchMLSchool. ML: A Technical Perspective
Feature Engineering
Ā 
BSSML17 - API and WhizzML
Ad

Similar to VSSML17 L3. Clusters and Anomaly Detection (20)

PDF
VSSML18. Clustering and Latent Dirichlet Allocation
PDF
BigML Education - Clusters
PDF
L13. Cluster Analysis
PDF
MLSEV. Association Discovery and Topic Modeling
PDF
L14. Anomaly Detection
PPTX
07 learning
PDF
DutchMLSchool. Logistic Regression, Deepnets, Time Series
PDF
DutchMLSchool. Models, Evaluations, and Ensembles
PPTX
Data_Preparation.pptx
Ā 
PDF
DutchMLSchool. Automating Decision Making
PDF
Future of AI-powered automation in business
PDF
DutchMLSchool. Introduction to Machine Learning with the BigML Platform
PDF
DutchMLSchool. Associations and Topic Models
PPTX
Ml1 introduction to-supervised_learning_and_k_nearest_neighbors
PDF
How to do Predictive Analytics with Limited Data
PDF
Trusting AI with important decisions
PPTX
Lecture 9 -Clustering(ML algorithms: Clustering, KNN, DBScan).pptx
PPTX
CS194Lec0hbh6EDA.pptx
PPTX
ObjRecog2-17 (1).pptx
PPTX
Data Mining Lecture_1.pptx
VSSML18. Clustering and Latent Dirichlet Allocation
BigML Education - Clusters
L13. Cluster Analysis
MLSEV. Association Discovery and Topic Modeling
L14. Anomaly Detection
07 learning
DutchMLSchool. Logistic Regression, Deepnets, Time Series
DutchMLSchool. Models, Evaluations, and Ensembles
Data_Preparation.pptx
Ā 
DutchMLSchool. Automating Decision Making
Future of AI-powered automation in business
DutchMLSchool. Introduction to Machine Learning with the BigML Platform
DutchMLSchool. Associations and Topic Models
Ml1 introduction to-supervised_learning_and_k_nearest_neighbors
How to do Predictive Analytics with Limited Data
Trusting AI with important decisions
Lecture 9 -Clustering(ML algorithms: Clustering, KNN, DBScan).pptx
CS194Lec0hbh6EDA.pptx
ObjRecog2-17 (1).pptx
Data Mining Lecture_1.pptx
Ad

More from BigML, Inc (20)

PDF
Digital Transformation and Process Optimization in Manufacturing
PDF
DutchMLSchool 2022 - Automation
PDF
DutchMLSchool 2022 - ML for AML Compliance
PDF
DutchMLSchool 2022 - Multi Perspective Anomalies
PDF
DutchMLSchool 2022 - My First Anomaly Detector
PDF
DutchMLSchool 2022 - Anomaly Detection
PDF
DutchMLSchool 2022 - History and Developments in ML
PDF
DutchMLSchool 2022 - End-to-End ML
PDF
DutchMLSchool 2022 - A Data-Driven Company
PDF
DutchMLSchool 2022 - ML in the Legal Sector
PDF
DutchMLSchool 2022 - Smart Safe Stadiums
PDF
DutchMLSchool 2022 - Process Optimization in Manufacturing Plants
PDF
DutchMLSchool 2022 - Anomaly Detection at Scale
PDF
DutchMLSchool 2022 - Citizen Development in AI
PDF
Democratizing Object Detection
PDF
BigML Release: Image Processing
PDF
Machine Learning in Retail: Know Your Customers' Customer. See Your Future
PDF
Machine Learning in Retail: ML in the Retail Sector
PDF
ML in GRC: Machine Learning in Legal Automation, How to Trust a Lawyerbot
PDF
ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...
Digital Transformation and Process Optimization in Manufacturing
DutchMLSchool 2022 - Automation
DutchMLSchool 2022 - ML for AML Compliance
DutchMLSchool 2022 - Multi Perspective Anomalies
DutchMLSchool 2022 - My First Anomaly Detector
DutchMLSchool 2022 - Anomaly Detection
DutchMLSchool 2022 - History and Developments in ML
DutchMLSchool 2022 - End-to-End ML
DutchMLSchool 2022 - A Data-Driven Company
DutchMLSchool 2022 - ML in the Legal Sector
DutchMLSchool 2022 - Smart Safe Stadiums
DutchMLSchool 2022 - Process Optimization in Manufacturing Plants
DutchMLSchool 2022 - Anomaly Detection at Scale
DutchMLSchool 2022 - Citizen Development in AI
Democratizing Object Detection
BigML Release: Image Processing
Machine Learning in Retail: Know Your Customers' Customer. See Your Future
Machine Learning in Retail: ML in the Retail Sector
ML in GRC: Machine Learning in Legal Automation, How to Trust a Lawyerbot
ML in GRC: Supporting Human Decision Making for Regulatory Adherence with Mac...

Recently uploaded (20)

PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
oil_refinery_comprehensive_20250804084928 (1).pptx
PDF
Business Analytics and business intelligence.pdf
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
PDF
Lecture1 pattern recognition............
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPT
Quality review (1)_presentation of this 21
ISS -ESG Data flows What is ESG and HowHow
Introduction to Knowledge Engineering Part 1
oil_refinery_comprehensive_20250804084928 (1).pptx
Business Analytics and business intelligence.pdf
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Lecture1 pattern recognition............
.pdf is not working space design for the following data for the following dat...
Business Ppt On Nestle.pptx huunnnhhgfvu
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Miokarditis (Inflamasi pada Otot Jantung)
Qualitative Qantitative and Mixed Methods.pptx
iec ppt-1 pptx icmr ppt on rehabilitation.pptx
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Quality review (1)_presentation of this 21

VSSML17 L3. Clusters and Anomaly Detection

  • 1. Valencian Summer School in Machine Learning 3rd edition September 14-15, 2017
  • 2. BigML, Inc 2 Clusters Finding Similarities Poul Petersen CIO, BigML, Inc
  • 3. BigML, Inc 3Clusters What is Clustering? • An unsupervised learning technique • No labels necessary • Useful for finding similar instances • Smart sampling/labelling • Finds ā€œself-similar" groups of instances • Customer: groups with similar behavior • Medical: patients with similar diagnostic measurements • Defines each group by a ā€œcentroidā€ • Geometric center of the group • Represents the ā€œaverageā€ member • Number of centroids (k) can be specified or determined
  • 4. BigML, Inc 4Clusters Cluster Centroids date customer account auth class zip amount Mon Bob 3421 pin clothes 46140 135 Tue Bob 3421 sign food 46140 401 Tue Alice 2456 pin food 12222 234 Wed Sally 6788 pin gas 26339 94 Wed Bob 3421 pin tech 21350 2459 Wed Bob 3421 pin gas 46140 83 Thr Sally 6788 sign food 26339 51
  • 5. BigML, Inc 5Clusters Cluster Centroids date customer account auth class zip amount Mon Bob 3421 pin clothes 46140 135 Tue Bob 3421 sign food 46140 401 Tue Alice 2456 pin food 12222 234 Wed Sally 6788 pin gas 26339 94 Wed Bob 3421 pin tech 21350 2459 Wed Bob 3421 pin gas 46140 83 Thr Sally 6788 sign food 26339 51 auth = pin amount ~ $100 Same: date: Mon != Wed customer: Sally != Bob account: 6788 != 3421 class: clothes != gas zip: 26339 != 46140 Different: date = Wed (2 out of 3) customer = Bob account = 3421 auth = pin class = gas zip = 46140 amount = $104 Centroid: similar
  • 6. BigML, Inc 6Clusters Use Cases • Customer segmentation • Which customers are similar? • How many natural groups are there? • Item discovery • What other items are similar to this one? • Similarity • What other instances share a specific property? • Recommender (almost) • If you like this item, what other items might you like? • Active learning • Labelling unlabelled data efficiently
  • 7. BigML, Inc 7Clusters Customer Segmentation GOAL: Cluster the users by usage statistics. Identify clusters with a higher percentage of high LTV users. Since they have similar usage patterns, the remaining users in these clusters may be good candidates for up-sell. • Dataset of mobile game users. • Data for each user consists of usage statistics and a LTV based on in- game purchases • Assumption: Usage correlates to LTV 0% 3% 1%
  • 8. BigML, Inc 8Clusters Similarity GOAL: Cluster the loans by application profile to rank loan quality by percentage of trouble loans in population • Dataset of Lending Club Loans • Mark any loan that is currently or has even been late as ā€œtroubleā€ 0% 3% 7% 1%
  • 9. BigML, Inc 9Clusters Active Learning GOAL: Rather than sample randomly, use clustering to group patients by similarity and then test a sample from each cluster to label the data. • Dataset of diagnostic measurements of 768 patients. • Want to test each patient for diabetes and label the dataset to build a model but the test is expensive*.
  • 10. BigML, Inc 10Clusters Active Learning *For a more realistic example of high cost, imagine a dataset with a billion transactions, each one needing to be labelled as fraud/not- fraud. Or a million images which need to be labeled as cat/not-cat. 2323
  • 11. BigML, Inc 11Clusters Item Discovery GOAL: Cluster the whiskies by flavor profile to discover whiskies that have similar taste. • Dataset of 86 whiskies • Each whiskey scored on a scale from 0 to 4 for each of 12 possible flavor characteristics. Smoky Fruity
  • 13. BigML, Inc 13Clusters Human Expert Cluster into 3 groups…
  • 15. BigML, Inc 15Clusters Human Expert • Jesa used prior knowledge to select possible features that separated the objects. • ā€œroundā€, ā€œskinnyā€, ā€œedgesā€, ā€œhardā€, etc • Items were then clustered based on the chosen features • Separation quality was then tested to ensure: • met criteria of K=3 • groups were sufficiently ā€œdistantā€ • no crossover
  • 16. BigML, Inc 16Clusters Human Expert • Length/Width • greater than 1 => ā€œskinnyā€ • equal to 1 => ā€œroundā€ • less than 1 => invert • Number of Surfaces • distinct surfaces require ā€œedgesā€ which have corners • easier to count Create features that capture these object differences
  • 17. BigML, Inc 17Clusters Clustering Features Object Length / Width Num Surfaces penny 1 3 dime 1 3 knob 1 4 eraser 2.75 6 box 1 6 block 1.6 6 screw 8 3 battery 5 3 key 4.25 3 bead 1 2
  • 18. BigML, Inc 18Clusters Plot by Features Num Surfaces Length / Width box block eraser knob penny dime bead key battery screw K-Means Key Insight: We can find clusters using distances in n-dimensional feature space K=3
  • 19. BigML, Inc 19Clusters Plot by Features Num Surfaces Length / Width box block eraser knob penny dime bead key battery screw K-Means Find ā€œbestā€ (minimum distance) circles that include all points
  • 21. BigML, Inc 21Clusters K-Means Algorithm K=3 Repeat until centroids stop moving
  • 22. BigML, Inc 22Clusters Features Matter Metal Other Wood
  • 23. BigML, Inc 23Clusters Convergence Convergence guaranteed but not necessarily unique Starting points important (K++)
  • 24. BigML, Inc 24Clusters Starting Points • Random points or instances in n-dimensional space • Chose points ā€œfarthestā€ away from each other • but this is sensitive to outliers • k++ • the first center is chosen randomly from instances • each subsequent center is chosen from the remaining instances with probability proportional to its squared distance from the point's closest existing cluster center
  • 25. BigML, Inc 25Clusters Scaling Matters price number of bedrooms d = 160,000 d = 1
  • 26. BigML, Inc 26Clusters Other Tricks • What is the distance to a ā€œmissing valueā€? • What is the distance between categorical values? • What is the distance between text features? • Does it have to be Euclidean distance? • Unknown ā€œKā€?
  • 27. BigML, Inc 27Clusters Distance to Missing? • Nonsense! Try replacing missing values with: • Maximum • Mean • Median • Minimum • Zero • Ignore instances with missing values
  • 28. BigML, Inc 28Clusters Distance to Categorical? • Define special distance function: For two instances š‘„ and š‘¦ Ā  and the categorical field š‘Ž: • if š‘„ š‘Ž Ā ļ¼ Ā  š‘¦ š‘Ž then
 (š‘„,š‘¦)distanceļ¼0 Ā (or field scaling value) 
 else 
 (š‘„,š‘¦)distanceļ¼1 Approach: similar to ā€œk-prototypesā€
  • 29. BigML, Inc 29Clusters Distance to Categorical? animal favorite toy toy color cat ball red cat ball green d=0 d=0 d=1 cat laser red dog squeaky red d=1 d=1 d=0 D = 1 Then compute Euclidean distance between vectors D = √2 Note: the centroid is assigned the most common category of the member instances
  • 30. BigML, Inc 30Clusters Text Vectors 1 Cosine Similarity 0 -1 "hippo" "safari" "zebra" …. 1 0 1 … 1 1 0 … 0 1 1 … Text Field #1 Text Field #2 Features(thousands) • Cosine Ā Similarity Ā  • cos() between two vectors • 1 if collinear, 0 if orthogonal • only positive vectors: 0  ≤ Ā CS  ≤ Ā 1 • Cosine Ā Distanceļ¼1ļ¼Cosine Ā Similarity Ā  • CD(TF1, Ā TF2) Ā = Ā 0.5
  • 33. BigML, Inc 33Clusters Finding K: G-Means Let K=2 Keep 1, Split 1 New K=3
  • 34. BigML, Inc 34Clusters Finding K: G-Means Let K=3 Keep 1, Split 2 New K=5
  • 35. BigML, Inc 35Clusters Finding K: G-Means Let K=5 K=5
  • 37. BigML, Inc 37Clusters Summary • Cluster Purpose • Unsupervised technique for finding self-similar groups of instances • Number of centroids (k) can be inputed or computed • Outputs list of centroids • Configuration: • Algorithm: K-means / G-means • Cluster Parameter: k or critical value • Default missing / Summary fields / Scales / Weights • Model Clusters • Centroid / Batchcentroids
  • 38. BigML, Inc 2 Anomaly Detection Finding the Unusual Poul Petersen CIO, BigML, Inc
  • 39. BigML, Inc 3Anomaly Detection What is Anomaly Detection? • An unsupervised learning technique • No labels necessary • Useful for finding unusual instances • Filtering, finding mistakes, 1-class classifiers • Finds instances that do not match • Customer: big or small spender for profile • Medical: healthy patient despite indicative diagnostics • Defines each unusual instance by an ā€œanomaly scoreā€ • in BigML: 0ļ¼normal, Ā 1ļ¼unusual, and 0.7  ≫ Ā 0.6  ﹄0.5 • Standard deviation, distributions, etc
  • 40. BigML, Inc 4Anomaly Detection Clusters date customer account auth class zip amount Mon Bob 3421 pin clothes 46140 135 Tue Bob 3421 sign food 46140 401 Tue Alice 2456 pin food 12222 234 Wed Sally 6788 pin gas 26339 94 Wed Bob 3421 pin tech 21350 2459 Wed Bob 3421 pin gas 46140 83 Thr Sally 6788 sign food 26339 51
  • 41. BigML, Inc 5Anomaly Detection Clusters date customer account auth class zip amount Mon Bob 3421 pin clothes 46140 135 Tue Bob 3421 sign food 46140 401 Tue Alice 2456 pin food 12222 234 Wed Sally 6788 pin gas 26339 94 Wed Bob 3421 pin tech 21350 2459 Wed Bob 3421 pin gas 46140 83 Thr Sally 6788 sign food 26339 51 similar
  • 42. BigML, Inc 6Anomaly Detection Anomaly Detection date customer account auth class zip amount Mon Bob 3421 pin clothes 46140 135 Tue Bob 3421 sign food 46140 401 Tue Alice 2456 pin food 12222 234 Wed Sally 6788 pin gas 26339 94 Wed Bob 3421 pin tech 21350 2459 Wed Bob 3421 pin gas 46140 83 Thr Sally 6788 sign food 26339 51
  • 43. BigML, Inc 7Anomaly Detection Anomaly Detection date customer account auth class zip amount Mon Bob 3421 pin clothes 46140 135 Tue Bob 3421 sign food 46140 401 Tue Alice 2456 pin food 12222 234 Wed Sally 6788 pin gas 26339 94 Wed Bob 3421 pin tech 21350 2459 Wed Bob 3421 pin gas 46140 83 Thr Sally 6788 sign food 26339 51 anomaly • Amount $2,459 is higher than all other transactions • It is the only transaction • In zip 21350 • for the purchase class "tech"
  • 44. BigML, Inc 8Anomaly Detection Use Cases • Unusual instance discovery - "exploration" • Intrusion Detection - "looking for unusual usage patterns" • Fraud - "looking for unusual behavior" • Identify Incorrect Data - "looking for mistakes" • Remove Outliers - "improve model quality" • Model Competence / Input Data Drift
  • 45. BigML, Inc 9Anomaly Detection Removing Outliers • Models need to generalize • Outliers negatively impact generalization GOAL: Use anomaly detector to identify most anomalous points and then remove them before modeling. DATASET FILTERED DATASET ANOMALY DETECTOR CLEAN MODEL
  • 46. BigML, Inc 10Anomaly Detection Diabetes Anomalies DIABETES SOURCE DIABETES DATASET TRAIN SET TEST SET ALL MODEL CLEAN DATASET FILTER ALL MODEL ALL EVALUATION CLEAN EVALUATION COMPARE EVALUATIONS ANAOMALY DETECTOR
  • 47. BigML, Inc 11Anomaly Detection Anomaly Demo #1
  • 48. BigML, Inc 12Anomaly Detection Intrusion Detection GOAL: Identify unusual command line behavior per user and across all users that might indicate an intrusion. • Dataset of command line history for users • Data for each user consists of commands, flags, working directories, etc. • Assumption: Users typically issue the same flag patterns and work in certain directories Per User Per Dir All User All Dir
  • 49. BigML, Inc 13Anomaly Detection Fraud • Dataset of credit card transactions • Additional user profile information GOAL: Cluster users by profile and use multiple anomaly scores to detect transactions that are anomalous on multiple levels. Card Level User Level Similar User Level
  • 50. BigML, Inc 14Anomaly Detection Model Competence • After putting a model it into production, data that is being predicted can become statistically different than the training data. • Train an anomaly detector at the same time as the model. GOAL: For every prediction, compute an anomaly score. If the anomaly score is high, then the model may not be competent and should not be trusted. Prediction T T Confidence 86% 84% Anomaly Score 0.5367 0.7124 Competent? Y N At Prediction TimeAt Training Time DATASET MODEL ANOMALY DETECTOR
  • 51. BigML, Inc 15Anomaly Detection Univariate Approach • Single variable: heights, test scores, etc • Assume the value is distributed ā€œnormallyā€ • Compute standard deviation • a measure of how ā€œspread outā€ the numbers are • the square root of the variance (The average of the squared differences from the Mean.) • Depending on the number of instances, choose a ā€œmultipleā€ of standard deviations to indicate an anomaly. A multiple of 3 for 1000 instances removes ~ 3 outliers.
  • 52. BigML, Inc 16Anomaly Detection Univariate Approach measurement frequency outliersoutliers • Available in BigML API
  • 53. BigML, Inc 17Anomaly Detection Benford’s Law • In real-life numeric sets the small digits occur disproportionately often as leading significant digits. • Applications include: • accounting records • electricity bills • street addresses • stock prices • population numbers • death rates • lengths of rivers • Available in BigML API
  • 54. BigML, Inc 18Anomaly Detection Multivariate Matters
  • 55. BigML, Inc 19Anomaly Detection Multivariate Matters
  • 56. BigML, Inc 20Anomaly Detection Human Expert Most Unusual?
  • 57. BigML, Inc 21Anomaly Detection Human Expert ā€œRoundā€ā€œSkinnyā€ ā€œCornersā€ ā€œSkinnyā€ but not ā€œsmoothā€ No ā€œCornersā€ Not ā€œRoundā€ Key Insight The ā€œmost unusualā€ object is different in some way from every partition of the features. Most unusual
  • 58. BigML, Inc 22Anomaly Detection Human Expert • Human used prior knowledge to select possible features that separated the objects. • ā€œroundā€, ā€œskinnyā€, ā€œsmoothā€, ā€œcornersā€ • Items were then separated based on the chosen features • Each cluster was then examined to see which object fit the least well in its cluster and did not fit any other cluster
  • 59. BigML, Inc 23Anomaly Detection Human Expert • Length/Width • greater than 1 => ā€œskinnyā€ • equal to 1 => ā€œroundā€ • less than 1 => invert • Number of Surfaces • distinct surfaces require ā€œedgesā€ which have corners • easier to count • Smooth - true or false Create features that capture these object differences
  • 60. BigML, Inc 24Anomaly Detection Anomaly Features Object Length / Width Num Surfaces Smooth penny 1 3 TRUE dime 1 3 TRUE knob 1 4 TRUE eraser 2.75 6 TRUE box 1 6 TRUE block 1.6 6 TRUE screw 8 3 FALSE battery 5 3 TRUE key 4.25 3 FALSE bead 1 2 TRUE
  • 61. BigML, Inc 25Anomaly Detection length/width > 5 smooth? box blockeraser knob penny/dime bead key battery screw num surfaces = 6 length/width =1 length/width < 2 Know that ā€œsplitsā€ matter - don’t know the order TrueFalse TrueFalse TrueFalse FalseTrue TrueFalse Random Splits
  • 62. BigML, Inc 26Anomaly Detection Isolation Forest Grow a random decision tree until each instance is in its own leaf ā€œeasyā€ to isolate ā€œhardā€ to isolate Depth Now repeat the process several times and use average Depth to compute anomaly score: 0 (similar) -> 1 (dissimilar)
  • 63. BigML, Inc 27Anomaly Detection Isolation Forest Scoring D = 3 D = 6 D = 2 S=0.45 Map avg depth to final score f1 f2 f3 i1 red cat ball i2 red cat ball i3 red cat box i4 blue dog pen For the instance, i2 Find the depth in each tree
  • 64. BigML, Inc 28Anomaly Detection Model Competence • A low anomaly score means the loan is similar to the modeled loans. • A high anomaly score means you should not trust the model. Prediction T T Confidence 86% 84% Anomaly Score 0.5367 0.7124 Competent? Y N OPEN LOANS PREDICTION ANOMALY SCORE CLOSED LOAN MODEL CLOSED LOAN ANOMALY DETECTOR
  • 65. BigML, Inc 29Anomaly Detection Anomaly Demo #2
  • 66. BigML, Inc 30Anomaly Detection 1-Class Classifier? • You place an advertisement in a local newspaper • You collect demographic information about all responders • Now you want to market in a new locality with direct letters • To optimize mailing costs, need to predict who will respond • But, can not distinguish not interested from didn’t see the ad • Train an anomaly detector on the 1-class data • Pick the households with the lowest scores for mailing: • If a household has a low anomaly score, then they are ā€œsimilarā€ to enough of your positive responders and therefore may respond as well • If an individual has a high anomaly score, then they are dissimilar from all previous responders and therefore are less likely to respond.
  • 67. BigML, Inc 31Anomaly Detection Summary • Anomaly detection is the process of finding unusual instances • Some techniques and how they work: • Univariate: standard deviation • Benford’s law • Isolation Forest • Applications • Filtering to improve models • Finding mistakes, fraud, and intruders • Knowing when to retrain a model (competence) • 1-class classifiers • In general… unsupervised learning techniques: • Require more finesse and interpretation • Are more commonly part of a multistep workflow