SlideShare a Scribd company logo
Presentation Title
Your company information
Presentation subtitle
Data Cleaning – Outlier Detection
Group 01-IT 1
Contents
1. Types of outliers
2. Outlier detection
3. Statistical (or model-based) approaches
4. Proximity-base approaches
5. Clustering-base approaches
6. Classification approaches
7. Outlier detection in high dimensional data
2
Introduction
3
What are outliers?
Outlier: A data object that deviates significantly from the normal objects as if it were
generated by a different mechanism.
Outliers are interesting: It violates the mechanism that generates the normal data.
Applications of Outlier Detection:
◦Credit card fraud detection Medical analysis
◦Telecom fraud detection Public health
◦Customer segmentation Sports statistics
◦Detecting measurement errors
4
Types of Outliers
Three types:
Global outliers (or point anomaly)
Contextual outliers (or conditional outlier)
Collective outliers
5
1. Global outlier (or point anomaly)
• Object is Og if it significantly deviates from the rest of the data set
• Ex. Intrusion detection in computer networks
• Issue: Find an appropriate measurement of deviation
1. Contextual outlier (or conditional outlier)
• Object is Oc if it deviates significantly based on a selected context
• Ex. 80o
F in Urbana: outlier? (depending on summer or winter?)
• Issue: How to define or formulate meaningful context?
1. Collective Outliers
• A subset of data objects collectively deviate significantly from the whole data set,
even if the individual data objects may not be outliers
• E.g., intrusion detection:
6
Collective outlier
Challenges of Outlier Detection
Modeling normal objects and outliers properly
Application-specific outlier detection
Handling noise in outlier detection
Understandability
7
Categorization of Outlier Detection : 1 of 1
8
Categorization of Outlier
Detection Methods
There are two ways to categorize outlier detection methods:
Based on whether user-labeled examples of outliers can be obtained:
 Supervised
 Semi-supervised
 Unsupervised methods
 Based on assumptions about normal data and outliers:
 Statistical,
 proximity-based
 clustering-based methods
9
Supervised Methods
 Modeling outlier detection as a classification problem
 Methods for Learning a classifier for outlier detection effectively:
Model normal objects & report those not matching the model as outliers, or
Model outliers and treat those not matching the model as normal
 Challenges
Imbalanced classes, i.e., outliers are rare: Boost the outlier class and make
up some artificial outliers
Catch as many outliers as possible, i.e., recall is more important than
accuracy (i.e., not mislabeling normal objects as outliers)
10
Supervised Methods
11
Unsupervised Methods
Assume the normal objects are somewhat ``clustered'‘ into multiple groups,
each having some distinct features
An outlier is expected to be far away from any groups of normal objects
Weakness: Cannot detect collective outlier effectively
Ex. In some intrusion or virus detection, normal activities are diverse
Many clustering methods can be adapted for unsupervised methods
12
13
Unsupervised Methods
Semi-Supervised Methods
Labels could be on outliers only, normal objects only, or both
Semi-supervised outlier detection: Regarded as applications of semi-
supervised learning
This can be done in two ways
1. If some labeled normal objects are available
2. If only some labeled outliers are available
14
15
Semi-supervised Methods
Categorization of Outlier Detection : 2 of 2
16
Statistical Methods
Statistical methods (also known as model-based methods) assume that the
normal data follow some statistical model
Example :
STEP 1: Use Gaussian distribution model.
STEP 2: Consider the object y in region R
STEP 3: Estimate the probability of y fits the Gaussian
distribution (gD(y))
STEP 4: If gD(y) is very low, y is an outlier
Effectiveness highly depends on whether the assumption of statistical
model holds in the real data
There are rich alternatives to use various statistical models
17
Proximity-Based Methods
The proximity of the object(outlier) is significantly deviates from the proximity of most of
the other objects in the same data set.
Example :
• Model the proximity of an object using its 3 nearest neighbors.
• Objects in region R are different.
• Thus the objects in R are outliers.
The effectiveness highly relies on the proximity measure.
In some applications, proximity or distance measures cannot be obtained easily.
Often have a difficulty in finding a group of outliers which stay close to each other.
18
Clustering-Based Methods
Normal data belong to large and dense clusters, whereas outliers belong to small
clusters, or do not belong to any clusters.
Example: two clusters
• All points not in R form a large cluster
• The two points in R form a tiny cluster, thus are outliers
Since there are many clustering methods there are many
clustering-based outlier detection methods as well.
Clustering is expensive: straightforward adaption of a clustering method for
outlier detection can be costly and does not scale up well for large data sets.
19
Statistical Approaches
20
Parametric methods - Detection
Univariate Outliers Based on Normal
Distribution
Univariate data: A data set involving only one attribute or variable.
Often assume that data are generated from a normal distribution, learn the
parameters from the input data, and identify the points with low probability as
outliers.
Ex: Avg. temp.: {24.0, 28.9, 28.9, 29.0, 29.1, 29.1, 29.2, 29.2, 29.3, 29.4}
21
Parametric Methods -The Grubb’s
Test
Detect outliers in univariate data
Assume data comes from normal distribution
Detects one outlier at a time, remove the outlier, and repeat
H0: There is no outlier in data
HA: There is at least one outlier
22
Parametric Methods - Detection
of Multivariate Outliers
Multivariate data: A data set involving two or more attributes or variables
Transform the multivariate outlier detection task into a univariate outlier
detection problem
Method 1. Compute Mahalaobis distance
Method 2. Use χ2
–statistic:
23
Parametric Methods - Using
Mixture of Parametric
Distributions
Assuming data generated by a normal distribution could be sometimes overly
simplified.
Example: The objects between the two clusters cannot be captured as outliers
since they are close to the estimated mean.
To overcome this problem, assume the normal.
data is generated by two normal distributions.
24
Non-Parametric Methods:
Detection Using Histogram
The model of normal data is learned from the input data without any a priori
structure.
Often makes fewer assumptions about the data, and thus can be applicable in more
scenarios.
Outlier detection using histogram:
 Problem: Hard to choose an appropriate bin size for histogram
 Too small bin size → normal objects in empty/rare bins, false positive
 Too big bin size → outliers in some frequent bins, false negative
 Solution: Adopt kernel density estimation to estimate the probability density distribution of
the data. If the estimated density function is high, the object is likely normal. Otherwise, it is
likely an outlier.
25
Proximity-Based Approaches
26
Distance-Based Outlier Detection
Judge a point based on the distance(s) to its neighbors.
Several variants proposed.
Basic Assumptions
• Normal data objects have a dense neighborhood.
• Outliers are far apart from their neighbors, i.e., have a less dense
neighborhood.
27
DB(ε,π)-Outliers
 Basic model [Knorr and Ng 1997]
• Given a radius ε and a percentage π
• A point p is considered an outlier if at
most π percent of all other points have a
distance to p less than ε
28
Distance Based Algorithms
 Index-based [Knorr and Ng 1998]
– Compute distance range join using spatial index structure.
– Exclude point from further consideration if its ε-neighborhood contains more
than Card(DB) . π points.
Grid-based [Knorr and Ng 1998]
– Build grid such that any two points from the same grid cell have a
distance of at most ε to each other.
– Points need only compared with points from neighboring cells.
29
Distance Based Algorithms Cont.
Nested-loop based [Knorr and Ng 1998]
– Divide buffer in two parts.
– Use second part to scan/compare all points with the points from the first part.
Outlier scoring based on kNN distances
- Take the kNN distance of a point as its outlier score [Ramaswamy et al 2000]
- Aggregate the distances of a point to all its 1NN, 2NN, …, kNN as an outlier
score [Angiulli and Pizzuti 2002]
30
Density Based Outlier Detection
Local outliers: Outliers comparing to their
local neighborhoods, instead of the global data
distribution.
Example
– DB(ε,π)-outlier model
Outliers based on kNN-distance
Solution: consider relative density
31
Distance-based outlier detection models have problems with different
densities .
Compare the neighborhood of points from areas of different densities.
Compare the density around a point with the density around its local
neighbors
The relative density of a point compared to its neighbors is computed as an
outlier score
Approaches also differ in how to estimate density.
Basic assumptions
 The density around a normal data object is similar to the density around its
neighbors.
 The density around an outlier is considerably different to the density around
its neighbors.
32
Clustering-Based Approaches
33
Methods of Clustering in Outlier
Detection
Case I: Not belong to any cluster
◦ Identify animals not part of a flock
Case 2: Far from its closest cluster
◦ Using k-means, partition data points of into clusters
◦ For each object o, assign an outlier score based on its distance from its closest
center
Ex. Intrusion detection: Consider the similarity between data points and the
clusters in a training data set
34
Case 3 - Detect outliers in small clusters
◦Find clusters, and sort them in decreasing size
◦To each data point, assign a cluster-based local outlier factor (CBLOF):
◦If obj p belongs to a large cluster, CBLOF = cluster_size X similarity
between p and cluster
◦If p belongs to a small one, CBLOF = cluster size X similarity betw. p and
the closest large cluster
Ex. In the figure, o is outlier since its closest large cluster is C1,
but the similarity between o and C1 is small.
For any point in C3, its closest large cluster is C2
but its similarity from C2 is low, plus |C3| = 3 is small
35
Advantages and Disadvantages of
Clustering Based Methods
1. Advantages
• Detect outliers without requiring any labeled data
• Work for many types of data
• Clusters can be regarded as summaries of the data
• Once the cluster are obtained, need only compare any object against the
clusters to determine whether it is an outlier (fast)
1. Disadvantages
• Effectiveness depends highly on the clustering method used—they may not
be optimized for outlier detection
• High computational cost: Need to first find clusters
• A method to reduce the cost: Fixed-width clustering
36
Classification Approaches
37
Classification Based
Methods
Idea: Train a classification model that can distinguish
“normal” data from outliers
A brute-force approach: Consider a training set that
contains samples labeled as “normal” and others
labeled as “outlier”
But, the training set is typically heavily biased: # of
“normal” samples likely far exceeds # of outlier
samples
Cannot detect unseen anomaly
38
Classification-based Method I: One-
class Model
A classifier is built to describe only the normal class.
 Learn the decision boundary of the normal class using
classification methods such as SVM
 Any samples that do not belong to the normal class
(not within the decision boundary) are declared as
outliers.
 Advantage: can detect new outliers that may not
appear close to any outlier objects in the training set
 Extension: Normal objects may belong to multiple
classes
39
Classification-based Method II: Semi-
supervised Learning
Combining classification-based and clustering-based methods
Method
Using a clustering-based approach, find a large cluster, C, and a
small cluster, C1
Since some objects in C carry the label “normal”, treat all
objects in C as normal
Use the one-class model of this cluster to identify normal
objects in outlier detection
Since some objects in cluster C1 carry the label “outlier”,
declare all objects in C1 as outliers
Any object that does not fall into the model for C (such as a) is
considered an outlier as well
40
Challenges for Outlier Detection in
High -Dimensional Data
Interpretation of outliers
 Detecting outliers without saying why they are outliers is not very useful in high-D due to
many features (or dimensions) are involved in a high-dimensional data set
 E.g., which subspaces that manifest the outliers or an assessment regarding the “outlier-
ness” of the objects
Data sparsity
 Data in high-D spaces are often sparse
 The distance between objects becomes heavily dominated by noise as the dimensionality
increases
Data subspaces
 Adaptive to the subspaces signifying the outliers
 Capturing the local behavior of data
Scalable with respect to dimensionality
 # of subspaces increases exponentially 41
THANK YOU!!
43

More Related Content

PPTX
Use of statistics in real life
PPTX
World human population growth through history
PDF
Lecture 2: Artificial Neural Network
PPTX
Artificial intelligence in medicine
PDF
Lab report on to plot efficiency of pure and slotted aloha in matlab a data c...
PPTX
Population growth and it's impact
PDF
Artificial intelligence in financial sector
PDF
Machine Learning
Use of statistics in real life
World human population growth through history
Lecture 2: Artificial Neural Network
Artificial intelligence in medicine
Lab report on to plot efficiency of pure and slotted aloha in matlab a data c...
Population growth and it's impact
Artificial intelligence in financial sector
Machine Learning

What's hot (20)

PPT
Data mining: Concepts and Techniques, Chapter12 outlier Analysis
PPT
K mean-clustering algorithm
PPT
3.7 outlier analysis
PDF
Dimensionality Reduction
PDF
Decision trees in Machine Learning
PDF
Hierarchical Clustering
PPT
3.3 hierarchical methods
PPT
2.2 decision tree
PPT
2.4 rule based classification
PPTX
Classification Algorithm.
PPTX
Outlier analysis and anomaly detection
PDF
Outlier Detection
PDF
Cluster analysis
PDF
K - Nearest neighbor ( KNN )
PPTX
Clustering in Data Mining
PPTX
Maximum likelihood estimation
PPT
Chapter 12 outlier
PPTX
Data Mining: Outlier analysis
PDF
Outlier detection method introduction
Data mining: Concepts and Techniques, Chapter12 outlier Analysis
K mean-clustering algorithm
3.7 outlier analysis
Dimensionality Reduction
Decision trees in Machine Learning
Hierarchical Clustering
3.3 hierarchical methods
2.2 decision tree
2.4 rule based classification
Classification Algorithm.
Outlier analysis and anomaly detection
Outlier Detection
Cluster analysis
K - Nearest neighbor ( KNN )
Clustering in Data Mining
Maximum likelihood estimation
Chapter 12 outlier
Data Mining: Outlier analysis
Outlier detection method introduction
Ad

Similar to Data cleaning-outlier-detection (20)

PPT
Chapter 12. Outlier Detection.ppt
PPT
12Outlier.for software introductionalism
PDF
12 outlier
PPT
data engineering topic on cluster analysis
PPTX
Outlier analysis,Chapter-12, Data Mining: Concepts and Techniques
PDF
Outlier Detection Using Unsupervised Learning on High Dimensional Data
PDF
angle based outlier de
PDF
Kdd08 abod
PPTX
Anomaly Detection
PPTX
Anomaly Detection
PPTX
Anomaly Detection
PPT
DM_clustering.ppt
PDF
Multiple Linear Regression Models in Outlier Detection
PPT
DM UNIT_4 PPT for btech final year students
PPTX
unitvclusteranalysis-221214135407-1956d6ef.pptx
PDF
G44093135
PPT
Chap10 Anomaly Detection
DOCX
Data Mining Anomaly DetectionLecture Notes for Chapt.docx
PPT
Data Mining and Warehousing Concept and Techniques
PPT
Basic Clustering Algorithms in Data Warehouisng and Data Miningppt
Chapter 12. Outlier Detection.ppt
12Outlier.for software introductionalism
12 outlier
data engineering topic on cluster analysis
Outlier analysis,Chapter-12, Data Mining: Concepts and Techniques
Outlier Detection Using Unsupervised Learning on High Dimensional Data
angle based outlier de
Kdd08 abod
Anomaly Detection
Anomaly Detection
Anomaly Detection
DM_clustering.ppt
Multiple Linear Regression Models in Outlier Detection
DM UNIT_4 PPT for btech final year students
unitvclusteranalysis-221214135407-1956d6ef.pptx
G44093135
Chap10 Anomaly Detection
Data Mining Anomaly DetectionLecture Notes for Chapt.docx
Data Mining and Warehousing Concept and Techniques
Basic Clustering Algorithms in Data Warehouisng and Data Miningppt
Ad

Recently uploaded (20)

PPTX
Global journeys: estimating international migration
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPT
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
PPTX
Supervised vs unsupervised machine learning algorithms
PPTX
Business Acumen Training GuidePresentation.pptx
PDF
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
PDF
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
PPTX
Moving the Public Sector (Government) to a Digital Adoption
PPTX
Computer network topology notes for revision
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
IB Computer Science - Internal Assessment.pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
PPTX
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
Business Ppt On Nestle.pptx huunnnhhgfvu
PDF
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
PPTX
Introduction to Knowledge Engineering Part 1
Global journeys: estimating international migration
climate analysis of Dhaka ,Banglades.pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
Chapter 2 METAL FORMINGhhhhhhhjjjjmmmmmmmmm
Supervised vs unsupervised machine learning algorithms
Business Acumen Training GuidePresentation.pptx
TRAFFIC-MANAGEMENT-AND-ACCIDENT-INVESTIGATION-WITH-DRIVING-PDF-FILE.pdf
“Getting Started with Data Analytics Using R – Concepts, Tools & Case Studies”
Moving the Public Sector (Government) to a Digital Adoption
Computer network topology notes for revision
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
IB Computer Science - Internal Assessment.pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck
advance b rammar.pptxfdgdfgdfsgdfgsdgfdfgdfgsdfgdfgdfg
05. PRACTICAL GUIDE TO MICROSOFT EXCEL.pptx
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Miokarditis (Inflamasi pada Otot Jantung)
Business Ppt On Nestle.pptx huunnnhhgfvu
Recruitment and Placement PPT.pdfbjfibjdfbjfobj
Introduction to Knowledge Engineering Part 1

Data cleaning-outlier-detection

  • 1. Presentation Title Your company information Presentation subtitle Data Cleaning – Outlier Detection Group 01-IT 1
  • 2. Contents 1. Types of outliers 2. Outlier detection 3. Statistical (or model-based) approaches 4. Proximity-base approaches 5. Clustering-base approaches 6. Classification approaches 7. Outlier detection in high dimensional data 2
  • 4. What are outliers? Outlier: A data object that deviates significantly from the normal objects as if it were generated by a different mechanism. Outliers are interesting: It violates the mechanism that generates the normal data. Applications of Outlier Detection: ◦Credit card fraud detection Medical analysis ◦Telecom fraud detection Public health ◦Customer segmentation Sports statistics ◦Detecting measurement errors 4
  • 5. Types of Outliers Three types: Global outliers (or point anomaly) Contextual outliers (or conditional outlier) Collective outliers 5
  • 6. 1. Global outlier (or point anomaly) • Object is Og if it significantly deviates from the rest of the data set • Ex. Intrusion detection in computer networks • Issue: Find an appropriate measurement of deviation 1. Contextual outlier (or conditional outlier) • Object is Oc if it deviates significantly based on a selected context • Ex. 80o F in Urbana: outlier? (depending on summer or winter?) • Issue: How to define or formulate meaningful context? 1. Collective Outliers • A subset of data objects collectively deviate significantly from the whole data set, even if the individual data objects may not be outliers • E.g., intrusion detection: 6 Collective outlier
  • 7. Challenges of Outlier Detection Modeling normal objects and outliers properly Application-specific outlier detection Handling noise in outlier detection Understandability 7
  • 8. Categorization of Outlier Detection : 1 of 1 8
  • 9. Categorization of Outlier Detection Methods There are two ways to categorize outlier detection methods: Based on whether user-labeled examples of outliers can be obtained:  Supervised  Semi-supervised  Unsupervised methods  Based on assumptions about normal data and outliers:  Statistical,  proximity-based  clustering-based methods 9
  • 10. Supervised Methods  Modeling outlier detection as a classification problem  Methods for Learning a classifier for outlier detection effectively: Model normal objects & report those not matching the model as outliers, or Model outliers and treat those not matching the model as normal  Challenges Imbalanced classes, i.e., outliers are rare: Boost the outlier class and make up some artificial outliers Catch as many outliers as possible, i.e., recall is more important than accuracy (i.e., not mislabeling normal objects as outliers) 10
  • 12. Unsupervised Methods Assume the normal objects are somewhat ``clustered'‘ into multiple groups, each having some distinct features An outlier is expected to be far away from any groups of normal objects Weakness: Cannot detect collective outlier effectively Ex. In some intrusion or virus detection, normal activities are diverse Many clustering methods can be adapted for unsupervised methods 12
  • 14. Semi-Supervised Methods Labels could be on outliers only, normal objects only, or both Semi-supervised outlier detection: Regarded as applications of semi- supervised learning This can be done in two ways 1. If some labeled normal objects are available 2. If only some labeled outliers are available 14
  • 16. Categorization of Outlier Detection : 2 of 2 16
  • 17. Statistical Methods Statistical methods (also known as model-based methods) assume that the normal data follow some statistical model Example : STEP 1: Use Gaussian distribution model. STEP 2: Consider the object y in region R STEP 3: Estimate the probability of y fits the Gaussian distribution (gD(y)) STEP 4: If gD(y) is very low, y is an outlier Effectiveness highly depends on whether the assumption of statistical model holds in the real data There are rich alternatives to use various statistical models 17
  • 18. Proximity-Based Methods The proximity of the object(outlier) is significantly deviates from the proximity of most of the other objects in the same data set. Example : • Model the proximity of an object using its 3 nearest neighbors. • Objects in region R are different. • Thus the objects in R are outliers. The effectiveness highly relies on the proximity measure. In some applications, proximity or distance measures cannot be obtained easily. Often have a difficulty in finding a group of outliers which stay close to each other. 18
  • 19. Clustering-Based Methods Normal data belong to large and dense clusters, whereas outliers belong to small clusters, or do not belong to any clusters. Example: two clusters • All points not in R form a large cluster • The two points in R form a tiny cluster, thus are outliers Since there are many clustering methods there are many clustering-based outlier detection methods as well. Clustering is expensive: straightforward adaption of a clustering method for outlier detection can be costly and does not scale up well for large data sets. 19
  • 21. Parametric methods - Detection Univariate Outliers Based on Normal Distribution Univariate data: A data set involving only one attribute or variable. Often assume that data are generated from a normal distribution, learn the parameters from the input data, and identify the points with low probability as outliers. Ex: Avg. temp.: {24.0, 28.9, 28.9, 29.0, 29.1, 29.1, 29.2, 29.2, 29.3, 29.4} 21
  • 22. Parametric Methods -The Grubb’s Test Detect outliers in univariate data Assume data comes from normal distribution Detects one outlier at a time, remove the outlier, and repeat H0: There is no outlier in data HA: There is at least one outlier 22
  • 23. Parametric Methods - Detection of Multivariate Outliers Multivariate data: A data set involving two or more attributes or variables Transform the multivariate outlier detection task into a univariate outlier detection problem Method 1. Compute Mahalaobis distance Method 2. Use χ2 –statistic: 23
  • 24. Parametric Methods - Using Mixture of Parametric Distributions Assuming data generated by a normal distribution could be sometimes overly simplified. Example: The objects between the two clusters cannot be captured as outliers since they are close to the estimated mean. To overcome this problem, assume the normal. data is generated by two normal distributions. 24
  • 25. Non-Parametric Methods: Detection Using Histogram The model of normal data is learned from the input data without any a priori structure. Often makes fewer assumptions about the data, and thus can be applicable in more scenarios. Outlier detection using histogram:  Problem: Hard to choose an appropriate bin size for histogram  Too small bin size → normal objects in empty/rare bins, false positive  Too big bin size → outliers in some frequent bins, false negative  Solution: Adopt kernel density estimation to estimate the probability density distribution of the data. If the estimated density function is high, the object is likely normal. Otherwise, it is likely an outlier. 25
  • 27. Distance-Based Outlier Detection Judge a point based on the distance(s) to its neighbors. Several variants proposed. Basic Assumptions • Normal data objects have a dense neighborhood. • Outliers are far apart from their neighbors, i.e., have a less dense neighborhood. 27
  • 28. DB(ε,π)-Outliers  Basic model [Knorr and Ng 1997] • Given a radius ε and a percentage π • A point p is considered an outlier if at most π percent of all other points have a distance to p less than ε 28
  • 29. Distance Based Algorithms  Index-based [Knorr and Ng 1998] – Compute distance range join using spatial index structure. – Exclude point from further consideration if its ε-neighborhood contains more than Card(DB) . π points. Grid-based [Knorr and Ng 1998] – Build grid such that any two points from the same grid cell have a distance of at most ε to each other. – Points need only compared with points from neighboring cells. 29
  • 30. Distance Based Algorithms Cont. Nested-loop based [Knorr and Ng 1998] – Divide buffer in two parts. – Use second part to scan/compare all points with the points from the first part. Outlier scoring based on kNN distances - Take the kNN distance of a point as its outlier score [Ramaswamy et al 2000] - Aggregate the distances of a point to all its 1NN, 2NN, …, kNN as an outlier score [Angiulli and Pizzuti 2002] 30
  • 31. Density Based Outlier Detection Local outliers: Outliers comparing to their local neighborhoods, instead of the global data distribution. Example – DB(ε,π)-outlier model Outliers based on kNN-distance Solution: consider relative density 31
  • 32. Distance-based outlier detection models have problems with different densities . Compare the neighborhood of points from areas of different densities. Compare the density around a point with the density around its local neighbors The relative density of a point compared to its neighbors is computed as an outlier score Approaches also differ in how to estimate density. Basic assumptions  The density around a normal data object is similar to the density around its neighbors.  The density around an outlier is considerably different to the density around its neighbors. 32
  • 34. Methods of Clustering in Outlier Detection Case I: Not belong to any cluster ◦ Identify animals not part of a flock Case 2: Far from its closest cluster ◦ Using k-means, partition data points of into clusters ◦ For each object o, assign an outlier score based on its distance from its closest center Ex. Intrusion detection: Consider the similarity between data points and the clusters in a training data set 34
  • 35. Case 3 - Detect outliers in small clusters ◦Find clusters, and sort them in decreasing size ◦To each data point, assign a cluster-based local outlier factor (CBLOF): ◦If obj p belongs to a large cluster, CBLOF = cluster_size X similarity between p and cluster ◦If p belongs to a small one, CBLOF = cluster size X similarity betw. p and the closest large cluster Ex. In the figure, o is outlier since its closest large cluster is C1, but the similarity between o and C1 is small. For any point in C3, its closest large cluster is C2 but its similarity from C2 is low, plus |C3| = 3 is small 35
  • 36. Advantages and Disadvantages of Clustering Based Methods 1. Advantages • Detect outliers without requiring any labeled data • Work for many types of data • Clusters can be regarded as summaries of the data • Once the cluster are obtained, need only compare any object against the clusters to determine whether it is an outlier (fast) 1. Disadvantages • Effectiveness depends highly on the clustering method used—they may not be optimized for outlier detection • High computational cost: Need to first find clusters • A method to reduce the cost: Fixed-width clustering 36
  • 38. Classification Based Methods Idea: Train a classification model that can distinguish “normal” data from outliers A brute-force approach: Consider a training set that contains samples labeled as “normal” and others labeled as “outlier” But, the training set is typically heavily biased: # of “normal” samples likely far exceeds # of outlier samples Cannot detect unseen anomaly 38
  • 39. Classification-based Method I: One- class Model A classifier is built to describe only the normal class.  Learn the decision boundary of the normal class using classification methods such as SVM  Any samples that do not belong to the normal class (not within the decision boundary) are declared as outliers.  Advantage: can detect new outliers that may not appear close to any outlier objects in the training set  Extension: Normal objects may belong to multiple classes 39
  • 40. Classification-based Method II: Semi- supervised Learning Combining classification-based and clustering-based methods Method Using a clustering-based approach, find a large cluster, C, and a small cluster, C1 Since some objects in C carry the label “normal”, treat all objects in C as normal Use the one-class model of this cluster to identify normal objects in outlier detection Since some objects in cluster C1 carry the label “outlier”, declare all objects in C1 as outliers Any object that does not fall into the model for C (such as a) is considered an outlier as well 40
  • 41. Challenges for Outlier Detection in High -Dimensional Data Interpretation of outliers  Detecting outliers without saying why they are outliers is not very useful in high-D due to many features (or dimensions) are involved in a high-dimensional data set  E.g., which subspaces that manifest the outliers or an assessment regarding the “outlier- ness” of the objects Data sparsity  Data in high-D spaces are often sparse  The distance between objects becomes heavily dominated by noise as the dimensionality increases Data subspaces  Adaptive to the subspaces signifying the outliers  Capturing the local behavior of data Scalable with respect to dimensionality  # of subspaces increases exponentially 41