Data cleaning-outlier-detection

Presentation Title
Your company information
Presentation subtitle
Data Cleaning – Outlier Detection
Group 01-IT 1

Contents
1. Types of outliers
2. Outlier detection
3. Statistical (or model-based) approaches
4. Proximity-base approaches
5. Clustering-base approaches
6. Classification approaches
7. Outlier detection in high dimensional data
2

What are outliers?
Outlier: A data object that deviates significantly from the normal objects as if it were
generated by a different mechanism.
Outliers are interesting: It violates the mechanism that generates the normal data.
Applications of Outlier Detection:
◦Credit card fraud detection Medical analysis
◦Telecom fraud detection Public health
◦Customer segmentation Sports statistics
◦Detecting measurement errors
4

Types of Outliers
Three types:
Global outliers (or point anomaly)
Contextual outliers (or conditional outlier)
Collective outliers
5

1. Global outlier (or point anomaly)
• Object is Og if it significantly deviates from the rest of the data set
• Ex. Intrusion detection in computer networks
• Issue: Find an appropriate measurement of deviation
1. Contextual outlier (or conditional outlier)
• Object is Oc if it deviates significantly based on a selected context
• Ex. 80o
F in Urbana: outlier? (depending on summer or winter?)
• Issue: How to define or formulate meaningful context?
1. Collective Outliers
• A subset of data objects collectively deviate significantly from the whole data set,
even if the individual data objects may not be outliers
• E.g., intrusion detection:
6
Collective outlier

Challenges of Outlier Detection
Modeling normal objects and outliers properly
Application-specific outlier detection
Handling noise in outlier detection
Understandability
7

Categorization of Outlier Detection : 1 of 1
8

Categorization of Outlier
Detection Methods
There are two ways to categorize outlier detection methods:
Based on whether user-labeled examples of outliers can be obtained:
 Supervised
 Semi-supervised
 Unsupervised methods
 Based on assumptions about normal data and outliers:
 Statistical,
 proximity-based
 clustering-based methods
9

Supervised Methods
 Modeling outlier detection as a classification problem
 Methods for Learning a classifier for outlier detection effectively:
Model normal objects & report those not matching the model as outliers, or
Model outliers and treat those not matching the model as normal
 Challenges
Imbalanced classes, i.e., outliers are rare: Boost the outlier class and make
up some artificial outliers
Catch as many outliers as possible, i.e., recall is more important than
accuracy (i.e., not mislabeling normal objects as outliers)
10

Unsupervised Methods
Assume the normal objects are somewhat ``clustered'‘ into multiple groups,
each having some distinct features
An outlier is expected to be far away from any groups of normal objects
Weakness: Cannot detect collective outlier effectively
Ex. In some intrusion or virus detection, normal activities are diverse
Many clustering methods can be adapted for unsupervised methods
12

Semi-Supervised Methods
Labels could be on outliers only, normal objects only, or both
Semi-supervised outlier detection: Regarded as applications of semi-
supervised learning
This can be done in two ways
1. If some labeled normal objects are available
2. If only some labeled outliers are available
14

Categorization of Outlier Detection : 2 of 2
16

Statistical Methods
Statistical methods (also known as model-based methods) assume that the
normal data follow some statistical model
Example :
STEP 1: Use Gaussian distribution model.
STEP 2: Consider the object y in region R
STEP 3: Estimate the probability of y fits the Gaussian
distribution (gD(y))
STEP 4: If gD(y) is very low, y is an outlier
Effectiveness highly depends on whether the assumption of statistical
model holds in the real data
There are rich alternatives to use various statistical models
17

Proximity-Based Methods
The proximity of the object(outlier) is significantly deviates from the proximity of most of
the other objects in the same data set.
Example :
• Model the proximity of an object using its 3 nearest neighbors.
• Objects in region R are different.
• Thus the objects in R are outliers.
The effectiveness highly relies on the proximity measure.
In some applications, proximity or distance measures cannot be obtained easily.
Often have a difficulty in finding a group of outliers which stay close to each other.
18

Clustering-Based Methods
Normal data belong to large and dense clusters, whereas outliers belong to small
clusters, or do not belong to any clusters.
Example: two clusters
• All points not in R form a large cluster
• The two points in R form a tiny cluster, thus are outliers
Since there are many clustering methods there are many
clustering-based outlier detection methods as well.
Clustering is expensive: straightforward adaption of a clustering method for
outlier detection can be costly and does not scale up well for large data sets.
19

Parametric methods - Detection
Univariate Outliers Based on Normal
Distribution
Univariate data: A data set involving only one attribute or variable.
Often assume that data are generated from a normal distribution, learn the
parameters from the input data, and identify the points with low probability as
outliers.
Ex: Avg. temp.: {24.0, 28.9, 28.9, 29.0, 29.1, 29.1, 29.2, 29.2, 29.3, 29.4}
21

Parametric Methods -The Grubb’s
Test
Detect outliers in univariate data
Assume data comes from normal distribution
Detects one outlier at a time, remove the outlier, and repeat
H0: There is no outlier in data
HA: There is at least one outlier
22

Parametric Methods - Detection
of Multivariate Outliers
Multivariate data: A data set involving two or more attributes or variables
Transform the multivariate outlier detection task into a univariate outlier
detection problem
Method 1. Compute Mahalaobis distance
Method 2. Use χ2
–statistic:
23

Parametric Methods - Using
Mixture of Parametric
Distributions
Assuming data generated by a normal distribution could be sometimes overly
simplified.
Example: The objects between the two clusters cannot be captured as outliers
since they are close to the estimated mean.
To overcome this problem, assume the normal.
data is generated by two normal distributions.
24

Non-Parametric Methods:
Detection Using Histogram
The model of normal data is learned from the input data without any a priori
structure.
Often makes fewer assumptions about the data, and thus can be applicable in more
scenarios.
Outlier detection using histogram:
 Problem: Hard to choose an appropriate bin size for histogram
 Too small bin size → normal objects in empty/rare bins, false positive
 Too big bin size → outliers in some frequent bins, false negative
 Solution: Adopt kernel density estimation to estimate the probability density distribution of
the data. If the estimated density function is high, the object is likely normal. Otherwise, it is
likely an outlier.
25

Distance-Based Outlier Detection
Judge a point based on the distance(s) to its neighbors.
Several variants proposed.
Basic Assumptions
• Normal data objects have a dense neighborhood.
• Outliers are far apart from their neighbors, i.e., have a less dense
neighborhood.
27

DB(ε,π)-Outliers
 Basic model [Knorr and Ng 1997]
• Given a radius ε and a percentage π
• A point p is considered an outlier if at
most π percent of all other points have a
distance to p less than ε
28

Distance Based Algorithms
 Index-based [Knorr and Ng 1998]
– Compute distance range join using spatial index structure.
– Exclude point from further consideration if its ε-neighborhood contains more
than Card(DB) . π points.
Grid-based [Knorr and Ng 1998]
– Build grid such that any two points from the same grid cell have a
distance of at most ε to each other.
– Points need only compared with points from neighboring cells.
29

Distance Based Algorithms Cont.
Nested-loop based [Knorr and Ng 1998]
– Divide buffer in two parts.
– Use second part to scan/compare all points with the points from the first part.
Outlier scoring based on kNN distances
- Take the kNN distance of a point as its outlier score [Ramaswamy et al 2000]
- Aggregate the distances of a point to all its 1NN, 2NN, …, kNN as an outlier
score [Angiulli and Pizzuti 2002]
30

Density Based Outlier Detection
Local outliers: Outliers comparing to their
local neighborhoods, instead of the global data
distribution.
Example
– DB(ε,π)-outlier model
Outliers based on kNN-distance
Solution: consider relative density
31

Distance-based outlier detection models have problems with different
densities .
Compare the neighborhood of points from areas of different densities.
Compare the density around a point with the density around its local
neighbors
The relative density of a point compared to its neighbors is computed as an
outlier score
Approaches also differ in how to estimate density.
Basic assumptions
 The density around a normal data object is similar to the density around its
neighbors.
 The density around an outlier is considerably different to the density around
its neighbors.
32

Clustering-Based Approaches
33

Methods of Clustering in Outlier
Detection
Case I: Not belong to any cluster
◦ Identify animals not part of a flock
Case 2: Far from its closest cluster
◦ Using k-means, partition data points of into clusters
◦ For each object o, assign an outlier score based on its distance from its closest
center
Ex. Intrusion detection: Consider the similarity between data points and the
clusters in a training data set
34

Case 3 - Detect outliers in small clusters
◦Find clusters, and sort them in decreasing size
◦To each data point, assign a cluster-based local outlier factor (CBLOF):
◦If obj p belongs to a large cluster, CBLOF = cluster_size X similarity
between p and cluster
◦If p belongs to a small one, CBLOF = cluster size X similarity betw. p and
the closest large cluster
Ex. In the figure, o is outlier since its closest large cluster is C1,
but the similarity between o and C1 is small.
For any point in C3, its closest large cluster is C2
but its similarity from C2 is low, plus |C3| = 3 is small
35

Advantages and Disadvantages of
Clustering Based Methods
1. Advantages
• Detect outliers without requiring any labeled data
• Work for many types of data
• Clusters can be regarded as summaries of the data
• Once the cluster are obtained, need only compare any object against the
clusters to determine whether it is an outlier (fast)
1. Disadvantages
• Effectiveness depends highly on the clustering method used—they may not
be optimized for outlier detection
• High computational cost: Need to first find clusters
• A method to reduce the cost: Fixed-width clustering
36

Classification Based
Methods
Idea: Train a classification model that can distinguish
“normal” data from outliers
A brute-force approach: Consider a training set that
contains samples labeled as “normal” and others
labeled as “outlier”
But, the training set is typically heavily biased: # of
“normal” samples likely far exceeds # of outlier
samples
Cannot detect unseen anomaly
38

Classification-based Method I: One-
class Model
A classifier is built to describe only the normal class.
 Learn the decision boundary of the normal class using
classification methods such as SVM
 Any samples that do not belong to the normal class
(not within the decision boundary) are declared as
outliers.
 Advantage: can detect new outliers that may not
appear close to any outlier objects in the training set
 Extension: Normal objects may belong to multiple
classes
39

Classification-based Method II: Semi-
supervised Learning
Combining classification-based and clustering-based methods
Method
Using a clustering-based approach, find a large cluster, C, and a
small cluster, C1
Since some objects in C carry the label “normal”, treat all
objects in C as normal
Use the one-class model of this cluster to identify normal
objects in outlier detection
Since some objects in cluster C1 carry the label “outlier”,
declare all objects in C1 as outliers
Any object that does not fall into the model for C (such as a) is
considered an outlier as well
40

Challenges for Outlier Detection in
High -Dimensional Data
Interpretation of outliers
 Detecting outliers without saying why they are outliers is not very useful in high-D due to
many features (or dimensions) are involved in a high-dimensional data set
 E.g., which subspaces that manifest the outliers or an assessment regarding the “outlier-
ness” of the objects
Data sparsity
 Data in high-D spaces are often sparse
 The distance between objects becomes heavily dominated by noise as the dimensionality
increases
Data subspaces
 Adaptive to the subspaces signifying the outliers
 Capturing the local behavior of data
Scalable with respect to dimensionality
 # of subspaces increases exponentially 41

Data cleaning-outlier-detection

More Related Content

What's hot (20)

Similar to Data cleaning-outlier-detection (20)

Recently uploaded (20)

Data cleaning-outlier-detection