SlideShare a Scribd company logo
© 2022, AJCSE. All Rights Reserved 1
RESEARCH ARTICLE
A Comprehensive Study on Outlier Detection in Data Mining
Deepti Mishra1
, Devpriya Soni2
1
Department of CSE, Noida International University, Noida, Uttar Pradesh, India, 2
Department of CSE, Jaypee
Institute of Information Technology, Noida, Uttar Pradesh, India
Received on: 25-11-2021; Revised on: 30-12-2021; Accepted on: 10-01-2022
ABSTRACT
The paper presents a survey on the literature of outliers and data mining. The prime focus of the paper
is to deliver an outline of outlier with various approaches for its detection. Outliers are the data points
which are partially or totally diverge from the residue data set. They can be considered as those data
objects which cannot be fitted in any cluster. Outliers can be different from its neighboring data points
only or from complete data set. It is a necessary task to identify and detect outliers from the dataset
as their presence effect the preciseness of the outcome. Outliers can exist in any kind of data varies
from low-dimensional to high-dimensional data set. The detection of an outlier requires some precise
mathematical calculations, appropriate domain knowledge, and statistical calculations which are
presented in the paper. The paper presents, the significant characteristics of the outliers.
Key words: Clustering, Data mining, Knowledge discovery, Outlier detection, Outliers
INTRODUCTION
It is exceedingly difficult to manage large scale
collected data, as nowadays, there is abundance
of data in the data base or data warehouses.
At present, increment in data is going on very
rapidly. It might be petabytes, terabytes, or
exabytes of data consisting of billion to trillions of
records of million entities gathered from different
sources. For such massive data, the manual data
analysis and data retrieval system are very time
consuming.[1]
This huge amount of data generates
necessity of data analysis tool since data may be
located at different locations. A competent and
robust tool can be built using data mining with
its techniques. Data mining is an exciting field of
computer science. Scientists and researchers find
it as a new field for their research.
Dataminingisthecombinationoftechniqueswhich
are further based on concept of learning.Acquiring
knowledge is a big aspect of data mining. Further,
it can be stated as the data mining is the part of
knowledge discovery. Data mining is the extraction
of information from a large data storage which
includes many extraction techniques. It can handle
Address for correspondence:
Deepti Mishra
E-mail: itsdeepti.s@gmail.com
a large amount of information. Data mining, that
extracts unrevealed predictive knowledge from
huge datasets, is a strong innovative technology
with influential strength to help organizations for
managing the most crucial information in their data
bases. The data mining tools help to find out the
future trends and patterns, empowering businesses
to conclude knowledge generated decisions.[2]
The
automated potential analysis provided by data
mining is far ahead of the analyses of data provided
by retroactive tools mainly of decision support
systems. The tools used for data mining techniques
can resolve conventionally and extremely time-
consuming issues related to business in efficient
way. The tools examine databases for hidden
patterns, detecting predictive information that one
may miss easily otherwise.
It is the process of semi-automatically analyzing
huge datasets to detect patterns which are: true,
useful, fresh, feasible, and easily assimilated.
Everyday life of a person goes through an
innumerable kinds of pattern recognition (PR)
problems such as images, faces, smells, voices,
situations, and many more.[3]
Initially, most of
these issues are corrected at a sensory level or
instinctively, without any help of an explicit
technique or algorithm. The moment we get an
algorithm the problem or issue becomes trivial
and is then handed over to the computer. Now,
Available Online at www.ajcse.info
Asian Journal of Computer Science Engineering 2022;7(1):1-10
ISSN 2581 – 3781
AJCSE/Jan-Mar-2022/Vol 7/Issue 1 2
Mishra and Soni: Outlier Detection
machines have regularly replaced humans in many
previously difficult or impossible way, except
only in few unvarying PR tasks, namely, medical
test reading, fingerprint recognition, mail sorting,
military target recognition, signature verification,
DNA matching, meteorological forecasting,
and others. PR is the scientific discipline whose
aim is to classify data, objects, and patterns into
different categories or classes. It utilizes abilities
of – unsupervised learning – supervised learning.
The basic ideology is to use data mining techniques
to find pertinent and useful patterns of system
virtues that elucidate program and user nature,
with utilization of set of relevant system features
to summarize (learning by induction) classifiers
which detect outliers. Nowadays, outlier and its
detection are an emerging area of data mining.
There is no accurate method to identify outlier.
At present in data mining, outlier detection is
the widely research field which entails excessive
attention. Uncovering outliers carried out in
generated patterns is a pertinent hindrance in
the data mining. Outlier mining is the technique
of spotting rare events, deviant data points, and
exceptional objects.[4]
Outlier analysis is a major
data mining topic in knowledge discovery. Data
mining and knowledge discovery can be used
interchangeably. Data mining in database systems
points to self-extraction of useful predictive
information that is not conspicuous otherwise.[5,6]
The focus of the paper goes around the concept
of data mining, outlier detection, and its different
approaches.
The work presents the literature survey of outliers
in various kinds of data set. The work is designed
to demonstrate the study on data mining and
various methodologies for outlier detection.
We categorized and presented a comprehensive
overview of approaches of outlier detection in
various taxonomy.
Rest of the paper presents literature survey of
outliers with approaches applied for detection
additionally with comparison and analysis.
CUMULATIVE KNOWLEDGE FACTS
FOR OUTLIERS
This section provides preliminary knowledge
required for outlier identification and detection.
Before proceeding to outlier, let have a look on
key areas for awareness of outliers.
Data warehouse
Data warehouse is the collection of enormous
heterogenous data gathered from multiple sources
for further prediction and analysis. Key features
of data warehouse are that it is subject-oriented,
integrated, time variant, and durability.
Data warehouse facilitates the process named
extract, transform, and load, that is, ETL which
is a key for analyzing and discovering business
perceptions. Extraction is the procedure to read
and gather data from different data sources in a one
database using a tool. The same tool transforms
the data into usable and required form which is
followed by writing the data into target database
called load. The data are pre-processed and
cleaned during the procedure. The transferring
and loading process is applied with the help of
data marts. It can be stated as data mart as subset
of data warehouse which concentrates on specific
domain of business.[7]
OLAP operations are
applied for data processing.
It processes the data, incorporate in business
point of view such as customers’ information, sale
details, products information, and review results.
Data warehouse covers numerous benefits such
as easy retrieval of information, consistency in
database, adaptable to change, modifications in
timely manner, and many more.[8]
It provides a
foundation for good decision-making system.[9]
Data mining
Data mining includes various techniques to
facilitates the process of generating beneficial
information for prediction and analysis from huge
pre-existing database. Data mining comprises
different steps such as data pre-processing,
learning, and analysis. Data collected from
different sources are cleaned, transformed, and
reduced to construct a product which is better
for the use. As during collection of enormous
data, there are chances of loss of information
such as geological data, numerical data, etc.[10]
It is composed of three techniques classification,
clustering, and association rule mining. The
classification followed the concept of supervised
learning, while clustering is applied on the
concept of unsupervised learning. Decision trees,
regression, Bayesian classification, and neural
network are few methodologies and techniques of
AJCSE/Jan-Mar-2022/Vol 7/Issue 1 3
Mishra and Soni: Outlier Detection
classification. K-means, DBSCAN, OPTICS, and
Chameleon are few techniques of clustering.
In the current times, when data generation is
occurring at an exponential rate, it is equally
important to treat the data collected to harvest out
the useful data from the not so useful data.
Such mined out, data are extremely useful for
innumerable purposes, namely, fraud detection,
weather forecasting, market trend predictions,
customer purchasing trend identification, and
many more.
Knowledge discovery
Knowledge discovery is acquiring and achieving
theresultsofpredictionfromhugedatabyapplying
data mining techniques. It is the extraction of
knowledge from structured or unstructured data.
Data mining and knowledge discovery are directly
or indirectly related to each other.
Fundamentally, the resulted patterns generated
after applying data mining techniques to find some
conclusions are termed as knowledge discovery.
Initially, the objective of KDD procedure is to
be identified according to users’ point of view.
Afterward, the selection of data is accomplished
to achieve the goal, while selecting the data set
user needs to ponder in right direction with exact
methodology of data mining.
OUTLIERS
Outliers are the data points which can be notably
differentiated from the residue of the statistics
of data points. It is assumed practically that the
number of “normal” observations are considerably
more than “abnormal” observations (outliers/
anomalies) in the given data.
There are numerous definitions of outliers
suggested by many authors.
Large-scaled gathered data needs to be processed
and managed properly for efficient analysis. The
analysis process requires software tools based on
techniques of data mining. During the process
of examining of data, it is mandatory to remove
unwanted data for precise outcome.That unwanted
data may be considered as outliers.
Outliers can be flowed into the data set may be
by malicious activity, due to outcomes of faulty
experiments and merely by erroneous entries in
the data.[22]
There are many reasons to handle outliers, because
they have significant impact on the result. Some
may hide substantial and useful information for
further analysis.
Detecting outliers play a key role in retrieving
precise information and have various applications
such as intrusion detection, credit card fraud
detection, proofing medical diagnosis data, and
analysis of satellite images. It has[23]
observed
as outliers often share the same mathematical
features, so it can be difficult to distinguish
between them. If there is a lack of relevant
domain information that leads to the idea of
deciding why they are outside and different and
what the data model is doing below it means that
it is difficult to identify them as outsiders. To
find out, if an outlier is revealing or important
other information is needed, for example, an
accurate mathematical calculation and appropriate
background.[24]
External detection algorithms are
intended to automatically detect those objects or
disturbing visions in large amounts of data. Since
there is no standard definition of an outlier, so the
whole algorithm is based on a model based on a
certain guess of what qualifies as a trader.
Outliers carry few characteristics so that they can
be distinguished. Outliers can be identified and
separated based on their characteristics such as
existence in high-density or low-density region,
identified separately or with other data points.
Some key feature of outliers to distinguish them
are mentioned below.[25]
1. Based on neighborhood – They can be
categorized as global and local outliers.
They can be stated as global having different
characteristics from whole data set, and local
if they can be identified only from their
neighborhood.
2. Approach of degree for knowing outlier –
A SCALAR (binary) value is used to detect
whether the data object can be considered as
an outlier or not. The degree for considering
data point as outliers implies that to generate
the score of the data point for declaring it as an
outlier. In contrast, OUTLIERNESS signifies
the visualization of characteristics and degree,
on which the point can be considered as an
outlier in respect with other data.
3. Approach of dimensions for detecting outliers
– UNIVARATE data can be categorized
as, with one feature that can be categorized
AJCSE/Jan-Mar-2022/Vol 7/Issue 1 4
Mishra and Soni: Outlier Detection
externally only on the basis that one feature
is unique in relation to that other data. In
contrast, MULTIVARIATE data contains
multiple attributes and can be considered as
for identifying outliers as the combination of
few factors have unusual data values.
4. Total count of outliers – The number of outliers
can be obtained by different techniques
in different ways per second. A few tricks
identify one outlier and remove it and repeat
the process until more exits are found. Some
strategies find a collection of outliers in a
single effort. Both have their advantages and
disadvantages.
The necessity and requirement to distinguish the
outliers according to their characteristics because
they can have significant impact on the result.
Sometimes, they may have valuable knowledge
and vital information. They may have valuable
knowledge and vital information.
OUTLIER DETECTION APPROACHES
AND TECHNIQUES
Outlier detection techniques and approaches have
been comprehensively analyzed in the previous
decadesandmanyapproacheshavebeendeveloped
to identify outliers [Table 1]. There are few basic
approaches for outlier detection – statistical
approach, depth-based approach, distance-based
approach, density-based approach, and deviation-
based approach. Other approaches are – for spatial
data, graph-based approach. Approaches are also
developed for high-dimensional data.[14,26]
It can
be defined that approaches to identify outliers are
based on supervised learning and unsupervised
learning.[27]
Some techniques can be applied only
on univariate data or some technique are beneficial
for multivariate data for outlier detection. Before
proceeding further, it is better to remove outliers.
Statistical approach
Statistical algorithms are the foundation and basic
algorithms applied for outlier detection, which
were based on probability or distribution model
for the given data set. The standard deviation,
interquartile range, Chi-square test, ANNOVA,
and Z-score, these few are methods of statistics
to identify outliers. Some of the techniques of
statistics assume Gaussian distribution of data.
These kinds of statistical methods can be applied
to notify rare and unexpected data in bivariate
or multivariate dataset. The traditional way is
to apply discordancy method by assuming the
hypothesis which verifies that whether the data
point as outlier or not. A traditional way is to plot
the data to identify the odd data point from residue
data set. The data points that does not follow the
model stated as the outliers. Barnett and Lewis[12]
and Rousseuw and Leroy[28]
described some
techniques which are single dimensional. As the
dimension increases, it is very tough to manage
the model for data set.
Depth-based approach
In depth-based approach, the data set is organized
into different layers. The outer layer is more
prone to outliers in comparison to inner layers.
Minimum volume ellipsoid is based on depth-
based technique. The probability of observation is
computed and visualized to notify the outlier.
Distance-based approach
The distance-based approach as name suggested
requires distance calculations for outlier detection
such as Euclidean distance. Knorr and Ng
proposed the outlier detection technique using k
nearest neighbors.[29,30]
Some authors also use a distance-based approach
to detect external objects such as nested loop,
cell-based, index-based methods that reduce
unnecessary computations. A new hybrid
approach was introduced using a distance-based
method and a k-way to identify outliers.[15]
Initially, this method uses a different integration
algorithm which means K-means dividing the data
set into several clusters and then using distance-
based methods to find outliers. The scope of the
approach necessitates alterations so that it can be
made applicable to the textual mining. It can be
applied on more complicated data set. Also, can
be applied on varying data sets.
The paper presents the new algorithm PLDOF and
tried to proof that it is better than LDOF. K-means
clustering algorithm is applied to divide the data
set into clusters. It is based on the concept that
first calculates the distance of each point from the
centroid of the cluster.[13]
AJCSE/Jan-Mar-2022/Vol 7/Issue 1 5
Mishra and Soni: Outlier Detection
Density-based approach
The density-based approach evaluates the density
in the data region to detect outliers as outliers
generally lie in low-density region.
Certain methodologies are proposed and
implemented concerning with density-based
approach. One of the algorithms named outlier
finding technique is proposed, which further uses
K-means clustering algorithm to cluster the data
set and to find out outliers. The functioning of
methodology is based on the findings of density
and distance values.[19]
Deviation-based approach
Arning et al. defined a deviation-based tactic,
functions as observing key qualities to detect
the outliers. Those object that departs from the
qualities are declared as outliers.[31]
Anotherapproachproposedinthepaperadefinition
of outlier that is class outlier and proposed a class
outlier factor (COF). It generates as a ranking
score named COF to measure a degree of being a
class outlier for a data point. The main factors of
computing COF are the probability of the instance
classes, the deviation of instance from an instance
of a same class, and the distance between the
instance and its K neighbors.[4]
Outlier approaches for Spatial data set
Spatial data set includes geographic data or in other
words geographic information. It can be defined as
the data which can be mapped and related to space,
forexample,building,lake,mountains,andtownship.
Cai et al. proposed an iterative self-organizing
map with robust distance estimation for spatial
outlier detection.[32]
The neighboring clusters Kou
et al. proposed spatial-weighted outlier detection
algorithms for spatial data set.[33]
Outliers can be present in the meteorological data
and effect the valuable information.[34]
Author
presented a new approach based on wavelet
analysis. Wavelet analysis is applied to study
images and signal processing. The signals and
images are observed at various focuses to generate
conclusion and distinct patterns.
Graph-based approach
The paper introduced a new algorithm SWHOT
which is based on weighted hypergraph model.
It is the union of BSWH algorithm and CURE
algorithm. The algorithm provides the concept of
the feature vector and attribute similarity.[17]
High-dimensional data-based approach
Nowadays, high-dimensional data are the big
challengeforoutlieranalysis.Itisdifficulttodecoct
out the irrelevant data from high-dimensional data
and concurrent segregation of outliers.
Tests are performed on high-quality data set by an
outlier detection algorithm. The new angle-based
outlier discovery (ABOD) algorithm is proposed
and compared to other algorithms. ABOD
computes that the scalar product between the
vectors of the data points, furthermore, computes
Table 1: Various definitions of outliers
S. No. Authors Definitions
1. Hawkins[11]
Outlier is a monitored data point, which differ so much from rest data set as to create a suspicious environment
that it was generated by different mechanism.
2. Barnett and Lewis[12]
A data object (or subgroup of observation) which emerges and acts to be irregular or volatile with the remainder
of that data set.
3. Pamula et al.[13]
Outlier is the object or point which does not adapt the same characteristics alike to the usual data depicting the data set.
4. Guo et al.[14]
Outlier can act abnormally but may encompass valued information.
5. Pachgade and Dhande[15]
Outlier can act as a type of pattern such as behaving differently in aspect of residue data in the data set.
6. Agarwal and Yu[16]
The data point is considered as outlier if it is not identical with residue data on some characteristics.
7. Li et al.[17]
Outlier is a small quantity of data objects with abnormal behavior in data set.
8. Pham and Phag[18]
Outliers are the items that noticeably differ and cannot fit in the general distribution of the data.
9. Behera et al.[19]
The irregular data points pointed as outliers from residue data if it appears that they are engendered by any
faulty circumstances from experiments.
10. Li and Kitagawa[20]
An outlier is an abnormal data point which is greatly different from the rest of the data.
11. Breuning et al.[21]
Outliers are those data points which can be categorized by density as they lie in the less density region in
compared to its neighboring points.
AJCSE/Jan-Mar-2022/Vol 7/Issue 1 6
Mishra and Soni: Outlier Detection
the variance over the angles which are calculated
over the data points. There is not any requirement
for the selection of parameters for generating
outcomes.[35]
Another approach for outlier detection is
introduced by Pham and Phag which functions in
high-dimensionaldataset.Theapproachisefficient
and scalable to extremely high-dimensional data
set.[18]
There is another approach named outlier detection
in high-dimension based on projection. It discovers
the outliers by applying the concept of projection,
from the data set. In the methodology, clusters are
designed by applying the concept of projection
and weight. Smaller clusters are pruned further
detecting outliers.[14]
The author presents a paper which describes the
concept outliers in high-dimensional data set.[16]
Outliers and its techniques are discussed in detail
in the paper.
Amit Banerjee suggested a new algorithm for
outlier detection which is based on evolutionary
search. The genetic algorithm is used to find
external objects that are outliers. The method
used is to integrate the Euclidean-based approach
to distance, partly based on congestion based on
the converted value of Lancaster.[36]
There is an
advantage of the given approach which can also
be applied on continues attributes. The scope of
algorithm, to generate a procedure which can
detect absolute count of outliers by applying
fitness function.
For high-dimensional data, the analysis becomes
more critical. A new method COMPREX is
introducedwhichapplypattern-basedcompression.
The approach is parameter free, general, scalable,
and efficient. The categorical database is used.[37]
Approaches based on classification
Classification is based on concept of supervised
learning further part of data mining and machine
learning. In classification technique, there are
predefined trained models. In supervised learning,
the class description of every training sample is
provided. It is used to classify the dataset into
predefined groups of classes or models. The
classification includes many techniques such
as decision trees, neural network, and statistics.
In classification, there are predefined groups of
models or classes to classify the new data. It is
extremely useful process for knowledge discovery
by identifying the classes and models for new data.
Further continuing, the decision tree approach
of classification generally uses methodologies
of statistical techniques of or in other words
concept of clustering to identify the outliers or
classification of data precisely. Due to presence of
outliers’, branches of tree reflect variances in the
classes.
Approaches based on clustering
Clustering is based on concept of unsupervised
learning to identify the data in clusters or in
groups. The data with similar traits lie in same
cluster, but they have must have different traits
with data objects which lie in another cluster. The
clusters identified by any clustering algorithm
can be succumbed to outliers. Due to presence
of outliers, there may be variation in outcome
and result may be inappropriate. Techniques of
clusteringcanidentifytheoutlierssuchasdistance-
based methodologies, for example, K-means
and density-based methodologies, for example,
CURE and DBSCAN. The identification of outlier
depends on the distance value between data
point and centroid data point which may in turn
affect the accuracy of clusters. In that aspect, the
density-based approach identifies outliers on the
concept of high- and low-density regions. Hence,
it can be stated as density-based approaches are
better approaches than distance-based approaches
to identify outliers.
A new algorithm Balanced Iterative Reducing and
Clustering using Hierarchies is applied to generate
clusters.[38]
Pattern identification is an important
task from the large data bases. This methodology
works on large data sets to identify patterns. It
can also deal with noise. The paper defines the
two types of attribute – metric and non-metric. It
works in four stages – load database in memory
after building CF tree, reduce in smaller CF tree,
apply global clustering, and refine the clusters.[38]
The authors presents new approach for clustering
applying in huge databases.[17,39]
The key
objective of the work is to apply clustering in
two stages: A basic and fast step that divide data
into overlapping divisions named as “canopies,”
and then distance measure is done only for those
data objects that present in common canopy.
AJCSE/Jan-Mar-2022/Vol 7/Issue 1 7
Mishra and Soni: Outlier Detection
The canopies are the subset of the of the data
points. The generated clustering which is based
on canopies can be suitable to many existing
clustering algorithms, for example, K-means,
greedy agglomerative clustering, and expectation
maximization. In the paper, the canopies are
applied with agglomerative clustering.[39]
Authors proposed a new algorithm for outlier
detection.[40]
The approach is defined in different
stages. In initial stage, a modified K-means
approach is applied followed by constructing
minimum spanning tree. Further, the longest edge
is removed connecting with farthest data points
considering it as outliers.
OLAP data cube techniques
OLAP data cube techniques are applied to identify
unexpected data points in multidimensional data
model. It is consisting of numerical facts and
data which are termed as dimensions. It is used
for analysis and computations on data. Data are
gathered from various data collections in data
cube. Basic operations on OLAP include drill
down, roll up, slice, dice, and pivot. Each data
cube is comprised data cells. Each data cell
contains numeric data which are further applied
for statistical, mathematical, aggregate functions,
and many more. If data in any cell deviate too
much from data in another cells, then that cell
value is spotted as exception or outliers. Data cube
is better scope to store data as it offers worthy
visualization such as data in cell can be viewed in
distinct colors to signify outliers and exceptions.
COMPARATIVE STUDY AND ANALYSIS
OF VARIOUS TECHNIQUES OF
OUTLIER DETECTION
From the comparison shown in the Table 2, it can
be analyzed that different techniques are required
to manage different kind of databases to detect
outliers. Depth-based technique can identify
multiple outliers at a time, since it organizes
the data in different layers. It is assumed that
maximum number of outliers lie in the outer layer.
Usually, statistical- and distance-based techniques
work more efficiently on low-dimensional data
set, while techniques such as graph based, density
based, clustering based can be applied on high-
dimensional data set.
Table
2:
Comparison
of
different
approaches
of
outlier
detection
Comparison
Technique
Approach
Types
of
data
set
Types
of
outliers
Number
of
outliers
Supervised
Unsupervised
Semi
supervised
Single
High‑Dimensional
Hybrid
Global
Local
Single
Multiple
Statistical




Depth
Based




Distance
Based




Density
Based




Deviation
Based





Graph




Classification
based





Clustering
Based







AJCSE/Jan-Mar-2022/Vol 7/Issue 1 8
Mishra and Soni: Outlier Detection
Maximum number of techniques are based on
unsupervised learning such as depth based and
density based, while some techniques such as
classification and deviation based are lie on the
concept of supervised learning. Since these later
techniques needs predefined characteristics to
identify odd one data points.
Classification-based techniques are based on
supervised learning as they apply operations on
predefined model. On the contrary, clustering-
based approaches are based on unsupervised
learning, because there is no trained exemplar to
apply mathematical operations.
In general, graph-based approaches can function
efficiently on high-dimensional data set. It is
the part of unsupervised learning. It generates
graph using the distance function and finding
neighboring points.
OUTLIER DETECTION TOOLS
At present, numerous software in existence to
cope with the identifications and complications
generated by outliers such as WEKA, R Studio,
RapidMiner, and Orange.
This free software provides strong statistical
environment for analyzing data mining techniques
which are further capable of identifying
outliers.[41]
In general, the software is comprising
various machine learning techniques, offer vivid
visualization and excellent platform for data
science. These tools are foundation for data
analyzing, computation, and its visualization.
They offer an integrated environment for
data preparation, data mining, and generating
knowledge.
Outliers can be detected using pre-processing
techniques in WEKA tool. As it is a user-friendly
software, clicking on some buttons such as
filter can spot outliers effortlessly. In R studio,
user must install packages which initially take
few seconds to install to execute the functions
to detect outliers. One of the package outliers
apply the statistical techniques such as mean
values and standard deviation to spot outliers.
The function detect outliers are capable enough
to identify outliers easily in RapidMiner. Below
a comparative study is mentioned in the Table 3.
Four different software are compared on various
platforms.
CONCLUSION AND FUTURE WORK
Considering the never-ending rise in data
collection and concomitant need of processed data
for scientific and commercial usage, it is needless
to reemphasize about the need of development of
data mining tools for the same. The development
of new algorithms which possess the capability
of detecting outliers by more precision is deeply
needed. Furthermore, processed data need to be
presented in such a diligent pattern to manifest the
secret patterns for public usage.
Ongoing research efforts in artificial intelligence
needs to be either combined or utilized at specific
points to make out knowledge discoveries from
data mining more productive, scientific, and
sensitive.
The identification of outliers is a subtopic of data
mining. Outlier analysis is a highly research field
for scientists. Outliers are the data points that those
cannot be fitted in any type of clusters. These
objects are somehow objectionably unusual from
residue data set. They may differ from entire data
sets or may be difficult to locate. The presence
of outliers makes the results confusing. Patterns
generated after data from data are incorrect and
inaccurate for outliers. Every kind of data includes
a new problem of outliers and requires the
different techniques to handle it. A single straight
approach cannot deal with all kind of outliers
in various fields and subject areas. Hence, we
Table 3: Comparison of different tools on various platforms
WEKA R Studio RapidMiner Orange
Written in JAVA C++ JAVA XML Python, C++, Cython
Rapidity FAST FAST FAST FAST
Open source YES YES YES YES
Large data Efficiently on small
data sets less than R
Efficiently on large and
small data sets
Efficient less than R Efficiently on large and
small data sets
Visualization Less options Many options Many options Less options
Outliers YES YES YES YES
AJCSE/Jan-Mar-2022/Vol 7/Issue 1 9
Mishra and Soni: Outlier Detection
need different approaches to identify and deduct
outliers. The paper presents the review of outlier
and outliers detection techniques. It is valuable
for the researchers. Outliers may cover any vital
information or may influence the accuracy of
result. The idea is, how to notify the outliers, as
to retain them for further information or remove
them to gain accurate outcome.
REFERENCES
1. Houari ME, Rhanoui M, Asri BE. Hybrid Big Data
Warehouse for on-demand Decision Needs. Rabat,
Morocco: International Conference on Electrical and
Information Technologies (ICEIT); 2017.
2. Ming J, Zhang L, Sun J, Zhang Y. Analysis Models of
Technical and Economic Data of Mining Enterprises
Based on Big Data Analysis. Chengdu, China: 2018
IEEE 3rd
International Conference on Cloud Computing
and Big Data Analysis (ICCCBDA); 2018.
3. Heling J, Yang A, Yan F, Miao H. Research on pattern
analysis and data classification methodology for data
mining and knowledge discovery. Int J Hybrid Inf
Technol 2016;9:179-88.
4. Hewahi NM, Saad MK. Class outliers mining:
Distance-based approach. Int J Comput Infor Eng
2007;1:2805-18.
5. Koteeswaran S, Visu P, Janet J. A review on clustering
and outlier analysis techniques in data mining. Am J
Appl Sci 2012;9:254-8.
6. Velchamy I, Subramanian R, Vasudevan V. A five step
procedure for outlier analysis in data mining. Eur J Sci
Res 2012;75:327-39.
7. Prakash D, Prakash N. A Requirements Driven
Approach to Data Warehouse Consolidation. Brighton,
UK: 11th
International Conference on Research
Challenges in Information Science (RCIS); 2017.
8. Bagambiki E. Enterprise Data Warehouse and Business
Intelligence Solution. In: ICEGOV’18 Proceedings
of the 11th
International Conference on Theory and
Practice of Electronic Governance; 2018.
9. Farooqui NA, Mehra R. Design of a Data Warehouse
for Medical Information System Using Data
Mining Techniques. Solan Himachal Pradesh,
India: 2018 5th
International Conference on Parallel,
Distributed and Grid Computing (PDGC); 2019.
10. Lgkklw EA, Lingling Z. A BP neural network based
method for geological missing data processing. Gold
Sci Technol 2015;23:53-8.
11. Hawkins DM. Identification of Outliers. Berlin,
Germany: Spinger; 1980.
12. Barnett V, Lewis T. Outliers in Statistical Data.
Hoboken, New Jersey: Wiley; 1994.
13. Pamula R, Deka JK, Nandi S. An Outlier Detection
Method Based on Clustering. Kolkata, India: Second
International Conference on Emerging Applications of
Information Technology; 2011.
14. Guo P, Dai JY, Wang YX. Outlier Detection in High
Dimension Based on Projection. Dalian, China:
International Conference on Machine Learning and
Cybernetics; 2006.
15. Pachgade SD, Dhande SS. Outlier detection over data
set using cluster-based and distance-based approach. Int
J Adv Res Comput Sci Softw Eng 2012;2:12-6.
16. Agarwal CC, Yu PS. Outlier Detection for High
Dimensional Data. Santa Barbara, California,
USA: ACM SIGMOD International Conference on
Management of Data; 2001.
17. LiY, Wu D, Ren J, Hu C.An Improved Outlier Detection
Method in High-dimension Based on Weighted
Hypergraph. In: Second International Symposium on
Electronic Commerce and Security; 2009.
18. Pham N, Phag R. A near-linear Time Approximation
Algorithm for Angle Based Outlier Detection in High
Dimensional Data. Beijing, China: Proceedings of
the 18th
ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining; 2012.
19. Behera HS, Ghosh A, Mishra SK. A new hybridized
k-means clustering based outlier detection technique
for effective data mining. Int J Adv Res Comput Sci
Softw Eng 2012;2:287-92.
20. Li Y, Kitagawa H. DB-outlier Detection by Example
in High Dimensional Datasets. Istanbul, Turkey:
IEEE International Workshop on Databases for Next
Generation Researchers; 2007.
21. Breunig MM, Kriegel HP, Ng RT, Sander J. LOF-
identifying Density Based Local outliers. In:
Proceedings of the 2000 ACM SIGMOD International
Conference on Management of Data: Dallas, Texas,
USA; 2000.
22. Hodge VJ, Austin J. A survey of outlier detection
methodologies. Artif Intell Rev 2004;22:85-126.
23. Cheng JG. Outlier Management in Intelligent Data
Analysis. London: University of London; 2000.
24. Zimek A, Campello R, Sander J. Ensembles for
Unsupervised Outlier Detection: Challenges and
Research Questions. SIGKDD Explorations Newsletter;
2013. p. 11-23.
25. Zhang Y, Meratnia N, Havinga P. A Taxonomy
Framework for Unsupervised Outlier Detection
Techniques for Multi-Type Data Sets. CTIT Technical
Report Series; No. Paper P-NS/TR-CTIT-07-79).
Enschede: Centre for Telematics and Information
Technology (CTIT), 2007.
26. Xi J. Outlier Detection Algorithms in Data Mining. In:
2nd
International Symposium on Intelligent Information
Technology Application: Shanghai, China; 2008.
27. Zhu C, Kitagawa H, Papadimitriou S, Faloutsos C.
Outlier detection by example. J Intell Inf Syst
2011;36:217-47.
28. Rousseeuw PJ, Leroy AM. Robust regression and
outlier detection. In: Wiley Series in Probability and
Statistics. Hoboken, New Jersey: Wiley; 1987.
29. Knor EM, Ng RT. Finding Intensional Knowledge
of Distance Based Outliers. Edinburgh, Scotland:
Proceedings of the 25th
VLDB Conference; 1999.
30. Knorr EM, Ng RT, Tucakov V. Distance based
outlier: Algorithm and applications. VLDB J Int J
2000;8:237-53.
31. Arning A, Agrawal R, Raghavan P. A linear Method
AJCSE/Jan-Mar-2022/Vol 7/Issue 1 10
Mishra and Soni: Outlier Detection
for Deviation Detection in Large Databases. Portland,
Oregon: Proceedings of the Second International
Conference on Knowledge Discovery and Data Mining;
1996.
32. Cai Q, He H, Man H. Spatial outlier detection
based on iterative self organizing learning model.
Neurocomputing 2013;117:161-72.
33. Kou Y, Lu CT, Chen D. Spatial Weighted Outlier
Detection. Bethesda: Proceedings of SIAM Conference
on Data Mining April 20-22; 2006.
34. Zhao J, Lu CT, Kou Y. Detecting Region Outliers in
Meteorological Data. In: GIS 03 Proceedings of the
11th
ACM International Symposium on Advances
in Geographic Information Systems: New Orleans,
Louisiana, USA; 2003.
35. Kriegel HP, Schubert M, Zimek A. Angle Based Outlier
Detection in High Dimensional Data. Las Vegas,
Nevada, USA: KDD ’08 Proceedings of the 14th
ACM
SIGKDD International Conference on Knowledge
Discovery and Data Mining; 2008.
36. Banerjee A. Density-based Evolutionary Outlier
Detection. Philadelphia, PA, USA: GECCO’12; 2012.
37. Akoglu L, Tong H, Vreeken J, Faloutsos C. Fast and
Reliable Anomaly Detection in Categorical Data. Maui,
HI, USA: CIKM’12; 2012.
38. Zhang T, Ramakrishnan R, Livny M. BIRCH: An
Efficient Data Clustering Method for Very Large
Databases. Montreal, Quebec, Canada: SIGMOD ’96
Proceedings of the 1996 ACM SIGMOD International
Conference on Management of Data; 1996.
39. McCallumA, Nigam K, Ungar LH. Efficient Clustering of
High-dimensional Data Sets withApplication to Reference
Matching. Boston, Massachusetts, USA: Proceedings of
the Sixth ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining; 2000.
40. Jiang MF, Tseng SS, Su CM. Two-phase Clustering
Process for Outliers Detection. Amsterdam,
Netherlands: Pattern Recognition Letters, Elsevier;
2001. p. 691-700.
41. Al-Khoder A, Harmouch H. Evaluating four of the most
popular open source and free data mining tools. Int J
Acad Sci Res 2015;3:13-23

More Related Content

PDF
A LITERATURE REVIEW ON DATAMINING
PDF
11.0005www.iiste.org call for paper. data mining tools and techniques- a revi...
PDF
5. data mining tools and techniques a review--31-39
PDF
The Survey of Data Mining Applications And Feature Scope
PDF
Data Mining Applications And Feature Scope Survey
PDF
Fundamentals of data mining and its applications
PDF
Data Mining System and Applications: A Review
DOC
Ci2004-10.doc
A LITERATURE REVIEW ON DATAMINING
11.0005www.iiste.org call for paper. data mining tools and techniques- a revi...
5. data mining tools and techniques a review--31-39
The Survey of Data Mining Applications And Feature Scope
Data Mining Applications And Feature Scope Survey
Fundamentals of data mining and its applications
Data Mining System and Applications: A Review
Ci2004-10.doc

Similar to A Comprehensive Study on Outlier Detection in Data Mining (20)

PDF
TTG Int.LTD Data Mining Technique
PDF
PERFORMING DATA MINING IN (SRMS) THROUGH VERTICAL APPROACH WITH ASSOCIATION R...
PDF
Data Mining – A Perspective Approach
DOCX
Data Warehose and Data Mining Unit II.docx
DOCX
knowledge discovery and data mining approach in databases (2)
PDF
Advancing Knowledge Discovery and Data Mining
PDF
Data Science Demystified_ Journeying Through Insights and Innovations
PDF
ISSUES, CHALLENGES, AND SOLUTIONS: BIG DATA MINING
PDF
Issues, challenges, and solutions
PDF
An Analysis of Outlier Detection through clustering method
PDF
4113ijaia09
PDF
4113ijaia09
PPTX
Presentation Research Proposal on computer Science Subject
DOCX
Seminar Report Vaibhav
PDF
Deep learning applications and challenges in big data analytics
PDF
G045033841
PDF
KIT-601 Lecture Notes-UNIT-1.pdf
PPTX
A review on data mining
PDF
Introduction to Data Science: Unveiling Insights Hidden in Data
TTG Int.LTD Data Mining Technique
PERFORMING DATA MINING IN (SRMS) THROUGH VERTICAL APPROACH WITH ASSOCIATION R...
Data Mining – A Perspective Approach
Data Warehose and Data Mining Unit II.docx
knowledge discovery and data mining approach in databases (2)
Advancing Knowledge Discovery and Data Mining
Data Science Demystified_ Journeying Through Insights and Innovations
ISSUES, CHALLENGES, AND SOLUTIONS: BIG DATA MINING
Issues, challenges, and solutions
An Analysis of Outlier Detection through clustering method
4113ijaia09
4113ijaia09
Presentation Research Proposal on computer Science Subject
Seminar Report Vaibhav
Deep learning applications and challenges in big data analytics
G045033841
KIT-601 Lecture Notes-UNIT-1.pdf
A review on data mining
Introduction to Data Science: Unveiling Insights Hidden in Data
Ad

More from BRNSSPublicationHubI (20)

PDF
Investigation the Effect of Wet Granulation and Hydrophilic Binder in Dissolu...
PDF
Antimicrobial Activity of Green Tea: Honey Blend against Acne-Causing Microo...
PDF
Allergic Angina with QTc Prolongation and Wellens Type-A Syndrome Post-ceftr...
PDF
Polymer Synergy in Action: Antimicrobial Properties of Chitosan-Alginate Com...
PDF
Screening of polymerization effect with Alginate-HPMC-Guar gum on Polymer-Ba...
PDF
Formulation and Evaluation of Herbal Soap for Fairness and Glowing Skin
PDF
An Analytical Study to Ascertain the Safety of Dashmooli Kwath (Syrup)
PDF
An Analytical Study of Kushmanda avaleha to Evaluate its Safest Ayurvedic Pre...
PDF
Evaluation of the Antibacterial Activity of Methanolic Extracts of Hemigraphi...
PDF
Management of Polycystic Ovary Syndrome by Chinese Herbal Medicine Cinnamon a...
PDF
The Role of Air Pollution on Climate Change: Myths and Realities
PDF
Suggesting a Prescriptive Model for Online Agricultural Education
PDF
Multidimensional Poverty Status Correlates of Rural Households in Kaduna Stat...
PDF
Typology of Processed Tea (Camellia sinensis [L.] O. Kuntze): A Review
PDF
Sustainable Entrepreneurship of Farm Women through Duck Farming in Purba Bard...
PDF
A Comparative Study of Management Approaches for Khari Goats in Traditional V...
PDF
From Field to Kitchen: Pre-extension Demonstration of Sweet Potato Variety (H...
PDF
Characterization of Systematic Variations in Met Parameters: Impact of El Nin...
PDF
Mutual interactions and Inter-relationships between “Weather” and “Weather Sy...
PDF
The Relationship between the Food Nutritional Value and the Absence of Microb...
Investigation the Effect of Wet Granulation and Hydrophilic Binder in Dissolu...
Antimicrobial Activity of Green Tea: Honey Blend against Acne-Causing Microo...
Allergic Angina with QTc Prolongation and Wellens Type-A Syndrome Post-ceftr...
Polymer Synergy in Action: Antimicrobial Properties of Chitosan-Alginate Com...
Screening of polymerization effect with Alginate-HPMC-Guar gum on Polymer-Ba...
Formulation and Evaluation of Herbal Soap for Fairness and Glowing Skin
An Analytical Study to Ascertain the Safety of Dashmooli Kwath (Syrup)
An Analytical Study of Kushmanda avaleha to Evaluate its Safest Ayurvedic Pre...
Evaluation of the Antibacterial Activity of Methanolic Extracts of Hemigraphi...
Management of Polycystic Ovary Syndrome by Chinese Herbal Medicine Cinnamon a...
The Role of Air Pollution on Climate Change: Myths and Realities
Suggesting a Prescriptive Model for Online Agricultural Education
Multidimensional Poverty Status Correlates of Rural Households in Kaduna Stat...
Typology of Processed Tea (Camellia sinensis [L.] O. Kuntze): A Review
Sustainable Entrepreneurship of Farm Women through Duck Farming in Purba Bard...
A Comparative Study of Management Approaches for Khari Goats in Traditional V...
From Field to Kitchen: Pre-extension Demonstration of Sweet Potato Variety (H...
Characterization of Systematic Variations in Met Parameters: Impact of El Nin...
Mutual interactions and Inter-relationships between “Weather” and “Weather Sy...
The Relationship between the Food Nutritional Value and the Absence of Microb...
Ad

Recently uploaded (20)

PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PPTX
Cell Structure & Organelles in detailed.
PDF
Module 4: Burden of Disease Tutorial Slides S2 2025
PDF
O7-L3 Supply Chain Operations - ICLT Program
PPTX
Cell Types and Its function , kingdom of life
PPTX
Institutional Correction lecture only . . .
PDF
Insiders guide to clinical Medicine.pdf
PDF
Microbial disease of the cardiovascular and lymphatic systems
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
RMMM.pdf make it easy to upload and study
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PPTX
Pharma ospi slides which help in ospi learning
PPTX
human mycosis Human fungal infections are called human mycosis..pptx
PDF
TR - Agricultural Crops Production NC III.pdf
PDF
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
PDF
Basic Mud Logging Guide for educational purpose
PDF
Pre independence Education in Inndia.pdf
PPTX
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
PDF
Computing-Curriculum for Schools in Ghana
PPTX
Renaissance Architecture: A Journey from Faith to Humanism
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
Cell Structure & Organelles in detailed.
Module 4: Burden of Disease Tutorial Slides S2 2025
O7-L3 Supply Chain Operations - ICLT Program
Cell Types and Its function , kingdom of life
Institutional Correction lecture only . . .
Insiders guide to clinical Medicine.pdf
Microbial disease of the cardiovascular and lymphatic systems
Final Presentation General Medicine 03-08-2024.pptx
RMMM.pdf make it easy to upload and study
Supply Chain Operations Speaking Notes -ICLT Program
Pharma ospi slides which help in ospi learning
human mycosis Human fungal infections are called human mycosis..pptx
TR - Agricultural Crops Production NC III.pdf
Physiotherapy_for_Respiratory_and_Cardiac_Problems WEBBER.pdf
Basic Mud Logging Guide for educational purpose
Pre independence Education in Inndia.pdf
school management -TNTEU- B.Ed., Semester II Unit 1.pptx
Computing-Curriculum for Schools in Ghana
Renaissance Architecture: A Journey from Faith to Humanism

A Comprehensive Study on Outlier Detection in Data Mining

  • 1. © 2022, AJCSE. All Rights Reserved 1 RESEARCH ARTICLE A Comprehensive Study on Outlier Detection in Data Mining Deepti Mishra1 , Devpriya Soni2 1 Department of CSE, Noida International University, Noida, Uttar Pradesh, India, 2 Department of CSE, Jaypee Institute of Information Technology, Noida, Uttar Pradesh, India Received on: 25-11-2021; Revised on: 30-12-2021; Accepted on: 10-01-2022 ABSTRACT The paper presents a survey on the literature of outliers and data mining. The prime focus of the paper is to deliver an outline of outlier with various approaches for its detection. Outliers are the data points which are partially or totally diverge from the residue data set. They can be considered as those data objects which cannot be fitted in any cluster. Outliers can be different from its neighboring data points only or from complete data set. It is a necessary task to identify and detect outliers from the dataset as their presence effect the preciseness of the outcome. Outliers can exist in any kind of data varies from low-dimensional to high-dimensional data set. The detection of an outlier requires some precise mathematical calculations, appropriate domain knowledge, and statistical calculations which are presented in the paper. The paper presents, the significant characteristics of the outliers. Key words: Clustering, Data mining, Knowledge discovery, Outlier detection, Outliers INTRODUCTION It is exceedingly difficult to manage large scale collected data, as nowadays, there is abundance of data in the data base or data warehouses. At present, increment in data is going on very rapidly. It might be petabytes, terabytes, or exabytes of data consisting of billion to trillions of records of million entities gathered from different sources. For such massive data, the manual data analysis and data retrieval system are very time consuming.[1] This huge amount of data generates necessity of data analysis tool since data may be located at different locations. A competent and robust tool can be built using data mining with its techniques. Data mining is an exciting field of computer science. Scientists and researchers find it as a new field for their research. Dataminingisthecombinationoftechniqueswhich are further based on concept of learning.Acquiring knowledge is a big aspect of data mining. Further, it can be stated as the data mining is the part of knowledge discovery. Data mining is the extraction of information from a large data storage which includes many extraction techniques. It can handle Address for correspondence: Deepti Mishra E-mail: itsdeepti.s@gmail.com a large amount of information. Data mining, that extracts unrevealed predictive knowledge from huge datasets, is a strong innovative technology with influential strength to help organizations for managing the most crucial information in their data bases. The data mining tools help to find out the future trends and patterns, empowering businesses to conclude knowledge generated decisions.[2] The automated potential analysis provided by data mining is far ahead of the analyses of data provided by retroactive tools mainly of decision support systems. The tools used for data mining techniques can resolve conventionally and extremely time- consuming issues related to business in efficient way. The tools examine databases for hidden patterns, detecting predictive information that one may miss easily otherwise. It is the process of semi-automatically analyzing huge datasets to detect patterns which are: true, useful, fresh, feasible, and easily assimilated. Everyday life of a person goes through an innumerable kinds of pattern recognition (PR) problems such as images, faces, smells, voices, situations, and many more.[3] Initially, most of these issues are corrected at a sensory level or instinctively, without any help of an explicit technique or algorithm. The moment we get an algorithm the problem or issue becomes trivial and is then handed over to the computer. Now, Available Online at www.ajcse.info Asian Journal of Computer Science Engineering 2022;7(1):1-10 ISSN 2581 – 3781
  • 2. AJCSE/Jan-Mar-2022/Vol 7/Issue 1 2 Mishra and Soni: Outlier Detection machines have regularly replaced humans in many previously difficult or impossible way, except only in few unvarying PR tasks, namely, medical test reading, fingerprint recognition, mail sorting, military target recognition, signature verification, DNA matching, meteorological forecasting, and others. PR is the scientific discipline whose aim is to classify data, objects, and patterns into different categories or classes. It utilizes abilities of – unsupervised learning – supervised learning. The basic ideology is to use data mining techniques to find pertinent and useful patterns of system virtues that elucidate program and user nature, with utilization of set of relevant system features to summarize (learning by induction) classifiers which detect outliers. Nowadays, outlier and its detection are an emerging area of data mining. There is no accurate method to identify outlier. At present in data mining, outlier detection is the widely research field which entails excessive attention. Uncovering outliers carried out in generated patterns is a pertinent hindrance in the data mining. Outlier mining is the technique of spotting rare events, deviant data points, and exceptional objects.[4] Outlier analysis is a major data mining topic in knowledge discovery. Data mining and knowledge discovery can be used interchangeably. Data mining in database systems points to self-extraction of useful predictive information that is not conspicuous otherwise.[5,6] The focus of the paper goes around the concept of data mining, outlier detection, and its different approaches. The work presents the literature survey of outliers in various kinds of data set. The work is designed to demonstrate the study on data mining and various methodologies for outlier detection. We categorized and presented a comprehensive overview of approaches of outlier detection in various taxonomy. Rest of the paper presents literature survey of outliers with approaches applied for detection additionally with comparison and analysis. CUMULATIVE KNOWLEDGE FACTS FOR OUTLIERS This section provides preliminary knowledge required for outlier identification and detection. Before proceeding to outlier, let have a look on key areas for awareness of outliers. Data warehouse Data warehouse is the collection of enormous heterogenous data gathered from multiple sources for further prediction and analysis. Key features of data warehouse are that it is subject-oriented, integrated, time variant, and durability. Data warehouse facilitates the process named extract, transform, and load, that is, ETL which is a key for analyzing and discovering business perceptions. Extraction is the procedure to read and gather data from different data sources in a one database using a tool. The same tool transforms the data into usable and required form which is followed by writing the data into target database called load. The data are pre-processed and cleaned during the procedure. The transferring and loading process is applied with the help of data marts. It can be stated as data mart as subset of data warehouse which concentrates on specific domain of business.[7] OLAP operations are applied for data processing. It processes the data, incorporate in business point of view such as customers’ information, sale details, products information, and review results. Data warehouse covers numerous benefits such as easy retrieval of information, consistency in database, adaptable to change, modifications in timely manner, and many more.[8] It provides a foundation for good decision-making system.[9] Data mining Data mining includes various techniques to facilitates the process of generating beneficial information for prediction and analysis from huge pre-existing database. Data mining comprises different steps such as data pre-processing, learning, and analysis. Data collected from different sources are cleaned, transformed, and reduced to construct a product which is better for the use. As during collection of enormous data, there are chances of loss of information such as geological data, numerical data, etc.[10] It is composed of three techniques classification, clustering, and association rule mining. The classification followed the concept of supervised learning, while clustering is applied on the concept of unsupervised learning. Decision trees, regression, Bayesian classification, and neural network are few methodologies and techniques of
  • 3. AJCSE/Jan-Mar-2022/Vol 7/Issue 1 3 Mishra and Soni: Outlier Detection classification. K-means, DBSCAN, OPTICS, and Chameleon are few techniques of clustering. In the current times, when data generation is occurring at an exponential rate, it is equally important to treat the data collected to harvest out the useful data from the not so useful data. Such mined out, data are extremely useful for innumerable purposes, namely, fraud detection, weather forecasting, market trend predictions, customer purchasing trend identification, and many more. Knowledge discovery Knowledge discovery is acquiring and achieving theresultsofpredictionfromhugedatabyapplying data mining techniques. It is the extraction of knowledge from structured or unstructured data. Data mining and knowledge discovery are directly or indirectly related to each other. Fundamentally, the resulted patterns generated after applying data mining techniques to find some conclusions are termed as knowledge discovery. Initially, the objective of KDD procedure is to be identified according to users’ point of view. Afterward, the selection of data is accomplished to achieve the goal, while selecting the data set user needs to ponder in right direction with exact methodology of data mining. OUTLIERS Outliers are the data points which can be notably differentiated from the residue of the statistics of data points. It is assumed practically that the number of “normal” observations are considerably more than “abnormal” observations (outliers/ anomalies) in the given data. There are numerous definitions of outliers suggested by many authors. Large-scaled gathered data needs to be processed and managed properly for efficient analysis. The analysis process requires software tools based on techniques of data mining. During the process of examining of data, it is mandatory to remove unwanted data for precise outcome.That unwanted data may be considered as outliers. Outliers can be flowed into the data set may be by malicious activity, due to outcomes of faulty experiments and merely by erroneous entries in the data.[22] There are many reasons to handle outliers, because they have significant impact on the result. Some may hide substantial and useful information for further analysis. Detecting outliers play a key role in retrieving precise information and have various applications such as intrusion detection, credit card fraud detection, proofing medical diagnosis data, and analysis of satellite images. It has[23] observed as outliers often share the same mathematical features, so it can be difficult to distinguish between them. If there is a lack of relevant domain information that leads to the idea of deciding why they are outside and different and what the data model is doing below it means that it is difficult to identify them as outsiders. To find out, if an outlier is revealing or important other information is needed, for example, an accurate mathematical calculation and appropriate background.[24] External detection algorithms are intended to automatically detect those objects or disturbing visions in large amounts of data. Since there is no standard definition of an outlier, so the whole algorithm is based on a model based on a certain guess of what qualifies as a trader. Outliers carry few characteristics so that they can be distinguished. Outliers can be identified and separated based on their characteristics such as existence in high-density or low-density region, identified separately or with other data points. Some key feature of outliers to distinguish them are mentioned below.[25] 1. Based on neighborhood – They can be categorized as global and local outliers. They can be stated as global having different characteristics from whole data set, and local if they can be identified only from their neighborhood. 2. Approach of degree for knowing outlier – A SCALAR (binary) value is used to detect whether the data object can be considered as an outlier or not. The degree for considering data point as outliers implies that to generate the score of the data point for declaring it as an outlier. In contrast, OUTLIERNESS signifies the visualization of characteristics and degree, on which the point can be considered as an outlier in respect with other data. 3. Approach of dimensions for detecting outliers – UNIVARATE data can be categorized as, with one feature that can be categorized
  • 4. AJCSE/Jan-Mar-2022/Vol 7/Issue 1 4 Mishra and Soni: Outlier Detection externally only on the basis that one feature is unique in relation to that other data. In contrast, MULTIVARIATE data contains multiple attributes and can be considered as for identifying outliers as the combination of few factors have unusual data values. 4. Total count of outliers – The number of outliers can be obtained by different techniques in different ways per second. A few tricks identify one outlier and remove it and repeat the process until more exits are found. Some strategies find a collection of outliers in a single effort. Both have their advantages and disadvantages. The necessity and requirement to distinguish the outliers according to their characteristics because they can have significant impact on the result. Sometimes, they may have valuable knowledge and vital information. They may have valuable knowledge and vital information. OUTLIER DETECTION APPROACHES AND TECHNIQUES Outlier detection techniques and approaches have been comprehensively analyzed in the previous decadesandmanyapproacheshavebeendeveloped to identify outliers [Table 1]. There are few basic approaches for outlier detection – statistical approach, depth-based approach, distance-based approach, density-based approach, and deviation- based approach. Other approaches are – for spatial data, graph-based approach. Approaches are also developed for high-dimensional data.[14,26] It can be defined that approaches to identify outliers are based on supervised learning and unsupervised learning.[27] Some techniques can be applied only on univariate data or some technique are beneficial for multivariate data for outlier detection. Before proceeding further, it is better to remove outliers. Statistical approach Statistical algorithms are the foundation and basic algorithms applied for outlier detection, which were based on probability or distribution model for the given data set. The standard deviation, interquartile range, Chi-square test, ANNOVA, and Z-score, these few are methods of statistics to identify outliers. Some of the techniques of statistics assume Gaussian distribution of data. These kinds of statistical methods can be applied to notify rare and unexpected data in bivariate or multivariate dataset. The traditional way is to apply discordancy method by assuming the hypothesis which verifies that whether the data point as outlier or not. A traditional way is to plot the data to identify the odd data point from residue data set. The data points that does not follow the model stated as the outliers. Barnett and Lewis[12] and Rousseuw and Leroy[28] described some techniques which are single dimensional. As the dimension increases, it is very tough to manage the model for data set. Depth-based approach In depth-based approach, the data set is organized into different layers. The outer layer is more prone to outliers in comparison to inner layers. Minimum volume ellipsoid is based on depth- based technique. The probability of observation is computed and visualized to notify the outlier. Distance-based approach The distance-based approach as name suggested requires distance calculations for outlier detection such as Euclidean distance. Knorr and Ng proposed the outlier detection technique using k nearest neighbors.[29,30] Some authors also use a distance-based approach to detect external objects such as nested loop, cell-based, index-based methods that reduce unnecessary computations. A new hybrid approach was introduced using a distance-based method and a k-way to identify outliers.[15] Initially, this method uses a different integration algorithm which means K-means dividing the data set into several clusters and then using distance- based methods to find outliers. The scope of the approach necessitates alterations so that it can be made applicable to the textual mining. It can be applied on more complicated data set. Also, can be applied on varying data sets. The paper presents the new algorithm PLDOF and tried to proof that it is better than LDOF. K-means clustering algorithm is applied to divide the data set into clusters. It is based on the concept that first calculates the distance of each point from the centroid of the cluster.[13]
  • 5. AJCSE/Jan-Mar-2022/Vol 7/Issue 1 5 Mishra and Soni: Outlier Detection Density-based approach The density-based approach evaluates the density in the data region to detect outliers as outliers generally lie in low-density region. Certain methodologies are proposed and implemented concerning with density-based approach. One of the algorithms named outlier finding technique is proposed, which further uses K-means clustering algorithm to cluster the data set and to find out outliers. The functioning of methodology is based on the findings of density and distance values.[19] Deviation-based approach Arning et al. defined a deviation-based tactic, functions as observing key qualities to detect the outliers. Those object that departs from the qualities are declared as outliers.[31] Anotherapproachproposedinthepaperadefinition of outlier that is class outlier and proposed a class outlier factor (COF). It generates as a ranking score named COF to measure a degree of being a class outlier for a data point. The main factors of computing COF are the probability of the instance classes, the deviation of instance from an instance of a same class, and the distance between the instance and its K neighbors.[4] Outlier approaches for Spatial data set Spatial data set includes geographic data or in other words geographic information. It can be defined as the data which can be mapped and related to space, forexample,building,lake,mountains,andtownship. Cai et al. proposed an iterative self-organizing map with robust distance estimation for spatial outlier detection.[32] The neighboring clusters Kou et al. proposed spatial-weighted outlier detection algorithms for spatial data set.[33] Outliers can be present in the meteorological data and effect the valuable information.[34] Author presented a new approach based on wavelet analysis. Wavelet analysis is applied to study images and signal processing. The signals and images are observed at various focuses to generate conclusion and distinct patterns. Graph-based approach The paper introduced a new algorithm SWHOT which is based on weighted hypergraph model. It is the union of BSWH algorithm and CURE algorithm. The algorithm provides the concept of the feature vector and attribute similarity.[17] High-dimensional data-based approach Nowadays, high-dimensional data are the big challengeforoutlieranalysis.Itisdifficulttodecoct out the irrelevant data from high-dimensional data and concurrent segregation of outliers. Tests are performed on high-quality data set by an outlier detection algorithm. The new angle-based outlier discovery (ABOD) algorithm is proposed and compared to other algorithms. ABOD computes that the scalar product between the vectors of the data points, furthermore, computes Table 1: Various definitions of outliers S. No. Authors Definitions 1. Hawkins[11] Outlier is a monitored data point, which differ so much from rest data set as to create a suspicious environment that it was generated by different mechanism. 2. Barnett and Lewis[12] A data object (or subgroup of observation) which emerges and acts to be irregular or volatile with the remainder of that data set. 3. Pamula et al.[13] Outlier is the object or point which does not adapt the same characteristics alike to the usual data depicting the data set. 4. Guo et al.[14] Outlier can act abnormally but may encompass valued information. 5. Pachgade and Dhande[15] Outlier can act as a type of pattern such as behaving differently in aspect of residue data in the data set. 6. Agarwal and Yu[16] The data point is considered as outlier if it is not identical with residue data on some characteristics. 7. Li et al.[17] Outlier is a small quantity of data objects with abnormal behavior in data set. 8. Pham and Phag[18] Outliers are the items that noticeably differ and cannot fit in the general distribution of the data. 9. Behera et al.[19] The irregular data points pointed as outliers from residue data if it appears that they are engendered by any faulty circumstances from experiments. 10. Li and Kitagawa[20] An outlier is an abnormal data point which is greatly different from the rest of the data. 11. Breuning et al.[21] Outliers are those data points which can be categorized by density as they lie in the less density region in compared to its neighboring points.
  • 6. AJCSE/Jan-Mar-2022/Vol 7/Issue 1 6 Mishra and Soni: Outlier Detection the variance over the angles which are calculated over the data points. There is not any requirement for the selection of parameters for generating outcomes.[35] Another approach for outlier detection is introduced by Pham and Phag which functions in high-dimensionaldataset.Theapproachisefficient and scalable to extremely high-dimensional data set.[18] There is another approach named outlier detection in high-dimension based on projection. It discovers the outliers by applying the concept of projection, from the data set. In the methodology, clusters are designed by applying the concept of projection and weight. Smaller clusters are pruned further detecting outliers.[14] The author presents a paper which describes the concept outliers in high-dimensional data set.[16] Outliers and its techniques are discussed in detail in the paper. Amit Banerjee suggested a new algorithm for outlier detection which is based on evolutionary search. The genetic algorithm is used to find external objects that are outliers. The method used is to integrate the Euclidean-based approach to distance, partly based on congestion based on the converted value of Lancaster.[36] There is an advantage of the given approach which can also be applied on continues attributes. The scope of algorithm, to generate a procedure which can detect absolute count of outliers by applying fitness function. For high-dimensional data, the analysis becomes more critical. A new method COMPREX is introducedwhichapplypattern-basedcompression. The approach is parameter free, general, scalable, and efficient. The categorical database is used.[37] Approaches based on classification Classification is based on concept of supervised learning further part of data mining and machine learning. In classification technique, there are predefined trained models. In supervised learning, the class description of every training sample is provided. It is used to classify the dataset into predefined groups of classes or models. The classification includes many techniques such as decision trees, neural network, and statistics. In classification, there are predefined groups of models or classes to classify the new data. It is extremely useful process for knowledge discovery by identifying the classes and models for new data. Further continuing, the decision tree approach of classification generally uses methodologies of statistical techniques of or in other words concept of clustering to identify the outliers or classification of data precisely. Due to presence of outliers’, branches of tree reflect variances in the classes. Approaches based on clustering Clustering is based on concept of unsupervised learning to identify the data in clusters or in groups. The data with similar traits lie in same cluster, but they have must have different traits with data objects which lie in another cluster. The clusters identified by any clustering algorithm can be succumbed to outliers. Due to presence of outliers, there may be variation in outcome and result may be inappropriate. Techniques of clusteringcanidentifytheoutlierssuchasdistance- based methodologies, for example, K-means and density-based methodologies, for example, CURE and DBSCAN. The identification of outlier depends on the distance value between data point and centroid data point which may in turn affect the accuracy of clusters. In that aspect, the density-based approach identifies outliers on the concept of high- and low-density regions. Hence, it can be stated as density-based approaches are better approaches than distance-based approaches to identify outliers. A new algorithm Balanced Iterative Reducing and Clustering using Hierarchies is applied to generate clusters.[38] Pattern identification is an important task from the large data bases. This methodology works on large data sets to identify patterns. It can also deal with noise. The paper defines the two types of attribute – metric and non-metric. It works in four stages – load database in memory after building CF tree, reduce in smaller CF tree, apply global clustering, and refine the clusters.[38] The authors presents new approach for clustering applying in huge databases.[17,39] The key objective of the work is to apply clustering in two stages: A basic and fast step that divide data into overlapping divisions named as “canopies,” and then distance measure is done only for those data objects that present in common canopy.
  • 7. AJCSE/Jan-Mar-2022/Vol 7/Issue 1 7 Mishra and Soni: Outlier Detection The canopies are the subset of the of the data points. The generated clustering which is based on canopies can be suitable to many existing clustering algorithms, for example, K-means, greedy agglomerative clustering, and expectation maximization. In the paper, the canopies are applied with agglomerative clustering.[39] Authors proposed a new algorithm for outlier detection.[40] The approach is defined in different stages. In initial stage, a modified K-means approach is applied followed by constructing minimum spanning tree. Further, the longest edge is removed connecting with farthest data points considering it as outliers. OLAP data cube techniques OLAP data cube techniques are applied to identify unexpected data points in multidimensional data model. It is consisting of numerical facts and data which are termed as dimensions. It is used for analysis and computations on data. Data are gathered from various data collections in data cube. Basic operations on OLAP include drill down, roll up, slice, dice, and pivot. Each data cube is comprised data cells. Each data cell contains numeric data which are further applied for statistical, mathematical, aggregate functions, and many more. If data in any cell deviate too much from data in another cells, then that cell value is spotted as exception or outliers. Data cube is better scope to store data as it offers worthy visualization such as data in cell can be viewed in distinct colors to signify outliers and exceptions. COMPARATIVE STUDY AND ANALYSIS OF VARIOUS TECHNIQUES OF OUTLIER DETECTION From the comparison shown in the Table 2, it can be analyzed that different techniques are required to manage different kind of databases to detect outliers. Depth-based technique can identify multiple outliers at a time, since it organizes the data in different layers. It is assumed that maximum number of outliers lie in the outer layer. Usually, statistical- and distance-based techniques work more efficiently on low-dimensional data set, while techniques such as graph based, density based, clustering based can be applied on high- dimensional data set. Table 2: Comparison of different approaches of outlier detection Comparison Technique Approach Types of data set Types of outliers Number of outliers Supervised Unsupervised Semi supervised Single High‑Dimensional Hybrid Global Local Single Multiple Statistical     Depth Based     Distance Based     Density Based     Deviation Based      Graph     Classification based      Clustering Based       
  • 8. AJCSE/Jan-Mar-2022/Vol 7/Issue 1 8 Mishra and Soni: Outlier Detection Maximum number of techniques are based on unsupervised learning such as depth based and density based, while some techniques such as classification and deviation based are lie on the concept of supervised learning. Since these later techniques needs predefined characteristics to identify odd one data points. Classification-based techniques are based on supervised learning as they apply operations on predefined model. On the contrary, clustering- based approaches are based on unsupervised learning, because there is no trained exemplar to apply mathematical operations. In general, graph-based approaches can function efficiently on high-dimensional data set. It is the part of unsupervised learning. It generates graph using the distance function and finding neighboring points. OUTLIER DETECTION TOOLS At present, numerous software in existence to cope with the identifications and complications generated by outliers such as WEKA, R Studio, RapidMiner, and Orange. This free software provides strong statistical environment for analyzing data mining techniques which are further capable of identifying outliers.[41] In general, the software is comprising various machine learning techniques, offer vivid visualization and excellent platform for data science. These tools are foundation for data analyzing, computation, and its visualization. They offer an integrated environment for data preparation, data mining, and generating knowledge. Outliers can be detected using pre-processing techniques in WEKA tool. As it is a user-friendly software, clicking on some buttons such as filter can spot outliers effortlessly. In R studio, user must install packages which initially take few seconds to install to execute the functions to detect outliers. One of the package outliers apply the statistical techniques such as mean values and standard deviation to spot outliers. The function detect outliers are capable enough to identify outliers easily in RapidMiner. Below a comparative study is mentioned in the Table 3. Four different software are compared on various platforms. CONCLUSION AND FUTURE WORK Considering the never-ending rise in data collection and concomitant need of processed data for scientific and commercial usage, it is needless to reemphasize about the need of development of data mining tools for the same. The development of new algorithms which possess the capability of detecting outliers by more precision is deeply needed. Furthermore, processed data need to be presented in such a diligent pattern to manifest the secret patterns for public usage. Ongoing research efforts in artificial intelligence needs to be either combined or utilized at specific points to make out knowledge discoveries from data mining more productive, scientific, and sensitive. The identification of outliers is a subtopic of data mining. Outlier analysis is a highly research field for scientists. Outliers are the data points that those cannot be fitted in any type of clusters. These objects are somehow objectionably unusual from residue data set. They may differ from entire data sets or may be difficult to locate. The presence of outliers makes the results confusing. Patterns generated after data from data are incorrect and inaccurate for outliers. Every kind of data includes a new problem of outliers and requires the different techniques to handle it. A single straight approach cannot deal with all kind of outliers in various fields and subject areas. Hence, we Table 3: Comparison of different tools on various platforms WEKA R Studio RapidMiner Orange Written in JAVA C++ JAVA XML Python, C++, Cython Rapidity FAST FAST FAST FAST Open source YES YES YES YES Large data Efficiently on small data sets less than R Efficiently on large and small data sets Efficient less than R Efficiently on large and small data sets Visualization Less options Many options Many options Less options Outliers YES YES YES YES
  • 9. AJCSE/Jan-Mar-2022/Vol 7/Issue 1 9 Mishra and Soni: Outlier Detection need different approaches to identify and deduct outliers. The paper presents the review of outlier and outliers detection techniques. It is valuable for the researchers. Outliers may cover any vital information or may influence the accuracy of result. The idea is, how to notify the outliers, as to retain them for further information or remove them to gain accurate outcome. REFERENCES 1. Houari ME, Rhanoui M, Asri BE. Hybrid Big Data Warehouse for on-demand Decision Needs. Rabat, Morocco: International Conference on Electrical and Information Technologies (ICEIT); 2017. 2. Ming J, Zhang L, Sun J, Zhang Y. Analysis Models of Technical and Economic Data of Mining Enterprises Based on Big Data Analysis. Chengdu, China: 2018 IEEE 3rd International Conference on Cloud Computing and Big Data Analysis (ICCCBDA); 2018. 3. Heling J, Yang A, Yan F, Miao H. Research on pattern analysis and data classification methodology for data mining and knowledge discovery. Int J Hybrid Inf Technol 2016;9:179-88. 4. Hewahi NM, Saad MK. Class outliers mining: Distance-based approach. Int J Comput Infor Eng 2007;1:2805-18. 5. Koteeswaran S, Visu P, Janet J. A review on clustering and outlier analysis techniques in data mining. Am J Appl Sci 2012;9:254-8. 6. Velchamy I, Subramanian R, Vasudevan V. A five step procedure for outlier analysis in data mining. Eur J Sci Res 2012;75:327-39. 7. Prakash D, Prakash N. A Requirements Driven Approach to Data Warehouse Consolidation. Brighton, UK: 11th International Conference on Research Challenges in Information Science (RCIS); 2017. 8. Bagambiki E. Enterprise Data Warehouse and Business Intelligence Solution. In: ICEGOV’18 Proceedings of the 11th International Conference on Theory and Practice of Electronic Governance; 2018. 9. Farooqui NA, Mehra R. Design of a Data Warehouse for Medical Information System Using Data Mining Techniques. Solan Himachal Pradesh, India: 2018 5th International Conference on Parallel, Distributed and Grid Computing (PDGC); 2019. 10. Lgkklw EA, Lingling Z. A BP neural network based method for geological missing data processing. Gold Sci Technol 2015;23:53-8. 11. Hawkins DM. Identification of Outliers. Berlin, Germany: Spinger; 1980. 12. Barnett V, Lewis T. Outliers in Statistical Data. Hoboken, New Jersey: Wiley; 1994. 13. Pamula R, Deka JK, Nandi S. An Outlier Detection Method Based on Clustering. Kolkata, India: Second International Conference on Emerging Applications of Information Technology; 2011. 14. Guo P, Dai JY, Wang YX. Outlier Detection in High Dimension Based on Projection. Dalian, China: International Conference on Machine Learning and Cybernetics; 2006. 15. Pachgade SD, Dhande SS. Outlier detection over data set using cluster-based and distance-based approach. Int J Adv Res Comput Sci Softw Eng 2012;2:12-6. 16. Agarwal CC, Yu PS. Outlier Detection for High Dimensional Data. Santa Barbara, California, USA: ACM SIGMOD International Conference on Management of Data; 2001. 17. LiY, Wu D, Ren J, Hu C.An Improved Outlier Detection Method in High-dimension Based on Weighted Hypergraph. In: Second International Symposium on Electronic Commerce and Security; 2009. 18. Pham N, Phag R. A near-linear Time Approximation Algorithm for Angle Based Outlier Detection in High Dimensional Data. Beijing, China: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; 2012. 19. Behera HS, Ghosh A, Mishra SK. A new hybridized k-means clustering based outlier detection technique for effective data mining. Int J Adv Res Comput Sci Softw Eng 2012;2:287-92. 20. Li Y, Kitagawa H. DB-outlier Detection by Example in High Dimensional Datasets. Istanbul, Turkey: IEEE International Workshop on Databases for Next Generation Researchers; 2007. 21. Breunig MM, Kriegel HP, Ng RT, Sander J. LOF- identifying Density Based Local outliers. In: Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data: Dallas, Texas, USA; 2000. 22. Hodge VJ, Austin J. A survey of outlier detection methodologies. Artif Intell Rev 2004;22:85-126. 23. Cheng JG. Outlier Management in Intelligent Data Analysis. London: University of London; 2000. 24. Zimek A, Campello R, Sander J. Ensembles for Unsupervised Outlier Detection: Challenges and Research Questions. SIGKDD Explorations Newsletter; 2013. p. 11-23. 25. Zhang Y, Meratnia N, Havinga P. A Taxonomy Framework for Unsupervised Outlier Detection Techniques for Multi-Type Data Sets. CTIT Technical Report Series; No. Paper P-NS/TR-CTIT-07-79). Enschede: Centre for Telematics and Information Technology (CTIT), 2007. 26. Xi J. Outlier Detection Algorithms in Data Mining. In: 2nd International Symposium on Intelligent Information Technology Application: Shanghai, China; 2008. 27. Zhu C, Kitagawa H, Papadimitriou S, Faloutsos C. Outlier detection by example. J Intell Inf Syst 2011;36:217-47. 28. Rousseeuw PJ, Leroy AM. Robust regression and outlier detection. In: Wiley Series in Probability and Statistics. Hoboken, New Jersey: Wiley; 1987. 29. Knor EM, Ng RT. Finding Intensional Knowledge of Distance Based Outliers. Edinburgh, Scotland: Proceedings of the 25th VLDB Conference; 1999. 30. Knorr EM, Ng RT, Tucakov V. Distance based outlier: Algorithm and applications. VLDB J Int J 2000;8:237-53. 31. Arning A, Agrawal R, Raghavan P. A linear Method
  • 10. AJCSE/Jan-Mar-2022/Vol 7/Issue 1 10 Mishra and Soni: Outlier Detection for Deviation Detection in Large Databases. Portland, Oregon: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining; 1996. 32. Cai Q, He H, Man H. Spatial outlier detection based on iterative self organizing learning model. Neurocomputing 2013;117:161-72. 33. Kou Y, Lu CT, Chen D. Spatial Weighted Outlier Detection. Bethesda: Proceedings of SIAM Conference on Data Mining April 20-22; 2006. 34. Zhao J, Lu CT, Kou Y. Detecting Region Outliers in Meteorological Data. In: GIS 03 Proceedings of the 11th ACM International Symposium on Advances in Geographic Information Systems: New Orleans, Louisiana, USA; 2003. 35. Kriegel HP, Schubert M, Zimek A. Angle Based Outlier Detection in High Dimensional Data. Las Vegas, Nevada, USA: KDD ’08 Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; 2008. 36. Banerjee A. Density-based Evolutionary Outlier Detection. Philadelphia, PA, USA: GECCO’12; 2012. 37. Akoglu L, Tong H, Vreeken J, Faloutsos C. Fast and Reliable Anomaly Detection in Categorical Data. Maui, HI, USA: CIKM’12; 2012. 38. Zhang T, Ramakrishnan R, Livny M. BIRCH: An Efficient Data Clustering Method for Very Large Databases. Montreal, Quebec, Canada: SIGMOD ’96 Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data; 1996. 39. McCallumA, Nigam K, Ungar LH. Efficient Clustering of High-dimensional Data Sets withApplication to Reference Matching. Boston, Massachusetts, USA: Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; 2000. 40. Jiang MF, Tseng SS, Su CM. Two-phase Clustering Process for Outliers Detection. Amsterdam, Netherlands: Pattern Recognition Letters, Elsevier; 2001. p. 691-700. 41. Al-Khoder A, Harmouch H. Evaluating four of the most popular open source and free data mining tools. Int J Acad Sci Res 2015;3:13-23