SlideShare a Scribd company logo
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 04 Issue: 10 | Oct -2017 www.irjet.net p-ISSN: 2395-0072
© 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 687
A Comparative Study for Anomaly Detection in Data Mining
Fenil Shingala1, Shalin Barot2, Bhavesh Oza3
1,2 Student, LDCE Ahmedabad
3Professor, Dept. of Computer Engineering, LD College of Engineering, Gujarat, India
---------------------------------------------------------------------***---------------------------------------------------------------------
Abstract - In this paper, we will discuss some of the
research we have found till and what we have concluded from
that survey. We try to compare and combine three of the
methods we have explored. We will work on Outlier/Anomaly
Detection. Data mining is the processofextractionofdatathat
would be of any kind and Outlier / Anomaly is detection of
irrelevant data.
Key Words: moving averages, mean absolute deviation,
DBSCAN, regression, anomaly detection, outlier
detection, prediction analysis, data mining
1.INTRODUCTION
Anomaly detection was originally proposed for intrusion
detection systems to prevent cases such as burglary.
Anomaly detection for IDS is normally accomplished with
statistical analysis by defining boundaries, yet ought to
similarly be conceivable with soft computing, and inductive
learning. [1]
Fig -1: Anomaly detection
In data mining, anomaly detection is the identification of
items, events or observations that do not comply with an
expected pattern or other items in a dataset. Normally the
anomalous items will mean some sort of issue, for
example bank extortion, a structural defect, medical
problems or mistakes in content. For example, as figure
shows, in given time series there are some extreme points,
which are largely deviating from other points.
Anomaly detection:
Observing datasets for certain time period and
afterward analyzing patterns to find abnormal behaviors of
upcoming data points detect anomalies. Here we are
comparing few methods to do so. They include basic ideas of
statistical mathematics and clustering techniques.
Moving averages:
In statistics, a moving average is a process of creating a
series of averages of a certain portion of data sets of the
entire data. These averages are called moving mean or
rolling mean. [2]
But how can we do this? The answer is sliding window. First
of all, we will take a series of numbers. Then decide a fixed
subset size that will become a window. Start from initial
index to window size and take an average of numbers. Then
slide a window by removing first number and adding next
number in series after current window size, and take an
average. This will produce a series of averages which is
called moving averages.
It can also be used for anomaly detection. After finding first
moving average we can define threshold limits plus and
minus to current moving average to list out anomalies.
Example:
Let’s consider we are analyzing data taken from a tractor
company. We take sell of each month and plota graph. There
will be three parts to this graph. Middle part; duration is
number of tractors sold were of not too much or too less
proportion. Other two parts will be of too much and too less
sell in which so ever duration.
Let’s say we’re interested in finding periods during which
sell is on peak, so that we can keep more number of tractors
during those duration to save ourselves from shortage.Also,
we need to make sure that we do not order more quantities
than needed when there is less sell. This situation can be
solved by observing history; that is to say look for past sell
records and find out which months are on peak of sell and
which months have comparatively less sell. That is exactly
what moving averages do in terms of time series analysis.
Figure shows a graph for tractor sales and by applying
algorithm we can find out anomalies that are deviating from
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 04 Issue: 10 | Oct -2017 www.irjet.net p-ISSN: 2395-0072
© 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 688
given period average. So whenever graphgoestoextremesit
is considered as anomalous. Here, in graph such extreme
points are depicted in red. It showsupperandlower extreme
values of sales during a particular time, which is exactly
what we seek to find out.
Moving averages can be used for measuring the trend of any
series. This method is applicable to linear as well as non-
linear trends. [3]
On the other hand, the trend obtained bymovingaverages
generally is neither a straight line nor a standard curve. For
this reason the trend cannot be extended for forecasting
future values.
Fig -2: Moving Averages
Mean absolute deviation:
In statistics, the median absolute deviation (MAD) is a
method of findingthevariabilityofasamplethatincludesone
variable quantity. It can also refer to the population
parameter that is estimated by the MAD calculated from a
sample. [4]
For a univariate data set [a1, an] the MAD is characterized as
the median of theabsolutedeviationsfromthedata'smedian:
MAD=median| ai - median |
So in a nutshell, MAD is the median of absolute deviations
from the data's median.
How to find MAD:
1. Find the mean/median of data
2. Find the absolute differences between each data value to
the mean/median
3. Find the mean/median of these differences
Here, note that mean or median both can be used to find
MAD depending on the type of data.
MAD can be used for anomaly detection by defining
thresholds for modulus of division of each point by MAD
(point/MAD). This values after division givesdistribution of
data points with respect to their mean. Then we can easily
define a threshold according to the distribution we get to
extract out outliers.
Data outside the threshold are considered anomalous. It is
more suitable to the types of data whose most of the points
range between particular limits,andanomaliesaredeviating
in such a way that differences from anomalous data to MAD
are larger than most of the points.
Example:
Suppose a company is hosting a cloud service for which they
maintain servers and serves subscribedcustomerswithdata
storage. Let’s say, in a hypothetical situation, company’s all
customer uses most of their cloud storage and also they all
access it at the same time, then in this type of situation there
will be too much burden on CPUs and storage devices. This
type of situations are common in shopping sites mostly,
where for particular given time period there is a lot of
customers make transactions, such as during festivals. So in
this type of situation, whenever critical situation arises, that
is, when load is extreme, they can take precautionary steps
in future to avoid such crises.
While this method shows promisingly advantageous
mathematical side of its implementation, the same side also
is also a cause of its minus point.
Merits:
 Mean deviation is broadly utilized as a part of
different fields, for example, Economics, Business,
Commerce or some other field of such sort.
 When comparison is required this is perhaps the
best measure between at least two arrangements.
 This calculation has its base upon measurement
than an estimate.
 As it is based on the deviations about an average, it
gives us better measure for correlation.
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 04 Issue: 10 | Oct -2017 www.irjet.net p-ISSN: 2395-0072
© 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 689
Demerits:
 If Range increments on the off chance that the
sample increases, average deviation additionally
increments yet not in a similar proportion.
 For Sociological studies, itis practicallynotused.[5]
DBSCAN:
Density Based Spatial Clustering Algorithm is veryuseful for
finding shapes based on density. Density-Based Spatial
Clustering of Applications with Noise (DBSCAN) is most
widely used density based clustering algorithm. Other such
an algorithm is K-means. These types of algorithms use
measure of reachability and eligibility of forming a cluster.
[6]
Reachability is the condition defined by parameter ε
(epsilon). It stands for the maximum allowed distance
between two points under consideration to be eligible for
being neighbor of each other.
Eligibility to form a cluster is defined by a number. That
number indicates minimum number of points required to
form a cluster. [6]
Epsilon is a parameter to be given manually for maximum
distance to be considered eligible for including data point
into a cluster. [7]
This clustering method creates a sort of chaining by which
two distant points that are at more than ε distance will
eventually be considered as neighbors. Let’sunderstandthis
by an example. Suppose a random point ‘a’ is within ε
distance to another point ‘b’, hence are neighbors.Thispoint
‘b’ in turn is neighbor of point ‘c’. By chaining, point ‘a’and ‘c’
will also be neighbors even if they’re at more than ε distance
apart.
Algorithmic steps for DBSCAN clustering
Let’s understand this by taking sample data pointsofinteger
values. Now, DBSCAN takes two parameters: ε (epsilon) and
The minimum number of points required to form a cluster
(p).
1) Start with a random starting point that has not been
visited.
2) Extract the neighbor of this point usingε(All pointswhich
are within the ε distance are neighbors) by applying
condition difference of each point and point currently in
observation < ε.
3) If there are enough neighbors around this point then
clustering process starts and point is marked as visited else
this point is labeled as noise (But later on this point can
become the part of the cluster).
4) If a point is found to be a part of the cluster then its ε
neighbors are also the part of the cluster and the step 2 is
repeated for all ε neighborhood points. Thisisrepeateduntil
all points in the cluster are determined.
5) After the formation of the cluster,a next randomunvisited
point is taken to repeat the same process of clustering.
6) This process continues until all points are marked as
visited.
Example:
Let’s say, we are developing an algorithm for auto
recognizing lone islands on the map of world or any map
whatsoever. DBSCAN provides perfect solution for this type
of situation. We can decide epsilon by providing minimum
area for a land to fall into a category of “not a lone island”.
We can have these lone islands by looking at resulting
outliers after applying algorithm.
Major features:
Discover clusters of arbitrary shape
Handle noise
One scan
Need density parameters[9]
Comparison:
 Unfortunately, moving averages don't work for all
scenarios, especially for those in very volatile
observations or those that are heavilyinfluencedby
current events. [10]
 While Mean absolute deviationshaslimitationsthat
it is only applicable when there is no measurement
error and when distributions are normal. [11]
 DBSCAN does not work too well when we are
dealing with clusters of varying densities or with
high dimensional data. [7]
International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056
Volume: 04 Issue: 10 | Oct -2017 www.irjet.net p-ISSN: 2395-0072
© 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 690
Table -1: Comparison of techniques
3. CONCLUSIONS
This brief paper discusses about what Anomalydetectionis,
and comparison of different techniques in data mining.
Moving averages, Mean Absolute Deviation (MAD) and
DBSCAN are most frequently used algorithms in practice.
After comparison it can be said that all three of them have
their own advantages and disadvantages and they can best
be applied in different circumstances.
REFERENCES
[1] Hodge, V. J.; Austin, J. (2004). "A Survey of Outlier
Detection Methodologies".
[2] Statistical Analysis, Ya-lun Chou, Holt International,
1975.
[3] http://www.emathzone.com/tutorials/basic-
statistics/merits-and-demerits-of-moving-average-
method.html
[4] Geary, R. C. (1935). The ratio of the mean deviation to
the standard deviation as a test of normality.
Biometrika, 27(3/4), 310–332.
[5] http://www.publishyourarticles.net/knowledge-
hub/statistics/merits-and-demerits-of-mean-
deviation/1096/
[6] Arlia, Domenica; Coppola, Massimo. "Experiments in
Parallel Clustering with DBSCAN"
[7] http://www.hypertextbookshop.com/dataminingbook/
public_version/contents/chapters/chapter004/section0
04/blue/page003.html
[8] http://www.ques10.com/p/9286/dbscan-clustering-
algorithm-with-an-example/
[9] http://www.investopedia.com/articles/trading/11/pitf
alls-moving-averages.asp
[10] http://www.investopedia.com/articles/trading/11/pitf
alls-moving-averages.asp
[11] http://influentialpoints.com/Training/absolute_deviati
ons.htm
Algorithm Execution
Time
Efficiency Number of
false alerts
Moving
Averages
High High Less
MAD Less Medium More
DBSCAN Medium Medium Average

More Related Content

PDF
Data Science - Part III - EDA & Model Selection
PDF
Bank loan purchase modeling
PDF
HRUG - Linear regression with R
PDF
Employee mode of commuting
PDF
Data Trend Analysis by Assigning Polynomial Function For Given Data Set
PDF
Telecom customer churn prediction
ZIP
B409 W11 Sas Collaborative Stats Guide V4.2
PDF
IRJET- Effecient Support Itemset Mining using Parallel Map Reducing
Data Science - Part III - EDA & Model Selection
Bank loan purchase modeling
HRUG - Linear regression with R
Employee mode of commuting
Data Trend Analysis by Assigning Polynomial Function For Given Data Set
Telecom customer churn prediction
B409 W11 Sas Collaborative Stats Guide V4.2
IRJET- Effecient Support Itemset Mining using Parallel Map Reducing

What's hot (20)

PDF
Data Preparation with the help of Analytics Methodology
PDF
IRJET - An Overview of Machine Learning Algorithms for Data Science
PDF
Statistics for data scientists
PDF
Employees Data Analysis by Applied SPSS
PDF
Car insurance - data visualization
PDF
IRJET- Supervised Learning Classification Algorithms Comparison
PDF
Qualitative Understanding of Flattening the Curve Term in Context of COVID 19
PPT
Excel Datamining Addin Advanced
PPTX
Scatter diagram and control chart
DOCX
Krupa rm
PDF
Real Estate Investment Advising Using Machine Learning
PDF
Using microsoft excel for weibull analysis
PDF
Exploratory data analysis data visualization
PDF
Workbook Project
PDF
Ijaems apr-2016-23 Study of Pruning Techniques to Predict Efficient Business ...
PDF
Factors affecting customer satisfaction
PDF
Credit risk - loan default model
DOCX
Macromolecules
DOCX
SMU DRIVE SPRING 2017 MBA 103- Statistics for Management solved free assignment
Data Preparation with the help of Analytics Methodology
IRJET - An Overview of Machine Learning Algorithms for Data Science
Statistics for data scientists
Employees Data Analysis by Applied SPSS
Car insurance - data visualization
IRJET- Supervised Learning Classification Algorithms Comparison
Qualitative Understanding of Flattening the Curve Term in Context of COVID 19
Excel Datamining Addin Advanced
Scatter diagram and control chart
Krupa rm
Real Estate Investment Advising Using Machine Learning
Using microsoft excel for weibull analysis
Exploratory data analysis data visualization
Workbook Project
Ijaems apr-2016-23 Study of Pruning Techniques to Predict Efficient Business ...
Factors affecting customer satisfaction
Credit risk - loan default model
Macromolecules
SMU DRIVE SPRING 2017 MBA 103- Statistics for Management solved free assignment
Ad

Similar to A Comparative Study for Anomaly Detection in Data Mining (20)

PPTX
Simple math for anomaly detection toufic boubez - metafor software - monito...
PPTX
Data centre analytics toufic boubez-metafor-dev ops days vancouver-2013-10-25
PDF
Anomaly detection (Unsupervised Learning) in Machine Learning
PDF
Dataday Texas 2016 - Datadog
DOCX
A Survey on Cluster Based Outlier Detection Techniques in Data Stream
PDF
angle based outlier de
PDF
Kdd08 abod
PPTX
Cluster analysis cluster analysis analysis
PPTX
PyGotham 2016
PPTX
Time Series Anomaly Detection with .net and Azure
PPT
8clst.ppt
PDF
Anomaly detection Workshop slides
PPT
PDF
Choosing allowability boundaries for describing objects in subject areas
DOC
Log Data Mining
PDF
Outlier analysis for Temporal Datasets
PPTX
Outliers or anamolies IN DATA ANALYTICS.pptx
PPT
clustering.ppt
PPT
Chapter 07
PPT
Chapter - 8.2 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Simple math for anomaly detection toufic boubez - metafor software - monito...
Data centre analytics toufic boubez-metafor-dev ops days vancouver-2013-10-25
Anomaly detection (Unsupervised Learning) in Machine Learning
Dataday Texas 2016 - Datadog
A Survey on Cluster Based Outlier Detection Techniques in Data Stream
angle based outlier de
Kdd08 abod
Cluster analysis cluster analysis analysis
PyGotham 2016
Time Series Anomaly Detection with .net and Azure
8clst.ppt
Anomaly detection Workshop slides
Choosing allowability boundaries for describing objects in subject areas
Log Data Mining
Outlier analysis for Temporal Datasets
Outliers or anamolies IN DATA ANALYTICS.pptx
clustering.ppt
Chapter 07
Chapter - 8.2 Data Mining Concepts and Techniques 2nd Ed slides Han & Kamber
Ad

More from IRJET Journal (20)

PDF
Enhanced heart disease prediction using SKNDGR ensemble Machine Learning Model
PDF
Utilizing Biomedical Waste for Sustainable Brick Manufacturing: A Novel Appro...
PDF
Kiona – A Smart Society Automation Project
PDF
DESIGN AND DEVELOPMENT OF BATTERY THERMAL MANAGEMENT SYSTEM USING PHASE CHANG...
PDF
Invest in Innovation: Empowering Ideas through Blockchain Based Crowdfunding
PDF
SPACE WATCH YOUR REAL-TIME SPACE INFORMATION HUB
PDF
A Review on Influence of Fluid Viscous Damper on The Behaviour of Multi-store...
PDF
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
PDF
Explainable AI(XAI) using LIME and Disease Detection in Mango Leaf by Transfe...
PDF
BRAIN TUMOUR DETECTION AND CLASSIFICATION
PDF
The Project Manager as an ambassador of the contract. The case of NEC4 ECC co...
PDF
"Enhanced Heat Transfer Performance in Shell and Tube Heat Exchangers: A CFD ...
PDF
Advancements in CFD Analysis of Shell and Tube Heat Exchangers with Nanofluid...
PDF
Breast Cancer Detection using Computer Vision
PDF
Auto-Charging E-Vehicle with its battery Management.
PDF
Analysis of high energy charge particle in the Heliosphere
PDF
A Novel System for Recommending Agricultural Crops Using Machine Learning App...
PDF
Auto-Charging E-Vehicle with its battery Management.
PDF
Analysis of high energy charge particle in the Heliosphere
PDF
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
Enhanced heart disease prediction using SKNDGR ensemble Machine Learning Model
Utilizing Biomedical Waste for Sustainable Brick Manufacturing: A Novel Appro...
Kiona – A Smart Society Automation Project
DESIGN AND DEVELOPMENT OF BATTERY THERMAL MANAGEMENT SYSTEM USING PHASE CHANG...
Invest in Innovation: Empowering Ideas through Blockchain Based Crowdfunding
SPACE WATCH YOUR REAL-TIME SPACE INFORMATION HUB
A Review on Influence of Fluid Viscous Damper on The Behaviour of Multi-store...
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...
Explainable AI(XAI) using LIME and Disease Detection in Mango Leaf by Transfe...
BRAIN TUMOUR DETECTION AND CLASSIFICATION
The Project Manager as an ambassador of the contract. The case of NEC4 ECC co...
"Enhanced Heat Transfer Performance in Shell and Tube Heat Exchangers: A CFD ...
Advancements in CFD Analysis of Shell and Tube Heat Exchangers with Nanofluid...
Breast Cancer Detection using Computer Vision
Auto-Charging E-Vehicle with its battery Management.
Analysis of high energy charge particle in the Heliosphere
A Novel System for Recommending Agricultural Crops Using Machine Learning App...
Auto-Charging E-Vehicle with its battery Management.
Analysis of high energy charge particle in the Heliosphere
Wireless Arduino Control via Mobile: Eliminating the Need for a Dedicated Wir...

Recently uploaded (20)

PPTX
web development for engineering and engineering
PPTX
Construction Project Organization Group 2.pptx
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PDF
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PDF
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
PDF
R24 SURVEYING LAB MANUAL for civil enggi
PPTX
Internet of Things (IOT) - A guide to understanding
PPTX
Lecture Notes Electrical Wiring System Components
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PPTX
Geodesy 1.pptx...............................................
PPTX
OOP with Java - Java Introduction (Basics)
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PDF
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PDF
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
PPT
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
DOCX
573137875-Attendance-Management-System-original
PPTX
UNIT 4 Total Quality Management .pptx
PPTX
additive manufacturing of ss316l using mig welding
web development for engineering and engineering
Construction Project Organization Group 2.pptx
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
July 2025 - Top 10 Read Articles in International Journal of Software Enginee...
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PRIZ Academy - 9 Windows Thinking Where to Invest Today to Win Tomorrow.pdf
R24 SURVEYING LAB MANUAL for civil enggi
Internet of Things (IOT) - A guide to understanding
Lecture Notes Electrical Wiring System Components
Automation-in-Manufacturing-Chapter-Introduction.pdf
Geodesy 1.pptx...............................................
OOP with Java - Java Introduction (Basics)
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
SM_6th-Sem__Cse_Internet-of-Things.pdf IOT
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
keyrequirementskkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkkk
CRASH COURSE IN ALTERNATIVE PLUMBING CLASS
573137875-Attendance-Management-System-original
UNIT 4 Total Quality Management .pptx
additive manufacturing of ss316l using mig welding

A Comparative Study for Anomaly Detection in Data Mining

  • 1. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 04 Issue: 10 | Oct -2017 www.irjet.net p-ISSN: 2395-0072 © 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 687 A Comparative Study for Anomaly Detection in Data Mining Fenil Shingala1, Shalin Barot2, Bhavesh Oza3 1,2 Student, LDCE Ahmedabad 3Professor, Dept. of Computer Engineering, LD College of Engineering, Gujarat, India ---------------------------------------------------------------------***--------------------------------------------------------------------- Abstract - In this paper, we will discuss some of the research we have found till and what we have concluded from that survey. We try to compare and combine three of the methods we have explored. We will work on Outlier/Anomaly Detection. Data mining is the processofextractionofdatathat would be of any kind and Outlier / Anomaly is detection of irrelevant data. Key Words: moving averages, mean absolute deviation, DBSCAN, regression, anomaly detection, outlier detection, prediction analysis, data mining 1.INTRODUCTION Anomaly detection was originally proposed for intrusion detection systems to prevent cases such as burglary. Anomaly detection for IDS is normally accomplished with statistical analysis by defining boundaries, yet ought to similarly be conceivable with soft computing, and inductive learning. [1] Fig -1: Anomaly detection In data mining, anomaly detection is the identification of items, events or observations that do not comply with an expected pattern or other items in a dataset. Normally the anomalous items will mean some sort of issue, for example bank extortion, a structural defect, medical problems or mistakes in content. For example, as figure shows, in given time series there are some extreme points, which are largely deviating from other points. Anomaly detection: Observing datasets for certain time period and afterward analyzing patterns to find abnormal behaviors of upcoming data points detect anomalies. Here we are comparing few methods to do so. They include basic ideas of statistical mathematics and clustering techniques. Moving averages: In statistics, a moving average is a process of creating a series of averages of a certain portion of data sets of the entire data. These averages are called moving mean or rolling mean. [2] But how can we do this? The answer is sliding window. First of all, we will take a series of numbers. Then decide a fixed subset size that will become a window. Start from initial index to window size and take an average of numbers. Then slide a window by removing first number and adding next number in series after current window size, and take an average. This will produce a series of averages which is called moving averages. It can also be used for anomaly detection. After finding first moving average we can define threshold limits plus and minus to current moving average to list out anomalies. Example: Let’s consider we are analyzing data taken from a tractor company. We take sell of each month and plota graph. There will be three parts to this graph. Middle part; duration is number of tractors sold were of not too much or too less proportion. Other two parts will be of too much and too less sell in which so ever duration. Let’s say we’re interested in finding periods during which sell is on peak, so that we can keep more number of tractors during those duration to save ourselves from shortage.Also, we need to make sure that we do not order more quantities than needed when there is less sell. This situation can be solved by observing history; that is to say look for past sell records and find out which months are on peak of sell and which months have comparatively less sell. That is exactly what moving averages do in terms of time series analysis. Figure shows a graph for tractor sales and by applying algorithm we can find out anomalies that are deviating from
  • 2. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 04 Issue: 10 | Oct -2017 www.irjet.net p-ISSN: 2395-0072 © 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 688 given period average. So whenever graphgoestoextremesit is considered as anomalous. Here, in graph such extreme points are depicted in red. It showsupperandlower extreme values of sales during a particular time, which is exactly what we seek to find out. Moving averages can be used for measuring the trend of any series. This method is applicable to linear as well as non- linear trends. [3] On the other hand, the trend obtained bymovingaverages generally is neither a straight line nor a standard curve. For this reason the trend cannot be extended for forecasting future values. Fig -2: Moving Averages Mean absolute deviation: In statistics, the median absolute deviation (MAD) is a method of findingthevariabilityofasamplethatincludesone variable quantity. It can also refer to the population parameter that is estimated by the MAD calculated from a sample. [4] For a univariate data set [a1, an] the MAD is characterized as the median of theabsolutedeviationsfromthedata'smedian: MAD=median| ai - median | So in a nutshell, MAD is the median of absolute deviations from the data's median. How to find MAD: 1. Find the mean/median of data 2. Find the absolute differences between each data value to the mean/median 3. Find the mean/median of these differences Here, note that mean or median both can be used to find MAD depending on the type of data. MAD can be used for anomaly detection by defining thresholds for modulus of division of each point by MAD (point/MAD). This values after division givesdistribution of data points with respect to their mean. Then we can easily define a threshold according to the distribution we get to extract out outliers. Data outside the threshold are considered anomalous. It is more suitable to the types of data whose most of the points range between particular limits,andanomaliesaredeviating in such a way that differences from anomalous data to MAD are larger than most of the points. Example: Suppose a company is hosting a cloud service for which they maintain servers and serves subscribedcustomerswithdata storage. Let’s say, in a hypothetical situation, company’s all customer uses most of their cloud storage and also they all access it at the same time, then in this type of situation there will be too much burden on CPUs and storage devices. This type of situations are common in shopping sites mostly, where for particular given time period there is a lot of customers make transactions, such as during festivals. So in this type of situation, whenever critical situation arises, that is, when load is extreme, they can take precautionary steps in future to avoid such crises. While this method shows promisingly advantageous mathematical side of its implementation, the same side also is also a cause of its minus point. Merits:  Mean deviation is broadly utilized as a part of different fields, for example, Economics, Business, Commerce or some other field of such sort.  When comparison is required this is perhaps the best measure between at least two arrangements.  This calculation has its base upon measurement than an estimate.  As it is based on the deviations about an average, it gives us better measure for correlation.
  • 3. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 04 Issue: 10 | Oct -2017 www.irjet.net p-ISSN: 2395-0072 © 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 689 Demerits:  If Range increments on the off chance that the sample increases, average deviation additionally increments yet not in a similar proportion.  For Sociological studies, itis practicallynotused.[5] DBSCAN: Density Based Spatial Clustering Algorithm is veryuseful for finding shapes based on density. Density-Based Spatial Clustering of Applications with Noise (DBSCAN) is most widely used density based clustering algorithm. Other such an algorithm is K-means. These types of algorithms use measure of reachability and eligibility of forming a cluster. [6] Reachability is the condition defined by parameter ε (epsilon). It stands for the maximum allowed distance between two points under consideration to be eligible for being neighbor of each other. Eligibility to form a cluster is defined by a number. That number indicates minimum number of points required to form a cluster. [6] Epsilon is a parameter to be given manually for maximum distance to be considered eligible for including data point into a cluster. [7] This clustering method creates a sort of chaining by which two distant points that are at more than ε distance will eventually be considered as neighbors. Let’sunderstandthis by an example. Suppose a random point ‘a’ is within ε distance to another point ‘b’, hence are neighbors.Thispoint ‘b’ in turn is neighbor of point ‘c’. By chaining, point ‘a’and ‘c’ will also be neighbors even if they’re at more than ε distance apart. Algorithmic steps for DBSCAN clustering Let’s understand this by taking sample data pointsofinteger values. Now, DBSCAN takes two parameters: ε (epsilon) and The minimum number of points required to form a cluster (p). 1) Start with a random starting point that has not been visited. 2) Extract the neighbor of this point usingε(All pointswhich are within the ε distance are neighbors) by applying condition difference of each point and point currently in observation < ε. 3) If there are enough neighbors around this point then clustering process starts and point is marked as visited else this point is labeled as noise (But later on this point can become the part of the cluster). 4) If a point is found to be a part of the cluster then its ε neighbors are also the part of the cluster and the step 2 is repeated for all ε neighborhood points. Thisisrepeateduntil all points in the cluster are determined. 5) After the formation of the cluster,a next randomunvisited point is taken to repeat the same process of clustering. 6) This process continues until all points are marked as visited. Example: Let’s say, we are developing an algorithm for auto recognizing lone islands on the map of world or any map whatsoever. DBSCAN provides perfect solution for this type of situation. We can decide epsilon by providing minimum area for a land to fall into a category of “not a lone island”. We can have these lone islands by looking at resulting outliers after applying algorithm. Major features: Discover clusters of arbitrary shape Handle noise One scan Need density parameters[9] Comparison:  Unfortunately, moving averages don't work for all scenarios, especially for those in very volatile observations or those that are heavilyinfluencedby current events. [10]  While Mean absolute deviationshaslimitationsthat it is only applicable when there is no measurement error and when distributions are normal. [11]  DBSCAN does not work too well when we are dealing with clusters of varying densities or with high dimensional data. [7]
  • 4. International Research Journal of Engineering and Technology (IRJET) e-ISSN: 2395-0056 Volume: 04 Issue: 10 | Oct -2017 www.irjet.net p-ISSN: 2395-0072 © 2017, IRJET | Impact Factor value: 5.181 | ISO 9001:2008 Certified Journal | Page 690 Table -1: Comparison of techniques 3. CONCLUSIONS This brief paper discusses about what Anomalydetectionis, and comparison of different techniques in data mining. Moving averages, Mean Absolute Deviation (MAD) and DBSCAN are most frequently used algorithms in practice. After comparison it can be said that all three of them have their own advantages and disadvantages and they can best be applied in different circumstances. REFERENCES [1] Hodge, V. J.; Austin, J. (2004). "A Survey of Outlier Detection Methodologies". [2] Statistical Analysis, Ya-lun Chou, Holt International, 1975. [3] http://www.emathzone.com/tutorials/basic- statistics/merits-and-demerits-of-moving-average- method.html [4] Geary, R. C. (1935). The ratio of the mean deviation to the standard deviation as a test of normality. Biometrika, 27(3/4), 310–332. [5] http://www.publishyourarticles.net/knowledge- hub/statistics/merits-and-demerits-of-mean- deviation/1096/ [6] Arlia, Domenica; Coppola, Massimo. "Experiments in Parallel Clustering with DBSCAN" [7] http://www.hypertextbookshop.com/dataminingbook/ public_version/contents/chapters/chapter004/section0 04/blue/page003.html [8] http://www.ques10.com/p/9286/dbscan-clustering- algorithm-with-an-example/ [9] http://www.investopedia.com/articles/trading/11/pitf alls-moving-averages.asp [10] http://www.investopedia.com/articles/trading/11/pitf alls-moving-averages.asp [11] http://influentialpoints.com/Training/absolute_deviati ons.htm Algorithm Execution Time Efficiency Number of false alerts Moving Averages High High Less MAD Less Medium More DBSCAN Medium Medium Average