SlideShare a Scribd company logo
A.Jenefa, S.E Vinodh Ewards / International Journal of Engineering Research and
Applications (IJERA) ISSN: 2248-9622 www.ijera.com
Vol. 3, Issue 2, March -April 2013, pp.1334-1339
1334 | P a g e
Application Identification using Supervised Clustering Method
A.Jenefa*, S.E Vinodh Ewards**
*Department of Computer Science and Engineering, Karunya University, Coimbatore-114
** Department of Computer Science and Engineering, Karunya University, Coimbatore-114
ABSTRACT
The classification of traffic provides
essential data for network management and
research. Several classification approaches are
developed and proposed to protect the network
resources or enforce organizational policies.
Whereas the port number based classification
works only for some well-known applications and
payload based classification is not suitable for
encrypted packet payloads that make the sense to
classify the traffic based on behaviors observed
in networks. In this paper, a supervised
clustering algorithm called Flow Level based
Classification (FLC) is proposed to classify
network flows, which comprises of flows in the
same conversation. In this paper we discussed
recent laurels and various research trends in
supervised and unsupervised clustering
algorithms. We outline the obstinately
mysterious challenges in the field over the last
decade and suggest strategies for tackling these
challenges to promote headway in the art of
Internet traffic classification.
Keywords - Clustering approaches, Machine
learning approaches, Application Identification,
Traffic Classification.
I. INTRODUCTION
Identifying application is essential for
effective network planning and monitoring the
trends of applications. The network traffic
classification becomes more challenging because
modern applications complicated their network
behaviours. The objective of traffic classification is
to understand the type of traffic carried on the
Internet to protect the network resources. A number
of methods have been proposed to identify and to
classify the traffic into applications. There exist
three methods: Payload Based classification, Port
Based classification and Machine Learning
approaches. However the traditional methods: Port
Based [1] and Payload Based approach [11] may not
work well in the modern applications. Identifying
applications in networks using port based and
payload based approach has been greatly diminished
in recent years. But the Machine Learning approach
provides a better result in application identification
in which the statistical characteristics of IP flows are
concerned. In this paper, we present a Machine
learning approach called Supervised Clustering
approach called Flow Level based Classification
(FLC) is used for classifying the traffic using
different statistics. Our aim is to build an efficient
and an accurate classification approach using
clustering techniques as the building block. Such a
Clustering approach would consist of two stages: a
learning phase and a classification phase. The
objective of the offline learning phase is to find out
the information that should be unique to or different
from other applications. After the clusters are
grouped by Euclidean distance, the average packet
sizes for each application is calculated which will be
helpful to identify the applications in the
classification phase. In the classification phase, the
similarity of flows is calculated and grouped by 5-
tuple information and Transport Layer protocol.
Once the flows are grouped, the classification phase
compares the information with the offline phase for
application identification.
The remainder of this paper is organized as
follows. Section II explains the evaluation of both
supervised and unsupervised algorithms. Each
module of Flow Level based Classification is
explained as follows. Section III, IV explains the
details of learning phase and the classification
phase. Section VI presents our conclusions.
II. EVALUATION OF CLASSFICATION
OF TRAFFIC BY CLUSTERING
APPROACHES
Normally the statistical properties are used
to classify the network applications. This Section
summarizes the basic concepts of clustering and
outlines how the supervised and unsupervised
approaches can be applied to traffic classification
1) Unsupervised Clustering Approaches
The unsupervised clustering algorithms
namely), Auto Class, Simple K-Means, Expectation
Maximization (EM and DBSCAN Clustering are
considered in this work.
McGregor et al. [2] used Expectation
maximization technique to group flows based on a
set of flow statistics to classify traffic under
different metrics and criteria. This algorithm is
helpful only for the first step of classifying where
the traffic is completely unknown, and possibly
gives a hint on the group of applications that have
similar traffic characteristics.
Zander et al. [3] used a probabilistic model-based
clustering technique called Auto Class [4, 5] which
allows for the automatic selection of clusters and the
soft clustering of data. In Auto class the clusters are
A.Jenefa, S.E Vinodh Ewards / International Journal of Engineering Research and
Applications (IJERA) ISSN: 2248-9622 www.ijera.com
Vol. 3, Issue 2, March -April 2013, pp.1334-1339
1335 | P a g e
labelled with the most common traffic category of
the flows in it. If two or more categories are tied,
then a label is chosen randomly amongst the tied
category labels.
In [6] K-Means algorithm, an unsupervised
clustering is used and it classifies different types of
applications using the first few packets of the traffic
flow. This algorithm is not effective if the classifier
misses the first few packets of the traffic flow.
The Clustering Algorithm DBSCAN which
relies on a density-based notion of clusters. Density-
Based algorithms have an improvement over
partition-based algorithms because it‟s not limited to
finding spherical shaped clusters but can find
clusters of random shapes. In [7] Density-Based
algorithms have selected DBSCAN algorithm as a
representative. This is in contrast to K-Means and
Auto Class that allocates every object to a cluster.
2) Supervised Clustering Approach
Supervised Clustering requires a prior
knowledge to classify the traffic flows. The phases
of supervised clustering are
 Learning Phase: The Training phase that
builds a set of classification model or rules.
 Classification Phase: The model that has
been built in the learning phase is used to
classify new unseen instances
In Karagiannis et al. [8] present a novel
approach to classify traffic flows into application
behaviours based on connection patterns. The
connection patterns are evaluated in three different
levels. 1) The social level 2) The functional 3) The
application. It correlates Internet host behaviour
patterns with one or more applications and filters the
correlation by behaviour stratification. It is able to
accurately associate hosts with the service they
provide or use by inspecting all the flows generated
by specific hosts. However, it cannot identify
specific application sub types because it has
gathered information from multiple flows for each
individual host to decide the role of the host.
In Nen-Fu Huang et al. [9] present a Machine
Learning technique for traffic classification. This
paper addresses the problem of early identifying
application traffic in protocol level. The Machine
Learning involves mainly two steps. First, extensive
features are defined based on statistical
characteristics of application protocols such as flow
duration, inter-arrival times, packet length etc. A
machine learning classifier is then trained to
associate set of features with known traffic classes,
and apply the well-trained machine learning
classifier to classify unknown traffic using
previously learned rules. It‟s [9] also suitable to
identify encrypted protocols.
In Chun-Nan Lu [10] present a Session Level
Flow Classification (SLFC) algorithm to classify
flows into application behaviours based on flow
classification and session grouping. The flow is
classified into applications by packet size
distribution and then the flows are grouped as
sessions by port locality. The SLFC classifies the
traffic without examining the packet payloads. This
method works even if the packet payloads are
encrypted. Table 2 lists a summary of related works.
III. SUPERVISED LEARNING PHASE
The learning phase is operated offline to
identify the application behaviours in the network.
So it‟s helpful to spot out an individual application
from a mix of applications. The objective of the
offline training phase is to find out the information
that should be unique to or different from other
applications, to be the basis of comparison. For this
reason, this learning phase first collects a set of
traffic traces and tries to extract the information
from the traces. A traffic filter is very much helpful
to collect the traffic traces from the network because
it filters out irrelevant information from the traffic
collection. Our experimental research uses a one day
trace collected at the edge of our university. Any
packet trace analyser can be used to extract the
flows. Initially, to group flows into clusters, the
degree of similarity between flows is to be
measured.
1) Method 1 (H1) Using Euclidean Distance
By Euclidean distance, the similarity
between flows is measured; a small distance
between the two flows shows a strong similarity
(whereas) a large distance shows a low similarity. In
between two flows, the Euclidean distance can be
calculated as
2
1
)
y
x
(
)
y
,
x
(
dist i
n
i
i




Where n be the number of flows. If the distance
between two flows is very small then the two flows
would be grouped together or otherwise they are
grouped as different groups. The flow having
smaller value distance will be grouped together
named as n
S
.......
S
,
S 2
1 .The above step is repeated
till all flows get grouped.
2) Method 2 (H2) Using per-flow Average Packet
size
This method allows us to classify a flow
once it has to be assigned to a cluster. The average
packet sizes for each application is calculated and
stored in the REF table which will be helpful to
identify the application in the classification phase.



n
j
j
S
n
pk
i
size
_
pkt
_
avg
1
Whereas j
pk be the frequent available
number of packet sizes and „n‟ be the number of
flows in i
S .The above process is repeated for each
A.Jenefa, S.E Vinodh Ewards / International Journal of Engineering Research and
Applications (IJERA) ISSN: 2248-9622 www.ijera.com
Vol. 3, Issue 2, March -April 2013, pp.1334-1339
1336 | P a g e
i
S i.e. for each application and recorded in the REF
table for the classification process. At maximum all
the applications shows the unique behaviour
regarding patterns of transfer of packet sizes. Thus,
distinguished packet sizes are helpful to
discriminate certain application. Normally the
packets have the same sizes across all flows but it‟s
necessary to examine that the average packet size
per flow remains constant across all flows in the
network traffic.
IV. SUPERVISED CLASSIFICATION
PHASE
The supervised clustering of classification
phase which includes different heuristics to refine
the classification of traffic. A set of experimental
research has done in our traces to inspect various
applications in the network. The first three methods
are used to group the similar flows and the last
method is used for flow classification.
Fig 1: Overall Diagram of Application identification
TABLE 1
A SUMMARY OF RESEARCH REVIEWED
Related
Work
Machine Learning
Algorithms
Applications Considered Feature Computation
Overhead
McGregor
et al. (2004)
Expectation
Maximization
(unsupervised
Clustering)
HTTP,SMTP,FTP,DNS,IMAP,NTP
etc.
Moderate
Zander et al.
(2005)
Auto
Class(Unsupervised
Clustering)
HTTP,DNS,SMTP,FTP,Telnet etc. Moderate
Bernaille et
al.(2005)
Simple K-Means
(unsupervised
Clustering)
HTTP,FTP,NTP,HTTPS,
SMTP,POP3,SSH etc.
Low
Karagiannis
et al. (2005)
Supervised
Clustering
All applications concerned. High
Huang et al.
(2008)
Supervised
Clustering
BitTorrent,eMule,FTP,HTTP,Skype,
Shoutcast, SMTP, POP3, PPLive.
Moderate
Chun-Nan
Lu et
al.(2009)
Supervised
Clustering
BitTorrent,eMule,FTP,HTTP,Skype,
Shoutcast, SMTP, POP3, PPLive.
Low
Jenefa et al.
(2013)
(Our Work)
Supervised
Clustering
All applications concerned.
(Refer Table 2)
Low
A.Jenefa, S.E Vinodh Ewards / International Journal of Engineering Research and
Applications (IJERA) ISSN: 2248-9622 www.ijera.com
Vol. 3, Issue 2, March -April 2013, pp.1334-1339
1337 | P a g e
TABLE 2
Classification of Applications using the Transport Layer Protocol
Traffic
Classification
Application Identification
Transport
Layer Protocol
Used
Application Layer Protocol
Used
Chat
MSN Messenger, Yahoo
Messenger,AIM,IRC
TCP chat
ftp(Data) ftp,databases TCP ftp
Web http,https TCP http
Mail smtp,pop,nntp,imap,identd TCP mail
p2p
Bit Torrent,eDonkey,Gnutella,
Pee Enabler, WinMX, OpenNap,
MP2P, FastTrack, Direct Connect
TCP/UDP p2p
Attack Port scans,IP address scans ----- -----
Streaming
mms(wmp),real,quicktime,shoutcast,
vbrick streaming.
TCP/UDP streaming
TABLE 3
Classification of Applications using 5-Tuple information
Source IP_ address Destination IP_ address
Port Number
used
Flow_Id
192.16.2.29 192.168.2.51 1400 x
192.16.2.29 192.168.2.51 1400 x
192.16.2.29 192.168.2.51 1400 x
192.16.2.29 192.168.2.51 1401 x+1
192.16.2.29 192.168.2.51 1401 x+1
192.16.2.29 192.168.2.51 1401 x+1
192.16.2.29 74.125.236.181 3210 y
Fig 2: Distinct Packet Sizes of Bit Torrent and Skype
Fig 3: Adjacent Port Numbers used by flows
A.Jenefa, S.E Vinodh Ewards / International Journal of Engineering Research and
Applications (IJERA) ISSN: 2248-9622 www.ijera.com
Vol. 3, Issue 2, March -April 2013, pp.1334-1339
1338 | P a g e
Fig 4: Accuracy rate of different Approaches
1) Method 1 (M1) The Transport Layer protocol
To distinguish the applications in the
network, the protocol information is helpful to
categorise into groups: i) Transport Layer Protocol
(TCP) is used by FTP, mail, chat, and web. ii)
Transport Layer Protocol (UDP) is normally used by
games and Network Management Traffic. iii) p2p
and streaming traffic uses either TCP or UDP. Thus,
the method H1 helps to group the similar
applications for at least to some extent. Table 2
shows the grouping of Applications using Transport
Layer Protocol.
2) Method 2 (M2) The Cardinality of sets
Using ports and IPs, the behaviours of each
application can be distinguished. Normally, the
operating system assigns consecutive port numbers
for similar flows. Figure 3 shows how the operating
system assigns the consecutive port numbers for
each individual flow. Here we used (Source_IP,
Destination_IP, Inter-arrival Time, Port number,
Flow ID) to group similar behaviours. If the
Source_IP address and the Destination_IP address of
different flows are same with the same port number,
then the flows will be grouped together. But all
flows having same port number does not belong to
the same flow. So there comes, Inter-arrival time
(difference of each individual arrival time of
different flows) helps to group the similar flows. If
the arrival time of any two flows is within a
particular limit, then it will be grouped together or
consider as different flow. Table 3 shows the
grouping of similar flows using 5-Tuple information.
3) Method 3 (M3) The Similarity Distance
The individual similarity distance between
the flows is identified by Euclidean distance.
Euclidean distance helps to identify that the two
different flows are identical or not. If the distance
between the flows is within a specified range then it
will be grouped together or else it will be grouped as
different.
4) Method 4 (M4) Flow Classification
After grouping identical flows, the average
packet size of each similar flow is compared with
the REF table to decide which application it should
be. To classify the applications, the unknown flows
are compared with the REF table one by one. Thus
like SLFC [10], this method works even if the packet
payloads are encrypted. These methods work very
well, if the average packet size remains constant
across the flows in the network traffic. Figure 2
shows that the different applications have distinct
packet sizes. So it‟s helpful for us to effectively
classify the application based on its unique
behaviour of packet traces.
V. IMPLEMENTATION DETAILS
Our aim is to produce an efficient
classification algorithm. Normally, the supervised
and the unsupervised algorithms of machine learning
are used to solve the problems in network traffic
classification. Here we used, a supervised algorithm
named as Flow Level based Classification to classify
the network traffic. First of all, TCPDump or
Wireshark is used for traffic filtration. By
implementing our own supervised learning
classification algorithm, effectively classify the
application based on its unique behaviour of packet
traces. Figure 4 shows the accuracy rate of different
approaches based on our test data taken at Karunya
University.
VI. CONCLUSION
Every algorithm is proposed and designed
for certain purposes of improvement. We proposed
Flow Level based Classification algorithm which
runs in two phases: a Learning phase and a
Classification phase. The Learning phase such a
Supervised Clustering approach would consist of
two stages: a learning phase and a classification
phase. Both phases are helpful to improve the
accuracy rate of traffic classification. Our proposed
algorithm achieves a high accuracy rate of traffic
A.Jenefa, S.E Vinodh Ewards / International Journal of Engineering Research and
Applications (IJERA) ISSN: 2248-9622 www.ijera.com
Vol. 3, Issue 2, March -April 2013, pp.1334-1339
1339 | P a g e
classification. Using the proposed algorithm, a high
accuracy rate of 97.9 % is achieved.
ACKNOWLEDGEMENTS
This survey paper is made possible through
the help and support from everyone. First and
foremost, I would like to acknowledge and extend
my heartfelt gratitude to my project guide Mr.S.E
Vinodh Ewards who helped me to complete this
paper possible. Second, I would like to thank my
department staffs, for their vital encouragement and
support. Most especially to god, who made all things
possible.
REFERENCES
[1] IANA, “Internet Assigned Numbers
Authority”,
http://www.iana.org/assignment/port-
numbers.
[2] A. McGregor, M. Hall , P. Lorier , J.
Brunskill , “ Flow clustering using Machine
Learning Techniques”, in: Proc. Passive
and Active Measurement Workshop
(PAM2004), Antibes Juan-Les-Pins,
France, April 2004.
[3] S. Zander, T. Nguyen, G. Armitrage,
“Automated traffic classification and
application identification using machine
learning”, in IEEE 30tsh
conference on
Local Computer Networks (LCN 2005),
Sydney, Australia, November 2005.
[4] P. Cheeseman and J. Strutz, “Bayesian
Classification (AutoClass): Theory and
Results”, in Advances in Knowledge
Discovery and Data Mining, 1996.
[5] A. Dempster, N. Laird, and D. Rubin ,
“Maximum likelihood from incomplete
data via the EM algorithm,” J. Royal
Statistical Society, Vol. 30, no. 1,1997
[6] L. Bernaille, R. Teixeira, I. Akodkenou, A.
Soule, and K. Salamatian, “Traffic
Classification on the fly,” ACM Special
Interest Group on Data Communication
(SIGCOMM) Computer Communication
Review, vol. 36, no. 2,2006.
[7] M. Ester, H. Kriegel, J. Sander, and X.Xu.
A Density-based Algorithm for Discovering
Clusters in Large Spatial Databases with
Noise. In 2nd
International Conference on
Knowledge Discovery and Data Mining
(KDD 96), Portland, USA, 1996.
[8] T. Karagiannis, K. Papagiannaki, and M.
Faloutsos, “BLINC: Multilevel traffic
classification in the dark,” in Proc. of the
Special Interest Group on Data
Communication conference (SIGCOMM)
2005,Philadelphia,PA,USA,August 2005.
[9] Nen-Fu Huang, Han-Chieh Chao, “Early
Identifying Application traffic with
Application Characteristics”, in
Proceedings of the IEEE ICC, 2008.
[10] Chun-Nan-Lu, Chun-Ying Huang, Ying-
Dar Lin, Yuan-Cheng Lai, “Session Level
Flow Classification by packet size
distribution and session grouping, Journal
of Network and Computer
Applications(2009).
[11] S. Sen, O. Spatscheck, and D. Wang,
“Accurate scalable in network identification
of P2P traffic using application signatures,”
in WWW2004, New York, NY, USA, May
2004.

More Related Content

PDF
Jd2516161623
PDF
Jd2516161623
PDF
Automated Traffic Classification And Application Identification Using Machine...
PDF
H0444146
PDF
A Review on Traffic Classification Methods in WSN
PDF
IRJET- Comparative Study on Embedded Feature Selection Techniques for Interne...
PDF
Internet Traffic Classification Using Bayesian Analysis Techniques
PDF
Reduce the False Positive and False Negative from Real Traffic with Intrusion...
Jd2516161623
Jd2516161623
Automated Traffic Classification And Application Identification Using Machine...
H0444146
A Review on Traffic Classification Methods in WSN
IRJET- Comparative Study on Embedded Feature Selection Techniques for Interne...
Internet Traffic Classification Using Bayesian Analysis Techniques
Reduce the False Positive and False Negative from Real Traffic with Intrusion...

Similar to Application Identification Using Supervised Clustering Method (20)

PDF
Traffic Classification using a Statistical Approach
PPT
` Traffic Classification based on Machine Learning
PDF
Analysis of service-oriented traffic classification with imperfect traffic cl...
PDF
Reduce the False Positive and False Negative from Real Traffic with Intrusion...
PDF
E017112028
PDF
Classification of Software Defined Network Traffic to provide Quality of Service
PDF
pam2008
PDF
Ije v4 i2International Journal of Engineering (IJE) Volume (3) Issue (4)
PPTX
Network Measurement and Monitori - Assigment 1, Group3, "Classification"
PDF
Waterfall: Rapid identification of IP flows using cascade classification
DOC
DOWNLOAD
PPT
Exploiting clustering techniques
PDF
Early application identification. CONEXT 2006
PDF
A PHASED APPROACH TO INTRUSION DETECTION IN NETWORK
DOC
CSC 347 – Computer Hardware and Maintenance
PPTX
Presentation
PDF
Walking on AI/ML for Networking
PDF
Traffic classification svm_im2015_10may2015
PPTX
JM Information Retrieval Techniques Unit III
PPTX
Information Storage and Retrieval Techniques Unit III
Traffic Classification using a Statistical Approach
` Traffic Classification based on Machine Learning
Analysis of service-oriented traffic classification with imperfect traffic cl...
Reduce the False Positive and False Negative from Real Traffic with Intrusion...
E017112028
Classification of Software Defined Network Traffic to provide Quality of Service
pam2008
Ije v4 i2International Journal of Engineering (IJE) Volume (3) Issue (4)
Network Measurement and Monitori - Assigment 1, Group3, "Classification"
Waterfall: Rapid identification of IP flows using cascade classification
DOWNLOAD
Exploiting clustering techniques
Early application identification. CONEXT 2006
A PHASED APPROACH TO INTRUSION DETECTION IN NETWORK
CSC 347 – Computer Hardware and Maintenance
Presentation
Walking on AI/ML for Networking
Traffic classification svm_im2015_10may2015
JM Information Retrieval Techniques Unit III
Information Storage and Retrieval Techniques Unit III
Ad

More from Yolanda Ivey (20)

PDF
How Can I Write About Myself. Write My Essay Or
PDF
Do My Essay For Cheap Uk - Can So
PDF
Soal Essay Sistem Inventory Beinyu.Com
PDF
Petra Van Der Merwe Champion Of IOCD Essay Competition
PDF
Premium Writing Paper Sets - Honeytree Person
PDF
PPT - How To Write A Perfect Persuasive Essay Conclusi
PDF
Hire The Cheap Custom Writing Services For Getting Best Quality Of ...
PDF
College Of Charleston Highly Ranked By U.S. News World Report
PDF
Strathmore 400 Series Calligraphy Writing Paper
PDF
Sample Case Study Paper - Example Of Case Study R
PDF
Pay Someone To Write Your Research Paper. Write
PDF
A 10-Step Process For Writing White Papers - B2B Technology Copywriting ...
PDF
Lined Writing Paper For Kids
PDF
AN ANALYIS OF THE LATE PROFESSOR ATIENO-ODHIAMBO S HISTORICAL DISCOURESES.......
PDF
ABE fermentation products recovery methods A review.pdf
PDF
AN ANALYSIS ON WRITING ANXIETY OF INDONESIAN EFL COLLEGE LEARNERS.pdf
PDF
Age, equality, and cultural oppression An argument against ageism.pdf
PDF
A Study on Work Environment, Work-life Balance and Burnout among Health Worke...
PDF
Anthropology A Global Perspective Ninth Edition.pdf
PDF
Attacks in Cognitive Radio Networks (CRN) A Survey.pdf
How Can I Write About Myself. Write My Essay Or
Do My Essay For Cheap Uk - Can So
Soal Essay Sistem Inventory Beinyu.Com
Petra Van Der Merwe Champion Of IOCD Essay Competition
Premium Writing Paper Sets - Honeytree Person
PPT - How To Write A Perfect Persuasive Essay Conclusi
Hire The Cheap Custom Writing Services For Getting Best Quality Of ...
College Of Charleston Highly Ranked By U.S. News World Report
Strathmore 400 Series Calligraphy Writing Paper
Sample Case Study Paper - Example Of Case Study R
Pay Someone To Write Your Research Paper. Write
A 10-Step Process For Writing White Papers - B2B Technology Copywriting ...
Lined Writing Paper For Kids
AN ANALYIS OF THE LATE PROFESSOR ATIENO-ODHIAMBO S HISTORICAL DISCOURESES.......
ABE fermentation products recovery methods A review.pdf
AN ANALYSIS ON WRITING ANXIETY OF INDONESIAN EFL COLLEGE LEARNERS.pdf
Age, equality, and cultural oppression An argument against ageism.pdf
A Study on Work Environment, Work-life Balance and Burnout among Health Worke...
Anthropology A Global Perspective Ninth Edition.pdf
Attacks in Cognitive Radio Networks (CRN) A Survey.pdf
Ad

Recently uploaded (20)

PDF
Classroom Observation Tools for Teachers
PDF
medical_surgical_nursing_10th_edition_ignatavicius_TEST_BANK_pdf.pdf
DOC
Soft-furnishing-By-Architect-A.F.M.Mohiuddin-Akhand.doc
PDF
Paper A Mock Exam 9_ Attempt review.pdf.
PDF
Hazard Identification & Risk Assessment .pdf
PPTX
Orientation - ARALprogram of Deped to the Parents.pptx
PPTX
Introduction to Building Materials
PDF
Complications of Minimal Access Surgery at WLH
PDF
SOIL: Factor, Horizon, Process, Classification, Degradation, Conservation
PPTX
A powerpoint presentation on the Revised K-10 Science Shaping Paper
PDF
RTP_AR_KS1_Tutor's Guide_English [FOR REPRODUCTION].pdf
PPTX
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
PDF
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
PPTX
Cell Types and Its function , kingdom of life
PDF
Practical Manual AGRO-233 Principles and Practices of Natural Farming
PDF
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PDF
Chinmaya Tiranga quiz Grand Finale.pdf
PDF
Weekly quiz Compilation Jan -July 25.pdf
PDF
احياء السادس العلمي - الفصل الثالث (التكاثر) منهج متميزين/كلية بغداد/موهوبين
Classroom Observation Tools for Teachers
medical_surgical_nursing_10th_edition_ignatavicius_TEST_BANK_pdf.pdf
Soft-furnishing-By-Architect-A.F.M.Mohiuddin-Akhand.doc
Paper A Mock Exam 9_ Attempt review.pdf.
Hazard Identification & Risk Assessment .pdf
Orientation - ARALprogram of Deped to the Parents.pptx
Introduction to Building Materials
Complications of Minimal Access Surgery at WLH
SOIL: Factor, Horizon, Process, Classification, Degradation, Conservation
A powerpoint presentation on the Revised K-10 Science Shaping Paper
RTP_AR_KS1_Tutor's Guide_English [FOR REPRODUCTION].pdf
1st Inaugural Professorial Lecture held on 19th February 2020 (Governance and...
Black Hat USA 2025 - Micro ICS Summit - ICS/OT Threat Landscape
Cell Types and Its function , kingdom of life
Practical Manual AGRO-233 Principles and Practices of Natural Farming
A GUIDE TO GENETICS FOR UNDERGRADUATE MEDICAL STUDENTS
Supply Chain Operations Speaking Notes -ICLT Program
Chinmaya Tiranga quiz Grand Finale.pdf
Weekly quiz Compilation Jan -July 25.pdf
احياء السادس العلمي - الفصل الثالث (التكاثر) منهج متميزين/كلية بغداد/موهوبين

Application Identification Using Supervised Clustering Method

  • 1. A.Jenefa, S.E Vinodh Ewards / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 3, Issue 2, March -April 2013, pp.1334-1339 1334 | P a g e Application Identification using Supervised Clustering Method A.Jenefa*, S.E Vinodh Ewards** *Department of Computer Science and Engineering, Karunya University, Coimbatore-114 ** Department of Computer Science and Engineering, Karunya University, Coimbatore-114 ABSTRACT The classification of traffic provides essential data for network management and research. Several classification approaches are developed and proposed to protect the network resources or enforce organizational policies. Whereas the port number based classification works only for some well-known applications and payload based classification is not suitable for encrypted packet payloads that make the sense to classify the traffic based on behaviors observed in networks. In this paper, a supervised clustering algorithm called Flow Level based Classification (FLC) is proposed to classify network flows, which comprises of flows in the same conversation. In this paper we discussed recent laurels and various research trends in supervised and unsupervised clustering algorithms. We outline the obstinately mysterious challenges in the field over the last decade and suggest strategies for tackling these challenges to promote headway in the art of Internet traffic classification. Keywords - Clustering approaches, Machine learning approaches, Application Identification, Traffic Classification. I. INTRODUCTION Identifying application is essential for effective network planning and monitoring the trends of applications. The network traffic classification becomes more challenging because modern applications complicated their network behaviours. The objective of traffic classification is to understand the type of traffic carried on the Internet to protect the network resources. A number of methods have been proposed to identify and to classify the traffic into applications. There exist three methods: Payload Based classification, Port Based classification and Machine Learning approaches. However the traditional methods: Port Based [1] and Payload Based approach [11] may not work well in the modern applications. Identifying applications in networks using port based and payload based approach has been greatly diminished in recent years. But the Machine Learning approach provides a better result in application identification in which the statistical characteristics of IP flows are concerned. In this paper, we present a Machine learning approach called Supervised Clustering approach called Flow Level based Classification (FLC) is used for classifying the traffic using different statistics. Our aim is to build an efficient and an accurate classification approach using clustering techniques as the building block. Such a Clustering approach would consist of two stages: a learning phase and a classification phase. The objective of the offline learning phase is to find out the information that should be unique to or different from other applications. After the clusters are grouped by Euclidean distance, the average packet sizes for each application is calculated which will be helpful to identify the applications in the classification phase. In the classification phase, the similarity of flows is calculated and grouped by 5- tuple information and Transport Layer protocol. Once the flows are grouped, the classification phase compares the information with the offline phase for application identification. The remainder of this paper is organized as follows. Section II explains the evaluation of both supervised and unsupervised algorithms. Each module of Flow Level based Classification is explained as follows. Section III, IV explains the details of learning phase and the classification phase. Section VI presents our conclusions. II. EVALUATION OF CLASSFICATION OF TRAFFIC BY CLUSTERING APPROACHES Normally the statistical properties are used to classify the network applications. This Section summarizes the basic concepts of clustering and outlines how the supervised and unsupervised approaches can be applied to traffic classification 1) Unsupervised Clustering Approaches The unsupervised clustering algorithms namely), Auto Class, Simple K-Means, Expectation Maximization (EM and DBSCAN Clustering are considered in this work. McGregor et al. [2] used Expectation maximization technique to group flows based on a set of flow statistics to classify traffic under different metrics and criteria. This algorithm is helpful only for the first step of classifying where the traffic is completely unknown, and possibly gives a hint on the group of applications that have similar traffic characteristics. Zander et al. [3] used a probabilistic model-based clustering technique called Auto Class [4, 5] which allows for the automatic selection of clusters and the soft clustering of data. In Auto class the clusters are
  • 2. A.Jenefa, S.E Vinodh Ewards / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 3, Issue 2, March -April 2013, pp.1334-1339 1335 | P a g e labelled with the most common traffic category of the flows in it. If two or more categories are tied, then a label is chosen randomly amongst the tied category labels. In [6] K-Means algorithm, an unsupervised clustering is used and it classifies different types of applications using the first few packets of the traffic flow. This algorithm is not effective if the classifier misses the first few packets of the traffic flow. The Clustering Algorithm DBSCAN which relies on a density-based notion of clusters. Density- Based algorithms have an improvement over partition-based algorithms because it‟s not limited to finding spherical shaped clusters but can find clusters of random shapes. In [7] Density-Based algorithms have selected DBSCAN algorithm as a representative. This is in contrast to K-Means and Auto Class that allocates every object to a cluster. 2) Supervised Clustering Approach Supervised Clustering requires a prior knowledge to classify the traffic flows. The phases of supervised clustering are  Learning Phase: The Training phase that builds a set of classification model or rules.  Classification Phase: The model that has been built in the learning phase is used to classify new unseen instances In Karagiannis et al. [8] present a novel approach to classify traffic flows into application behaviours based on connection patterns. The connection patterns are evaluated in three different levels. 1) The social level 2) The functional 3) The application. It correlates Internet host behaviour patterns with one or more applications and filters the correlation by behaviour stratification. It is able to accurately associate hosts with the service they provide or use by inspecting all the flows generated by specific hosts. However, it cannot identify specific application sub types because it has gathered information from multiple flows for each individual host to decide the role of the host. In Nen-Fu Huang et al. [9] present a Machine Learning technique for traffic classification. This paper addresses the problem of early identifying application traffic in protocol level. The Machine Learning involves mainly two steps. First, extensive features are defined based on statistical characteristics of application protocols such as flow duration, inter-arrival times, packet length etc. A machine learning classifier is then trained to associate set of features with known traffic classes, and apply the well-trained machine learning classifier to classify unknown traffic using previously learned rules. It‟s [9] also suitable to identify encrypted protocols. In Chun-Nan Lu [10] present a Session Level Flow Classification (SLFC) algorithm to classify flows into application behaviours based on flow classification and session grouping. The flow is classified into applications by packet size distribution and then the flows are grouped as sessions by port locality. The SLFC classifies the traffic without examining the packet payloads. This method works even if the packet payloads are encrypted. Table 2 lists a summary of related works. III. SUPERVISED LEARNING PHASE The learning phase is operated offline to identify the application behaviours in the network. So it‟s helpful to spot out an individual application from a mix of applications. The objective of the offline training phase is to find out the information that should be unique to or different from other applications, to be the basis of comparison. For this reason, this learning phase first collects a set of traffic traces and tries to extract the information from the traces. A traffic filter is very much helpful to collect the traffic traces from the network because it filters out irrelevant information from the traffic collection. Our experimental research uses a one day trace collected at the edge of our university. Any packet trace analyser can be used to extract the flows. Initially, to group flows into clusters, the degree of similarity between flows is to be measured. 1) Method 1 (H1) Using Euclidean Distance By Euclidean distance, the similarity between flows is measured; a small distance between the two flows shows a strong similarity (whereas) a large distance shows a low similarity. In between two flows, the Euclidean distance can be calculated as 2 1 ) y x ( ) y , x ( dist i n i i     Where n be the number of flows. If the distance between two flows is very small then the two flows would be grouped together or otherwise they are grouped as different groups. The flow having smaller value distance will be grouped together named as n S ....... S , S 2 1 .The above step is repeated till all flows get grouped. 2) Method 2 (H2) Using per-flow Average Packet size This method allows us to classify a flow once it has to be assigned to a cluster. The average packet sizes for each application is calculated and stored in the REF table which will be helpful to identify the application in the classification phase.    n j j S n pk i size _ pkt _ avg 1 Whereas j pk be the frequent available number of packet sizes and „n‟ be the number of flows in i S .The above process is repeated for each
  • 3. A.Jenefa, S.E Vinodh Ewards / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 3, Issue 2, March -April 2013, pp.1334-1339 1336 | P a g e i S i.e. for each application and recorded in the REF table for the classification process. At maximum all the applications shows the unique behaviour regarding patterns of transfer of packet sizes. Thus, distinguished packet sizes are helpful to discriminate certain application. Normally the packets have the same sizes across all flows but it‟s necessary to examine that the average packet size per flow remains constant across all flows in the network traffic. IV. SUPERVISED CLASSIFICATION PHASE The supervised clustering of classification phase which includes different heuristics to refine the classification of traffic. A set of experimental research has done in our traces to inspect various applications in the network. The first three methods are used to group the similar flows and the last method is used for flow classification. Fig 1: Overall Diagram of Application identification TABLE 1 A SUMMARY OF RESEARCH REVIEWED Related Work Machine Learning Algorithms Applications Considered Feature Computation Overhead McGregor et al. (2004) Expectation Maximization (unsupervised Clustering) HTTP,SMTP,FTP,DNS,IMAP,NTP etc. Moderate Zander et al. (2005) Auto Class(Unsupervised Clustering) HTTP,DNS,SMTP,FTP,Telnet etc. Moderate Bernaille et al.(2005) Simple K-Means (unsupervised Clustering) HTTP,FTP,NTP,HTTPS, SMTP,POP3,SSH etc. Low Karagiannis et al. (2005) Supervised Clustering All applications concerned. High Huang et al. (2008) Supervised Clustering BitTorrent,eMule,FTP,HTTP,Skype, Shoutcast, SMTP, POP3, PPLive. Moderate Chun-Nan Lu et al.(2009) Supervised Clustering BitTorrent,eMule,FTP,HTTP,Skype, Shoutcast, SMTP, POP3, PPLive. Low Jenefa et al. (2013) (Our Work) Supervised Clustering All applications concerned. (Refer Table 2) Low
  • 4. A.Jenefa, S.E Vinodh Ewards / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 3, Issue 2, March -April 2013, pp.1334-1339 1337 | P a g e TABLE 2 Classification of Applications using the Transport Layer Protocol Traffic Classification Application Identification Transport Layer Protocol Used Application Layer Protocol Used Chat MSN Messenger, Yahoo Messenger,AIM,IRC TCP chat ftp(Data) ftp,databases TCP ftp Web http,https TCP http Mail smtp,pop,nntp,imap,identd TCP mail p2p Bit Torrent,eDonkey,Gnutella, Pee Enabler, WinMX, OpenNap, MP2P, FastTrack, Direct Connect TCP/UDP p2p Attack Port scans,IP address scans ----- ----- Streaming mms(wmp),real,quicktime,shoutcast, vbrick streaming. TCP/UDP streaming TABLE 3 Classification of Applications using 5-Tuple information Source IP_ address Destination IP_ address Port Number used Flow_Id 192.16.2.29 192.168.2.51 1400 x 192.16.2.29 192.168.2.51 1400 x 192.16.2.29 192.168.2.51 1400 x 192.16.2.29 192.168.2.51 1401 x+1 192.16.2.29 192.168.2.51 1401 x+1 192.16.2.29 192.168.2.51 1401 x+1 192.16.2.29 74.125.236.181 3210 y Fig 2: Distinct Packet Sizes of Bit Torrent and Skype Fig 3: Adjacent Port Numbers used by flows
  • 5. A.Jenefa, S.E Vinodh Ewards / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 3, Issue 2, March -April 2013, pp.1334-1339 1338 | P a g e Fig 4: Accuracy rate of different Approaches 1) Method 1 (M1) The Transport Layer protocol To distinguish the applications in the network, the protocol information is helpful to categorise into groups: i) Transport Layer Protocol (TCP) is used by FTP, mail, chat, and web. ii) Transport Layer Protocol (UDP) is normally used by games and Network Management Traffic. iii) p2p and streaming traffic uses either TCP or UDP. Thus, the method H1 helps to group the similar applications for at least to some extent. Table 2 shows the grouping of Applications using Transport Layer Protocol. 2) Method 2 (M2) The Cardinality of sets Using ports and IPs, the behaviours of each application can be distinguished. Normally, the operating system assigns consecutive port numbers for similar flows. Figure 3 shows how the operating system assigns the consecutive port numbers for each individual flow. Here we used (Source_IP, Destination_IP, Inter-arrival Time, Port number, Flow ID) to group similar behaviours. If the Source_IP address and the Destination_IP address of different flows are same with the same port number, then the flows will be grouped together. But all flows having same port number does not belong to the same flow. So there comes, Inter-arrival time (difference of each individual arrival time of different flows) helps to group the similar flows. If the arrival time of any two flows is within a particular limit, then it will be grouped together or consider as different flow. Table 3 shows the grouping of similar flows using 5-Tuple information. 3) Method 3 (M3) The Similarity Distance The individual similarity distance between the flows is identified by Euclidean distance. Euclidean distance helps to identify that the two different flows are identical or not. If the distance between the flows is within a specified range then it will be grouped together or else it will be grouped as different. 4) Method 4 (M4) Flow Classification After grouping identical flows, the average packet size of each similar flow is compared with the REF table to decide which application it should be. To classify the applications, the unknown flows are compared with the REF table one by one. Thus like SLFC [10], this method works even if the packet payloads are encrypted. These methods work very well, if the average packet size remains constant across the flows in the network traffic. Figure 2 shows that the different applications have distinct packet sizes. So it‟s helpful for us to effectively classify the application based on its unique behaviour of packet traces. V. IMPLEMENTATION DETAILS Our aim is to produce an efficient classification algorithm. Normally, the supervised and the unsupervised algorithms of machine learning are used to solve the problems in network traffic classification. Here we used, a supervised algorithm named as Flow Level based Classification to classify the network traffic. First of all, TCPDump or Wireshark is used for traffic filtration. By implementing our own supervised learning classification algorithm, effectively classify the application based on its unique behaviour of packet traces. Figure 4 shows the accuracy rate of different approaches based on our test data taken at Karunya University. VI. CONCLUSION Every algorithm is proposed and designed for certain purposes of improvement. We proposed Flow Level based Classification algorithm which runs in two phases: a Learning phase and a Classification phase. The Learning phase such a Supervised Clustering approach would consist of two stages: a learning phase and a classification phase. Both phases are helpful to improve the accuracy rate of traffic classification. Our proposed algorithm achieves a high accuracy rate of traffic
  • 6. A.Jenefa, S.E Vinodh Ewards / International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 3, Issue 2, March -April 2013, pp.1334-1339 1339 | P a g e classification. Using the proposed algorithm, a high accuracy rate of 97.9 % is achieved. ACKNOWLEDGEMENTS This survey paper is made possible through the help and support from everyone. First and foremost, I would like to acknowledge and extend my heartfelt gratitude to my project guide Mr.S.E Vinodh Ewards who helped me to complete this paper possible. Second, I would like to thank my department staffs, for their vital encouragement and support. Most especially to god, who made all things possible. REFERENCES [1] IANA, “Internet Assigned Numbers Authority”, http://www.iana.org/assignment/port- numbers. [2] A. McGregor, M. Hall , P. Lorier , J. Brunskill , “ Flow clustering using Machine Learning Techniques”, in: Proc. Passive and Active Measurement Workshop (PAM2004), Antibes Juan-Les-Pins, France, April 2004. [3] S. Zander, T. Nguyen, G. Armitrage, “Automated traffic classification and application identification using machine learning”, in IEEE 30tsh conference on Local Computer Networks (LCN 2005), Sydney, Australia, November 2005. [4] P. Cheeseman and J. Strutz, “Bayesian Classification (AutoClass): Theory and Results”, in Advances in Knowledge Discovery and Data Mining, 1996. [5] A. Dempster, N. Laird, and D. Rubin , “Maximum likelihood from incomplete data via the EM algorithm,” J. Royal Statistical Society, Vol. 30, no. 1,1997 [6] L. Bernaille, R. Teixeira, I. Akodkenou, A. Soule, and K. Salamatian, “Traffic Classification on the fly,” ACM Special Interest Group on Data Communication (SIGCOMM) Computer Communication Review, vol. 36, no. 2,2006. [7] M. Ester, H. Kriegel, J. Sander, and X.Xu. A Density-based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In 2nd International Conference on Knowledge Discovery and Data Mining (KDD 96), Portland, USA, 1996. [8] T. Karagiannis, K. Papagiannaki, and M. Faloutsos, “BLINC: Multilevel traffic classification in the dark,” in Proc. of the Special Interest Group on Data Communication conference (SIGCOMM) 2005,Philadelphia,PA,USA,August 2005. [9] Nen-Fu Huang, Han-Chieh Chao, “Early Identifying Application traffic with Application Characteristics”, in Proceedings of the IEEE ICC, 2008. [10] Chun-Nan-Lu, Chun-Ying Huang, Ying- Dar Lin, Yuan-Cheng Lai, “Session Level Flow Classification by packet size distribution and session grouping, Journal of Network and Computer Applications(2009). [11] S. Sen, O. Spatscheck, and D. Wang, “Accurate scalable in network identification of P2P traffic using application signatures,” in WWW2004, New York, NY, USA, May 2004.