SlideShare a Scribd company logo
1
Data Mining Techniques
ITE2006
NETWORK ABUSE DETECTION
PROJECT REPORT
SUBMITTED BY
15BIT0134 RUBAL NANDAL
15BIT0268 KEDAR KUMAR
Guided By:
Dr. Sudha M
2
CERTIFICATE
This is to guarantee that the undertaking work entitled "STUDENT Marks
Analysis" that is being put together by "KEDAR KUMAR (15BIT0268) and
RUBAL NANDAL (15BIT0134)" is a record of bonafide work done in Data
MINING (ITE2006) under my watch. The substance of this Project work, in
full or in parts, have nor been taken from some other source nor have been
submitted for some other CAL course.
PLACE:VELLORE
DATE:1/11/2017
KEDAR KUMAR (15BIT0268)
RUBAL NANDAL (15BIT0134)"
3
Table of components
Acknowlegement 2
Problem Statement 3
Approach 6
Modules 7
Proposed Implementation 8
Implementation 9
Conclusi
on
22
Referenc
es
23
4
ACKNOWLEDGEMENTS
We acknowledge SUDHA M mam for the direction and help gave help
the execution of the undertaking. We additionally recognize all others
worried about accomplishment of this undertaking. It is standard to
recognize the University Management/School Dean for giving us a
chance to complete our examinations at the University. Thanks for such
an outstanding opportunity to us.
Problem Statement
Now a days there are so many attacks are carried out on various people with malicious intents
.Most of them are network attacks , so we attempt to develop an network abuse detection
(intrusion detection ) from the KDD-1999 data set and try to identity normal connection and
attacked connection
To detect network intrusions protects a computer network from unauthorized users, including
perhaps insiders. The intrusion detector learning task is to build a predictive model (i.e. a
classifier) capable of distinguishing between "bad" connections, called intrusions or attacks, and
"good" normal connections.
A connection is a sequence of TCP packets starting and ending at some well defined times,
between which data flows to and from a source IP address to a target IP address under some well
defined protocol. Each connection is labelled as either normal, or as an attack, with exactly one
specific attack type. Each connection record consists of about 100 bytes.
Attacks fall into four main categories
 DOS: denial-of-service, e.g. syn flood;
 R2L: unauthorized access from a remote machine, e.g. guessing password;
 U2R: unauthorized access to local superuser (root) privileges, e.g., various "buffer
overflow" attacks;
 PROBING: surveillance and other probing, e.g., port scanning.
5
ABOUT DATASET
Our dataset contains these features
Table 1: Basic features of individual TCP connections
feature name description type
duration length (number of seconds) of the connection continuous
protocol_type type of the protocol, e.g. tcp, udp, etc. discrete
service network service on the destination, e.g., http, telnet, etc. discrete
src_bytes number of data bytes from source to destination continuous
dst_bytes number of data bytes from destination to source continuous
flag normal or error status of the connection discrete
land 1 if connection is from/to the same host/port; 0 otherwise discrete
wrong_fragment number of "wrong" fragments continuous
urgent number of urgent packets continuous
Table 2: Content features within a connection suggested by domain knowledge
feature name description type
hot number of "hot" indicators continuous
num_failed_logins number of failed login attempts continuous
logged_in 1 if successfully logged in; 0 otherwise discrete
num_compromised number of "compromised" conditions continuous
root_shell 1 if root shell is obtained; 0 otherwise discrete
su_attempted 1 if "su root" command attempted; 0 otherwise discrete
num_root number of "root" accesses continuous
6
num_file_creations number of file creation operations continuous
num_shells number of shell prompts continuous
num_access_files number of operations on access control files continuous
num_outbound_cmds number of outbound commands in an ftp session continuous
is_hot_login 1 if the login belongs to the "hot" list; 0 otherwise discrete
is_guest_login 1 if the login is a "guest"login; 0 otherwise discrete
Table 3: Traffic features computed using a two-second time window
feature name description> type
count number of connections to the same host as the current connection
in the past two seconds
continuous
Note: The following features refer to these same-host connections.
serror_rate % of connections that have "SYN" errors continuous
rerror_rate % of connections that have "REJ" errors continuous
same_srv_rate % of connections to the same service continuous
diff_srv_rate % of connections to different services continuous
srv_count number of connections to the same service as the current
connection in the past two seconds
continuous
Note: The following features refer to these same-service connections.
srv_serror_rate % of connections that have "SYN" errors continuous
srv_rerror_rate % of connections that have "REJ" errors continuous
srv_diff_host_rate % of connections to different hosts continuous
7
Approach
1)There we will do some exploratory data analysis using Pandas.
2) After that we will do Data pre-processing and remove unnecessary features (attributes) from
our dataset
3) Then we will use clustering and anomality detection. We want our model to be able to work
well with unknown attack types and also to give an approximation of the closest attack type. We
will use K-mean clustering.
4) Then we will build a classifier using Scikit-learn (machine learning library).
Our classifier will just classify entries into normal or attack. By doing so, we can
generalise the model to new attack types.
8
Modules
1) Data Pre-processing:
Initially, we will use all features. We need to do something with our categorical variables. But
not all the features are numerical so we will do feature selection to remove unwanted features to
reduce the dimensionality of our data.
2) KMeans clustering
We will perform anomaly detection approach in the reduced dataset. We will start by doing k-
means clustering. Once we have the cluster centres, we can use it to identify the clusters of
attack or normal in new dataset
3) Classification
In classification we will train our dataset and make a classifier and use that classifier to predict
other data file and then we will test our estimation with R2
test to predict the accuracy of our
classifier.
4) Predictions
Based on the assumption that new attack types will resemble old type, we will be able to detect
those. Moreover, anything that falls too far from any cluster, will be considered anomalous and
therefore a possible attack.
9
Feature
selection
and scaling
DKK-1999
Labelled
dataset
Proposed Implementation Framework
DKK-1999
Labelled raw
dataset
DKK-1999
Corrected
raw
dataset
Clustering
and anomaly
detection
Anomaly
detection
algorithm
DKK-1999
Corrected
dataset
Unlabell
ed
dataset
labelled
dataset
Predicti
on
results
10
Implementation
1) CLUSTERING
LOADING THE DATA
In [2] : import pandas
from time import time
col_names = ["duration","protocol_type","service","flag","src_bytes",
"dst_bytes","land","wrong_fragment","urgent","hot","num_failed_logins",
"logged_in","num_compromised","root_shell","su_attempted","num_root",
"num_file_creations","num_shells","num_access_files","num_outbound_cmds",
"is_host_login","is_guest_login","count","srv_count","serror_rate",
"srv_serror_rate","rerror_rate","srv_rerror_rate","same_srv_rate",
"diff_srv_rate","srv_diff_host_rate","dst_host_count","dst_host_srv_count",
"dst_host_same_srv_rate","dst_host_diff_srv_rate","dst_host_same_src_port_rate",
"dst_host_srv_diff_host_rate","dst_host_serror_rate","dst_host_srv_serror_rate",
"dst_host_rerror_rate","dst_host_srv_rerror_rate","label"]
kdd_data_10percent =
pandas.read_csv("D:studysem5dataminingprojectdatasetdatakddcup.data_10_percent_corrected",
header=None, names = col_names)
kdd_data_10percent.describe()
11
OUTPUT
VIEWING THE LABELS
In [3] : kdd_data_10percent['label'].value_counts()
OUTPUT
12
FEATURE SELECTION
In [4] :num_features = [
"duration","src_bytes",
"dst_bytes","land","wrong_fragment","urgent","hot","num_failed_logins",
"logged_in","num_compromised","root_shell","su_attempted","num_root",
"num_file_creations","num_shells","num_access_files","num_outbound_cmds",
"is_host_login","is_guest_login","count","srv_count","serror_rate",
"srv_serror_rate","rerror_rate","srv_rerror_rate","same_srv_rate",
"diff_srv_rate","srv_diff_host_rate","dst_host_count","dst_host_srv_count",
"dst_host_same_srv_rate","dst_host_diff_srv_rate","dst_host_same_src_port_rate",
"dst_host_srv_diff_host_rate","dst_host_serror_rate","dst_host_srv_serror_rate",
"dst_host_rerror_rate","dst_host_srv_rerror_rate"
]
features = kdd_data_10percent[num_features].astype(float)
features.describe()
OUTPUT
13
CLUSTERING
from sklearn.cluster import KMeans
k = 30
km = KMeans(n_clusters = k)
t0 = time()
km.fit(features)
tt = time()-t0
print("Clustered in",round(tt,3)," seconds")
#visualising cluster sample
for i in range(600,620):
print (km.labels_[i])
ASSIGINING LABELS
labels = kdd_data_10percent['label']
label_names = list(map(
lambda x: pandas.Series([labels[i] for i in range(len(km.labels_)) if km.labels_[i]==x]),
range(k)))
for i in range(k):
print ("Cluster ",i," labels:")
print (label_names[i].value_counts(),"n")
print
14
LOADING TESTING DATA
kdd_data_corrected = pandas.read_csv("D:studysem5dataminingprojectdatasetdatacorrected",
header=None, names = col_names)
ASSIGINING CLUSTERS
t0 = time()
pred = km.predict(kdd_data_corrected[num_features])
tt = time() - t0
print ("Assigned clusters in",round(tt,3)," seconds")
15
2) CLASSIFICATIONS
LOADING THE DATA
In [2] : import pandas
from time import time
col_names = ["duration","protocol_type","service","flag","src_bytes",
"dst_bytes","land","wrong_fragment","urgent","hot","num_failed_logins",
"logged_in","num_compromised","root_shell","su_attempted","num_root",
"num_file_creations","num_shells","num_access_files","num_outbound_cmds",
"is_host_login","is_guest_login","count","srv_count","serror_rate",
"srv_serror_rate","rerror_rate","srv_rerror_rate","same_srv_rate",
"diff_srv_rate","srv_diff_host_rate","dst_host_count","dst_host_srv_count",
"dst_host_same_srv_rate","dst_host_diff_srv_rate","dst_host_same_src_port_rate",
"dst_host_srv_diff_host_rate","dst_host_serror_rate","dst_host_srv_serror_rate",
"dst_host_rerror_rate","dst_host_srv_rerror_rate","label"]
kdd_data_10percent =
pandas.read_csv("D:studysem5dataminingprojectdatasetdatakddcup.data_10_percent_corrected",
header=None, names = col_names)
kdd_data_10percent.describe()
OUTPUT
16
VIEWING THE LABELS
In [3] : kdd_data_10percent['label'].value_counts()
OUTPUT
FEATURE SELECTION
17
In [4] :num_features = [
"duration","src_bytes",
"dst_bytes","land","wrong_fragment","urgent","hot","num_failed_logins",
"logged_in","num_compromised","root_shell","su_attempted","num_root",
"num_file_creations","num_shells","num_access_files","num_outbound_cmds",
"is_host_login","is_guest_login","count","srv_count","serror_rate",
"srv_serror_rate","rerror_rate","srv_rerror_rate","same_srv_rate",
"diff_srv_rate","srv_diff_host_rate","dst_host_count","dst_host_srv_count",
"dst_host_same_srv_rate","dst_host_diff_srv_rate","dst_host_same_src_port_rate",
"dst_host_srv_diff_host_rate","dst_host_serror_rate","dst_host_srv_serror_rate",
"dst_host_rerror_rate","dst_host_srv_rerror_rate"
]
features = kdd_data_10percent[num_features].astype(float)
features.describe()
OUTPUT
18
ADDING LABELS
from sklearn.neighbors import KNeighborsClassifier
labels = kdd_data_10percent['label'].copy()
labels[labels!='normal.'] = 'attack.'
labels.value_counts()
1) TRAINING CLASSIFIER WITH BALL TREE
#algo=bruteforce , ball-tree,kd-tree
clf = KNeighborsClassifier(n_neighbors = 5, algorithm = 'ball_tree', leaf_size=500)
t0 = time()
clf.fit(features,labels)
tt = time() - t0
print ("Classifier trained in",round(tt,3),"seconds")
LOADING TESTING DATA
kdd_data_corrected = pandas.read_csv("D:studysem5dataminingprojectdatasetdatacorrected",
header=None, names = col_names)
kdd_data_corrected['label'].value_counts()
19
CONVERTING LABELS
kdd_data_corrected['label'][kdd_data_corrected['label']!='normal.'] = 'attack.'
kdd_data_corrected['label'].value_counts()
CREATING TEST SAMPLE
from sklearn.cross_validation import train_test_split
features_train, features_test, labels_train, labels_test = train_test_split(
kdd_data_corrected[num_features],
kdd_data_corrected['label'],
test_size=0.1,
random_state=42)
PRIDICTING
t0 = time()
pred = clf.predict(features_test)
tt = time() - t0
print ("Predicted in",round(tt,3)," seconds")
20
CHECKING ACCURACY
from sklearn.metrics import accuracy_score
acc = accuracy_score(pred, labels_test)
print("R squared is ",round(acc,4),"")
21
2) TRAINING CLASSIFIER WITH KD-TREE
#algo=bruteforce , ball-tree,kd-tree
clf = KNeighborsClassifier(n_neighbors = 5, algorithm = 'kd-tree', leaf_size=500)
t0 = time()
clf.fit(features,labels)
tt = time() - t0
print ("Classifier trained in",round(tt,3),"seconds")
ACCURACY
from sklearn.metrics import accuracy_score
acc = accuracy_score(pred, labels_test)
print("R squared is ",round(acc,4),"")
22
3) TRAINING CLASSIFIER WITH BRUTEFORCE
#algo=bruteforce , ball-tree,kd-tree
clf = KNeighborsClassifier(n_neighbors = 5, algorithm = 'bruteforce', leaf_size=500)
t0 = time()
clf.fit(features,labels)
tt = time() - t0
print ("Classifier trained in",round(tt,3),"seconds")
ACCURACY
from sklearn.metrics import accuracy_score
acc = accuracy_score(pred, labels_test)
print("R squared is ",round(acc,4),"")
23
CONCLUSION
We have formed clusters . those clusters can e used with real data to predict an
attack and a normal connection. Even anything falling far from cluster can also be
considered as an attack
From classification we obtained results tabulated in below table
ALGORITHM TIME FOR TRAINING ACCURACY
Ball-Tree Least 0.925 (near max)
KD-TREE Little higher than Ball-tree 0.820 (least)
BRUTEFORCE High 0.932 (maximum)
Form our experiment we concluded bruteforce is most expensive algorithm but
produced max accuracy on the other hand kd-tree obtained least result for our data
and ball-tree algorithm worked better as it consumed almost least time and almost
max accuracy
24
References
Dataset
[1] http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html
Software
[2] https://spark.apache.org/downloads.html
Pyspark tutorial
[3] https://www.dezyre.com/apache-spark-tutorial/pyspark-tutorial
[4] https://www.datacamp.com/community/tutorials/apache-spark-python
Research article
[2] Tavallaee, M., Bagheri, E., Lu, W., & Ghorbani, A. A. (2009, July). A detailed analysis of
the KDD CUP 99 data set. In Computational Intelligence for Security and Defense Applications,
2009. CISDA 2009. IEEE Symposium on (pp. 1-6). IEEE.

More Related Content

PDF
Intelligent soft computing based
DOCX
Entropy based DDos Detection in SDN
PPTX
Time-based DDoS Detection and Mitigation for SDN Controller
PDF
Authentication in Different Scenarios
PDF
DoS Forensic Exemplar Comparison to a Known Sample
PPT
Reliable data transfer CN - prashant odhavani- 160920107003
PDF
Replay of Malicious Traffic in Network Testbeds
PPT
Client server computing in mobile environments part 2
Intelligent soft computing based
Entropy based DDos Detection in SDN
Time-based DDoS Detection and Mitigation for SDN Controller
Authentication in Different Scenarios
DoS Forensic Exemplar Comparison to a Known Sample
Reliable data transfer CN - prashant odhavani- 160920107003
Replay of Malicious Traffic in Network Testbeds
Client server computing in mobile environments part 2

What's hot (20)

PDF
Transforming Security: Containers, Virtualization and Softwarization
PDF
Deadlock in Distributed Systems
PDF
DDoS Attack Detection & Mitigation in SDN
PDF
Pattern-Oriented Network Trace Analysis
PDF
Myriam phd
PDF
Securing tesla broadcast protocol with diffie hellman key exchange
PDF
IRJET- Secure Kerberos System in Distributed Environment
PDF
Report_Summer
PDF
Cyber-security
PDF
Unveiling-Patchwork
PDF
IRJET- Estimating Various DHT Protocols
PDF
An Analytical Approach To Analyze The Impact Of Gray Hole Attacks In Manet
PDF
Deadlock in distribute system by saeed siddik
PDF
DDoS Attack on DNS using infected IoT Devices
DOCX
CONTROL CLOUD DATA ACCESS PRIVILEGE AND ANONYMITY WITH FULLY ANONYMOUS ATTRIB...
PPTX
Exploiting tls to disrupt privacy of web application's traffic
PDF
Manu sheelvant resume
PDF
Early exploring design alterna1ves of smart sensor so5ware with actors
PDF
State of the art parallel approaches for
PDF
Cldap threat-advisory
Transforming Security: Containers, Virtualization and Softwarization
Deadlock in Distributed Systems
DDoS Attack Detection & Mitigation in SDN
Pattern-Oriented Network Trace Analysis
Myriam phd
Securing tesla broadcast protocol with diffie hellman key exchange
IRJET- Secure Kerberos System in Distributed Environment
Report_Summer
Cyber-security
Unveiling-Patchwork
IRJET- Estimating Various DHT Protocols
An Analytical Approach To Analyze The Impact Of Gray Hole Attacks In Manet
Deadlock in distribute system by saeed siddik
DDoS Attack on DNS using infected IoT Devices
CONTROL CLOUD DATA ACCESS PRIVILEGE AND ANONYMITY WITH FULLY ANONYMOUS ATTRIB...
Exploiting tls to disrupt privacy of web application's traffic
Manu sheelvant resume
Early exploring design alterna1ves of smart sensor so5ware with actors
State of the art parallel approaches for
Cldap threat-advisory
Ad

Similar to Data mining final report (20)

PDF
Node Legitimacy Based False Data Filtering Scheme in Wireless Sensor Networks
PDF
ASSURED NEIGHBOR BASED COUNTER PROTOCOL ON MAC-LAYER PROVIDING SECURITY IN MO...
PDF
PT0-003 CompTIA PenTest+ Exam questions pdf 2025
PDF
06558266
PDF
Detecting Hacks: Anomaly Detection on Networking Data
PPTX
Detecting Hacks: Anomaly Detection on Networking Data
PPTX
Dynamic Population Discovery for Lateral Movement (Using Machine Learning)
PDF
5G-USA-Telemetry
PDF
Quantstamp Report - LINKSWAP
PDF
DDoS Attack Detection and Botnet Prevention using Machine Learning
PDF
Secure Checkpointing Approach for Mobile Environment
PDF
IntelFlow: Toward adding Cyber Threat Intelligence to Software Defined Networ...
PDF
Atlas Services Remote Analysis Report Sample
PPTX
Protecting Financial Networks from Cyber Crime
PPTX
The-Vulnerabldde-Algorithm-Hit-List.pptx
DOC
Layered approach using conditional random fields for intrusion detection (syn...
PDF
Proactive ops for container orchestration environments
PDF
Certified ethical hacker (cehv11) exam dumps 2022
PDF
Semantic Metadata Annotation for Network Anomaly Detection
PDF
PREDICTIVE DETECTION OF KNOWN SECURITY CRITICALITIES IN CYBER PHYSICAL SYSTEM...
Node Legitimacy Based False Data Filtering Scheme in Wireless Sensor Networks
ASSURED NEIGHBOR BASED COUNTER PROTOCOL ON MAC-LAYER PROVIDING SECURITY IN MO...
PT0-003 CompTIA PenTest+ Exam questions pdf 2025
06558266
Detecting Hacks: Anomaly Detection on Networking Data
Detecting Hacks: Anomaly Detection on Networking Data
Dynamic Population Discovery for Lateral Movement (Using Machine Learning)
5G-USA-Telemetry
Quantstamp Report - LINKSWAP
DDoS Attack Detection and Botnet Prevention using Machine Learning
Secure Checkpointing Approach for Mobile Environment
IntelFlow: Toward adding Cyber Threat Intelligence to Software Defined Networ...
Atlas Services Remote Analysis Report Sample
Protecting Financial Networks from Cyber Crime
The-Vulnerabldde-Algorithm-Hit-List.pptx
Layered approach using conditional random fields for intrusion detection (syn...
Proactive ops for container orchestration environments
Certified ethical hacker (cehv11) exam dumps 2022
Semantic Metadata Annotation for Network Anomaly Detection
PREDICTIVE DETECTION OF KNOWN SECURITY CRITICALITIES IN CYBER PHYSICAL SYSTEM...
Ad

More from Kedar Kumar (7)

PDF
Big data project
DOCX
.net programming using asp.net to make web project
DOCX
educational website report
PDF
Storage final rev
DOCX
Wireless multimedia sensor networking
PDF
Combinatorial testing
PPTX
Combinatorial testing ppt
Big data project
.net programming using asp.net to make web project
educational website report
Storage final rev
Wireless multimedia sensor networking
Combinatorial testing
Combinatorial testing ppt

Recently uploaded (20)

PPT
Project quality management in manufacturing
PDF
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PDF
III.4.1.2_The_Space_Environment.p pdffdf
PPTX
CH1 Production IntroductoryConcepts.pptx
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PDF
Well-logging-methods_new................
PPTX
Fundamentals of safety and accident prevention -final (1).pptx
PDF
R24 SURVEYING LAB MANUAL for civil enggi
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PPTX
Safety Seminar civil to be ensured for safe working.
PDF
PREDICTION OF DIABETES FROM ELECTRONIC HEALTH RECORDS
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PPTX
web development for engineering and engineering
PPTX
Construction Project Organization Group 2.pptx
PDF
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
PPTX
Geodesy 1.pptx...............................................
PDF
737-MAX_SRG.pdf student reference guides
PDF
Human-AI Collaboration: Balancing Agentic AI and Autonomy in Hybrid Systems
Project quality management in manufacturing
Unit I ESSENTIAL OF DIGITAL MARKETING.pdf
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
III.4.1.2_The_Space_Environment.p pdffdf
CH1 Production IntroductoryConcepts.pptx
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
Well-logging-methods_new................
Fundamentals of safety and accident prevention -final (1).pptx
R24 SURVEYING LAB MANUAL for civil enggi
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
Safety Seminar civil to be ensured for safe working.
PREDICTION OF DIABETES FROM ELECTRONIC HEALTH RECORDS
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
web development for engineering and engineering
Construction Project Organization Group 2.pptx
The CXO Playbook 2025 – Future-Ready Strategies for C-Suite Leaders Cerebrai...
Geodesy 1.pptx...............................................
737-MAX_SRG.pdf student reference guides
Human-AI Collaboration: Balancing Agentic AI and Autonomy in Hybrid Systems

Data mining final report

  • 1. 1 Data Mining Techniques ITE2006 NETWORK ABUSE DETECTION PROJECT REPORT SUBMITTED BY 15BIT0134 RUBAL NANDAL 15BIT0268 KEDAR KUMAR Guided By: Dr. Sudha M
  • 2. 2 CERTIFICATE This is to guarantee that the undertaking work entitled "STUDENT Marks Analysis" that is being put together by "KEDAR KUMAR (15BIT0268) and RUBAL NANDAL (15BIT0134)" is a record of bonafide work done in Data MINING (ITE2006) under my watch. The substance of this Project work, in full or in parts, have nor been taken from some other source nor have been submitted for some other CAL course. PLACE:VELLORE DATE:1/11/2017 KEDAR KUMAR (15BIT0268) RUBAL NANDAL (15BIT0134)"
  • 3. 3 Table of components Acknowlegement 2 Problem Statement 3 Approach 6 Modules 7 Proposed Implementation 8 Implementation 9 Conclusi on 22 Referenc es 23
  • 4. 4 ACKNOWLEDGEMENTS We acknowledge SUDHA M mam for the direction and help gave help the execution of the undertaking. We additionally recognize all others worried about accomplishment of this undertaking. It is standard to recognize the University Management/School Dean for giving us a chance to complete our examinations at the University. Thanks for such an outstanding opportunity to us. Problem Statement Now a days there are so many attacks are carried out on various people with malicious intents .Most of them are network attacks , so we attempt to develop an network abuse detection (intrusion detection ) from the KDD-1999 data set and try to identity normal connection and attacked connection To detect network intrusions protects a computer network from unauthorized users, including perhaps insiders. The intrusion detector learning task is to build a predictive model (i.e. a classifier) capable of distinguishing between "bad" connections, called intrusions or attacks, and "good" normal connections. A connection is a sequence of TCP packets starting and ending at some well defined times, between which data flows to and from a source IP address to a target IP address under some well defined protocol. Each connection is labelled as either normal, or as an attack, with exactly one specific attack type. Each connection record consists of about 100 bytes. Attacks fall into four main categories  DOS: denial-of-service, e.g. syn flood;  R2L: unauthorized access from a remote machine, e.g. guessing password;  U2R: unauthorized access to local superuser (root) privileges, e.g., various "buffer overflow" attacks;  PROBING: surveillance and other probing, e.g., port scanning.
  • 5. 5 ABOUT DATASET Our dataset contains these features Table 1: Basic features of individual TCP connections feature name description type duration length (number of seconds) of the connection continuous protocol_type type of the protocol, e.g. tcp, udp, etc. discrete service network service on the destination, e.g., http, telnet, etc. discrete src_bytes number of data bytes from source to destination continuous dst_bytes number of data bytes from destination to source continuous flag normal or error status of the connection discrete land 1 if connection is from/to the same host/port; 0 otherwise discrete wrong_fragment number of "wrong" fragments continuous urgent number of urgent packets continuous Table 2: Content features within a connection suggested by domain knowledge feature name description type hot number of "hot" indicators continuous num_failed_logins number of failed login attempts continuous logged_in 1 if successfully logged in; 0 otherwise discrete num_compromised number of "compromised" conditions continuous root_shell 1 if root shell is obtained; 0 otherwise discrete su_attempted 1 if "su root" command attempted; 0 otherwise discrete num_root number of "root" accesses continuous
  • 6. 6 num_file_creations number of file creation operations continuous num_shells number of shell prompts continuous num_access_files number of operations on access control files continuous num_outbound_cmds number of outbound commands in an ftp session continuous is_hot_login 1 if the login belongs to the "hot" list; 0 otherwise discrete is_guest_login 1 if the login is a "guest"login; 0 otherwise discrete Table 3: Traffic features computed using a two-second time window feature name description> type count number of connections to the same host as the current connection in the past two seconds continuous Note: The following features refer to these same-host connections. serror_rate % of connections that have "SYN" errors continuous rerror_rate % of connections that have "REJ" errors continuous same_srv_rate % of connections to the same service continuous diff_srv_rate % of connections to different services continuous srv_count number of connections to the same service as the current connection in the past two seconds continuous Note: The following features refer to these same-service connections. srv_serror_rate % of connections that have "SYN" errors continuous srv_rerror_rate % of connections that have "REJ" errors continuous srv_diff_host_rate % of connections to different hosts continuous
  • 7. 7 Approach 1)There we will do some exploratory data analysis using Pandas. 2) After that we will do Data pre-processing and remove unnecessary features (attributes) from our dataset 3) Then we will use clustering and anomality detection. We want our model to be able to work well with unknown attack types and also to give an approximation of the closest attack type. We will use K-mean clustering. 4) Then we will build a classifier using Scikit-learn (machine learning library). Our classifier will just classify entries into normal or attack. By doing so, we can generalise the model to new attack types.
  • 8. 8 Modules 1) Data Pre-processing: Initially, we will use all features. We need to do something with our categorical variables. But not all the features are numerical so we will do feature selection to remove unwanted features to reduce the dimensionality of our data. 2) KMeans clustering We will perform anomaly detection approach in the reduced dataset. We will start by doing k- means clustering. Once we have the cluster centres, we can use it to identify the clusters of attack or normal in new dataset 3) Classification In classification we will train our dataset and make a classifier and use that classifier to predict other data file and then we will test our estimation with R2 test to predict the accuracy of our classifier. 4) Predictions Based on the assumption that new attack types will resemble old type, we will be able to detect those. Moreover, anything that falls too far from any cluster, will be considered anomalous and therefore a possible attack.
  • 9. 9 Feature selection and scaling DKK-1999 Labelled dataset Proposed Implementation Framework DKK-1999 Labelled raw dataset DKK-1999 Corrected raw dataset Clustering and anomaly detection Anomaly detection algorithm DKK-1999 Corrected dataset Unlabell ed dataset labelled dataset Predicti on results
  • 10. 10 Implementation 1) CLUSTERING LOADING THE DATA In [2] : import pandas from time import time col_names = ["duration","protocol_type","service","flag","src_bytes", "dst_bytes","land","wrong_fragment","urgent","hot","num_failed_logins", "logged_in","num_compromised","root_shell","su_attempted","num_root", "num_file_creations","num_shells","num_access_files","num_outbound_cmds", "is_host_login","is_guest_login","count","srv_count","serror_rate", "srv_serror_rate","rerror_rate","srv_rerror_rate","same_srv_rate", "diff_srv_rate","srv_diff_host_rate","dst_host_count","dst_host_srv_count", "dst_host_same_srv_rate","dst_host_diff_srv_rate","dst_host_same_src_port_rate", "dst_host_srv_diff_host_rate","dst_host_serror_rate","dst_host_srv_serror_rate", "dst_host_rerror_rate","dst_host_srv_rerror_rate","label"] kdd_data_10percent = pandas.read_csv("D:studysem5dataminingprojectdatasetdatakddcup.data_10_percent_corrected", header=None, names = col_names) kdd_data_10percent.describe()
  • 11. 11 OUTPUT VIEWING THE LABELS In [3] : kdd_data_10percent['label'].value_counts() OUTPUT
  • 12. 12 FEATURE SELECTION In [4] :num_features = [ "duration","src_bytes", "dst_bytes","land","wrong_fragment","urgent","hot","num_failed_logins", "logged_in","num_compromised","root_shell","su_attempted","num_root", "num_file_creations","num_shells","num_access_files","num_outbound_cmds", "is_host_login","is_guest_login","count","srv_count","serror_rate", "srv_serror_rate","rerror_rate","srv_rerror_rate","same_srv_rate", "diff_srv_rate","srv_diff_host_rate","dst_host_count","dst_host_srv_count", "dst_host_same_srv_rate","dst_host_diff_srv_rate","dst_host_same_src_port_rate", "dst_host_srv_diff_host_rate","dst_host_serror_rate","dst_host_srv_serror_rate", "dst_host_rerror_rate","dst_host_srv_rerror_rate" ] features = kdd_data_10percent[num_features].astype(float) features.describe() OUTPUT
  • 13. 13 CLUSTERING from sklearn.cluster import KMeans k = 30 km = KMeans(n_clusters = k) t0 = time() km.fit(features) tt = time()-t0 print("Clustered in",round(tt,3)," seconds") #visualising cluster sample for i in range(600,620): print (km.labels_[i]) ASSIGINING LABELS labels = kdd_data_10percent['label'] label_names = list(map( lambda x: pandas.Series([labels[i] for i in range(len(km.labels_)) if km.labels_[i]==x]), range(k))) for i in range(k): print ("Cluster ",i," labels:") print (label_names[i].value_counts(),"n") print
  • 14. 14 LOADING TESTING DATA kdd_data_corrected = pandas.read_csv("D:studysem5dataminingprojectdatasetdatacorrected", header=None, names = col_names) ASSIGINING CLUSTERS t0 = time() pred = km.predict(kdd_data_corrected[num_features]) tt = time() - t0 print ("Assigned clusters in",round(tt,3)," seconds")
  • 15. 15 2) CLASSIFICATIONS LOADING THE DATA In [2] : import pandas from time import time col_names = ["duration","protocol_type","service","flag","src_bytes", "dst_bytes","land","wrong_fragment","urgent","hot","num_failed_logins", "logged_in","num_compromised","root_shell","su_attempted","num_root", "num_file_creations","num_shells","num_access_files","num_outbound_cmds", "is_host_login","is_guest_login","count","srv_count","serror_rate", "srv_serror_rate","rerror_rate","srv_rerror_rate","same_srv_rate", "diff_srv_rate","srv_diff_host_rate","dst_host_count","dst_host_srv_count", "dst_host_same_srv_rate","dst_host_diff_srv_rate","dst_host_same_src_port_rate", "dst_host_srv_diff_host_rate","dst_host_serror_rate","dst_host_srv_serror_rate", "dst_host_rerror_rate","dst_host_srv_rerror_rate","label"] kdd_data_10percent = pandas.read_csv("D:studysem5dataminingprojectdatasetdatakddcup.data_10_percent_corrected", header=None, names = col_names) kdd_data_10percent.describe() OUTPUT
  • 16. 16 VIEWING THE LABELS In [3] : kdd_data_10percent['label'].value_counts() OUTPUT FEATURE SELECTION
  • 17. 17 In [4] :num_features = [ "duration","src_bytes", "dst_bytes","land","wrong_fragment","urgent","hot","num_failed_logins", "logged_in","num_compromised","root_shell","su_attempted","num_root", "num_file_creations","num_shells","num_access_files","num_outbound_cmds", "is_host_login","is_guest_login","count","srv_count","serror_rate", "srv_serror_rate","rerror_rate","srv_rerror_rate","same_srv_rate", "diff_srv_rate","srv_diff_host_rate","dst_host_count","dst_host_srv_count", "dst_host_same_srv_rate","dst_host_diff_srv_rate","dst_host_same_src_port_rate", "dst_host_srv_diff_host_rate","dst_host_serror_rate","dst_host_srv_serror_rate", "dst_host_rerror_rate","dst_host_srv_rerror_rate" ] features = kdd_data_10percent[num_features].astype(float) features.describe() OUTPUT
  • 18. 18 ADDING LABELS from sklearn.neighbors import KNeighborsClassifier labels = kdd_data_10percent['label'].copy() labels[labels!='normal.'] = 'attack.' labels.value_counts() 1) TRAINING CLASSIFIER WITH BALL TREE #algo=bruteforce , ball-tree,kd-tree clf = KNeighborsClassifier(n_neighbors = 5, algorithm = 'ball_tree', leaf_size=500) t0 = time() clf.fit(features,labels) tt = time() - t0 print ("Classifier trained in",round(tt,3),"seconds") LOADING TESTING DATA kdd_data_corrected = pandas.read_csv("D:studysem5dataminingprojectdatasetdatacorrected", header=None, names = col_names) kdd_data_corrected['label'].value_counts()
  • 19. 19 CONVERTING LABELS kdd_data_corrected['label'][kdd_data_corrected['label']!='normal.'] = 'attack.' kdd_data_corrected['label'].value_counts() CREATING TEST SAMPLE from sklearn.cross_validation import train_test_split features_train, features_test, labels_train, labels_test = train_test_split( kdd_data_corrected[num_features], kdd_data_corrected['label'], test_size=0.1, random_state=42) PRIDICTING t0 = time() pred = clf.predict(features_test) tt = time() - t0 print ("Predicted in",round(tt,3)," seconds")
  • 20. 20 CHECKING ACCURACY from sklearn.metrics import accuracy_score acc = accuracy_score(pred, labels_test) print("R squared is ",round(acc,4),"")
  • 21. 21 2) TRAINING CLASSIFIER WITH KD-TREE #algo=bruteforce , ball-tree,kd-tree clf = KNeighborsClassifier(n_neighbors = 5, algorithm = 'kd-tree', leaf_size=500) t0 = time() clf.fit(features,labels) tt = time() - t0 print ("Classifier trained in",round(tt,3),"seconds") ACCURACY from sklearn.metrics import accuracy_score acc = accuracy_score(pred, labels_test) print("R squared is ",round(acc,4),"")
  • 22. 22 3) TRAINING CLASSIFIER WITH BRUTEFORCE #algo=bruteforce , ball-tree,kd-tree clf = KNeighborsClassifier(n_neighbors = 5, algorithm = 'bruteforce', leaf_size=500) t0 = time() clf.fit(features,labels) tt = time() - t0 print ("Classifier trained in",round(tt,3),"seconds") ACCURACY from sklearn.metrics import accuracy_score acc = accuracy_score(pred, labels_test) print("R squared is ",round(acc,4),"")
  • 23. 23 CONCLUSION We have formed clusters . those clusters can e used with real data to predict an attack and a normal connection. Even anything falling far from cluster can also be considered as an attack From classification we obtained results tabulated in below table ALGORITHM TIME FOR TRAINING ACCURACY Ball-Tree Least 0.925 (near max) KD-TREE Little higher than Ball-tree 0.820 (least) BRUTEFORCE High 0.932 (maximum) Form our experiment we concluded bruteforce is most expensive algorithm but produced max accuracy on the other hand kd-tree obtained least result for our data and ball-tree algorithm worked better as it consumed almost least time and almost max accuracy
  • 24. 24 References Dataset [1] http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html Software [2] https://spark.apache.org/downloads.html Pyspark tutorial [3] https://www.dezyre.com/apache-spark-tutorial/pyspark-tutorial [4] https://www.datacamp.com/community/tutorials/apache-spark-python Research article [2] Tavallaee, M., Bagheri, E., Lu, W., & Ghorbani, A. A. (2009, July). A detailed analysis of the KDD CUP 99 data set. In Computational Intelligence for Security and Defense Applications, 2009. CISDA 2009. IEEE Symposium on (pp. 1-6). IEEE.