Network Intrusion Detection Analysis using Random Forest Algorithm on Apache Mahout

 Group 7
 Dec 08/12/2014
 Prof Shanoan Tian

• Uses Decision trees as a base.
 Normally one tree is used
 Tree is explored in depth - many branches; many
leaves
 It has to be monitored and tailored by a trained
statistician

•Many decision trees
•Shallow exploration
•Slightly different dataset for each
•Specified questions

• Apache Mahout is a library implemented on top of Apache
Hadoop
• Scalable machine-learning algorithms
• Using the MapReduce paradigm.
• Data Mining tools for the data stored on a Hadoop system
• Clustering
• Classification
• Batch based collaborative filtering

The dataset we are using is the NSL-KDD
dataset [2]. It is an improvement on the
KDD
99 dataset [1]. The KDD 1999 Data Set
was a set of data of simulated computer
network intrusion.

 An Apache Software Foundation project to
create scalable machine learning libraries
 http://mahout.apache.org
 Why Mahout?
 Scalable Machine Learning Algorithms
Map Reduce Implementations on Apache
Hadoop

 Apache Mahout has several classification
algorithms implementations
 Naïve Bayes
 Complementary Naïve Bayes -
 Random Forest
 Hidden Markov Models
 Logistic Regression

 Most algorithms have a Driver program
 Shell script in $MAHOUT_HOME/bin helps with most tasks
 Prepare the Data
 Different algorithms require different setup
 Run the algorithm
 Single Node
 Hadoop
 Print out the results
 Several helper classes:
 ClusterDumper, etc.

 Make directory in HDFS
$ hadoop fs -mkdir testdata
 Load data to HDFS
$ hadoop fs -put ./downloads/data/* testdata
 Verify data loading
$ hadoop fs -ls testdata

 hadoop jar mahout-examples-0.9-job.jar
org.apache.mahout.classifier.df.tools.Describ
e -p testdata/KDDTrain+_20Percent.arff -f
testdata/KDDTrain+_20Percent.info -d N 3 C 2
N C 4 N C 8 N 2 C 19 N L
 The "N 3 C 2 N C 4 N C 8 N 2 C 19 N L”
describes all the attributes of the data. 1
numerical(N) attribute, followed by 3
Categorical(C) attributes, ...L indicates the label

 14/12/08 07:37:53 INFO mapreduce.BuildForest:
Build Time: 0h 0m 25s 949
Forest num Nodes: 62706
Forest mean num Nodes: 627
Forest mean max Depth: 14

org.apache.mahout.classifier.df.mapreduce.T
estForest -i testdata/KDDTest+.arff -ds
testdata/KDDTrain+_20Percent.info -m nsl-forest
-a -mr -o predictions
 Predicts on "KDDTest+.arff" dataset (-i argument) using the same data
descriptor generated for the training set (-ds) and the decision forest built
previously (-m) computes the confusion matrix (-a)
 Passing the (-mr) parameter to use Hadoop Mapreduce framework

Correctly Classified Instances
17639 78.2425%
In correctly Classified Instances
4905 21.7575%
Total Classified Instances
22544 100%

Confusion
Matrix
A = normal B = anamoly Total
9454 257 9711
4648 8185 12833

org.apache.mahout.classifier.df.mapreduce.TestForest -
i testdata/KDDTest+.arff -ds
testdata/KDDTrain+_20Percent.info -m nsl-forest -a -mr
-o predictions
 Predicts on "KDDTest+.arff" dataset (-i argument) using
the same data descriptor generated for the training set (-
ds) and the decision forest built previously (-m)
computes the confusion matrix (-a)
 Passing the (-mr) parameter to use Hadoop Mapreduce
framework
 http://nsl.cs.unb.ca/NSL-KDD/

Network Intrusion Detection Analysis using Random Forest Algorithm on Apache Mahout

More Related Content

Similar to Network Intrusion Detection Analysis using Random Forest Algorithm on Apache Mahout (20)

More from Cisco (7)

Recently uploaded (20)

Network Intrusion Detection Analysis using Random Forest Algorithm on Apache Mahout