SlideShare a Scribd company logo
Network Intrusion Detection Analysis using Random Forest Algorithm on Apache Mahout
 Group 7 
 Dec 08/12/2014 
 Prof Shanoan Tian
• Uses Decision trees as a base. 
 Normally one tree is used 
 Tree is explored in depth - many branches; many 
leaves 
 It has to be monitored and tailored by a trained 
statistician
•Many decision trees 
•Shallow exploration 
•Slightly different dataset for each 
•Specified questions
Network Intrusion Detection Analysis using Random Forest Algorithm on Apache Mahout
• Apache Mahout is a library implemented on top of Apache 
Hadoop 
• Scalable machine-learning algorithms 
• Using the MapReduce paradigm. 
• Data Mining tools for the data stored on a Hadoop system 
• Clustering 
• Classification 
• Batch based collaborative filtering
The dataset we are using is the NSL-KDD 
dataset [2]. It is an improvement on the 
KDD 
99 dataset [1]. The KDD 1999 Data Set 
was a set of data of simulated computer 
network intrusion.
 An Apache Software Foundation project to 
create scalable machine learning libraries 
 http://mahout.apache.org 
 Why Mahout? 
 Scalable Machine Learning Algorithms 
Map Reduce Implementations on Apache 
Hadoop
 Apache Mahout has several classification 
algorithms implementations 
 Naïve Bayes 
 Complementary Naïve Bayes - 
 Random Forest 
 Hidden Markov Models 
 Logistic Regression
 Most algorithms have a Driver program 
 Shell script in $MAHOUT_HOME/bin helps with most tasks 
 Prepare the Data 
 Different algorithms require different setup 
 Run the algorithm 
 Single Node 
 Hadoop 
 Print out the results 
 Several helper classes: 
 ClusterDumper, etc.
Network Intrusion Detection Analysis using Random Forest Algorithm on Apache Mahout
Network Intrusion Detection Analysis using Random Forest Algorithm on Apache Mahout
 Make directory in HDFS 
$ hadoop fs -mkdir testdata 
 Load data to HDFS 
$ hadoop fs -put ./downloads/data/* testdata 
 Verify data loading 
$ hadoop fs -ls testdata
 hadoop jar mahout-examples-0.9-job.jar 
org.apache.mahout.classifier.df.tools.Describ 
e -p testdata/KDDTrain+_20Percent.arff -f 
testdata/KDDTrain+_20Percent.info -d N 3 C 2 
N C 4 N C 8 N 2 C 19 N L 
 The "N 3 C 2 N C 4 N C 8 N 2 C 19 N L” 
describes all the attributes of the data. 1 
numerical(N) attribute, followed by 3 
Categorical(C) attributes, ...L indicates the label
Network Intrusion Detection Analysis using Random Forest Algorithm on Apache Mahout
 14/12/08 07:37:53 INFO mapreduce.BuildForest: 
Build Time: 0h 0m 25s 949 
 14/12/08 07:37:53 INFO mapreduce.BuildForest: 
Forest num Nodes: 62706 
 14/12/08 07:37:53 INFO mapreduce.BuildForest: 
Forest mean num Nodes: 627 
 14/12/08 07:37:53 INFO mapreduce.BuildForest: 
Forest mean max Depth: 14
 hadoop jar mahout-examples-0.9-job.jar 
org.apache.mahout.classifier.df.mapreduce.T 
estForest -i testdata/KDDTest+.arff -ds 
testdata/KDDTrain+_20Percent.info -m nsl-forest 
-a -mr -o predictions 
 Predicts on "KDDTest+.arff" dataset (-i argument) using the same data 
descriptor generated for the training set (-ds) and the decision forest built 
previously (-m) computes the confusion matrix (-a) 
 Passing the (-mr) parameter to use Hadoop Mapreduce framework
Correctly Classified Instances 
17639 78.2425% 
In correctly Classified Instances 
4905 21.7575% 
Total Classified Instances 
22544 100%
Confusion 
Matrix 
A = normal B = anamoly Total 
9454 257 9711 
4648 8185 12833
 hadoop jar mahout-examples-0.9-job.jar 
org.apache.mahout.classifier.df.mapreduce.TestForest - 
i testdata/KDDTest+.arff -ds 
testdata/KDDTrain+_20Percent.info -m nsl-forest -a -mr 
-o predictions 
 Predicts on "KDDTest+.arff" dataset (-i argument) using 
the same data descriptor generated for the training set (- 
ds) and the decision forest built previously (-m) 
computes the confusion matrix (-a) 
 Passing the (-mr) parameter to use Hadoop Mapreduce 
framework 
 http://nsl.cs.unb.ca/NSL-KDD/

More Related Content

PPTX
CHARACTERISTICS OF SERVICES APPLICABLE IN ADVERTSING INDUSTRY
PPTX
Presentation sreenu dwh-services
PDF
Random forest using apache mahout
PPTX
Big Data Analytics with Storm, Spark and GraphLab
DOCX
500 data engineering interview question.docx
PPTX
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
PPTX
Introduction to HDFS and MapReduce
PPTX
Hadoop
CHARACTERISTICS OF SERVICES APPLICABLE IN ADVERTSING INDUSTRY
Presentation sreenu dwh-services
Random forest using apache mahout
Big Data Analytics with Storm, Spark and GraphLab
500 data engineering interview question.docx
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Introduction to HDFS and MapReduce
Hadoop

Similar to Network Intrusion Detection Analysis using Random Forest Algorithm on Apache Mahout (20)

PDF
Power Hadoop Cluster with AWS Cloud
PPTX
Hadoop by kamran khan
ODP
Hadoop seminar
PPTX
Hadoop and big data training
PPTX
Next generation analytics with yarn, spark and graph lab
PPTX
Yarn spark next_gen_hadoop_8_jan_2014
PPTX
SparkNotes
PPTX
Basic of Big Data
PPT
hadoop-spark.ppt
PPTX
PPTX
Hadoop introduction
PPT
hadoop_spark_Introduction_Bigdata_intro.ppt
PPTX
Hadoop introduction
PPTX
BIG DATA: Apache Hadoop
PPT
hadoop-sparktitlsdernsfslfsfnsfsflsnfsfnsfl
PPTX
THE SOLUTION FOR BIG DATA
PPTX
THE SOLUTION FOR BIG DATA
PPT
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
PDF
Yarn by default (Spark on YARN)
PDF
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Power Hadoop Cluster with AWS Cloud
Hadoop by kamran khan
Hadoop seminar
Hadoop and big data training
Next generation analytics with yarn, spark and graph lab
Yarn spark next_gen_hadoop_8_jan_2014
SparkNotes
Basic of Big Data
hadoop-spark.ppt
Hadoop introduction
hadoop_spark_Introduction_Bigdata_intro.ppt
Hadoop introduction
BIG DATA: Apache Hadoop
hadoop-sparktitlsdernsfslfsfnsfsflsnfsfnsfl
THE SOLUTION FOR BIG DATA
THE SOLUTION FOR BIG DATA
Advanced Hadoop Tuning and Optimization - Hadoop Consulting
Yarn by default (Spark on YARN)
Harnessing Hadoop: Understanding the Big Data Processing Options for Optimizi...
Ad

More from Cisco (7)

PPTX
Big data
PPTX
Colloborative computing
PPTX
mobile case_presentation_byod_dey_sushmita
PPTX
Clustering and Association Rule
PPTX
Time Series Forecasting for Google Inc. and Break-even analysis for Google gl...
PPTX
Time Series Forecasting
PPTX
Kenneth Lay
Big data
Colloborative computing
mobile case_presentation_byod_dey_sushmita
Clustering and Association Rule
Time Series Forecasting for Google Inc. and Break-even analysis for Google gl...
Time Series Forecasting
Kenneth Lay
Ad

Recently uploaded (20)

PPTX
master seminar digital applications in india
PPTX
Lesson notes of climatology university.
PDF
O5-L3 Freight Transport Ops (International) V1.pdf
PPTX
Cell Structure & Organelles in detailed.
PDF
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
PDF
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
PDF
Sports Quiz easy sports quiz sports quiz
PDF
Microbial disease of the cardiovascular and lymphatic systems
PPTX
Pharma ospi slides which help in ospi learning
PPTX
GDM (1) (1).pptx small presentation for students
PPTX
Institutional Correction lecture only . . .
PPTX
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PPTX
PPH.pptx obstetrics and gynecology in nursing
PDF
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
PDF
Supply Chain Operations Speaking Notes -ICLT Program
PDF
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
PDF
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
PPTX
Final Presentation General Medicine 03-08-2024.pptx
PDF
Anesthesia in Laparoscopic Surgery in India
PDF
2.FourierTransform-ShortQuestionswithAnswers.pdf
master seminar digital applications in india
Lesson notes of climatology university.
O5-L3 Freight Transport Ops (International) V1.pdf
Cell Structure & Organelles in detailed.
The Lost Whites of Pakistan by Jahanzaib Mughal.pdf
Saundersa Comprehensive Review for the NCLEX-RN Examination.pdf
Sports Quiz easy sports quiz sports quiz
Microbial disease of the cardiovascular and lymphatic systems
Pharma ospi slides which help in ospi learning
GDM (1) (1).pptx small presentation for students
Institutional Correction lecture only . . .
PPT- ENG7_QUARTER1_LESSON1_WEEK1. IMAGERY -DESCRIPTIONS pptx.pptx
PPH.pptx obstetrics and gynecology in nursing
Chapter 2 Heredity, Prenatal Development, and Birth.pdf
Supply Chain Operations Speaking Notes -ICLT Program
BÀI TẬP BỔ TRỢ 4 KỸ NĂNG TIẾNG ANH 9 GLOBAL SUCCESS - CẢ NĂM - BÁM SÁT FORM Đ...
grade 11-chemistry_fetena_net_5883.pdf teacher guide for all student
Final Presentation General Medicine 03-08-2024.pptx
Anesthesia in Laparoscopic Surgery in India
2.FourierTransform-ShortQuestionswithAnswers.pdf

Network Intrusion Detection Analysis using Random Forest Algorithm on Apache Mahout

  • 2.  Group 7  Dec 08/12/2014  Prof Shanoan Tian
  • 3. • Uses Decision trees as a base.  Normally one tree is used  Tree is explored in depth - many branches; many leaves  It has to be monitored and tailored by a trained statistician
  • 4. •Many decision trees •Shallow exploration •Slightly different dataset for each •Specified questions
  • 6. • Apache Mahout is a library implemented on top of Apache Hadoop • Scalable machine-learning algorithms • Using the MapReduce paradigm. • Data Mining tools for the data stored on a Hadoop system • Clustering • Classification • Batch based collaborative filtering
  • 7. The dataset we are using is the NSL-KDD dataset [2]. It is an improvement on the KDD 99 dataset [1]. The KDD 1999 Data Set was a set of data of simulated computer network intrusion.
  • 8.  An Apache Software Foundation project to create scalable machine learning libraries  http://mahout.apache.org  Why Mahout?  Scalable Machine Learning Algorithms Map Reduce Implementations on Apache Hadoop
  • 9.  Apache Mahout has several classification algorithms implementations  Naïve Bayes  Complementary Naïve Bayes -  Random Forest  Hidden Markov Models  Logistic Regression
  • 10.  Most algorithms have a Driver program  Shell script in $MAHOUT_HOME/bin helps with most tasks  Prepare the Data  Different algorithms require different setup  Run the algorithm  Single Node  Hadoop  Print out the results  Several helper classes:  ClusterDumper, etc.
  • 13.  Make directory in HDFS $ hadoop fs -mkdir testdata  Load data to HDFS $ hadoop fs -put ./downloads/data/* testdata  Verify data loading $ hadoop fs -ls testdata
  • 14.  hadoop jar mahout-examples-0.9-job.jar org.apache.mahout.classifier.df.tools.Describ e -p testdata/KDDTrain+_20Percent.arff -f testdata/KDDTrain+_20Percent.info -d N 3 C 2 N C 4 N C 8 N 2 C 19 N L  The "N 3 C 2 N C 4 N C 8 N 2 C 19 N L” describes all the attributes of the data. 1 numerical(N) attribute, followed by 3 Categorical(C) attributes, ...L indicates the label
  • 16.  14/12/08 07:37:53 INFO mapreduce.BuildForest: Build Time: 0h 0m 25s 949  14/12/08 07:37:53 INFO mapreduce.BuildForest: Forest num Nodes: 62706  14/12/08 07:37:53 INFO mapreduce.BuildForest: Forest mean num Nodes: 627  14/12/08 07:37:53 INFO mapreduce.BuildForest: Forest mean max Depth: 14
  • 17.  hadoop jar mahout-examples-0.9-job.jar org.apache.mahout.classifier.df.mapreduce.T estForest -i testdata/KDDTest+.arff -ds testdata/KDDTrain+_20Percent.info -m nsl-forest -a -mr -o predictions  Predicts on "KDDTest+.arff" dataset (-i argument) using the same data descriptor generated for the training set (-ds) and the decision forest built previously (-m) computes the confusion matrix (-a)  Passing the (-mr) parameter to use Hadoop Mapreduce framework
  • 18. Correctly Classified Instances 17639 78.2425% In correctly Classified Instances 4905 21.7575% Total Classified Instances 22544 100%
  • 19. Confusion Matrix A = normal B = anamoly Total 9454 257 9711 4648 8185 12833
  • 20.  hadoop jar mahout-examples-0.9-job.jar org.apache.mahout.classifier.df.mapreduce.TestForest - i testdata/KDDTest+.arff -ds testdata/KDDTrain+_20Percent.info -m nsl-forest -a -mr -o predictions  Predicts on "KDDTest+.arff" dataset (-i argument) using the same data descriptor generated for the training set (- ds) and the decision forest built previously (-m) computes the confusion matrix (-a)  Passing the (-mr) parameter to use Hadoop Mapreduce framework  http://nsl.cs.unb.ca/NSL-KDD/