Today

Malware Detection using n-grams
and Evaluation using Machine
Learning Algorithms
11MSE0195-SHERIN JOSEPHIN B

Abstract
• Computer security has been a major concern in today's
scenario. The term Malware is used to denote bad
software which hacks the computer security in the
present world.
• While most of the anti-virus software fails to detects
new virus. Thus n-grams as file signature can help us to
detect own malware and reduce false positive ratio.
• Further the dataset is optimized by using feature
selection algorithm. The final Featured Vector Table
obtained from feature selection and dimension
reduction will be compared and evaluated using
various machine learning algorithms.

Aim and Scope
• The aim of this project is to detect malware files using n-
gram analysis and evaluate it using machine learning
algorithm.
• As many antivirus software fails to detect new virus, using
n-gram as a model, will detect malware files efficiently.
• This project will focus on developing a better tool to detect
the malware files taking into consideration space
complexity.
• It is currently used in industries. Every industry mainly
focuses on securing the data. Anti-virus software like
Kaspersky, K7 uses this technique to detect malware files.

LITERATURE SURVEY...
S.N TITLE ABSTRACT TECHNIQUES ADVANTAGES
8. “Static Malware Detection
with Segmented
Sandboxing”
This is study is about Taking
the best part from both
static and dynamic detection
approach, which is called
“Segmented Sandboxing” is
applied to detect malware
files.
1. segmented
sandboxing
Higher detection rate
(compare previous data)
9. .,“N grams based file
signature for malware
detection”.
This study proposes the use
of n-grams as file signatures
in order to detect unknown
malware
1.n-grams low false positive ratio.
10. “A Hybrid Model to
Detect Malicious
Executables”.
This paper proposes
featuthe re set is called
hybrid feature set which is
given to support vector
machine which classify
malware and benign files.
1.n-grams
2.SVM
1.high accuracy
2. low false positive rate
11. Detection of New
Malicious Code Using N-
grams Signatures”.
This paper says about the n-
gram analysis that classify
the malware and benign .
1.n-grams 1. efficient
2. Scalable
3. practical solutions

Module Description
MODULE 1: Dataset preparation
-executable files (benign or malware file) are disassembled using a
disassembler.
-assembly code is parsed. The opcode sequence is collected in Dataset.
MODULE 2 : Create Feature Vector Table( FVT )by n-grams extraction
- Dataset is classified as Training data and Testing data.
- The training data is used for n-gram extraction.
- These extracted n-grams are stored in a table called Feature Vector Table
(FVT).
- Feature Vector Table consists opcode, its frequency count and respective
class
MODULE 3 : Employing Feature Reduction Algorithm
- PCA
MODULE 4: Classification using Machine Learning Algorithm
- J48,Support Vector Machine(SVM) and Random Forest

UML Design
•USE CASE DIAGRAM
•CLASS DIAGRAM
•SEQUENCE DIAGRAM
•ACTIVITY DIAGRAM
•STATE CHART DIAGRAM

Results and Discussion
With PCA Without PCA
2 grams 8 216
3 grams 9 256
4 grams 8 256
With Feature Selection Algorithm

2-grams Random Forest SVM J48
Classified 95% 82.50% 88%
Misclassified 12.30% 82.50% 36.40%
Precision 95.00% 68.10% 86.90%
Performance Table for 2grams

Graphic view for 2grams
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Random Forest SVM J48
TPR
FPR
Precision

Performance Table for 3grams
3-grams Random Forest SVM J48
Classified 92% 94.70% 84%
Misclassified 52.10% 34.70% 53.20%
Precision 92.80% 95.00% 84.20%

Graphic view for 3grams
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Random Forest SVM J48
TPR
FPR
Precision

Normal Code and Obfuscated Code

Opcode and its frequency and class

Today

More Related Content

What's hot (20)

Viewers also liked (9)

Similar to Today (20)

Recently uploaded (20)

Today