A deep learning approach for twitter spam detection lijie zhou

A Deep Learning Approach
For Twitter Spam Detection
Lijie Zhou (lijie@mail.sfsu.edu) & Hao Yue
San Francisco State University

Outline
• Problem and Challenges
• Past Work
• Our Model and Results
• Conclusion
• Future Work

Spam on Facebook and Twitter
# of active
users
# of spam
accounts
%
Facebook 2.2 billion 60-83 million 2.73%-3.77%
Twitter 330 million 23 million 6.97%
Source: https://www.statista.com/

Social Media’s Fundamental Design Flaw
• Sophisticated spam accounts know how to use various features to
make the biggest harm:
• Use shortened URL to trick users
• Buy compromised accounts to look legitimate
• Use campaigns to gain traction in a short period time
• Use bots to amplify the noise
• Social media makes it easier and faster to spread spam.

Related Work
• Detection at the tweet level
• Focus on the content of tweets
• E.g., spam words? Overuse of hashtag, URL, mention, …?
• Detection at the account level
• Focus on the characteristics of spam accounts
• E.g., Age of the account? # of followers? # of followees? …

Challenges
• Large amount of unlabeled data
• Time and labor intensive
• Feature selection may cause model overfitting problem
• Twitter spam drift
• Spamming behavior changes over time, thus the performance of existing
machine learning based classifiers decreases.

Research Questions
• Question 1: Can we find an unsupervised way to learn from the
unlabeled data and later apply what we have learnt on labeled data?
• Will this approach outperform the hand-labeling process?
• Question 2: Can we find a more systematic way to reduce the feature
dimensions instead of feature engineering?

Stage 1: Self-taught Learning From Unlabeled Data
Training Data
W/O Label
One-to-N
Encoding
Max-Min
Norm
Sparse Auto-
encoder
Trained
Parameter Set

Stage 2: Soft-max Classifier Training
Preprocessed
Labeled
Training Data
Sparse Auto-
encoder
Soft-max
Regression
Trained
Parameter Set

Stage 3: Classification
Preprocessed
Test Data
Sparse Auto-
encoder
Soft-Max
Regression
Spam/Non-
Spam

Self-taught Learning
• Assumption:
• A single unlabeled record is less informative
• A large of amount of unlabeled records may show certain pattern
• Goal:
• Find an effective model to reveal this pattern (if exists)
• Choose sparse auto-encoder for its good performance and simplicity

Auto-encoder
• A special neural network whose
output is (almost) identical to its
input.
• A compression tool
• The hidden layer is considered the
compressed representation of the
input.

Auto-encoder
• Model parameter:
(𝑊, b) = (𝑊(1), 𝑏(1), 𝑊(2), 𝑏(2))
• Activation function
𝑎1
2
= f(𝑊11
(1)
𝑥1 + 𝑊12
(1)
𝑥2+ 𝑊13
(1)
𝑥3+ 𝑏1
(1)
)
𝑎2
2
= f(𝑊21
(1)
𝑥1 + 𝑊22
(1)
𝑥2+ 𝑊23
(1)
𝑥3+ 𝑏2
(1)
)
𝑎3
2
= f(𝑊31
(1)
𝑥1 + 𝑊32
(1)
𝑥2+ 𝑊33
(1)
𝑥3+ 𝑏3
(1)
)
• Hypothesis ℎ 𝑤,𝑏(𝑥) :
ℎ 𝑤,𝑏(𝑥)= 𝑎1
3
= f(𝑊11
(2)
𝑎1
2
+ 𝑊12
(2)
𝑎2
2
+ 𝑊13
(2)
𝑎3
2
+ 𝑏1
(2)
) = 𝑥

Sparse Auto-encoder
• Sparsity parameter
• Definition: a constraint imposed on the hidden layer
• Goal: ensure pattern will be revealed even if the size of hidden layer is large
• Average activation: 𝜌 =
1
𝑚 𝑖=1
𝑚
[𝑎𝑗
(2)
(𝑥(𝑖))]
• Penalty term
• 𝜌 = 𝜌 (𝜌 = 0.05)
• Kullback-Leibler (KL) divergence: 𝑗=1
𝐾
𝐾𝐿(𝜌 || 𝜌)= 𝜌𝑙𝑜𝑔
𝜌
𝜌
+ (1-𝜌) l𝑙𝑜𝑔
1− 𝜌
1− 𝜌
• 𝑗=1
𝐾
𝐾𝐿(𝜌 || 𝜌) = 0 if 𝜌= 𝜌

Cost Function
J(W,b) =
𝟏
𝒎 𝒊=𝟏
𝒎
| |𝒙𝒊 − 𝒙𝒊||
𝟐
+
𝝀
𝟐
( 𝒌,𝒏 𝑾 𝟐 + 𝒏,𝒌 𝑽 𝟐 + 𝒌 𝒃 𝟏
𝟐
+ 𝒏 𝒃 𝟐
𝟐
) +
𝜷 𝒋=𝟏
𝒌
𝑲𝑳(𝝆|| 𝝆𝒋)
Average sum-of-square error term
Weigh decay term
Penalty term

Cost Function
• Goal: minimize J(W, b) as a function of W and b
• Steps
• Initialization
• Update parameters with gradient descent
𝑊𝑖𝑗
(𝑙)
= 𝑊𝑖𝑗
(𝑙)
- 𝛼
𝜕
𝜕𝑊𝑖𝑗
𝑙 𝐽 𝑊, 𝑏
𝑏𝑖
(𝑙)
= 𝑏𝑖
(𝑙)
- 𝛼
𝜕
𝜕𝑏𝑖
(𝑙) 𝐽 𝑊, 𝑏

Back-propagation
𝛿𝑖
(𝑛 𝑙)
“error term”
how much the node is “responsible” for any error in the output

Back-propagation
1. Perform a feedforward pass, compute activations for layers𝐿2, 𝐿3,
up until the output layer 𝐿 𝑛 𝑙
2. For each output unit I in layer 𝑛𝑙 (the output layer), set
• 𝛿𝑖
(𝑛 𝑙)
= -(𝑦𝑖 − 𝑎𝑖
(𝑛 𝑙)
) x 𝑓−1(𝑧𝑖
(𝑛 𝑙)
)
3. For l = 𝑛𝑙 -1, 𝑛𝑙-2, 𝑛𝑙-3, …, 2
• For each node I in layer l, set 𝛿𝑖
(𝑙)
= ( 𝑗=1
𝑠 𝑙+1
𝑊𝑖𝑗
𝑙
𝛿𝑗
(𝑙+1)
) 𝑓−1(𝑧𝑖
(𝑙)
)
4. Compute the partial derivatives
• 𝛼
𝜕
𝜕𝑊𝑖𝑗
𝑙 𝐽 𝑊, 𝑏; 𝑥, 𝑦 = 𝑎𝑗
(𝑙)
𝛿𝑖
(𝑙+1)
• 𝛼
𝜕
𝜕𝑏𝑖
𝑙 𝐽 𝑊, 𝑏; 𝑥, 𝑦 = 𝛿𝑖
(𝑙+1)

Fine-tuning
Preprocessed
Labeled
Training Data
Sparse Auto-
encoder
Soft-max
Regression
Trained
Parameter Set
Fine-tuning

Dataset
• 1065 instances; Each instance has 62 features.
• Split 1065 instances into three groups:
• Training w/o label – 600 instances
• Training w label – 365 instances
• Test w label - 100 instances
• Comparison group: SVM, naïve bayes, and random forests
• Training w label – 365 instances
• Test w label – 100 instances

Evaluation
• True Positive (TP): actual spammer, prediction spammer.
• True Negative (TN): actual non-spammer, prediction non-spammer.
• False Positive (FP): actual non-spammer, prediction spammer.
• False Negative (FN): actual spammer, prediction non-spammer.

Evaluation
Accuracy: the correctly classified instances over the total number of
test instances.
Precision: P =
𝑇𝑃
(𝑻𝑃 + 𝐹𝑃)
* 100%
Recall: R =
𝑇𝑃
(𝑇𝑃 + 𝐹𝑁)
* 100%
F-Measure: F =
2∗𝑅𝑃
(𝑅 + 𝑃)

Results
Hidden L2
Hidden
L1
15 20 25 30 35 40 45 50 55 Avg
55 86% 88% 85% 84% 87% 85% 83% 86% 86% 86%
50 84% 84% 86% 88% 86% 89% 87% 86% 88% 86%
45 85% 88% 87% 86% 85% 84% 88% 86% 86% 86%
40 88% 87% 85% 85% 85% 87% 87% 86% 89% 87%
35 87% 88% 87% 86% 87% 86% 86% 85% 86% 86%
30 85% 86% 89% 85% 85% 84% 83% 87% 88% 86%
25 87% 87% 88% 87% 85% 88% 85% 87% 88% 87%
20 84% 88% 83% 88% 86% 85% 88% 87% 86% 86%
15 83% 83% 83% 87% 85% 82% 85% 86% 85% 84%
Avg 85% 87% 86% 86% 86% 86% 86% 86% 87%

Results – Comparison with SVM
TP TN FP FN A P R F
SAE 34 52 3 11 86% 91.9% 75.6% 83.0%
Top 5 28 52 2 18 80% 93.3% 60.9% 73.7%
Top 10 27 52 3 18 79% 90% 60.0% 72.0%
Top 20 28 52 3 17 80% 90.3% 62.2% 73.7%
Top 30 29 52 3 16 81% 90.6% 64.4% 75.3%

Results – Comparison with Random Forests &
Naïve Bayes
TP TN FP FN A P R F
SAE 34 52 3 11 86% 91.9% 75.6% 83.0%
Random
Forrest
32 52 3 13 84% 91% 71.0% 80.0%
Naïve
Bayes
33 50 5 12 83% 86.8% 73.0% 79.5%

Conclusion
• Self-taught Learning: large amount of unlabeled data + small amount
of labeled data
• Sparse AE: reduce the feature dimensions
• Fine tuning: improve the deep learning model by large extent.

Limitation & Future Work
• The dataset we use is relatively small.
• We are still exploring new ways to apply this model on raw data.

A Deep Learning Approach
For Twitter Spam Detection
Lijie Zhou (lijie@mail.sfsu.edu) and Hao Yue
San Francisco State University

A deep learning approach for twitter spam detection lijie zhou

More Related Content

Similar to A deep learning approach for twitter spam detection lijie zhou (20)

Recently uploaded (20)

A deep learning approach for twitter spam detection lijie zhou

Editor's Notes