SlideShare a Scribd company logo
RSC: Mining and Modeling Temporal
Activity in Social Media
Alceu F. Costa* Yuto Yamaguchi Agma J. M. Traina
Caetano Traina Jr. Christos Faloutsos
1
Universidade
de São Paulo
KDD 2015 – Sydney, Australia
*alceufc@icmc.usp.br
Introduction
2
Users generate sequences of time-stamps when
they use a social media Web site
What can we learn from time-stamps?
Are there common patterns?
Can we tell if a user is a bot or a human?
Sequence of tweets
time-stamps:
Bars are tweets
time-stamps
Outline
Pattern Mining
What patterns can we discover from temporal
activities of social media users?
Modeling
Bot Detection
Experiments
Conclusion
3
Reddit Dataset
Time-stamp from comments
21,198 users
20 Million time-stamps
Twitter Dataset
Time-stamp from tweets
6,790 users
16 Million time-stamps
Pattern Mining: Datasets
For each user we have:
Sequence of postings time-stamps: T = (t1, t2, t3, …)
Inter-arrival times (IAT) of postings: (∆1, ∆2, ∆3, …)
4
t1 t2 t3 t4
∆1 ∆2 ∆3
time
Pattern Mining
Pattern 1: Distribution of IAT is heavy-tailed
Users can be inactive for long periods of time before making
new postings
IAT Complementary Cumulative Distribution Function (CCDF)
(log-log axis)
5Reddit Users Twitter Users
Pattern Mining
Pattern 2: Bimodal IAT distribution
Users have highly active sections and resting periods
Log-binned histogram of postings IAT
6Twitter Users
10
2
10
4
10
6
0
0.005
0.01
0.015
D, IAT (seconds)
PDF
1st Mode (1min) 2nd Mode (3h)
10
2
10
4
10
6
0
0.005
0.01
D, IAT (seconds)
PDF
Pattern Mining
Pattern 3: Periodic spikes
in the IAT distribution
Caused by daily sleeping
intervals
7
10
5
0
0.005
0.01
0.015
D, IAT (seconds)
PDF
7h 12h 24h 48h 72h
Reddit Users
Pattern Mining
Pattern 4: Consecutive IAT are correlated
Long/short IAT are likely to be followed by long/short IAT
Heat-map: pairs
of consecutive IAT
All Reddit users
8
Concentration of
pairs in the
diagonal: positive
correlation
Outline
Pattern Mining
Modeling
Can we model the patterns?
Bot Detection
Experiments
Conclusion
9
RSC Model
Can we generate synthetic time-stamps that match
real data patterns?
10
Pattern
Poisson
Process
Queue
Based
Barabási,
2005
CNPP
Malmgren
et al.,
2009
SFP
Vaz de Melo
et al.,
2013
RSC
Proposed
Model
Heavy
Tails ✔ ✔ ✔
Bimodal
Distribution ✔ ✔
Periodic
Spikes ✔
IAT
Correlation ✔ ✔
Proposed Model: Rest-Sleep-and-Comment
RSC Model
Base model: Self-Correlated Process (SCorr)
Definition: A stochastic process is a SCorr process with
base rate λ and correlation ρ if:
Consecutive IAT are correlated:
The i-th IAT ∆i depends on the previous (i-1)-th IAT ∆i-1
ρ controls correlation strength:
If ρ = 0, SCorr reduces to an exponential distribution
11
X ~ Exp(1/λ)
exponential random
variable with rate λ
∆i ~ Exp(ρ∆i-1 + 1/λ)Details
SCorr Process
RSC Model
12
✔ Correlated IAT
✔ Heavy Tail
✗ Bimodal Distribution
✗ Periodic Spikes
Consecutive IAT Distribution
SCorr (synthetic data)
λ = 20h, ρ = 0.7
RSC Model
13
λ = 20h, ρ = 0.7
✔ Correlated IAT
✔ Heavy Tail
✗ Bimodal Distribution
✗ Periodic Spikes
IAT CCDF
Reddit Data
SCorr
SCorr Process
RSC Model
14
λ = 20m, ρ = 1.0
✔ Correlated IAT
✔ Heavy Tail
✗ Bimodal Distribution
✗ Periodic Spikes
IAT Log-binned Histogram
Data
SCorr
SCorr Process
RSC Model
Model States
Active:
1. Wait δ ~ SCorr(λA, ρA)
2. Post with probability ppost
3. Transition
Rest:
1. Wait δ ~ SCorr(λR, ρR)
2. Transition
Base rates: λA > λR
Average wait time for active state is
smaller when compared to rest state
State Transitions
15
Active
Rest
1-pR
pR
1-pA pA
Details
RSC Model
16
✔ Heavy Tail
✔ Correlated IAT
✔ Bimodal Distribution
✗ Periodic Spikes
IAT Log-binned Histogram
Data
Synth.
SCorr + Rest and Active States
RSC Model
Keep track of current time:
tclock variable, 0:00h < tclock < 23:59h
Update tclock after each wait time δ
Enter the sleep state if:
Current state = rest and
(tclock < twake or tclock > tsleep)
In the sleep state:
1. Wait until next wake-up time, twake
2. Transition to rest state
17
tsleep
twake
tclock
Sleep
Awake
Modeling periodic spikes: sleep state
Details
RSC Model
18
✔ Heavy Tail
✔ Correlated IAT
✔ Bimodal Distribution
✔ Periodic Spikes
Parameter estimation uses the
Levenberg-Marquardt algorithm
IAT Log-binned Histogram
Complete RSC Model
Outline
Pattern Mining
Modeling
Bot Detection
Can we spot automated behavior based only on time-
stamp data?
Experiments
Conclusion
19
Bot Detection
Problem: Given labeled time-stamp data from a set of
users {U1, U2, U3, …} decide if a unknown user Ui is a
human or a bot.
Solution: RSC-Spotter
Compare users IAT to synthetic IAT generated by the RSC model
If not similar to RSC, then is the user is likely to be a bot
20
0 10 20 30 40 50 60 70
Time (days)
Sequence of time-stamps
from a single user The user that produced
the time-stamps is a
human or a bot?
RSC-Spotter
Comparing Time-stamps
Estimate the RSC parameters
Time-stamps from all users
For each user:
1. Compute the IAT histogram
Using log-binned bins
2. Generate synthetic time-
stamps using RSC
RSC can generate the same
number of time-stamps as the user
3. Compare user and synthetic
IAT histogram
Cost sensitive classification is used
to decide if a user is a bot given the
dissimilarity D 21
∆, IAT
Bin Counts
(user data)ci
∆, IAT
Bin Counts
(synthetic) či
D = Σi |ci – či|
(dissimilarity)
Details
Outline
Pattern Mining
Modeling
Bot Detection
Experiments
Can RSC match real data?
How well can RSC-Spotter detect bots?
Conclusion
22
Reddit Users
Twitter
Users
Experiments: Can RSC Match Real Data?
23
Pattern CNPP SFP RSC
Heavy
Tail
Bimodal
Spikes
IAT
Correlation
RSC
Proposed model
CNPP
Malmgren et al.
SFP
Vaz de Melo et al
CNPP fails to match
the heavy tail
✗ ✔ ✔
Experiments: Can RSC Match Real Data?
24
Pattern CNPP SFP RSC
Heavy
Tail
Bimodal
Spikes
IAT
Correlation
✗ ✔
✗
✔
✔
Two Modes No Periodic
Spikes
Reddit Users
CNPP
Malmgren et al.
Experiments: Can RSC Match Real Data?
25
Pattern CNPP SFP RSC
Heavy
Tail
Bimodal
Spikes
IAT
Correlation
✗ ✔
✗
✗
✗
✔
✔
Reddit Users
Single Mode No Periodic
Spikes
SFP
Vaz de Melo et al
Experiments: Can RSC Match Real Data?
26
Pattern CNPP SFP RSC
Heavy
Tail
Bimodal
Spikes
IAT
Correlation
✗ ✔
✗
✗
✗
✔
✔
✔
✔
Reddit Users
Twitter
Users
Two Modes Periodic
Spikes
Reddit Users
RSC
Proposed model
Experiments: Can RSC Match Real Data?
27
Pattern CNPP SFP RSC
Heavy
Tail
Bimodal
Spikes
IAT
Correlation
✗ ✔
✗
✗
✗
✗
✔
✔
✔
✔
Twitter
Data
CNPP
Fit
No IAT
Correlation
CNPP
Malmgren et al.
Experiments: Can RSC Match Real Data?
28
Pattern CNPP SFP RSC
Heavy
Tail
Bimodal
Spikes
IAT
Correlation
✗ ✔
✗
✗
✗
✗
✔
✔
✔
✔
✔
Twitter
Data
SFP
Fit
IAT Correlation
(but too strong!)
SFP
Vaz de Melo et al
Experiments: Can RSC Match Real Data?
29
Pattern CNPP SFP RSC
Heavy
Tail
Bimodal
Spikes
IAT
Correlation
✗ ✔
✗
✗
✗
✗
✔
✔
✔
✔
✔
✔
Twitter
Data
RSC
Fit
IAT Correlation
RSC
Proposed model
Outline
Pattern Mining
Modeling
Bot Detection
Experiments
Can RSC Match Real Data?
How well can RSC-Spotter detect bots?
Conclusion
30
Experiments: Can RSC-Spotter Detect Bots?
Methodology
Datasets
Users were manually labeled as bot or humans
Training
Same size for train and test subsets (preserved class distribution)
Baseline features:
31
1,963 Humans
37 BotsReddit
1353 Humans
64 BotsTwitter
1. IAT Histogram
Log-binned IAT
histogram
2. Entropy
Entropy of the
IAT histogram
3. Week Hist.
# of postings
for day of week
4. All features
Combination of
1, 2 and 3
Experiments: Can RSC-Spotter Detect Bots?
Precision vs. Sensitivity Curves
Good performance: curve close to the top
32
Precision > 94%
Sensitivity > 70%
With strongly
imbalanced datasets
# humans >> # bots
Twitter Dataset
Experiments: Can RSC-Spotter Detect Bots?
Precision vs. Sensitivity Curves
Good performance: curve close to the top
33
Precision > 96%
Sensitivity > 47%
With strongly
imbalanced datasets
# humans >> # bots
Reddit Dataset
Outline
Pattern Mining
Modeling
Bot Detection
Experiments
Conclusion
34
Conclusion
Pattern Mining
Discovered four activity
patterns
RSC-Model
Model that matches the
postings IAT distribution
of social media users
RSC-Spotter
Can tell if a user is a bot
based only on time-
stamp data
35
10
2
10
4
10
6
0
0.005
0.01
D, IAT (seconds)
PDF
Thank you!
Alceu F. Costa* Yuto Yamaguchi Agma J. M. Traina
Caetano Traina Jr. Christos Faloutsos
36
Universidade
de São Paulo
*alceufc@icmc.usp.br
Datasets and Code: https://github.com/alceufc/rsc_model
Extra Slides
37
RSC Spotter – Training
Goal: decide if a dissimilarity D is big enough to say that a user
is a bot
Input: training set of labeled users
Positive examples: bots
Negative examples: humans
1. Estimate pbot = P(user is a bot | D)
Naive-Bayes classifier
Dissimilarity D is a feature
2. Estimate a probability threshold pthresh
Cost sensitive classification
Minimize the weighted harmonic mean between FP and FN errors
Uses only training data
38
Assign costs to False
Positive and False
Negative errors
Self-Correlated Process (SCorr)
Exponential distribution:
∆i ~ Exp(β)
PDF: f(x) = βe-xβ
Self-Correlated Process:
Similar to the exponential distribution…
…however β depends on the previous IAT
39
β: mean inter-
arrival time
βi = ρ∆i-1 + 1/λ
RSC: Time-stamp Generation
40
RSC: Complete State Machine
41

More Related Content

DOCX
Use of data mining techniques in the discovery of spatial and ...
PDF
Temporal Topic Models for Probabilistic Motif Mining (SMiLe2014)
PPT
Problem solution
PPT
The study on mining temporal patterns and related applications in dynamic soc...
PPT
Theories and Applications of Spatial-Temporal Data Mining and Knowledge Disco...
PPTX
Social Media in Australia: The Case of Twitter
PPTX
Social Media Analysis of Political Parties for Delhi Assembly Election 2015
PDF
Analysing the digital traces of Social Media users
Use of data mining techniques in the discovery of spatial and ...
Temporal Topic Models for Probabilistic Motif Mining (SMiLe2014)
Problem solution
The study on mining temporal patterns and related applications in dynamic soc...
Theories and Applications of Spatial-Temporal Data Mining and Knowledge Disco...
Social Media in Australia: The Case of Twitter
Social Media Analysis of Political Parties for Delhi Assembly Election 2015
Analysing the digital traces of Social Media users

Viewers also liked (14)

PPTX
A Different Perspective on Business with Social Data
PPTX
Telecom Data Analysis Using Social Media Feeds
PDF
Social networks, activities, and travel - building links to understand behaviour
PPSX
Multimedia Data Collection using Social Media Analysis
PPTX
Friendship and mobility user movement in location based social networks
PPT
Spatio-temporal demographic classification of the Twitter users
PPTX
Statistical analytical programming for social media analysis .
PPTX
A guide to realistic social media and measurement
PDF
20140329 modern logging and data analysis pattern on .NET
PDF
Usage and consumption pattern of Social Media- Girish.Havale
PPT
Picturing the Social: Talk for Transforming Digital Methods Winter School
PDF
Spatio-Temporal Data Mining and Classification of Ships' Trajectories
PPTX
Researching Social Media – Big Data and Social Media Analysis
PPTX
7 Hot Location-Based Apps You Should Know About
A Different Perspective on Business with Social Data
Telecom Data Analysis Using Social Media Feeds
Social networks, activities, and travel - building links to understand behaviour
Multimedia Data Collection using Social Media Analysis
Friendship and mobility user movement in location based social networks
Spatio-temporal demographic classification of the Twitter users
Statistical analytical programming for social media analysis .
A guide to realistic social media and measurement
20140329 modern logging and data analysis pattern on .NET
Usage and consumption pattern of Social Media- Girish.Havale
Picturing the Social: Talk for Transforming Digital Methods Winter School
Spatio-Temporal Data Mining and Classification of Ships' Trajectories
Researching Social Media – Big Data and Social Media Analysis
7 Hot Location-Based Apps You Should Know About
Ad

Similar to RSC: Mining and Modeling Temporal Activity in Social Media (20)

PDF
[EUC2016] FFWD: latency-aware event stream processing via domain-specific loa...
PDF
FFWD - Fast Forward With Degradation
PDF
Characterizing and Detecting Livestreaming Chatbots
PDF
Changing with Time: Modelling and Detecting User Lifecycle Periods in Online ...
PDF
Community Structure, Interaction and Evolution Analysis of Online Social Netw...
PDF
IRJET - Military Spy Robot with Intelligentdestruction
PPTX
Live Social Semantics @ ESWC2010
PPTX
Dealing with the need for Infrastructural Support in Ambient Intelligence
PPTX
A multi-sensor based uncut crop edge detection method for head-feeding combin...
PPTX
Understanding City Traffic Dynamics Utilizing Sensor and Textual Observations
PDF
From Billions to Quintillions: Paving the way to real-time motif discovery in...
PDF
Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016
PPTX
The Other HPC: High Productivity Computing
PDF
Botnets behavioral patterns in the network. A Machine Learning study of botne...
PDF
Andrew_Hair_Assignment_3
PDF
Collective Response Spike Prediction for Mutually Interacting Consumers
PPTX
DeepScan: Exploiting Deep Learning for Malicious Account Detection in Locatio...
PDF
Robot navigation in unknown environment with obstacle recognition using laser...
PPTX
Jaswanth-PPT.pptx
PPTX
ODSC 2019: Sessionisation via stochastic periods for root event identification
[EUC2016] FFWD: latency-aware event stream processing via domain-specific loa...
FFWD - Fast Forward With Degradation
Characterizing and Detecting Livestreaming Chatbots
Changing with Time: Modelling and Detecting User Lifecycle Periods in Online ...
Community Structure, Interaction and Evolution Analysis of Online Social Netw...
IRJET - Military Spy Robot with Intelligentdestruction
Live Social Semantics @ ESWC2010
Dealing with the need for Infrastructural Support in Ambient Intelligence
A multi-sensor based uncut crop edge detection method for head-feeding combin...
Understanding City Traffic Dynamics Utilizing Sensor and Textual Observations
From Billions to Quintillions: Paving the way to real-time motif discovery in...
Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016
The Other HPC: High Productivity Computing
Botnets behavioral patterns in the network. A Machine Learning study of botne...
Andrew_Hair_Assignment_3
Collective Response Spike Prediction for Mutually Interacting Consumers
DeepScan: Exploiting Deep Learning for Malicious Account Detection in Locatio...
Robot navigation in unknown environment with obstacle recognition using laser...
Jaswanth-PPT.pptx
ODSC 2019: Sessionisation via stochastic periods for root event identification
Ad

Recently uploaded (20)

PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPT
Quality review (1)_presentation of this 21
PPTX
Introduction to Knowledge Engineering Part 1
PPTX
climate analysis of Dhaka ,Banglades.pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
PDF
Lecture1 pattern recognition............
PPTX
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
SAP 2 completion done . PRESENTATION.pptx
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PPTX
Supervised vs unsupervised machine learning algorithms
PDF
annual-report-2024-2025 original latest.
PPT
ISS -ESG Data flows What is ESG and HowHow
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
Computer network topology notes for revision
PDF
Business Analytics and business intelligence.pdf
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Quality review (1)_presentation of this 21
Introduction to Knowledge Engineering Part 1
climate analysis of Dhaka ,Banglades.pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
Introduction to Basics of Ethical Hacking and Penetration Testing -Unit No. 1...
Lecture1 pattern recognition............
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
Acceptance and paychological effects of mandatory extra coach I classes.pptx
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
SAP 2 completion done . PRESENTATION.pptx
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Supervised vs unsupervised machine learning algorithms
annual-report-2024-2025 original latest.
ISS -ESG Data flows What is ESG and HowHow
Qualitative Qantitative and Mixed Methods.pptx
Miokarditis (Inflamasi pada Otot Jantung)
Computer network topology notes for revision
Business Analytics and business intelligence.pdf
Introduction-to-Cloud-ComputingFinal.pptx

RSC: Mining and Modeling Temporal Activity in Social Media

  • 1. RSC: Mining and Modeling Temporal Activity in Social Media Alceu F. Costa* Yuto Yamaguchi Agma J. M. Traina Caetano Traina Jr. Christos Faloutsos 1 Universidade de São Paulo KDD 2015 – Sydney, Australia *alceufc@icmc.usp.br
  • 2. Introduction 2 Users generate sequences of time-stamps when they use a social media Web site What can we learn from time-stamps? Are there common patterns? Can we tell if a user is a bot or a human? Sequence of tweets time-stamps: Bars are tweets time-stamps
  • 3. Outline Pattern Mining What patterns can we discover from temporal activities of social media users? Modeling Bot Detection Experiments Conclusion 3
  • 4. Reddit Dataset Time-stamp from comments 21,198 users 20 Million time-stamps Twitter Dataset Time-stamp from tweets 6,790 users 16 Million time-stamps Pattern Mining: Datasets For each user we have: Sequence of postings time-stamps: T = (t1, t2, t3, …) Inter-arrival times (IAT) of postings: (∆1, ∆2, ∆3, …) 4 t1 t2 t3 t4 ∆1 ∆2 ∆3 time
  • 5. Pattern Mining Pattern 1: Distribution of IAT is heavy-tailed Users can be inactive for long periods of time before making new postings IAT Complementary Cumulative Distribution Function (CCDF) (log-log axis) 5Reddit Users Twitter Users
  • 6. Pattern Mining Pattern 2: Bimodal IAT distribution Users have highly active sections and resting periods Log-binned histogram of postings IAT 6Twitter Users 10 2 10 4 10 6 0 0.005 0.01 0.015 D, IAT (seconds) PDF 1st Mode (1min) 2nd Mode (3h)
  • 7. 10 2 10 4 10 6 0 0.005 0.01 D, IAT (seconds) PDF Pattern Mining Pattern 3: Periodic spikes in the IAT distribution Caused by daily sleeping intervals 7 10 5 0 0.005 0.01 0.015 D, IAT (seconds) PDF 7h 12h 24h 48h 72h Reddit Users
  • 8. Pattern Mining Pattern 4: Consecutive IAT are correlated Long/short IAT are likely to be followed by long/short IAT Heat-map: pairs of consecutive IAT All Reddit users 8 Concentration of pairs in the diagonal: positive correlation
  • 9. Outline Pattern Mining Modeling Can we model the patterns? Bot Detection Experiments Conclusion 9
  • 10. RSC Model Can we generate synthetic time-stamps that match real data patterns? 10 Pattern Poisson Process Queue Based Barabási, 2005 CNPP Malmgren et al., 2009 SFP Vaz de Melo et al., 2013 RSC Proposed Model Heavy Tails ✔ ✔ ✔ Bimodal Distribution ✔ ✔ Periodic Spikes ✔ IAT Correlation ✔ ✔ Proposed Model: Rest-Sleep-and-Comment
  • 11. RSC Model Base model: Self-Correlated Process (SCorr) Definition: A stochastic process is a SCorr process with base rate λ and correlation ρ if: Consecutive IAT are correlated: The i-th IAT ∆i depends on the previous (i-1)-th IAT ∆i-1 ρ controls correlation strength: If ρ = 0, SCorr reduces to an exponential distribution 11 X ~ Exp(1/λ) exponential random variable with rate λ ∆i ~ Exp(ρ∆i-1 + 1/λ)Details
  • 12. SCorr Process RSC Model 12 ✔ Correlated IAT ✔ Heavy Tail ✗ Bimodal Distribution ✗ Periodic Spikes Consecutive IAT Distribution SCorr (synthetic data) λ = 20h, ρ = 0.7
  • 13. RSC Model 13 λ = 20h, ρ = 0.7 ✔ Correlated IAT ✔ Heavy Tail ✗ Bimodal Distribution ✗ Periodic Spikes IAT CCDF Reddit Data SCorr SCorr Process
  • 14. RSC Model 14 λ = 20m, ρ = 1.0 ✔ Correlated IAT ✔ Heavy Tail ✗ Bimodal Distribution ✗ Periodic Spikes IAT Log-binned Histogram Data SCorr SCorr Process
  • 15. RSC Model Model States Active: 1. Wait δ ~ SCorr(λA, ρA) 2. Post with probability ppost 3. Transition Rest: 1. Wait δ ~ SCorr(λR, ρR) 2. Transition Base rates: λA > λR Average wait time for active state is smaller when compared to rest state State Transitions 15 Active Rest 1-pR pR 1-pA pA Details
  • 16. RSC Model 16 ✔ Heavy Tail ✔ Correlated IAT ✔ Bimodal Distribution ✗ Periodic Spikes IAT Log-binned Histogram Data Synth. SCorr + Rest and Active States
  • 17. RSC Model Keep track of current time: tclock variable, 0:00h < tclock < 23:59h Update tclock after each wait time δ Enter the sleep state if: Current state = rest and (tclock < twake or tclock > tsleep) In the sleep state: 1. Wait until next wake-up time, twake 2. Transition to rest state 17 tsleep twake tclock Sleep Awake Modeling periodic spikes: sleep state Details
  • 18. RSC Model 18 ✔ Heavy Tail ✔ Correlated IAT ✔ Bimodal Distribution ✔ Periodic Spikes Parameter estimation uses the Levenberg-Marquardt algorithm IAT Log-binned Histogram Complete RSC Model
  • 19. Outline Pattern Mining Modeling Bot Detection Can we spot automated behavior based only on time- stamp data? Experiments Conclusion 19
  • 20. Bot Detection Problem: Given labeled time-stamp data from a set of users {U1, U2, U3, …} decide if a unknown user Ui is a human or a bot. Solution: RSC-Spotter Compare users IAT to synthetic IAT generated by the RSC model If not similar to RSC, then is the user is likely to be a bot 20 0 10 20 30 40 50 60 70 Time (days) Sequence of time-stamps from a single user The user that produced the time-stamps is a human or a bot?
  • 21. RSC-Spotter Comparing Time-stamps Estimate the RSC parameters Time-stamps from all users For each user: 1. Compute the IAT histogram Using log-binned bins 2. Generate synthetic time- stamps using RSC RSC can generate the same number of time-stamps as the user 3. Compare user and synthetic IAT histogram Cost sensitive classification is used to decide if a user is a bot given the dissimilarity D 21 ∆, IAT Bin Counts (user data)ci ∆, IAT Bin Counts (synthetic) či D = Σi |ci – či| (dissimilarity) Details
  • 22. Outline Pattern Mining Modeling Bot Detection Experiments Can RSC match real data? How well can RSC-Spotter detect bots? Conclusion 22
  • 23. Reddit Users Twitter Users Experiments: Can RSC Match Real Data? 23 Pattern CNPP SFP RSC Heavy Tail Bimodal Spikes IAT Correlation RSC Proposed model CNPP Malmgren et al. SFP Vaz de Melo et al CNPP fails to match the heavy tail ✗ ✔ ✔
  • 24. Experiments: Can RSC Match Real Data? 24 Pattern CNPP SFP RSC Heavy Tail Bimodal Spikes IAT Correlation ✗ ✔ ✗ ✔ ✔ Two Modes No Periodic Spikes Reddit Users CNPP Malmgren et al.
  • 25. Experiments: Can RSC Match Real Data? 25 Pattern CNPP SFP RSC Heavy Tail Bimodal Spikes IAT Correlation ✗ ✔ ✗ ✗ ✗ ✔ ✔ Reddit Users Single Mode No Periodic Spikes SFP Vaz de Melo et al
  • 26. Experiments: Can RSC Match Real Data? 26 Pattern CNPP SFP RSC Heavy Tail Bimodal Spikes IAT Correlation ✗ ✔ ✗ ✗ ✗ ✔ ✔ ✔ ✔ Reddit Users Twitter Users Two Modes Periodic Spikes Reddit Users RSC Proposed model
  • 27. Experiments: Can RSC Match Real Data? 27 Pattern CNPP SFP RSC Heavy Tail Bimodal Spikes IAT Correlation ✗ ✔ ✗ ✗ ✗ ✗ ✔ ✔ ✔ ✔ Twitter Data CNPP Fit No IAT Correlation CNPP Malmgren et al.
  • 28. Experiments: Can RSC Match Real Data? 28 Pattern CNPP SFP RSC Heavy Tail Bimodal Spikes IAT Correlation ✗ ✔ ✗ ✗ ✗ ✗ ✔ ✔ ✔ ✔ ✔ Twitter Data SFP Fit IAT Correlation (but too strong!) SFP Vaz de Melo et al
  • 29. Experiments: Can RSC Match Real Data? 29 Pattern CNPP SFP RSC Heavy Tail Bimodal Spikes IAT Correlation ✗ ✔ ✗ ✗ ✗ ✗ ✔ ✔ ✔ ✔ ✔ ✔ Twitter Data RSC Fit IAT Correlation RSC Proposed model
  • 30. Outline Pattern Mining Modeling Bot Detection Experiments Can RSC Match Real Data? How well can RSC-Spotter detect bots? Conclusion 30
  • 31. Experiments: Can RSC-Spotter Detect Bots? Methodology Datasets Users were manually labeled as bot or humans Training Same size for train and test subsets (preserved class distribution) Baseline features: 31 1,963 Humans 37 BotsReddit 1353 Humans 64 BotsTwitter 1. IAT Histogram Log-binned IAT histogram 2. Entropy Entropy of the IAT histogram 3. Week Hist. # of postings for day of week 4. All features Combination of 1, 2 and 3
  • 32. Experiments: Can RSC-Spotter Detect Bots? Precision vs. Sensitivity Curves Good performance: curve close to the top 32 Precision > 94% Sensitivity > 70% With strongly imbalanced datasets # humans >> # bots Twitter Dataset
  • 33. Experiments: Can RSC-Spotter Detect Bots? Precision vs. Sensitivity Curves Good performance: curve close to the top 33 Precision > 96% Sensitivity > 47% With strongly imbalanced datasets # humans >> # bots Reddit Dataset
  • 35. Conclusion Pattern Mining Discovered four activity patterns RSC-Model Model that matches the postings IAT distribution of social media users RSC-Spotter Can tell if a user is a bot based only on time- stamp data 35 10 2 10 4 10 6 0 0.005 0.01 D, IAT (seconds) PDF
  • 36. Thank you! Alceu F. Costa* Yuto Yamaguchi Agma J. M. Traina Caetano Traina Jr. Christos Faloutsos 36 Universidade de São Paulo *alceufc@icmc.usp.br Datasets and Code: https://github.com/alceufc/rsc_model
  • 38. RSC Spotter – Training Goal: decide if a dissimilarity D is big enough to say that a user is a bot Input: training set of labeled users Positive examples: bots Negative examples: humans 1. Estimate pbot = P(user is a bot | D) Naive-Bayes classifier Dissimilarity D is a feature 2. Estimate a probability threshold pthresh Cost sensitive classification Minimize the weighted harmonic mean between FP and FN errors Uses only training data 38 Assign costs to False Positive and False Negative errors
  • 39. Self-Correlated Process (SCorr) Exponential distribution: ∆i ~ Exp(β) PDF: f(x) = βe-xβ Self-Correlated Process: Similar to the exponential distribution… …however β depends on the previous IAT 39 β: mean inter- arrival time βi = ρ∆i-1 + 1/λ
  • 41. RSC: Complete State Machine 41

Editor's Notes

  • #3: When users use Web sites like Twitter or Reddit, they post content such as photos, comments, or tweets. All these postings are annotated with time-stamps. So, each user generates a sequence of time-stamps when they use a social media Web site. For example, we have here the time-line of postings time-stamps from two Twitter users. - Each bar, is a tweet and the time unit is day. What can we say just by looking at these time-stamps? - Are there patterns that are common between all these users? - Can we tell if they are from a real user or from a bot? - Is it possible to mimic the time-stamps from a user? Obs.: maybe we can close showing that the bot scores for the users from the first slide: Show a bot and a regular user here (without the photo) and ask can we tell which behavior is normal and which one is not normal? Final slide => show the scores for the users and reveal their photos
  • #5: In this work we analyzed data from two services: reddit and twitter For the reddit dataset we have time-stamp sequences from over 20k users and For the twitter dataset we have time-stamp sequences from over 6k users For each user had at least a sequence of at least 900 time-stamps. For the twitter dataset the time-stamps were from tweets. For the reddit dataset the time-stamps were from user comments. From each sequence of time-stamps we also computed the IAT (inter-arrival time) between postings
  • #6: The first pattern discovered from the datasets is the heavy tailed distribution of inter-arrival times. The plots in this slide shows the tail part of the IAT distribution for reddit and twitter users in log-log axis.
  • #7: The 2nd pattern we discovered is that the distribution of inter arrival times is bimodal. The two figures at the bottom part of the slide shows the log-binned histogram of inter-arrival times of all Reddit and Twitter users. We have a first mode at around 6min and the second mode at 2h mark. This can be explained by users having Highly active sections where they make more than one posting in a short interval of time Resting periods (e.g. working or doing some other activity)
  • #9: Another pattern we discovered in our data is that consecutive IAT are correlated. For example, if a user takes a long time to post a tweet, then it is more likely that she will take a long time to post her next tweet. The figure to the right shows a heat-map of pairs of consecutive IAT. There is a concentration of pairs in the diagonal of the plot, which indicates a positive correlation.
  • #11: I will start this next part of the presentation with the following question: Can we generate synthetic time-stamps that mimics human behavior? Although there are many mathematical models for human dynamics: The Poisson Process is not able to match any of the patterns that we found. Queue Based model, such as the one proposed by Barabasi, is able to generate power-law distributions and matches the heavy-tail pattern. CNPP (Cascading non-homogeneous Poisson Process) is able to match the bimodal distribution The SFP process, proposed matches both the heavy tails and correlation between consecutive IAT. However, no model is able to match all the communication patterns. Now I will present the RSC model that we propose that is able to match these patterns. To solve this problem We solve this problem by proposing the Rest-Sleep-and-Comment model, or RSC model.
  • #12: We call our model Rest-Sleep-and-Comment (RSC). The base of RSC is a stochastic process called Self-Correlated Process we proposed that we use to generate IAT. The green box shows the equation for the IATs generated by SCorr. The SCorr Process has two parameters: the base rate lambda and the correlation rho and is described by the equation shown in the slide. In Scorr IATs are exponentially distributed, however, the rate of the exponential distribution depends on the following factors: The previous IAT, which makes consecutive IAT to be correlated, The rho parameter, which control strength of the correaltion. For instance, if rho is equal to zero, the Scorr reduces to an exponential distribution.
  • #13: Before I show the complete RSC model, this slide shows the distribution of consecutive IAT for the SCorr Process. The SCorr is able to generate correlated consecutive IAT (notice the concentration along the diagonal of the heat-map).
  • #14: When we look at the CCDF of the IAT we can also see that SCorr is able to generate heavy tails. In this slide we are comparing the CDDF of synthetic SCorr time-stamps to real data from Twitter users.
  • #15: However, when we look at the histogram of IAT for the SCorr, it is possible to see that it does not match the bimodal distribution and the periodic spikes.
  • #16: Now we improve the RSC model by adding an active and rest state. In the active state, RSC will: - wait a time delta generated using SCorr - make a posting with a probability p_post - make a transition to the rest state with a probability p_R In the rest state only contributes to increase the IAT - wait a time delta generate using Scorr - make a transition to active state with probability p_A. The base rate lambda_A for the active state is higher than lambda_R: The average wait time for the active state is smaller than the wait times for the rest state. The important thing about the states is that the average wait times The SCorr parameters for the rest state are selected so that the base rate lambda_A is
  • #17: With the active and rest states we can now model the bimodal pattern. Show that each mode corresponds to a state. However, this version of the model is not able to match the periodic spikes.
  • #19: Show that each mode corresponds to a state.
  • #21: Why should we compare? We can show here that bots have strange IAT histograms. - No heavy tails - Many spikes (posting at regular intervals, e.g. every 10min)
  • #22: In order to generate synthetic time-stamps that mimics real user behavior we estimate the RSC parameters using data from all users. The next step consists in generating a log-binned histogram of IAT for each user. Now suppose that we have 1,000 time-stamps for this particular user. We can use RSC to generate exactly 1,000 time-stamps. Finally, we generate a histogram of synthetic IAT and compare the distance, that is, the difference between the 2 histograms.
  • #24: We will use this table to summarize the comparison of the models.
  • #39: the figure shows the distribution of the dissimilarity D computed from users labeled as human and bots from the Twitter dataset. Most of the users with higher dissimilarity values are bots. Now we need to decide whether a given value of dissimilarity is big enough to say that the user is a bot. Given a training set of users labeled either as bots or humans