A Bot Identification Model and Tool Based on GitHub Activity Sequences

Service public de Wallonie – Recherche
Grant 2010235 “ARIAC BY DIGITALWALLONIA4.AI”
A Bot Identification Model and Tool
Based on GitHub Activity Sequences
Natarajan Chidambaram, Alexandre Decan, Tom Mens
Software Engineering Lab
1

Bot accounts and Apps frequently engage in GitHub repositories
2

Pushing commits
3

Pushing commits
Commenting PR
4

Pushing commits
Commenting PR
Merging PR
5

Pushing commits
Commenting PR
Merging PR
6
• Pushing commits
• Opening/closing/reopening/
commenting issues or PRs
• Creating/deleting tags
• Creating/deleting branches
• Reviewing code
• …
Repository contributors can be
involved in a wide range of activity
types:

Pushing commits
Commenting PR
Merging PR
7
• Pushing commits
• Reviewing code
• …
types:

Pushing commits
Commenting PR
Merging PR
8
• Pushing commits
• Reviewing code
• …
types:

Comparison of
Bot Identification Approaches
9
Cannot be run on
“live data”. Requires
World of Code.
based on PR and issue
comments only
based on git commit
messages only
BotHunter
Abdellatif et al.
MSR 2022
ahmad-abdellatif/
BotHunter
BoDeGHa
Golzadeh et al.
JSS 2021
mehdigolzadeh/
BoDeGHa
BoDeGiC
Golzadeh et al.
BENEVOL 2020
mehdigolzadeh/
BoDeGiC
BIMAN
Dey et al.
MSR 2020
ssc-oscar/
BIMAN_bot_detection
uses too many API
queries

Dataset
10
# contributors # activities median
# activities
mean
# activity types
bots 1,035 182,218 194 3.6
humans 1,115 155,028 147 9.1
total 2,150 337,246 171 6.4
New Ground-truth Dataset
of GitHub contributors
• Training set of 1290 contributors (60%) of with 621 bots and 669 humans
• Test set of 860 contributors (40%) of which 414 bots and 446 humans.

Comparison of
Approach P R F1 #unknown Data
downloaded
time # API
queries
“bot in name”
heuristic
0.77 0.76 0.76 0 - 0.01
sec
-
BoDeGHa 0.92 0.51 0.66 392 3.83 GB 7.7 h 10,222
BoDeGiC 0.81 0.27 0.41 627 23.3 GB 23.1 h -
BotHunter 0.97 0.93 0.95 1 0.261 GB 20.8 h 37,240
on test set of 860 GitHub contributors
11

Comparison of
Approach P R F1 #unknown Data
downloaded
time # API
queries
“bot in name”
heuristic
0.77 0.76 0.76 0 - 0.01
sec
-
BoDeGHa 0.92 0.51 0.66 392 3.83 GB 7.7 h 10,222
BoDeGiC 0.81 0.27 0.41 627 23.3 GB 23.1 h -
BotHunter 0.97 0.93 0.95 1 0.261 GB 20.8 h 37,240
on test set of 860 GitHub contributors
12

Goal: Improved Bot Identification Model
Based on Activity Sequences (BIMBAS)
13
Good performance

14
Good performance
Less API queries

15
Good performance Download less data
Less API queries

16
Less API queries Fast enough to apply on
thousands of contributors

17
Independent of text
(future proof for LLM-based bots)
Less API queries Fast enough to apply on
thousands of contributors

18
# activities
# activity types
# repositories
# owners
# 𝑜𝑤𝑛𝑒𝑟𝑠
# 𝑟𝑒𝑝𝑜𝑠𝑖𝑡𝑜𝑟𝑖𝑒𝑠
Counting metrics Aggregate metrics – mean, std, median, IQR, Gini
# activities per repository
# activities per activity type
# consecutive activities in a repository
# activity types per repository
Time between consecutive activities
Time spent in a repository
Time to switch repository
Time to switch activity type
# features = 5 # features = 8*5 = 40
Features of BIMBAS

19
# activities
# activity types
# repositories
# owners
# 𝑜𝑤𝑛𝑒𝑟𝑠
# 𝑟𝑒𝑝𝑜𝑠𝑖𝑡𝑜𝑟𝑖𝑒𝑠
Counting metrics Aggregate metrics – mean, std, median, IQR, Gini
# activities per repository
# activities per activity type
# consecutive activities in a repository
# activity types per repository
Time between consecutive activities
Time spent in a repository
Time to switch repository
Time to switch activity type
# features = 5 # features = 8*5 = 40
type
repository
temporal
Features of BIMBAS

Grid search
on training set
Binary Classifiers
Gradient Boosting
Random Forest
Decision Tree
XGBoost
Linear Discriminant Analysis
Support Vector Machines
Gaussian Naïve Bayes
Important features
Model evaluation
on test set
hyperparameter
tuning
• 7 classifiers
• 13,021 models
Selecting a classifier for BIMBAS

Grid search
on training set
Binary Classifiers
Gradient Boosting
Random Forest
Decision Tree
XGBoost
Linear Discriminant Analysis
Support Vector Machines
Gaussian Naïve Bayes
Precision 0.93
Recall 0.93
AUC-ROC 0.97
Important features
Model evaluation
on test set
hyperparameter
tuning
• 7 classifiers
• 13,021 models
Selecting a classifier for BIMBAS

22
Recursive Feature Elimination: Removed 7 features
Removing unimportant features
Grid search
on training set
Important features
Model evaluation
on test set

23
Permutation Importance Analysis
1. # activity types
2. # owners of repositories
3. Median time between activities of different
types
4. Median number of activities per type
5. Mean number of activities per type
Recursive Feature Elimination: Removed 7 features
Removing unimportant features
Grid search
on training set
Important features
Model evaluation
on test set

24
Approach precision recall F1 score
“bot in name”
heuristic
.77 .76 .76
BoDeGHa .92 .51 .66
BoDeGiC .81 .27 .41
BotHunter .97 .93 .95
BIMBAS .90 .90 .90
Evaluating performance of BIMBAS
Grid search
on training set
Important features
Model evaluation
on test set

BIMBAS Case Study
25
July to September 2024, 59 projects, 1169 GitHub repositories
802 humans, 34 bot accounts, 13 Apps, 4 internal services

BIMBAS Case Study
26
July to September 2024, 59 projects, 1169 GitHub repositories
802 humans, 34 bot accounts, 13 Apps, 4 internal services

https://github.com/natarajan-chidambaram/RABBIT
RABBIT: A CLI-based tool
implementing BIMBAS
27

implementing BIMBAS
28
% rabbit --input-file names.txt

implementing BIMBAS
29
% rabbit --input-file names.txt
contributor type confidence
github-actions[bot] Bot 1.0
google Organization 1.0
openssl-machine Bot 0.714
natarajan-chidambaram Human 0.926
renovate Unknown -
gh-ci Invalid -

Evaluating RABBIT’s Efficiency
30
Data downloaded X time X API queries X
NBH - - 0.01 sec -
BoDeGHa 3.83 GB 34X 7.7 h 21X 10,222 4X
BoDeGiC 23.3 GB 208X 23.1 h 60X - -
BotHunter 0.261 GB 2.3X 20.8 h 57X 37,240 15X
RABBIT 0.112 GB 22 m 2,426
an order of magnitude faster and less queries !

A Bot Identification Model and Tool Based on GitHub Activity Sequences

More Related Content

Similar to A Bot Identification Model and Tool Based on GitHub Activity Sequences (20)

Recently uploaded (20)

A Bot Identification Model and Tool Based on GitHub Activity Sequences