Service public de Wallonie – Recherche
Grant 2010235 “ARIAC BY DIGITALWALLONIA4.AI”
A Bot Identification Model and Tool
Based on GitHub Activity Sequences
Natarajan Chidambaram, Alexandre Decan, Tom Mens
Software Engineering Lab
1
Bot accounts and Apps frequently engage in GitHub repositories
2
Pushing commits
Bot accounts and Apps frequently engage in GitHub repositories
3
Pushing commits
Commenting PR
Bot accounts and Apps frequently engage in GitHub repositories
4
Pushing commits
Commenting PR
Merging PR
Bot accounts and Apps frequently engage in GitHub repositories
5
Pushing commits
Commenting PR
Merging PR
Bot accounts and Apps frequently engage in GitHub repositories
6
• Pushing commits
• Opening/closing/reopening/
commenting issues or PRs
• Creating/deleting tags
• Creating/deleting branches
• Reviewing code
• …
Repository contributors can be
involved in a wide range of activity
types:
Pushing commits
Commenting PR
Merging PR
Bot accounts and Apps frequently engage in GitHub repositories
7
• Pushing commits
• Opening/closing/reopening/
commenting issues or PRs
• Creating/deleting tags
• Creating/deleting branches
• Reviewing code
• …
Repository contributors can be
involved in a wide range of activity
types:
Pushing commits
Commenting PR
Merging PR
Bot accounts and Apps frequently engage in GitHub repositories
8
• Pushing commits
• Opening/closing/reopening/
commenting issues or PRs
• Creating/deleting tags
• Creating/deleting branches
• Reviewing code
• …
Repository contributors can be
involved in a wide range of activity
types:
Comparison of
Bot Identification Approaches
9
Cannot be run on
“live data”. Requires
World of Code.
based on PR and issue
comments only
based on git commit
messages only
BotHunter
Abdellatif et al.
MSR 2022
ahmad-abdellatif/
BotHunter
BoDeGHa
Golzadeh et al.
JSS 2021
mehdigolzadeh/
BoDeGHa
BoDeGiC
Golzadeh et al.
BENEVOL 2020
mehdigolzadeh/
BoDeGiC
BIMAN
Dey et al.
MSR 2020
ssc-oscar/
BIMAN_bot_detection
uses too many API
queries
Dataset
10
# contributors # activities median
# activities
mean
# activity types
bots 1,035 182,218 194 3.6
humans 1,115 155,028 147 9.1
total 2,150 337,246 171 6.4
New Ground-truth Dataset
of GitHub contributors
• Training set of 1290 contributors (60%) of with 621 bots and 669 humans
• Test set of 860 contributors (40%) of which 414 bots and 446 humans.
Comparison of
Bot Identification Approaches
Approach P R F1 #unknown Data
downloaded
time # API
queries
“bot in name”
heuristic
0.77 0.76 0.76 0 - 0.01
sec
-
BoDeGHa 0.92 0.51 0.66 392 3.83 GB 7.7 h 10,222
BoDeGiC 0.81 0.27 0.41 627 23.3 GB 23.1 h -
BotHunter 0.97 0.93 0.95 1 0.261 GB 20.8 h 37,240
on test set of 860 GitHub contributors
11
Comparison of
Bot Identification Approaches
Approach P R F1 #unknown Data
downloaded
time # API
queries
“bot in name”
heuristic
0.77 0.76 0.76 0 - 0.01
sec
-
BoDeGHa 0.92 0.51 0.66 392 3.83 GB 7.7 h 10,222
BoDeGiC 0.81 0.27 0.41 627 23.3 GB 23.1 h -
BotHunter 0.97 0.93 0.95 1 0.261 GB 20.8 h 37,240
on test set of 860 GitHub contributors
12
Goal: Improved Bot Identification Model
Based on Activity Sequences (BIMBAS)
13
Good performance
Goal: Improved Bot Identification Model
Based on Activity Sequences (BIMBAS)
14
Good performance
Less API queries
Goal: Improved Bot Identification Model
Based on Activity Sequences (BIMBAS)
15
Good performance Download less data
Less API queries
Goal: Improved Bot Identification Model
Based on Activity Sequences (BIMBAS)
16
Good performance Download less data
Less API queries Fast enough to apply on
thousands of contributors
Goal: Improved Bot Identification Model
Based on Activity Sequences (BIMBAS)
17
Good performance Download less data
Independent of text
(future proof for LLM-based bots)
Less API queries Fast enough to apply on
thousands of contributors
18
# activities
# activity types
# repositories
# owners
# 𝑜𝑤𝑛𝑒𝑟𝑠
# 𝑟𝑒𝑝𝑜𝑠𝑖𝑡𝑜𝑟𝑖𝑒𝑠
Counting metrics Aggregate metrics – mean, std, median, IQR, Gini
# activities per repository
# activities per activity type
# consecutive activities in a repository
# activity types per repository
Time between consecutive activities
Time spent in a repository
Time to switch repository
Time to switch activity type
# features = 5 # features = 8*5 = 40
Features of BIMBAS
19
# activities
# activity types
# repositories
# owners
# 𝑜𝑤𝑛𝑒𝑟𝑠
# 𝑟𝑒𝑝𝑜𝑠𝑖𝑡𝑜𝑟𝑖𝑒𝑠
Counting metrics Aggregate metrics – mean, std, median, IQR, Gini
# activities per repository
# activities per activity type
# consecutive activities in a repository
# activity types per repository
Time between consecutive activities
Time spent in a repository
Time to switch repository
Time to switch activity type
# features = 5 # features = 8*5 = 40
type
repository
temporal
Features of BIMBAS
Grid search
on training set
Binary Classifiers
Gradient Boosting
Random Forest
Decision Tree
XGBoost
Linear Discriminant Analysis
Support Vector Machines
Gaussian Naïve Bayes
Important features
Model evaluation
on test set
hyperparameter
tuning
• 7 classifiers
• 13,021 models
Selecting a classifier for BIMBAS
Grid search
on training set
Binary Classifiers
Gradient Boosting
Random Forest
Decision Tree
XGBoost
Linear Discriminant Analysis
Support Vector Machines
Gaussian Naïve Bayes
Precision 0.93
Recall 0.93
AUC-ROC 0.97
Important features
Model evaluation
on test set
hyperparameter
tuning
• 7 classifiers
• 13,021 models
Selecting a classifier for BIMBAS
22
Recursive Feature Elimination: Removed 7 features
Removing unimportant features
Grid search
on training set
Important features
Model evaluation
on test set
23
Permutation Importance Analysis
1. # activity types
2. # owners of repositories
3. Median time between activities of different
types
4. Median number of activities per type
5. Mean number of activities per type
Recursive Feature Elimination: Removed 7 features
Removing unimportant features
Grid search
on training set
Important features
Model evaluation
on test set
24
Approach precision recall F1 score
“bot in name”
heuristic
.77 .76 .76
BoDeGHa .92 .51 .66
BoDeGiC .81 .27 .41
BotHunter .97 .93 .95
BIMBAS .90 .90 .90
Evaluating performance of BIMBAS
Grid search
on training set
Important features
Model evaluation
on test set
BIMBAS Case Study
25
July to September 2024, 59 projects, 1169 GitHub repositories
802 humans, 34 bot accounts, 13 Apps, 4 internal services
BIMBAS Case Study
26
July to September 2024, 59 projects, 1169 GitHub repositories
802 humans, 34 bot accounts, 13 Apps, 4 internal services
https://github.com/natarajan-chidambaram/RABBIT
RABBIT: A CLI-based tool
implementing BIMBAS
27
RABBIT: A CLI-based tool
implementing BIMBAS
28
% rabbit --input-file names.txt
https://github.com/natarajan-chidambaram/RABBIT
RABBIT: A CLI-based tool
implementing BIMBAS
29
% rabbit --input-file names.txt
contributor type confidence
github-actions[bot] Bot 1.0
google Organization 1.0
openssl-machine Bot 0.714
natarajan-chidambaram Human 0.926
renovate Unknown -
gh-ci Invalid -
https://github.com/natarajan-chidambaram/RABBIT
Evaluating RABBIT’s Efficiency
30
Data downloaded X time X API queries X
NBH - - 0.01 sec -
BoDeGHa 3.83 GB 34X 7.7 h 21X 10,222 4X
BoDeGiC 23.3 GB 208X 23.1 h 60X - -
BotHunter 0.261 GB 2.3X 20.8 h 57X 37,240 15X
RABBIT 0.112 GB 22 m 2,426
an order of magnitude faster and less queries !
31

More Related Content

PDF
A Bot Identification Model and Tool Based on GitHub Activity Sequences
PDF
Observing bots in the wild: A quantitative analysis of a large open-source ec...
PDF
ICSME_2022_2.pdf
PDF
Efficient GitHub Crawling using the GraphQL API
PDF
Github Copilot vs Amazon CodeWhisperer for Java developers at JCON 2023
PDF
RABBIT: A CLI tool for identifying bots based on their GitHub events.
PDF
The (r)evolution of CI/CD on GitHub
ODP
Unicon June 2014 IAM Briefing
A Bot Identification Model and Tool Based on GitHub Activity Sequences
Observing bots in the wild: A quantitative analysis of a large open-source ec...
ICSME_2022_2.pdf
Efficient GitHub Crawling using the GraphQL API
Github Copilot vs Amazon CodeWhisperer for Java developers at JCON 2023
RABBIT: A CLI tool for identifying bots based on their GitHub events.
The (r)evolution of CI/CD on GitHub
Unicon June 2014 IAM Briefing

Similar to A Bot Identification Model and Tool Based on GitHub Activity Sequences (20)

PPTX
It's all about feedback - code review as a great tool in the agile toolbox
PPTX
Difference between Github vs Gitlab vs Bitbucket
PPTX
Building A Distributed Build System at Google Scale (StrangeLoop 2016)
PPTX
Analysing GitHub commits with R
PPTX
Towards Better Open-Source Development:
PDF
How to Contribute to Apache Usergrid
PDF
SATToSE_2023_Presentation_slideshare.pdf
PPTX
Ship code like a keptn
PDF
Introduction to Github Actions
PDF
Introduction to Github Actions
PDF
Git & GitHub N00bs
PDF
Developing XWiki
PDF
Learning Github Actions Automation And Integration Of Cicd With Github 1st Ed...
PPTX
Using Github Insight as metric for the Developer collaboration and work metri...
PDF
Anaconda and PyData Solutions
PDF
Building Data Pipelines in Python
PDF
Software Development Practices.pdf
PDF
Gitlab for JS developers (BrisJs meetup, 2019-Apr-01)
PDF
An Architecture for Agile Machine Learning in Real-Time Applications
PPTX
GitHub Actions Security
It's all about feedback - code review as a great tool in the agile toolbox
Difference between Github vs Gitlab vs Bitbucket
Building A Distributed Build System at Google Scale (StrangeLoop 2016)
Analysing GitHub commits with R
Towards Better Open-Source Development:
How to Contribute to Apache Usergrid
SATToSE_2023_Presentation_slideshare.pdf
Ship code like a keptn
Introduction to Github Actions
Introduction to Github Actions
Git & GitHub N00bs
Developing XWiki
Learning Github Actions Automation And Integration Of Cicd With Github 1st Ed...
Using Github Insight as metric for the Developer collaboration and work metri...
Anaconda and PyData Solutions
Building Data Pipelines in Python
Software Development Practices.pdf
Gitlab for JS developers (BrisJs meetup, 2019-Apr-01)
An Architecture for Agile Machine Learning in Real-Time Applications
GitHub Actions Security
Ad

Recently uploaded (20)

PDF
Types of Token_ From Utility to Security.pdf
PPTX
CNN LeNet5 Architecture: Neural Networks
PDF
Topaz Photo AI Crack New Download (Latest 2025)
PPTX
assetexplorer- product-overview - presentation
PDF
How AI/LLM recommend to you ? GDG meetup 16 Aug by Fariman Guliev
PDF
Wondershare Recoverit Full Crack New Version (Latest 2025)
PPTX
Patient Appointment Booking in Odoo with online payment
PDF
Top 10 Software Development Trends to Watch in 2025 🚀.pdf
PDF
CCleaner 6.39.11548 Crack 2025 License Key
PPTX
"Secure File Sharing Solutions on AWS".pptx
PDF
How Tridens DevSecOps Ensures Compliance, Security, and Agility
PPTX
GSA Content Generator Crack (2025 Latest)
PDF
MCP Security Tutorial - Beginner to Advanced
PDF
Salesforce Agentforce AI Implementation.pdf
PPTX
Cybersecurity: Protecting the Digital World
PPTX
AMADEUS TRAVEL AGENT SOFTWARE | AMADEUS TICKETING SYSTEM
PDF
AI Guide for Business Growth - Arna Softech
PPTX
Trending Python Topics for Data Visualization in 2025
PDF
Autodesk AutoCAD Crack Free Download 2025
PPTX
Oracle Fusion HCM Cloud Demo for Beginners
Types of Token_ From Utility to Security.pdf
CNN LeNet5 Architecture: Neural Networks
Topaz Photo AI Crack New Download (Latest 2025)
assetexplorer- product-overview - presentation
How AI/LLM recommend to you ? GDG meetup 16 Aug by Fariman Guliev
Wondershare Recoverit Full Crack New Version (Latest 2025)
Patient Appointment Booking in Odoo with online payment
Top 10 Software Development Trends to Watch in 2025 🚀.pdf
CCleaner 6.39.11548 Crack 2025 License Key
"Secure File Sharing Solutions on AWS".pptx
How Tridens DevSecOps Ensures Compliance, Security, and Agility
GSA Content Generator Crack (2025 Latest)
MCP Security Tutorial - Beginner to Advanced
Salesforce Agentforce AI Implementation.pdf
Cybersecurity: Protecting the Digital World
AMADEUS TRAVEL AGENT SOFTWARE | AMADEUS TICKETING SYSTEM
AI Guide for Business Growth - Arna Softech
Trending Python Topics for Data Visualization in 2025
Autodesk AutoCAD Crack Free Download 2025
Oracle Fusion HCM Cloud Demo for Beginners
Ad

A Bot Identification Model and Tool Based on GitHub Activity Sequences

  • 1. Service public de Wallonie – Recherche Grant 2010235 “ARIAC BY DIGITALWALLONIA4.AI” A Bot Identification Model and Tool Based on GitHub Activity Sequences Natarajan Chidambaram, Alexandre Decan, Tom Mens Software Engineering Lab 1
  • 2. Bot accounts and Apps frequently engage in GitHub repositories 2
  • 3. Pushing commits Bot accounts and Apps frequently engage in GitHub repositories 3
  • 4. Pushing commits Commenting PR Bot accounts and Apps frequently engage in GitHub repositories 4
  • 5. Pushing commits Commenting PR Merging PR Bot accounts and Apps frequently engage in GitHub repositories 5
  • 6. Pushing commits Commenting PR Merging PR Bot accounts and Apps frequently engage in GitHub repositories 6 • Pushing commits • Opening/closing/reopening/ commenting issues or PRs • Creating/deleting tags • Creating/deleting branches • Reviewing code • … Repository contributors can be involved in a wide range of activity types:
  • 7. Pushing commits Commenting PR Merging PR Bot accounts and Apps frequently engage in GitHub repositories 7 • Pushing commits • Opening/closing/reopening/ commenting issues or PRs • Creating/deleting tags • Creating/deleting branches • Reviewing code • … Repository contributors can be involved in a wide range of activity types:
  • 8. Pushing commits Commenting PR Merging PR Bot accounts and Apps frequently engage in GitHub repositories 8 • Pushing commits • Opening/closing/reopening/ commenting issues or PRs • Creating/deleting tags • Creating/deleting branches • Reviewing code • … Repository contributors can be involved in a wide range of activity types:
  • 9. Comparison of Bot Identification Approaches 9 Cannot be run on “live data”. Requires World of Code. based on PR and issue comments only based on git commit messages only BotHunter Abdellatif et al. MSR 2022 ahmad-abdellatif/ BotHunter BoDeGHa Golzadeh et al. JSS 2021 mehdigolzadeh/ BoDeGHa BoDeGiC Golzadeh et al. BENEVOL 2020 mehdigolzadeh/ BoDeGiC BIMAN Dey et al. MSR 2020 ssc-oscar/ BIMAN_bot_detection uses too many API queries
  • 10. Dataset 10 # contributors # activities median # activities mean # activity types bots 1,035 182,218 194 3.6 humans 1,115 155,028 147 9.1 total 2,150 337,246 171 6.4 New Ground-truth Dataset of GitHub contributors • Training set of 1290 contributors (60%) of with 621 bots and 669 humans • Test set of 860 contributors (40%) of which 414 bots and 446 humans.
  • 11. Comparison of Bot Identification Approaches Approach P R F1 #unknown Data downloaded time # API queries “bot in name” heuristic 0.77 0.76 0.76 0 - 0.01 sec - BoDeGHa 0.92 0.51 0.66 392 3.83 GB 7.7 h 10,222 BoDeGiC 0.81 0.27 0.41 627 23.3 GB 23.1 h - BotHunter 0.97 0.93 0.95 1 0.261 GB 20.8 h 37,240 on test set of 860 GitHub contributors 11
  • 12. Comparison of Bot Identification Approaches Approach P R F1 #unknown Data downloaded time # API queries “bot in name” heuristic 0.77 0.76 0.76 0 - 0.01 sec - BoDeGHa 0.92 0.51 0.66 392 3.83 GB 7.7 h 10,222 BoDeGiC 0.81 0.27 0.41 627 23.3 GB 23.1 h - BotHunter 0.97 0.93 0.95 1 0.261 GB 20.8 h 37,240 on test set of 860 GitHub contributors 12
  • 13. Goal: Improved Bot Identification Model Based on Activity Sequences (BIMBAS) 13 Good performance
  • 14. Goal: Improved Bot Identification Model Based on Activity Sequences (BIMBAS) 14 Good performance Less API queries
  • 15. Goal: Improved Bot Identification Model Based on Activity Sequences (BIMBAS) 15 Good performance Download less data Less API queries
  • 16. Goal: Improved Bot Identification Model Based on Activity Sequences (BIMBAS) 16 Good performance Download less data Less API queries Fast enough to apply on thousands of contributors
  • 17. Goal: Improved Bot Identification Model Based on Activity Sequences (BIMBAS) 17 Good performance Download less data Independent of text (future proof for LLM-based bots) Less API queries Fast enough to apply on thousands of contributors
  • 18. 18 # activities # activity types # repositories # owners # 𝑜𝑤𝑛𝑒𝑟𝑠 # 𝑟𝑒𝑝𝑜𝑠𝑖𝑡𝑜𝑟𝑖𝑒𝑠 Counting metrics Aggregate metrics – mean, std, median, IQR, Gini # activities per repository # activities per activity type # consecutive activities in a repository # activity types per repository Time between consecutive activities Time spent in a repository Time to switch repository Time to switch activity type # features = 5 # features = 8*5 = 40 Features of BIMBAS
  • 19. 19 # activities # activity types # repositories # owners # 𝑜𝑤𝑛𝑒𝑟𝑠 # 𝑟𝑒𝑝𝑜𝑠𝑖𝑡𝑜𝑟𝑖𝑒𝑠 Counting metrics Aggregate metrics – mean, std, median, IQR, Gini # activities per repository # activities per activity type # consecutive activities in a repository # activity types per repository Time between consecutive activities Time spent in a repository Time to switch repository Time to switch activity type # features = 5 # features = 8*5 = 40 type repository temporal Features of BIMBAS
  • 20. Grid search on training set Binary Classifiers Gradient Boosting Random Forest Decision Tree XGBoost Linear Discriminant Analysis Support Vector Machines Gaussian Naïve Bayes Important features Model evaluation on test set hyperparameter tuning • 7 classifiers • 13,021 models Selecting a classifier for BIMBAS
  • 21. Grid search on training set Binary Classifiers Gradient Boosting Random Forest Decision Tree XGBoost Linear Discriminant Analysis Support Vector Machines Gaussian Naïve Bayes Precision 0.93 Recall 0.93 AUC-ROC 0.97 Important features Model evaluation on test set hyperparameter tuning • 7 classifiers • 13,021 models Selecting a classifier for BIMBAS
  • 22. 22 Recursive Feature Elimination: Removed 7 features Removing unimportant features Grid search on training set Important features Model evaluation on test set
  • 23. 23 Permutation Importance Analysis 1. # activity types 2. # owners of repositories 3. Median time between activities of different types 4. Median number of activities per type 5. Mean number of activities per type Recursive Feature Elimination: Removed 7 features Removing unimportant features Grid search on training set Important features Model evaluation on test set
  • 24. 24 Approach precision recall F1 score “bot in name” heuristic .77 .76 .76 BoDeGHa .92 .51 .66 BoDeGiC .81 .27 .41 BotHunter .97 .93 .95 BIMBAS .90 .90 .90 Evaluating performance of BIMBAS Grid search on training set Important features Model evaluation on test set
  • 25. BIMBAS Case Study 25 July to September 2024, 59 projects, 1169 GitHub repositories 802 humans, 34 bot accounts, 13 Apps, 4 internal services
  • 26. BIMBAS Case Study 26 July to September 2024, 59 projects, 1169 GitHub repositories 802 humans, 34 bot accounts, 13 Apps, 4 internal services
  • 28. RABBIT: A CLI-based tool implementing BIMBAS 28 % rabbit --input-file names.txt https://github.com/natarajan-chidambaram/RABBIT
  • 29. RABBIT: A CLI-based tool implementing BIMBAS 29 % rabbit --input-file names.txt contributor type confidence github-actions[bot] Bot 1.0 google Organization 1.0 openssl-machine Bot 0.714 natarajan-chidambaram Human 0.926 renovate Unknown - gh-ci Invalid - https://github.com/natarajan-chidambaram/RABBIT
  • 30. Evaluating RABBIT’s Efficiency 30 Data downloaded X time X API queries X NBH - - 0.01 sec - BoDeGHa 3.83 GB 34X 7.7 h 21X 10,222 4X BoDeGiC 23.3 GB 208X 23.1 h 60X - - BotHunter 0.261 GB 2.3X 20.8 h 57X 37,240 15X RABBIT 0.112 GB 22 m 2,426 an order of magnitude faster and less queries !
  • 31. 31