SlideShare a Scribd company logo
Machine Learning for
Large Scale Code Analysis
gitbase
SQL interface to git repositories
What we do
What we do
Delivering AI on Code
source{d} builds the open-source components that enable large-scale code analysis and
machine learning on source code.
Our powerful tools can ingest all of the world’s public git repositories turning code into ASTs
ready for machine learning and other analyses, all exposed through a flexible and friendly
API.
What we do
source{d} Engine
● Code Retrieval
● Git Analysis
● Language Agnostic Code Analysis
● Querying With Familiar APIs
What is gitbase
What is gitbase
MySQL wire protocol compatible (-ish) server to query git
repositories.
● Provide a set of raw git repos as input (no modifications)
● Read only
● Git repository data can be queried with MySQL client
● Written in Go language
● Apache license
What is gitbase
Start server
$ gitbase server -g /directory/with/repos
INFO[0000] server started and listening on localhost:3306
What is gitbase
Connect with MySQL CLI
$ mysql -h 127.0.0.1 -u root
MySQL [(none)]> show tables;
+--------------+
| table |
+--------------+
| blobs |
| commit_blobs |
| commit_trees |
| commits |
| files |
| ref_commits |
| refs |
| remotes |
| repositories |
| tree_entries |
+--------------+
10 rows in set (0.01 sec)
What is gitbase
Get the number of commits per repository
SELECT commits.repository_id,
COUNT(commits.commit_hash) as commits
FROM commits
GROUP BY commits.repository_id
What is gitbase
Get the number of commits per repository
+--------------------+---------+
| repository_id | commits |
+--------------------+---------+
| go-siva | 69 |
| rovers | 392 |
| borges | 497 |
| envconfig | 52 |
| gitbase | 627 |
| regression-gitbase | 7 |
| gcfg | 122 |
| go-git-fixtures | 52 |
| regression-borges | 19 |
+--------------------+---------+
What is gitbase
Get the number of blobs per HEAD commit
SELECT blobs.repository_id, COUNT(blobs.blob_hash)
FROM blobs
NATURAL JOIN refs
WHERE refs.ref_name = 'HEAD'
GROUP BY blobs.repository_id
What is gitbase
Get the number of blobs per HEAD commit
+---------------+------------------------+
| repository_id | COUNT(blobs.blob_hash) |
+---------------+------------------------+
| go-siva | 163 |
| gitquery | 3279 |
| go-billy | 353 |
| go-billy-siva | 67 |
+---------------+------------------------+
What is gitbase
Number of functions per file, Go language
SELECT files.repository_id, files.file_path,
ARRAY_LENGTH(UAST(
files.blob_content,
LANGUAGE(files.file_path, files.blob_content),
'//*[@roleFunction and @roleDeclaration]'
)) as functions
FROM files
NATURAL JOIN refs
WHERE LANGUAGE(files.file_path,files.blob_content) = 'Go' AND
refs.ref_name = 'HEAD'
What is gitbase
Number of functions per file, Go language
+---------------+---------------------------+-----------+
| repository_id | file_path | functions |
+---------------+---------------------------+-----------+
| gitbase | blobs.go | 12 |
| gitbase | blobs_test.go | 3 |
| gitbase | cmd/gitbase/main.go | 2 |
| gitbase | cmd/gitbase/server.go | 2 |
| gitbase | cmd/gitbase/version.go | 1 |
| gitbase | commits.go | 12 |
| gitbase | commits_test.go | 3 |
| gitbase | database.go | 3 |
| gitbase | database_test.go | 5 |
| gitbase | internal/format/common.go | 1 |
+---------------+---------------------------+-----------+
gitbase architecture
● go-mysql-server: storage agnostic database engine and server
● bblfsh: converts source code into universal AST
● enry: language detection
● go-git: git implementation in pure go
gitbase architecture
vitess go-mysql-server gitbase
enry
bblfsh
go-git pilosa
boltdb
git repos
indexes
gitbase architecture
● Volcano/iterator model architecture
● Only has one database with all the repositories
● Maps basic objects as tables and extra tables for relations
● Functions to extract data not available in the repo:
○ Language
○ Universal AST
Data in gitbase
repository_id
refs
name,
commit_hash
remotes
name, url,
refspec
tree_entries
hash, name,
type blob_hash
commits
hash, author,
date, message
tree
blobs
hash, size,
content
repositories
Tables mapping objects
Data in gitbase
tree, path, type,
content, size
commit_trees
commit, tree
commit_blobs
commit, blob
ref_commits
ref, commit
files
Relation tables
Making it fast
Making it fast
Performance problems
● Accessing the repository objects is not a cheap operation
● It’s good to retrieve as less objects as possible from the git repo
Making it fast
Solutions
● Cache, cache, cache, CACHE!!
● Squashed tables and filter pushdown
● Indexes
Making it fast
Squashed tables and filter pushdown (simplified)
SELECT refs.repository_id
FROM refs
NATURAL JOIN commits
WHERE commits.commit_author_name = 'Javi Fontan' AND
refs.ref_name='HEAD'
Making it fast
Squashed tables and filter pushdown (simplified)
Making it fast
Indexes
● There’s not a silver bullet index that accelerates all the queries
● These indexes potentially occupy a lot
● Take a lot to generate, requires a full scan of the table
Making it fast
Indexes
● Defined by the user
● Use bitmaps so indexes can be composed
● Stores packfile and offset of the object
● Around 2 orders of magnitude faster in some queries
CREATE INDEX author ON commits
USING pilosa (commit_author_name)
Thanks!
● gitbase: https://github.com/src-d/gitbase
● go-git: https://github.com/src-d/go-git
● engine: https://sourced.tech/engine
We are hiring!

More Related Content

PDF
Decentralized storage IPFS & Ulord
PDF
IPFS introduction
PPTX
InterPlanetary File System (IPFS)
PPTX
Redis & MongoDB: Stop Big Data Indigestion Before It Starts
PDF
Accumulo Summit Keynote 2018
PPTX
Why Your MongoDB Needs Redis
PDF
Fluentd and Docker - running fluentd within a docker container
Decentralized storage IPFS & Ulord
IPFS introduction
InterPlanetary File System (IPFS)
Redis & MongoDB: Stop Big Data Indigestion Before It Starts
Accumulo Summit Keynote 2018
Why Your MongoDB Needs Redis
Fluentd and Docker - running fluentd within a docker container

What's hot (20)

ODP
Rethink db with Python
PDF
Rethinkdb
PPTX
MySQL Slow Query log Monitoring using Beats & ELK
KEY
Redis overview for Software Architecture Forum
PDF
Realtime Search Infrastructure at Craigslist (OpenWest 2014)
PDF
Writing Redis Module with Rust
PDF
Expert Roundtable: The Future of Metadata After Hive Metastore
PDF
Atmosphere 2014: Centralized log management based on Logstash and Kibana - ca...
PPTX
Benchmarking Redis by itself and versus other NoSQL databases
PPTX
Introduction to Redis
PDF
Analyzing MySQL Logs with ClickHouse, by Peter Zaitsev
PDF
IPv4 IPv6 Media Player
PDF
Exploring the replication and sharding in MongoDB
PPTX
Lessons Learned Migrating 2+ Billion Documents at Craigslist
PPTX
Building Your First App with Shawn Mcarthy
PDF
Alluxio in MOMO
PDF
Script for the geomeetup presentation
KEY
Living with SQL and NoSQL at craigslist, a Pragmatic Approach
PPTX
Ui5 con@Banglore - UI5 App with Offline Storage using PouchDB
ODP
Elastic Search
Rethink db with Python
Rethinkdb
MySQL Slow Query log Monitoring using Beats & ELK
Redis overview for Software Architecture Forum
Realtime Search Infrastructure at Craigslist (OpenWest 2014)
Writing Redis Module with Rust
Expert Roundtable: The Future of Metadata After Hive Metastore
Atmosphere 2014: Centralized log management based on Logstash and Kibana - ca...
Benchmarking Redis by itself and versus other NoSQL databases
Introduction to Redis
Analyzing MySQL Logs with ClickHouse, by Peter Zaitsev
IPv4 IPv6 Media Player
Exploring the replication and sharding in MongoDB
Lessons Learned Migrating 2+ Billion Documents at Craigslist
Building Your First App with Shawn Mcarthy
Alluxio in MOMO
Script for the geomeetup presentation
Living with SQL and NoSQL at craigslist, a Pragmatic Approach
Ui5 con@Banglore - UI5 App with Offline Storage using PouchDB
Elastic Search
Ad

Similar to Gitbase, SQL interface to Git repositories (20)

PDF
Code as Data workshop: Using source{d} Engine to extract insights from git re...
PDF
Apache Calcite (a tutorial given at BOSS '21)
PPT
Introduction to Git
PPTX
Git.From thorns to the stars
PPTX
PDF
FOSDEM '18 - Tools for large scale collection and analysis of source code re...
PDF
Prestogres internals
PDF
What is MariaDB Server 10.3?
PDF
Level Up Your Git and GitHub Experience by Jordan McCullough and Brent Beer
PDF
Git hub
PPT
Git, Fast and Distributed Source Code Management
PDF
Implementing a build manager in Ada
ODP
Introduction to Git
PPTX
NoSQL on MySQL - MySQL Document Store by Vadim Tkachenko
PDF
PostgreSQL 9.4: NoSQL on ACID
PDF
CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQuery
PDF
Version control with GIT
PPTX
Git Overview
PDF
Github - Git Training Slides: Foundations
PDF
Code for Startup MVP (Ruby on Rails) Session 1
Code as Data workshop: Using source{d} Engine to extract insights from git re...
Apache Calcite (a tutorial given at BOSS '21)
Introduction to Git
Git.From thorns to the stars
FOSDEM '18 - Tools for large scale collection and analysis of source code re...
Prestogres internals
What is MariaDB Server 10.3?
Level Up Your Git and GitHub Experience by Jordan McCullough and Brent Beer
Git hub
Git, Fast and Distributed Source Code Management
Implementing a build manager in Ada
Introduction to Git
NoSQL on MySQL - MySQL Document Store by Vadim Tkachenko
PostgreSQL 9.4: NoSQL on ACID
CodeCamp Iasi - Creating serverless data analytics system on GCP using BigQuery
Version control with GIT
Git Overview
Github - Git Training Slides: Foundations
Code for Startup MVP (Ruby on Rails) Session 1
Ad

More from source{d} (14)

PDF
Overton, Apple Flavored ML
PDF
Unlocking Engineering Observability with advanced IT analytics
PDF
What's new in the latest source{d} releases!
PPTX
Deduplication on large amounts of code
PDF
Assisted code review with source{d} lookout
PDF
Machine Learning on Code - SF meetup
PDF
Inextricably linked reproducibility and productivity in data science and ai ...
PDF
source{d} Engine - your code as data
PDF
Introduction to the source{d} Stack
PDF
source{d} Engine: Exploring git repos with SQL
PDF
Introduction to source{d} Engine and source{d} Lookout
PDF
Machine learning on Go Code
PPTX
Improving go-git performance
PDF
Machine learning on source code
Overton, Apple Flavored ML
Unlocking Engineering Observability with advanced IT analytics
What's new in the latest source{d} releases!
Deduplication on large amounts of code
Assisted code review with source{d} lookout
Machine Learning on Code - SF meetup
Inextricably linked reproducibility and productivity in data science and ai ...
source{d} Engine - your code as data
Introduction to the source{d} Stack
source{d} Engine: Exploring git repos with SQL
Introduction to source{d} Engine and source{d} Lookout
Machine learning on Go Code
Improving go-git performance
Machine learning on source code

Recently uploaded (20)

PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Empathic Computing: Creating Shared Understanding
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PDF
Review of recent advances in non-invasive hemoglobin estimation
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Machine learning based COVID-19 study performance prediction
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PPTX
Cloud computing and distributed systems.
PDF
Electronic commerce courselecture one. Pdf
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
cuic standard and advanced reporting.pdf
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
PPTX
Big Data Technologies - Introduction.pptx
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
Programs and apps: productivity, graphics, security and other tools
Empathic Computing: Creating Shared Understanding
MIND Revenue Release Quarter 2 2025 Press Release
Review of recent advances in non-invasive hemoglobin estimation
The AUB Centre for AI in Media Proposal.docx
Machine learning based COVID-19 study performance prediction
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
Digital-Transformation-Roadmap-for-Companies.pptx
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Cloud computing and distributed systems.
Electronic commerce courselecture one. Pdf
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
The Rise and Fall of 3GPP – Time for a Sabbatical?
cuic standard and advanced reporting.pdf
Reach Out and Touch Someone: Haptics and Empathic Computing
Big Data Technologies - Introduction.pptx
Advanced methodologies resolving dimensionality complications for autism neur...

Gitbase, SQL interface to Git repositories

  • 1. Machine Learning for Large Scale Code Analysis gitbase SQL interface to git repositories
  • 3. What we do Delivering AI on Code source{d} builds the open-source components that enable large-scale code analysis and machine learning on source code. Our powerful tools can ingest all of the world’s public git repositories turning code into ASTs ready for machine learning and other analyses, all exposed through a flexible and friendly API.
  • 4. What we do source{d} Engine ● Code Retrieval ● Git Analysis ● Language Agnostic Code Analysis ● Querying With Familiar APIs
  • 6. What is gitbase MySQL wire protocol compatible (-ish) server to query git repositories. ● Provide a set of raw git repos as input (no modifications) ● Read only ● Git repository data can be queried with MySQL client ● Written in Go language ● Apache license
  • 7. What is gitbase Start server $ gitbase server -g /directory/with/repos INFO[0000] server started and listening on localhost:3306
  • 8. What is gitbase Connect with MySQL CLI $ mysql -h 127.0.0.1 -u root MySQL [(none)]> show tables; +--------------+ | table | +--------------+ | blobs | | commit_blobs | | commit_trees | | commits | | files | | ref_commits | | refs | | remotes | | repositories | | tree_entries | +--------------+ 10 rows in set (0.01 sec)
  • 9. What is gitbase Get the number of commits per repository SELECT commits.repository_id, COUNT(commits.commit_hash) as commits FROM commits GROUP BY commits.repository_id
  • 10. What is gitbase Get the number of commits per repository +--------------------+---------+ | repository_id | commits | +--------------------+---------+ | go-siva | 69 | | rovers | 392 | | borges | 497 | | envconfig | 52 | | gitbase | 627 | | regression-gitbase | 7 | | gcfg | 122 | | go-git-fixtures | 52 | | regression-borges | 19 | +--------------------+---------+
  • 11. What is gitbase Get the number of blobs per HEAD commit SELECT blobs.repository_id, COUNT(blobs.blob_hash) FROM blobs NATURAL JOIN refs WHERE refs.ref_name = 'HEAD' GROUP BY blobs.repository_id
  • 12. What is gitbase Get the number of blobs per HEAD commit +---------------+------------------------+ | repository_id | COUNT(blobs.blob_hash) | +---------------+------------------------+ | go-siva | 163 | | gitquery | 3279 | | go-billy | 353 | | go-billy-siva | 67 | +---------------+------------------------+
  • 13. What is gitbase Number of functions per file, Go language SELECT files.repository_id, files.file_path, ARRAY_LENGTH(UAST( files.blob_content, LANGUAGE(files.file_path, files.blob_content), '//*[@roleFunction and @roleDeclaration]' )) as functions FROM files NATURAL JOIN refs WHERE LANGUAGE(files.file_path,files.blob_content) = 'Go' AND refs.ref_name = 'HEAD'
  • 14. What is gitbase Number of functions per file, Go language +---------------+---------------------------+-----------+ | repository_id | file_path | functions | +---------------+---------------------------+-----------+ | gitbase | blobs.go | 12 | | gitbase | blobs_test.go | 3 | | gitbase | cmd/gitbase/main.go | 2 | | gitbase | cmd/gitbase/server.go | 2 | | gitbase | cmd/gitbase/version.go | 1 | | gitbase | commits.go | 12 | | gitbase | commits_test.go | 3 | | gitbase | database.go | 3 | | gitbase | database_test.go | 5 | | gitbase | internal/format/common.go | 1 | +---------------+---------------------------+-----------+
  • 16. ● go-mysql-server: storage agnostic database engine and server ● bblfsh: converts source code into universal AST ● enry: language detection ● go-git: git implementation in pure go gitbase architecture vitess go-mysql-server gitbase enry bblfsh go-git pilosa boltdb git repos indexes
  • 17. gitbase architecture ● Volcano/iterator model architecture ● Only has one database with all the repositories ● Maps basic objects as tables and extra tables for relations ● Functions to extract data not available in the repo: ○ Language ○ Universal AST
  • 18. Data in gitbase repository_id refs name, commit_hash remotes name, url, refspec tree_entries hash, name, type blob_hash commits hash, author, date, message tree blobs hash, size, content repositories Tables mapping objects
  • 19. Data in gitbase tree, path, type, content, size commit_trees commit, tree commit_blobs commit, blob ref_commits ref, commit files Relation tables
  • 21. Making it fast Performance problems ● Accessing the repository objects is not a cheap operation ● It’s good to retrieve as less objects as possible from the git repo
  • 22. Making it fast Solutions ● Cache, cache, cache, CACHE!! ● Squashed tables and filter pushdown ● Indexes
  • 23. Making it fast Squashed tables and filter pushdown (simplified) SELECT refs.repository_id FROM refs NATURAL JOIN commits WHERE commits.commit_author_name = 'Javi Fontan' AND refs.ref_name='HEAD'
  • 24. Making it fast Squashed tables and filter pushdown (simplified)
  • 25. Making it fast Indexes ● There’s not a silver bullet index that accelerates all the queries ● These indexes potentially occupy a lot ● Take a lot to generate, requires a full scan of the table
  • 26. Making it fast Indexes ● Defined by the user ● Use bitmaps so indexes can be composed ● Stores packfile and offset of the object ● Around 2 orders of magnitude faster in some queries CREATE INDEX author ON commits USING pilosa (commit_author_name)
  • 27. Thanks! ● gitbase: https://github.com/src-d/gitbase ● go-git: https://github.com/src-d/go-git ● engine: https://sourced.tech/engine We are hiring!

Editor's Notes

  • #2: Hi, I am Javi Fontán and I am an engineer at source{d}. I'm going to show you what's gitbase and some of its implementation details.
  • #4: At source{d} we make tools to do large scale code analysis and machine learning on source code. I am not going to talk about the Machine Learning stuff today. All the tools I'm going to talk about are Open Source.
  • #5: All the tools we make that deal with access to repositories, parse source code, detect language and do searches are under an umbrella called the source{d} Engine. gitbase is one of these tools.
  • #7: gitbase is a database server that uses mysql wire protocol and SQL dialect to query git repositories. This means that you can connect to it with MySQL clients and libraries and use a common SQL dialect to make queries. The storage for this database are plain git repositories, no preprocessing of these repos are needed.
  • #8: This is a typical use of gitbase. We provide it with a directory full of git repositories and it starts listening in a port. You should probably find the port 3306 familiar.
  • #9: Now we can connect to it using a MySQL compatible client. In this case we are using the standard MySQL client tool and get a list of the tables. You'll also find these tables familiar if you've been using Git.
  • #10: In this example we are using count and group by to get the number of commits per repository. As you can see it looks like a normal SQL query that you can do to your databases.
  • #11: Here's the result.
  • #12: In this other example we use a filter and a join to count only blobs that belong to the HEAD commit of each repository. You my find NATURAL JOIN weird but it’s an INNER JOIN that automatically does the matching by all columns with the same name. In this case it matches all the commits pointed by references named “HEAD” and blobs contained in those commits.
  • #13: This is what you get in mysql client.
  • #14: This other example is a more complex one. We are not only getting information from git itself but also use other components to extract more data: * LANGUAGE function detects the programming language from a file, this is used to filter only the files written in Go language at the end of the query * UAST parses the file and generates what we call a Universal Abstract Syntax Tree. That is, a standardized AST that let us query it in the same way for any languages supported. This AST can be queried with XPATH and that's the part with roleFunction and roleDeclaration in the query.
  • #15: Here's an excerpt of the output in one repository.
  • #17: gitbase uses several libraries to accomplish this: * go-mysql-server is a library to create databases. Uses parts of vitess project for the wire protocol and parse sql. It also has the database engine. * bblfsh is responsible to generate and query the Universal AST mentioned previously. * enry is a port of linguist library to Go to detect programming languages. * go-git is a library to access git repositories. It's like libgit2 but implemented in Go. * to implement indexes we use pilosa, that is a library to deal with bitmap indexes, and bolt db as key value store for bitmap to value mapping
  • #18: * gitbase uses Volcano or iterator model. This is the architecture used by most relational databases. A tree of iterators is generated from the query and each node gets information from the lower one, transforms or filters the data that is consumed then by another iterator on top. * gitbase has only one database with tables for each type of object. They contain the data for all the repositories. * it has some extra functions to get info not stored in the repositories like languages and Universal ASTs
  • #19: We map every git object type to its own table. As such you can query table commits or refs.
  • #20: And we also have some helper tables that allow us to do relations. One special one is files that have the information of a file from a commit. This allows easier queries than using relations to match commits with tree entries and blobs.
  • #22: We ran into lots of performance issues accessing git objects. Even if we are actively working on making our libraries faster accessing objects we are also trying to read as few objects as possible.
  • #23: Some of the techniques we use are: * cache as much data as we can so we don't have to access it in disk or uncompress again * squashed tables, that let us access the objects directly when possible instead of iterating all and filtering the ones we want * create indexes for values that we access more frequently
  • #24: In this example we want to get all the repositories in which HEAD commit has author name Javi Fontan. Without any optimization we iterate all the references and all the commits. The where nodes filter the objects we are interested in and the join matches the data from the iterators in both sides. Here we do a full scan of references table and we do N commit full scans, one for each reference that reaches the join. As you can see this is not super optimal.
  • #25: Squashed tables technique uses some peculiarities from git. We know that each ref tip is only one commit and we can do pushdown of some filters to the tables, that is, pass the filtering to the storage layer where we can get the object directly instead of iterating and filtering. In this example we substitute the general join into a chain of iterators. The references iterator is able to get directly the references from all repositories with an specific name, in this case HEAD. Then we use the commit hash to feed the next iterator so we only get HEAD commits and filter by author. Now we are only getting the commits that are HEAD of some repo, a much cheaper operation.
  • #26: Another technique we use is creating indexes for some columns. These indexes are not git indexes but user defined ones. Depending on the usage we may be interested in different columns so we let the user pick which ones they want. There may be people that want to pinpoint commits from specific users or do some analysis on files written in some language. Generating indexes take quite some time and space and that’s another reason to let the user pick their indexes.
  • #27: We use bitmaps as they are easily composable. We also store the packfile and offset for the rows so we bypass git indexing and read directly from the files. In this example we create an index for the author name of commits. After it is completed we can get exactly the commits that match a filter by author name.
  • #28: And that’s what I had prepared for today. If you want to know more about gitbase or any related library you can ask me and my colleagues or go to our project pages shown here. Thank you!