Gitbase, SQL interface to Git repositories

Machine Learning for
Large Scale Code Analysis
gitbase
SQL interface to git repositories

What we do
Delivering AI on Code
source{d} builds the open-source components that enable large-scale code analysis and
machine learning on source code.
Our powerful tools can ingest all of the world’s public git repositories turning code into ASTs
ready for machine learning and other analyses, all exposed through a flexible and friendly
API.

What we do
source{d} Engine
● Code Retrieval
● Git Analysis
● Language Agnostic Code Analysis
● Querying With Familiar APIs

What is gitbase
MySQL wire protocol compatible (-ish) server to query git
repositories.
● Provide a set of raw git repos as input (no modifications)
● Read only
● Git repository data can be queried with MySQL client
● Written in Go language
● Apache license

What is gitbase
Start server
$ gitbase server -g /directory/with/repos
INFO[0000] server started and listening on localhost:3306

What is gitbase
Get the number of commits per repository
SELECT commits.repository_id,
COUNT(commits.commit_hash) as commits
FROM commits
GROUP BY commits.repository_id

What is gitbase
Get the number of commits per repository
+--------------------+---------+
| repository_id | commits |
+--------------------+---------+
| go-siva | 69 |
| rovers | 392 |
| borges | 497 |
| envconfig | 52 |
| gitbase | 627 |
| regression-gitbase | 7 |
| gcfg | 122 |
| go-git-fixtures | 52 |
| regression-borges | 19 |
+--------------------+---------+

What is gitbase
Get the number of blobs per HEAD commit
SELECT blobs.repository_id, COUNT(blobs.blob_hash)
FROM blobs
NATURAL JOIN refs
WHERE refs.ref_name = 'HEAD'
GROUP BY blobs.repository_id

What is gitbase
Get the number of blobs per HEAD commit
+---------------+------------------------+
| repository_id | COUNT(blobs.blob_hash) |
+---------------+------------------------+
| go-siva | 163 |
| gitquery | 3279 |
| go-billy | 353 |
| go-billy-siva | 67 |
+---------------+------------------------+

What is gitbase
Number of functions per file, Go language
SELECT files.repository_id, files.file_path,
ARRAY_LENGTH(UAST(
files.blob_content,
LANGUAGE(files.file_path, files.blob_content),
'//*[@roleFunction and @roleDeclaration]'
)) as functions
FROM files
NATURAL JOIN refs
WHERE LANGUAGE(files.file_path,files.blob_content) = 'Go' AND
refs.ref_name = 'HEAD'

● go-mysql-server: storage agnostic database engine and server
● bblfsh: converts source code into universal AST
● enry: language detection
● go-git: git implementation in pure go
gitbase architecture
vitess go-mysql-server gitbase
enry
bblfsh
go-git pilosa
boltdb
git repos
indexes

gitbase architecture
● Volcano/iterator model architecture
● Only has one database with all the repositories
● Maps basic objects as tables and extra tables for relations
● Functions to extract data not available in the repo:
○ Language
○ Universal AST

Data in gitbase
repository_id
refs
name,
commit_hash
remotes
name, url,
refspec
tree_entries
hash, name,
type blob_hash
commits
hash, author,
date, message
tree
blobs
hash, size,
content
repositories
Tables mapping objects

Data in gitbase
tree, path, type,
content, size
commit_trees
commit, tree
commit_blobs
commit, blob
ref_commits
ref, commit
files
Relation tables

Making it fast
Performance problems
● Accessing the repository objects is not a cheap operation
● It’s good to retrieve as less objects as possible from the git repo

Making it fast
Solutions
● Cache, cache, cache, CACHE!!
● Squashed tables and filter pushdown
● Indexes

Making it fast
Squashed tables and filter pushdown (simplified)
SELECT refs.repository_id
FROM refs
NATURAL JOIN commits
WHERE commits.commit_author_name = 'Javi Fontan' AND
refs.ref_name='HEAD'

Making it fast
Squashed tables and filter pushdown (simplified)

Making it fast
Indexes
● There’s not a silver bullet index that accelerates all the queries
● These indexes potentially occupy a lot
● Take a lot to generate, requires a full scan of the table

Making it fast
Indexes
● Defined by the user
● Use bitmaps so indexes can be composed
● Stores packfile and offset of the object
● Around 2 orders of magnitude faster in some queries
CREATE INDEX author ON commits
USING pilosa (commit_author_name)

Thanks!
● gitbase: https://github.com/src-d/gitbase
● go-git: https://github.com/src-d/go-git
● engine: https://sourced.tech/engine
We are hiring!

Gitbase, SQL interface to Git repositories

More Related Content

What's hot (20)

Similar to Gitbase, SQL interface to Git repositories (20)

More from source{d} (14)

Recently uploaded (20)

Gitbase, SQL interface to Git repositories

Editor's Notes