SlideShare a Scribd company logo
#MLonCode
Egor Bulychev
@egor_bu
Machine Learning
 GitHub
Machine
Learning
on Source
Code
MLonCode
Examples
Code naturalness
class ??? :
def connect(self, dbname, user, password, host, port):
# ...
def query(self, sql):
# ...
def close(self):
# ...
01.
02.
03.
04.
05.
06.
07.
class Database :
def connect(self, dbname, user, password, host, port):
# ...
def query(self, sql):
# ...
def close(self):
# ...
01.
02.
03.
04.
05.
06.
07.
class Foo:
def bar(self, qux):
# ...
def baz(self, waldo):
# ...
def do(self, really):
# ...
01.
02.
03.
04.
05.
06.
07.
twitter/mysql Tokutek/mysql-5.5
facebook/mysql-5.6 percona/percona-xtrabackup
Tokutek/mariadb-5.5 percona/percona-server
atcurtis/mariadb webscalesql/webscalesql-5.6
alibaba/AliSQL mysql/mysql-server
Details
Projects similar to MariaDB/server?
Exploratory search
Similar code detection
• By style
• By structure
• By identifiers provides us
Global graph
 Licenses
 Refactoring
Code autocompletion
Code style
class foobar:
def connecttoserver(self):
myserverhost = globalconfig.server.host
class FooBar:
def connect_to_server(self):
myServerHost = globalConfig.server.host
Many other applications
• Prediction of class, function, variable names
• Type inference
• Which comments do not make sense?
• Which comments are funny?
• Which APIs are bad or misused?
Your code as a crime scene
• Idea: mine the development history
• Book by Adam Tornhill
• codescene.io
Machine learning on source code
Data
Datasets
• GHTorrent - everything except Git repositories, 70GB
• Public Git Archive - only Git repositories, 3TB
• GitHub Data in Google BigQuery
• rovers & borges - DIY
PGA
• 270k of siva files
• CSV index
Details in the paper.
Tools for MLonCode
clone
discover
classify
checkout
filter
parse
analyze
Plumbing
source{d} engine
• siva or bare git repository loader for Apache Spark
• Classification, parsing (thanks to bblfsh), filtering, checkouting, scaling
• Apache license
GitHub
source{d} engine
>>> from sourced.engine import Engine
>>> engine = Engine(spark, "/path/to/siva/files", "siva")
>>> engine.repositories.references.head_ref 
.commits.tree_entries.blobs 
.classify_languages() 
.select("blob_id", "path", "lang") 
.show()
function definition
identifier
value: nearest_neighbors
arguments body
identifier
value: self
identifier
value: origin
keyword argument
comment
value: origin can be...
conditional
identifier
value: k
literal
value: 10
condition true branch false branch
builtin
value: isinstance
identifier
value: origin
builtin
value: tuple
builtin
value: tuple
builtin
value: list
AST
How to parse
• Regular expressions - Pygments, highlight.js
• Abstract syntax tree (AST) - ANTLR
• Compilation
Machine learning on source code
 Universal AST
• ± uniform structure
• ± standard node types (roles)
• XPath queries
• 4 traversal orders
dashboard.bblf.sh
>>> engine.repositories.references.head_ref 
.commits.tree_entries.blobs 
.classify_languages() 
.filter('lang = "Python"') 
.extract_uasts() 
.query_uast('//*[@roleIdentifier]') 
.extract_tokens("result", "tokens") 
.select("blob_id", "path", "tokens")
Machine learning on source code
Powerful
modelforge
 Automatically fetch from the internet
 "Model Store"
 Flexible, modern binary format - ASDF
 Versioning
 Reproducability
 Support various programming languages
GitHub
Machine
Learning
hercules
• Go
• Mining Git history
• Relies on go-git, Babelfish and Tensorflow
• Apache license
GitHub
MLonCode in hercules
• Structural hotness
• Comment sentiment
source{d} ml
• Python
• Targets large-scale MLonCode
• Runs on top of source{d} engine, modelforge and Tensorflow
• Apache license
GitHub
Solved practical problems
• Identifier splitting
• O(1) code similarity search
• Topic modeling
• Embeddings
Identifier
embeddings
V1 ⇔ "foo"
V2 ⇔ "bar"
V3 ⇔ "integrate"
distance(V1, V2) < distance(V1, V3)
distance(Vi, Vj) = arccos
Vi⋅Vj
∥Vi∥∥∥Vj
∥∥
Scalar product
Norm
How to estimate
?Vi ⋅ Vj
class Database:
def connect(self, user, password, host, port):
self._tcp_socket_connect(host, port)
try:
self._authenticate(user, password)
except AuthenticationError as e:
self.socket.close()
raise e from None
01.
02.
03.
04.
05.
06.
07.
08.
Splitting and normalization
_tcp_socket_connect -> [tcp, socket, connect]
AuthenticationError -> [authentication, error]
authentication, authenticate -> authenticate
01.
02.
03.
class Database:
def connect(self, user, password, host, port):
self._tcp_socket_connect(host, port)
try:
self._authenticate(user, password)
except AuthenticationError as e:
self.socket.close()
raise e from None
database, connect , user , password , host , port , tcp,
socket , authenticate , error, close
01.
02.
03.
04.
05.
06.
07.
08.
>>> 2 2 2 2 2
2 2
connect
host
port
tcp
error
database
close
password
user
socket
authenticate
class Database:
def connect(self, user, password, host, port):
self._tcp_socket_connect(host, port)
try:
self._authenticate(user, password)
except AuthenticationError as e:
self.socket.close()
raise e from None
connect , user , password , host , port , tcp, socket ,
authenticate , error, close
01.
02.
03.
04.
05.
06.
07.
08.
>>> 2 2 2 2 2 2
2
connect
host
port
tcp
error
database
close
password
user
socket
authenticate
class Database:
def connect(self, user, password, host, port):
self._tcp_socket_connect(host, port)
try:
self._authenticate(user, password)
except AuthenticationError as e:
self.socket.close()
raise e from None
connect, user, password, host, port
01.
02.
03.
04.
05.
06.
07.
08.
>>>
connect
host
port
tcp
error
database
close
password
user
socket
authenticate
class Database:
def connect(self, user, password, host, port):
self._tcp_socket_connect(host, port)
try:
self._authenticate(user, password)
except AuthenticationError as e:
self.socket.close()
raise e from None
tcp, socket, connect, host, port
01.
02.
03.
04.
05.
06.
07.
08.
>>>
connect
host
port
tcp
error
database
close
password
user
socket
authenticate
class Database:
def connect(self, user, password, host, port):
self._tcp_socket_connect(host, port)
try:
self._authenticate(user, password)
except AuthenticationError as e:
self.socket.close()
raise e from None
authenticate , user, password, error, socket, close
01.
02.
03.
04.
05.
06.
07.
08.
>>> 2
connect
host
port
tcp
error
database
close
password
user
socket
authenticate
class Database:
def connect(self, user, password, host, port):
self._tcp_socket_connect(host, port)
try:
self._authenticate(user, password)
except AuthenticationError as e:
self.socket.close()
raise e from None
authenticate, user, password
01.
02.
03.
04.
05.
06.
07.
08.
>>>
connect
host
port
tcp
error
database
close
password
user
socket
authenticate
class Database:
def connect(self, user, password, host, port):
self._tcp_socket_connect(host, port)
try:
self._authenticate(user, password)
except AuthenticationError as e:
self.socket.close()
raise e from None
authenticate, error, socket, close
01.
02.
03.
04.
05.
06.
07.
08.
>>>
connect
host
port
tcp
error
database
close
password
user
socket
authenticate
class Database:
def connect(self, user, password, host, port):
self._tcp_socket_connect(host, port)
try:
self._authenticate(user, password)
except AuthenticationError as e:
self.socket.close()
raise e from None
authenticate, error
01.
02.
03.
04.
05.
06.
07.
08.
>>>
connect
host
port
tcp
error
database
close
password
user
socket
authenticate
class Database:
def connect(self, user, password, host, port):
self._tcp_socket_connect(host, port)
try:
self._authenticate(user, password)
except AuthenticationError as e:
self.socket.close()
raise e from None
socket, close
01.
02.
03.
04.
05.
06.
07.
08.
>>>
connect
host
port
tcp
error
database
close
password
user
socket
authenticate
connect
host
port
tcp
error
database
close
password
user
socket
authenticate
Incidence matrix
• number of times and were together 
• Also known as the co-occurrence matrix
Cij
Cij = i j
Pointwise Mutual Information (PMI)
Vi ⋅ Vj = P M Iij = log
Cij ∑ C
∑
N
k=1
Cik ∑
N
k=1
Cjk
Representation
Learning on
Explicit Matrix
Stochastic Gradient
Descent
Machine learning on source code
Machine learning on source code
Swivel
 Multi-GPU
 Multi-node
 Quality tricks
 Swivel: Improving Embeddings by Noticing What's Missing by Shazeer et.al.
 Tensorflow implementation
• afoo • qux
• myfoo • baz
• mfoo • wibble
• dofoo • quux
• dfoo • testing
• ifoo
Nearest to“foo”
Analogies
“bug” - “test” + “expect” = “suppress”
“database” - “query” + “tune” = “settings”
“send” - “receive” + “pop” = “push”
Typos
• recieve = receive
• grey = gray
• calback = callbak = callback
Summary
Summary
 MLonCode is fun
 There is data
 There are tools
 Community is forming
Idea ?  Doubt ?  Write us !
Machine learning on source code
Thank you
 egor@sourced.tech
 egorbu
 egor_bu
 sourcedtech
 blog.sourced.tech
 Awesome #MLonCode
bit.ly/2MAU7qG

More Related Content

PPTX
Fun with exploits old and new
PPT
Mining Ruby Gem vulnerabilities for Fun and No Profit.
PPTX
Black Hat: XML Out-Of-Band Data Retrieval
PPTX
How to discover 1352 Wordpress plugin 0days in one hour (not really)
PPT
Php introduction
PPTX
On secure application of PHP wrappers
PPTX
Hacking Wordpress Plugins
PDF
Django Heresies
Fun with exploits old and new
Mining Ruby Gem vulnerabilities for Fun and No Profit.
Black Hat: XML Out-Of-Band Data Retrieval
How to discover 1352 Wordpress plugin 0days in one hour (not really)
Php introduction
On secure application of PHP wrappers
Hacking Wordpress Plugins
Django Heresies

What's hot (20)

PPTX
PyGrunn 2017 - Django Performance Unchained - slides
PDF
Anatomy of a reusable module
PDF
MFF UK - Introduction to iOS
PDF
Spl in the wild
PDF
Introduction To Django (Strange Loop 2011)
PPSX
Symfony2 meets propel 1.5
KEY
DPC 2012 : PHP in the Dark Workshop
PDF
Unleash your inner console cowboy
PDF
Having Fun Programming!
DOCX
Po sm
PDF
Puppet Camp Paris 2015: Power of Puppet 4 (Beginner)
PDF
Ruby 2.0
PDF
Puppet Camp Phoenix 2015: Managing Files via Puppet: Let Me Count The Ways (B...
PDF
Fast as C: How to Write Really Terrible Java
PDF
Refactor Dance - Puppet Labs 'Best Practices'
PDF
JRuby and Invokedynamic - Japan JUG 2015
PDF
Shell scripting
PDF
ReUse Your (Puppet) Modules!
PPTX
Puppet for Everybody: Federated and Hierarchical Puppet Enterprise
ZIP
Ruby on Rails: Tasty Burgers
PyGrunn 2017 - Django Performance Unchained - slides
Anatomy of a reusable module
MFF UK - Introduction to iOS
Spl in the wild
Introduction To Django (Strange Loop 2011)
Symfony2 meets propel 1.5
DPC 2012 : PHP in the Dark Workshop
Unleash your inner console cowboy
Having Fun Programming!
Po sm
Puppet Camp Paris 2015: Power of Puppet 4 (Beginner)
Ruby 2.0
Puppet Camp Phoenix 2015: Managing Files via Puppet: Let Me Count The Ways (B...
Fast as C: How to Write Really Terrible Java
Refactor Dance - Puppet Labs 'Best Practices'
JRuby and Invokedynamic - Japan JUG 2015
Shell scripting
ReUse Your (Puppet) Modules!
Puppet for Everybody: Federated and Hierarchical Puppet Enterprise
Ruby on Rails: Tasty Burgers
Ad

Similar to Machine learning on source code (20)

PDF
IVS CTO Night And Day 2018 Winter - [re:Cap] Serverless & Mobile
PPTX
SQL Injection Defense in Python
KEY
DjangoCon 2010 Scaling Disqus
PDF
Python RESTful webservices with Python: Flask and Django solutions
PDF
Advancing JavaScript with Libraries (Yahoo Tech Talk)
PDF
Go Web Development
PPTX
06.1 .Net memory management
PDF
Let's read code: the python-requests library
PPTX
200 Open Source Projects Later: Source Code Static Analysis Experience
PDF
soft-shake.ch - Hands on Node.js
PDF
Echtzeitapplikationen mit Elixir und GraphQL
PPTX
Node.js Patterns for Discerning Developers
KEY
Site Performance - From Pinto to Ferrari
PDF
1.6 米嘉 gobuildweb
PPTX
Python and Oracle : allies for best of data management
PDF
Rails 4.0
PDF
Introduction to clojure
PPTX
PHP Basics and Demo HackU
KEY
MongoDB at ZPUGDC
PDF
Immutable Deployments with AWS CloudFormation and AWS Lambda
IVS CTO Night And Day 2018 Winter - [re:Cap] Serverless & Mobile
SQL Injection Defense in Python
DjangoCon 2010 Scaling Disqus
Python RESTful webservices with Python: Flask and Django solutions
Advancing JavaScript with Libraries (Yahoo Tech Talk)
Go Web Development
06.1 .Net memory management
Let's read code: the python-requests library
200 Open Source Projects Later: Source Code Static Analysis Experience
soft-shake.ch - Hands on Node.js
Echtzeitapplikationen mit Elixir und GraphQL
Node.js Patterns for Discerning Developers
Site Performance - From Pinto to Ferrari
1.6 米嘉 gobuildweb
Python and Oracle : allies for best of data management
Rails 4.0
Introduction to clojure
PHP Basics and Demo HackU
MongoDB at ZPUGDC
Immutable Deployments with AWS CloudFormation and AWS Lambda
Ad

More from source{d} (15)

PDF
Overton, Apple Flavored ML
PDF
Unlocking Engineering Observability with advanced IT analytics
PDF
What's new in the latest source{d} releases!
PDF
Code as Data workshop: Using source{d} Engine to extract insights from git re...
PPTX
Gitbase, SQL interface to Git repositories
PPTX
Deduplication on large amounts of code
PDF
Assisted code review with source{d} lookout
PDF
Machine Learning on Code - SF meetup
PDF
Inextricably linked reproducibility and productivity in data science and ai ...
PDF
source{d} Engine - your code as data
PDF
Introduction to the source{d} Stack
PDF
source{d} Engine: Exploring git repos with SQL
PDF
Introduction to source{d} Engine and source{d} Lookout
PDF
Machine learning on Go Code
PPTX
Improving go-git performance
Overton, Apple Flavored ML
Unlocking Engineering Observability with advanced IT analytics
What's new in the latest source{d} releases!
Code as Data workshop: Using source{d} Engine to extract insights from git re...
Gitbase, SQL interface to Git repositories
Deduplication on large amounts of code
Assisted code review with source{d} lookout
Machine Learning on Code - SF meetup
Inextricably linked reproducibility and productivity in data science and ai ...
source{d} Engine - your code as data
Introduction to the source{d} Stack
source{d} Engine: Exploring git repos with SQL
Introduction to source{d} Engine and source{d} Lookout
Machine learning on Go Code
Improving go-git performance

Recently uploaded (20)

PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
cuic standard and advanced reporting.pdf
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPT
Teaching material agriculture food technology
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Electronic commerce courselecture one. Pdf
PDF
Machine learning based COVID-19 study performance prediction
PPTX
A Presentation on Artificial Intelligence
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
A comparative analysis of optical character recognition models for extracting...
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
Spectroscopy.pptx food analysis technology
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Network Security Unit 5.pdf for BCA BBA.
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
cuic standard and advanced reporting.pdf
Chapter 3 Spatial Domain Image Processing.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
Teaching material agriculture food technology
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Electronic commerce courselecture one. Pdf
Machine learning based COVID-19 study performance prediction
A Presentation on Artificial Intelligence
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
MYSQL Presentation for SQL database connectivity
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
The Rise and Fall of 3GPP – Time for a Sabbatical?
A comparative analysis of optical character recognition models for extracting...
Mobile App Security Testing_ A Comprehensive Guide.pdf
MIND Revenue Release Quarter 2 2025 Press Release
Spectroscopy.pptx food analysis technology
Programs and apps: productivity, graphics, security and other tools
Network Security Unit 5.pdf for BCA BBA.

Machine learning on source code