SlideShare a Scribd company logo
Machine Learning on Graphs

Joseph Gonzalez
Co-Founder, GraphLab Inc.
joseph@graphlab.com
Postdoc, UC Berkeley AMPLab
jegonzal@eecs.berkeley.edu
Big
 Data
 
Graphs
More 
Signal

More 
Noise


2!
Social Media

Science

Advertising

Web

Graphs encode relationships between:

People

Products
Ideas
Facts
Interests

Big: billions of vertices and edges & rich metadata
Facebook	
  (10/2012):	
  1B	
  users,	
  144B	
  friendships	
  	
  
Twi>er	
  (2011):	
  15B	
  follower	
  edges	
  


3
Graphs are Essential to "
Data Mining and Machine Learning



Identify influential people and information
Find communities
Understand people’s shared interests
Model complex data dependencies
Predicting User Behavior
?	
  

?	
  
?	
  

Liberal	

?	
  

?	
  

?	
  

Conservative	


?	
  
?	
  

?	
  

Post	
  
Post	
  

?	
  

?	
  
Post	
  

Post	
  

Post	


?	
  
Post	
  

?	
  

Post	


Post	

?	
  

?	
  

?	
  

Post	
  

?	
  

Post	

Post	


Post	
  

?	
  

Conditional Random Field! ?	
  
?	
  
?	
  
?	
  
?	
  
?	
  
Belief Propagation!

Post	


?	
  
?	
  

Post	

Post	


Post	
  

?	
  
?	
  

?	
  

?	
  
5	
  
Finding Communities
Count triangles passing through each vertex:
"


2

3

1
4



Measures “cohesiveness” of local community

Fewer Triangles
Weaker Community

More Triangles
Stronger Community
Recommending Products
Users


Ratings


Items
Recommending Products

≈

Movies

f(j)

f(i)

Movies
Iterate:

f [i] = arg min

w2Rd

X

j2Nbrs(i)

rij

f(1)

User Factors (U)

Users

Netflix

x

f(2)

T

w f [j]

r13
r14
r24
r25

2

f(3)
f(4)
f(5)

Movie Factors (M)

Users

Low-Rank Matrix Factorization:

+ ||w||2
2
8
Identifying Leaders

9	
  
Identifying Leaders
R[i] = 0.15 +

X

wji R[j]

j2Nbrs(i)

Rank of
user i

Weighted sum of
neighbors’ ranks

Everyone starts with equal ranks
Update ranks in parallel 
Iterate until convergence
10	
  
Graph-Parallel Algorithms

Model / Alg. 
State

Computation depends
only on the neighbors
11	
  
Many More Graph Algorithms
•  Collaborative Filtering!
– 
– 
– 
– 

•  Graph Analytics!

Alternating Least Squares!
Stochastic Gradient Descent!
Tensor Factorization!
SVD!

•  Structured Prediction!
–  Loopy Belief Propagation!
–  Max-Product Linear
Programs!
–  Gibbs Sampling!

•  Semi-supervised ML!
–  Graph SSL !
–  CoEM!

– 
– 
– 
– 
– 
– 

PageRank!
Shortest Path!
Triangle-Counting!
Graph Coloring!
K-core Decomposition!
Personalized PageRank!

•  Classification!
–  Neural Networks!
–  Lasso!
…!

12
How should we program"
graph-parallel algorithms?

13
Structure of Computation
Data-Parallel

Graph-Parallel
Dependency Graph

Table
Row
6. Before

Row
Row

Result
7. After

Row
14

8. After
How should we program"
graph-parallel algorithms?

“Think like a Vertex.”	

- Pregel [SIGMOD’10]	


15
The Graph-Parallel Abstraction
A user-defined Vertex-Program runs on each vertex
Graph constrains interaction along edges
Using messages (e.g. Pregel [PODC’09, SIGMOD’10])
Through shared state (e.g., GraphLab [UAI’10, VLDB’12])










Parallelism: run multiple vertex programs simultaneously

16
The GraphLab Vertex Program	

Vertex Programs directly access adjacent vertices and edges	

GraphLab_PageRank(i)	
  	
  
	
  	
  //	
  Compute	
  sum	
  over	
  neighbors	
  
	
  	
  total	
  =	
  0	
  
	
  	
  foreach	
  (j	
  in	
  neighbors(i)):	
  	
  
	
  	
  	
  	
  total	
  =	
  total	
  +	
  R[j]	
  *	
  wji	
  
	
  
	
  	
  //	
  Update	
  the	
  PageRank	
  
	
  	
  R[i]	
  =	
  0.15	
  +	
  total	
  	
  
	
  
	
  	
  //	
  Trigger	
  neighbors	
  to	
  run	
  again	
  
	
  	
  if	
  R[i]	
  not	
  converged	
  then	
  
	
  	
  	
  signal	
  nbrsOf(i)	
  to	
  be	
  recomputed	
  

R[4]	
  *	
  w41	
  

4

+	
  

1

+	
  
3

Signaled vertices are recomputed eventually.	


2

17	
  
Num-­‐Ver1ces	
  

Be>er	
  

Convergence of Dynamic PageRank

100000000	
  

51%	
  updated	
  only	
  once!	
  

1000000	
  
10000	
  
100	
  
1	
  
0	
  

10	
  

20	
  

30	
  
40	
  
Number	
  of	
  Updates	
  

50	
  

60	
  

70	
  
18	
  
Adaptive Belief Propagation
Challenge = Boundaries	


Many	

Updates	


Splash	
  

Noisy “Sunset” Image	


Few	

Updates	


Cumulative Vertex Updates	


Algorithm identifies and focuses 	

on hidden sequential structure	

Graphical Model
6. Before

Graph-­‐parallel	
  Abstrac(ons	
  
BeDer	
  for	
  Machine	
  Learning	
  

Messaging	
  

	
  

i

Synchronous	
  

7. After

8. After

Shared	
  State	
  
i

Dynamic	
  Asynchronous	
  
20	
  
Natural Graphs

Graphs derived from natural
phenomena

21	
  
Properties of Natural Graphs

Regular Mesh

Natural Graph

Power-Law Degree Distribution
22
Power-Law Degree Distribution

“Star Like” Motif
President
Obama

Followers

23
Challenges	
  of	
  High-­‐Degree	
  VerMces	
  

SequenMally	
  process	
  
edges	
  

Touches	
  a	
  large	
  
fracMon	
  of	
  graph	
  

CPU 1

CPU 2

Provably	
  Difficult	
  to	
  ParMMon	
  
24	
  
ment. While fast and easy to implement,
placement cuts most of the edges:
Random	
  ParMMoning	
  

em 5.1. If vertices random	
  (hashed)	
   assigne
are randomly
•  GraphLab	
  resorts	
  to	
  
parMMoning	
  on	
  natural	
  graphs	
  
nes then the expected fraction of edges cut


|Edges Cut|
E
=1
|E|

1
p

10	
  Machines	
  !	
  90%	
  of	
  edges	
  cut	
  
example if just two machines are used, hal
100	
  Machines	
  !	
  99%	
  of	
  edges	
  cut!	
  
Machine	
  1	
  
Machine	
  2	
  
es will be cut requiring order |E| /2 commun
25	
  
Program	
  
For	
  This	
  

Run	
  on	
  This	
  
Machine 1

Machine 2

•  Split	
  High-­‐Degree	
  verMces	
  
•  New	
  Abstrac1on	
  !	
  Equivalence	
  on	
  Split	
  Ver(ces	
  
26	
  
A Common Pattern in

Vertex Programs
GraphLab_PageRank(i)	
  	
  
	
  	
  //	
  Compute	
  sum	
  over	
  neighbors	
  
	
  	
  total	
  =	
  0	
  
Gather	
  Informa1on	
  
	
  	
  foreach(	
  j	
  in	
  neighbors(i)):	
  	
  
About	
  Neighborhood	
  
	
  	
  	
  	
  total	
  =	
  total	
  +	
  R[j]	
  *	
  wji	
  
	
  
	
  	
  //	
  Update	
  the	
  PageRank	
  
Update	
  Vertex	
  
	
  	
  R[i]	
  =	
  total	
  	
  
	
  
	
  	
  //	
  Trigger	
  neighbors	
  to	
  run	
  again	
  
	
  	
  priority	
  =	
  |R[i]	
  –	
  oldR[i]|	
  
Signal	
  Neighbors	
  &	
  
	
  	
  if	
  R[i]	
  not	
  converged	
  then	
  
Modify	
  Edge	
  Data	
  
	
  	
  	
  	
  signal	
  neighbors(i)	
  with	
  priority	
  
	
  
27	
  
GAS Decomposition
Machine	
  1	
  

Machine	
  2	
  

Master	
  

Gather	
  
Apply	
  
Sca>er	
  

Y’	
  
Y’	
  
Y’	
  
Y’	
  

Σ1	
  

Σ

Σ2	
  

+	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  +	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  +	
  	
  	
  

Mirror	
  

Y	
  

Σ3	
  

Σ4	
  
Mirror	
  

Machine	
  3	
  

Mirror	
  

Machine	
  4	
  

28	
  
Minimizing Communication in
PowerGraph

Y
Communication is linear in "
the number of machines "
each vertex spans

A vertex-cut minimizes "
machines each vertex spans
Percolation theory suggests that power law
graphs have good vertex cuts. [Albert et al. 2000]
29
Machine Learning and Data-Mining
Toolkits
Graph	
  	
  
AnalyMcs	
  

Graphical	
  
Models	
  

Computer	
  
Vision	
  

Clustering	
  

Topic	
  
Modeling	
  

CollaboraMve	
  
Filtering	
  

GraphLab2	
  System	
  
MPI/TCP-­‐IP	
  

PThreads	
  

HDFS	
  

EC2	
  HPC	
  Nodes	
  

http://graphlab.org
Apache 2 License
PageRank on Twitter Follower Graph
Natural Graph with 40M Users, 1.4 Billion Links
Run1me	
  Per	
  Itera1on	
  
0	
  

50	
  

100	
  

150	
  

200	
  

Hadoop	
  
GraphLab	
  
Twister	
  
Piccolo	
  

Order of magnitude
by exploiting
properties of Natural
Graphs

PowerGraph	
  
Hadoop results from [Kang et al. '11]
Twister (in-memory MapReduce) [Ekanayake et al. ‘10]

31
GraphLab2 is Scalable
Yahoo Altavista Web Graph (2002):

One of the largest publicly available web graphs

1.4 Billion Webpages, 6.6 Billion Links


7 Seconds per Iter.
1B links Nodes
processed per second
64 HPC
1024 Cores (2048
30 lines of user code
 HT)

32
Topic Modeling
English language Wikipedia 
–  2.6M Documents, 8.3M Words, 500M Tokens

–  Computationally intensive algorithm
Million	
  Tokens	
  Per	
  Second	
  
0	
  

Smola	
  et	
  al.	
  

PowerGraph	
  

20	
  

40	
  

60	
  

80	
  

100	
  

120	
  

140	
  

160	
  

100 Yahoo! Machines

Specifically engineered for this task
64 cc2.8xlarge EC2 Nodes
200 lines of code & 4 human hours
33	
  
Triangle Counting on Twitter
40M Users, 1.4 Billion Links

Counted: 34.8 Billion Triangles

Hadoop
[WWW’11]	


1536 Machines	

423 Minutes	


64 Machines	

15 Seconds	

 1000 x Faster	

34	
  

S.	
  Suri	
  and	
  S.	
  Vassilvitskii,	
  “CounMng	
  triangles	
  and	
  the	
  curse	
  of	
  the	
  last	
  reducer,”	
  WWW’11	
  
7. After

8. After

By exploiting common patterns in graph data and computation:

New ways to represent 

real-world graphs

New ways execute 

graph algorithms
Machine 1
 Machine 2

Orders of magnitude improvements
over existing systems
7. After

8. After

Possibility
Scalability
Usability
Exciting Time to Work in ML
J Unique opportunities to change the world!!
With ML, I will
cure cancer!!!

With ML I will 
find true love.

Why won’t 
ML read
my mind???

L Building scalable learning system requires experts …
But… 
Even	
  basics	
  of	
  scalable	
  ML	
  
can	
  be	
  challenging	
  
ML key to any
new service we
want to build

6	
  months	
  from	
  prototype	
  
to	
  producMon	
  
State-­‐of-­‐art	
  ML	
  algorithms	
  
trapped	
  in	
  research	
  papers	
  

Goal of GraphLab 3: 
Make large-scale machine learning accessible to all! J
Adding a Python Layer
Python	
  API	
  
Graph	
  	
  
AnalyMcs	
  

Graphical	
  
Models	
  

Computer	
  
Vision	
  

Clustering	
  

Topic	
  
Modeling	
  

CollaboraMve	
  
Filtering	
  

GraphLab2	
  System	
  
MPI/TCP-­‐IP	
  

PThreads	
  

EC2	
  HPC	
  Nodes	
  

HDFS	
  
Learning ML with 

GraphLab Notebook

https://beta.graphlab.com/examples!
Prototype to Production

with Python GraphLab: 
Easily install  prototype locally

Deploy to the cluster in one step
Learn: 

GraphLab
Notebook

Prototype: 

pip install graphlab 

è 

local prototyping

Production: 

Same code scales
to EC2 cluster
GraphLab Toolkits
Highly scalable, state-of-the-art 

machine learning straight from python

Graph 

Analytics

Graphical

Models

Computer

Vision

Clustering

Topic

Modeling

Collaborative

Filtering
Machine Learning on Graphs
partners@graphlab.com

NIPS Workshop on Big Learning: biglearn.org
Lake Tahoe, December 9th

Joseph Gonzalez
Co-Founder, GraphLab Inc.
joseph@graphlab.com

More Related Content

PPTX
CS267_Graph_Lab
PDF
Josh Patterson MLconf slides
PDF
Scalable Distributed Real-Time Clustering for Big Data Streams
PDF
Data clustering using map reduce
PDF
A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...
PPTX
A Fast and Dirty Intro to NetworkX (and D3)
PDF
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
PDF
GraphChi big graph processing
CS267_Graph_Lab
Josh Patterson MLconf slides
Scalable Distributed Real-Time Clustering for Big Data Streams
Data clustering using map reduce
A More Scaleable Way of Making Recommendations with MLlib-(Xiangrui Meng, Dat...
A Fast and Dirty Intro to NetworkX (and D3)
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16
GraphChi big graph processing

What's hot (20)

PDF
Generalized Linear Models with H2O
PPTX
Large-Scale Graph Computation on Just a PC: Aapo Kyrola Ph.D. thesis defense
PPTX
Large-scale Recommendation Systems on Just a PC
PDF
Parallel External Memory Algorithms Applied to Generalized Linear Models
PPTX
Le Song, Assistant Professor, College of Computing, Georgia Institute of Tech...
PDF
Graph Analyses with Python and NetworkX
PPTX
Tensor flow
PDF
ML+Hadoop at NYC Predictive Analytics
PPTX
Yarn spark next_gen_hadoop_8_jan_2014
PPTX
Neural networks and google tensor flow
PDF
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
PDF
Deep Recurrent Neural Networks for Sequence Learning in Spark by Yves Mabiala
PPTX
Clustering: A Scikit Learn Tutorial
PPTX
STRIP: stream learning of influence probabilities.
PPTX
Magellan FOSS4G Talk, Boston 2017
PPTX
Distributed Deep Learning + others for Spark Meetup
PDF
Real Time Big Data Management
PDF
Vol 16 No 2 - July-December 2016
PPTX
TensorFrames: Google Tensorflow on Apache Spark
PPTX
Daniel Shank, Data Scientist, Talla at MLconf SF 2016
Generalized Linear Models with H2O
Large-Scale Graph Computation on Just a PC: Aapo Kyrola Ph.D. thesis defense
Large-scale Recommendation Systems on Just a PC
Parallel External Memory Algorithms Applied to Generalized Linear Models
Le Song, Assistant Professor, College of Computing, Georgia Institute of Tech...
Graph Analyses with Python and NetworkX
Tensor flow
ML+Hadoop at NYC Predictive Analytics
Yarn spark next_gen_hadoop_8_jan_2014
Neural networks and google tensor flow
2015-06-15 Large-Scale Elastic-Net Regularized Generalized Linear Models at S...
Deep Recurrent Neural Networks for Sequence Learning in Spark by Yves Mabiala
Clustering: A Scikit Learn Tutorial
STRIP: stream learning of influence probabilities.
Magellan FOSS4G Talk, Boston 2017
Distributed Deep Learning + others for Spark Meetup
Real Time Big Data Management
Vol 16 No 2 - July-December 2016
TensorFrames: Google Tensorflow on Apache Spark
Daniel Shank, Data Scientist, Talla at MLconf SF 2016
Ad

Viewers also liked (9)

PDF
Hakin9 nmap-ebook-ch1
PDF
Graphlab under the hood
PDF
Machine Learning in the Cloud with GraphLab
PDF
PDF
PowerGraph
PDF
Ling liu part 01:big graph processing
PDF
Jeff Bradshaw, Founder, Adaptris
PDF
Graph processing - Graphlab
PDF
Graph processing - Powergraph and GraphX
Hakin9 nmap-ebook-ch1
Graphlab under the hood
Machine Learning in the Cloud with GraphLab
PowerGraph
Ling liu part 01:big graph processing
Jeff Bradshaw, Founder, Adaptris
Graph processing - Graphlab
Graph processing - Powergraph and GraphX
Ad

Similar to Joey gonzalez, graph lab, m lconf 2013 (20)

PDF
CC-4007, Large-Scale Machine Learning on Graphs, by Yucheng Low, Joseph Gonza...
PDF
Ling liu part 02:big graph processing
PDF
F14 lec12graphs
PPTX
Making Machine Learning Scale: Single Machine and Distributed
PDF
Graph machine learning table of content
PPTX
Graph processing
PDF
Memory Efficient Graph Convolutional Network based Distributed Link Prediction
PDF
Exploring optimizations for dynamic PageRank algorithm based on GPU : V4
PPTX
How to get started with Graph Machine Learning
PPTX
PowerLyra@EuroSys2015
PDF
Leveraging Graphs for Better AI
PPTX
Sun_MAPL_GNN.pptx
PDF
High-Performance Graph Analysis and Modeling
PDF
Representation learning on graphs
PPTX
JOSA TechTalks - Machine Learning on Graph-Structured Data
PDF
How Graphs Enhance AI
PDF
Graph Analysis Beyond Linear Algebra
PDF
Machine Learning Powered by Graphs - Alessandro Negro
PDF
Large Scale Graph Processing with Apache Giraph
PDF
Evolution of Graph Algorithms – Benefits and Challenges
CC-4007, Large-Scale Machine Learning on Graphs, by Yucheng Low, Joseph Gonza...
Ling liu part 02:big graph processing
F14 lec12graphs
Making Machine Learning Scale: Single Machine and Distributed
Graph machine learning table of content
Graph processing
Memory Efficient Graph Convolutional Network based Distributed Link Prediction
Exploring optimizations for dynamic PageRank algorithm based on GPU : V4
How to get started with Graph Machine Learning
PowerLyra@EuroSys2015
Leveraging Graphs for Better AI
Sun_MAPL_GNN.pptx
High-Performance Graph Analysis and Modeling
Representation learning on graphs
JOSA TechTalks - Machine Learning on Graph-Structured Data
How Graphs Enhance AI
Graph Analysis Beyond Linear Algebra
Machine Learning Powered by Graphs - Alessandro Negro
Large Scale Graph Processing with Apache Giraph
Evolution of Graph Algorithms – Benefits and Challenges

More from MLconf (20)

PDF
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
PDF
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
PPTX
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
PDF
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
PPTX
Josh Wills - Data Labeling as Religious Experience
PDF
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
PDF
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
PDF
Meghana Ravikumar - Optimized Image Classification on the Cheap
PDF
Noam Finkelstein - The Importance of Modeling Data Collection
PDF
June Andrews - The Uncanny Valley of ML
PDF
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
PDF
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
PDF
Vito Ostuni - The Voice: New Challenges in a Zero UI World
PDF
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
PDF
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
PPTX
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
PPTX
Neel Sundaresan - Teaching a machine to code
PDF
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
PPTX
Soumith Chintala - Increasing the Impact of AI Through Better Software
PPTX
Roy Lowrance - Predicting Bond Prices: Regime Changes
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Josh Wills - Data Labeling as Religious Experience
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Meghana Ravikumar - Optimized Image Classification on the Cheap
Noam Finkelstein - The Importance of Modeling Data Collection
June Andrews - The Uncanny Valley of ML
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Vito Ostuni - The Voice: New Challenges in a Zero UI World
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Neel Sundaresan - Teaching a machine to code
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Soumith Chintala - Increasing the Impact of AI Through Better Software
Roy Lowrance - Predicting Bond Prices: Regime Changes

Recently uploaded (20)

PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
Approach and Philosophy of On baking technology
PPTX
A Presentation on Artificial Intelligence
PDF
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
PPTX
20250228 LYD VKU AI Blended-Learning.pptx
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Empathic Computing: Creating Shared Understanding
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
The AUB Centre for AI in Media Proposal.docx
Unlocking AI with Model Context Protocol (MCP)
Chapter 3 Spatial Domain Image Processing.pdf
NewMind AI Weekly Chronicles - August'25 Week I
Per capita expenditure prediction using model stacking based on satellite ima...
NewMind AI Monthly Chronicles - July 2025
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
MYSQL Presentation for SQL database connectivity
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Encapsulation_ Review paper, used for researhc scholars
Approach and Philosophy of On baking technology
A Presentation on Artificial Intelligence
How UI/UX Design Impacts User Retention in Mobile Apps.pdf
20250228 LYD VKU AI Blended-Learning.pptx
Diabetes mellitus diagnosis method based random forest with bat algorithm
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Empathic Computing: Creating Shared Understanding
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...

Joey gonzalez, graph lab, m lconf 2013

  • 1. Machine Learning on Graphs Joseph Gonzalez Co-Founder, GraphLab Inc. joseph@graphlab.com Postdoc, UC Berkeley AMPLab jegonzal@eecs.berkeley.edu
  • 2. Big Data Graphs More Signal More Noise 2!
  • 3. Social Media Science Advertising Web Graphs encode relationships between: People Products Ideas Facts Interests Big: billions of vertices and edges & rich metadata Facebook  (10/2012):  1B  users,  144B  friendships     Twi>er  (2011):  15B  follower  edges   3
  • 4. Graphs are Essential to " Data Mining and Machine Learning Identify influential people and information Find communities Understand people’s shared interests Model complex data dependencies
  • 5. Predicting User Behavior ?   ?   ?   Liberal ?   ?   ?   Conservative ?   ?   ?   Post   Post   ?   ?   Post   Post   Post ?   Post   ?   Post Post ?   ?   ?   Post   ?   Post Post Post   ?   Conditional Random Field! ?   ?   ?   ?   ?   ?   Belief Propagation! Post ?   ?   Post Post Post   ?   ?   ?   ?   5  
  • 6. Finding Communities Count triangles passing through each vertex: " 2 3 1 4 Measures “cohesiveness” of local community Fewer Triangles Weaker Community More Triangles Stronger Community
  • 8. Recommending Products ≈ Movies f(j) f(i) Movies Iterate: f [i] = arg min w2Rd X j2Nbrs(i) rij f(1) User Factors (U) Users Netflix x f(2) T w f [j] r13 r14 r24 r25 2 f(3) f(4) f(5) Movie Factors (M) Users Low-Rank Matrix Factorization: + ||w||2 2 8
  • 10. Identifying Leaders R[i] = 0.15 + X wji R[j] j2Nbrs(i) Rank of user i Weighted sum of neighbors’ ranks Everyone starts with equal ranks Update ranks in parallel Iterate until convergence 10  
  • 11. Graph-Parallel Algorithms Model / Alg. State Computation depends only on the neighbors 11  
  • 12. Many More Graph Algorithms •  Collaborative Filtering! –  –  –  –  •  Graph Analytics! Alternating Least Squares! Stochastic Gradient Descent! Tensor Factorization! SVD! •  Structured Prediction! –  Loopy Belief Propagation! –  Max-Product Linear Programs! –  Gibbs Sampling! •  Semi-supervised ML! –  Graph SSL ! –  CoEM! –  –  –  –  –  –  PageRank! Shortest Path! Triangle-Counting! Graph Coloring! K-core Decomposition! Personalized PageRank! •  Classification! –  Neural Networks! –  Lasso! …! 12
  • 13. How should we program" graph-parallel algorithms? 13
  • 14. Structure of Computation Data-Parallel Graph-Parallel Dependency Graph Table Row 6. Before Row Row Result 7. After Row 14 8. After
  • 15. How should we program" graph-parallel algorithms? “Think like a Vertex.” - Pregel [SIGMOD’10] 15
  • 16. The Graph-Parallel Abstraction A user-defined Vertex-Program runs on each vertex Graph constrains interaction along edges Using messages (e.g. Pregel [PODC’09, SIGMOD’10]) Through shared state (e.g., GraphLab [UAI’10, VLDB’12]) Parallelism: run multiple vertex programs simultaneously 16
  • 17. The GraphLab Vertex Program Vertex Programs directly access adjacent vertices and edges GraphLab_PageRank(i)        //  Compute  sum  over  neighbors      total  =  0      foreach  (j  in  neighbors(i)):            total  =  total  +  R[j]  *  wji        //  Update  the  PageRank      R[i]  =  0.15  +  total          //  Trigger  neighbors  to  run  again      if  R[i]  not  converged  then        signal  nbrsOf(i)  to  be  recomputed   R[4]  *  w41   4 +   1 +   3 Signaled vertices are recomputed eventually. 2 17  
  • 18. Num-­‐Ver1ces   Be>er   Convergence of Dynamic PageRank 100000000   51%  updated  only  once!   1000000   10000   100   1   0   10   20   30   40   Number  of  Updates   50   60   70   18  
  • 19. Adaptive Belief Propagation Challenge = Boundaries Many Updates Splash   Noisy “Sunset” Image Few Updates Cumulative Vertex Updates Algorithm identifies and focuses on hidden sequential structure Graphical Model
  • 20. 6. Before Graph-­‐parallel  Abstrac(ons   BeDer  for  Machine  Learning   Messaging     i Synchronous   7. After 8. After Shared  State   i Dynamic  Asynchronous   20  
  • 21. Natural Graphs
 Graphs derived from natural phenomena 21  
  • 22. Properties of Natural Graphs Regular Mesh Natural Graph Power-Law Degree Distribution 22
  • 23. Power-Law Degree Distribution “Star Like” Motif President Obama Followers 23
  • 24. Challenges  of  High-­‐Degree  VerMces   SequenMally  process   edges   Touches  a  large   fracMon  of  graph   CPU 1 CPU 2 Provably  Difficult  to  ParMMon   24  
  • 25. ment. While fast and easy to implement, placement cuts most of the edges: Random  ParMMoning   em 5.1. If vertices random  (hashed)   assigne are randomly •  GraphLab  resorts  to   parMMoning  on  natural  graphs   nes then the expected fraction of edges cut  |Edges Cut| E =1 |E| 1 p 10  Machines  !  90%  of  edges  cut   example if just two machines are used, hal 100  Machines  !  99%  of  edges  cut!   Machine  1   Machine  2   es will be cut requiring order |E| /2 commun 25  
  • 26. Program   For  This   Run  on  This   Machine 1 Machine 2 •  Split  High-­‐Degree  verMces   •  New  Abstrac1on  !  Equivalence  on  Split  Ver(ces   26  
  • 27. A Common Pattern in
 Vertex Programs GraphLab_PageRank(i)        //  Compute  sum  over  neighbors      total  =  0   Gather  Informa1on      foreach(  j  in  neighbors(i)):     About  Neighborhood          total  =  total  +  R[j]  *  wji        //  Update  the  PageRank   Update  Vertex      R[i]  =  total          //  Trigger  neighbors  to  run  again      priority  =  |R[i]  –  oldR[i]|   Signal  Neighbors  &      if  R[i]  not  converged  then   Modify  Edge  Data          signal  neighbors(i)  with  priority     27  
  • 28. GAS Decomposition Machine  1   Machine  2   Master   Gather   Apply   Sca>er   Y’   Y’   Y’   Y’   Σ1   Σ Σ2   +                        +                          +       Mirror   Y   Σ3   Σ4   Mirror   Machine  3   Mirror   Machine  4   28  
  • 29. Minimizing Communication in PowerGraph Y Communication is linear in " the number of machines " each vertex spans A vertex-cut minimizes " machines each vertex spans Percolation theory suggests that power law graphs have good vertex cuts. [Albert et al. 2000] 29
  • 30. Machine Learning and Data-Mining Toolkits Graph     AnalyMcs   Graphical   Models   Computer   Vision   Clustering   Topic   Modeling   CollaboraMve   Filtering   GraphLab2  System   MPI/TCP-­‐IP   PThreads   HDFS   EC2  HPC  Nodes   http://graphlab.org Apache 2 License
  • 31. PageRank on Twitter Follower Graph Natural Graph with 40M Users, 1.4 Billion Links Run1me  Per  Itera1on   0   50   100   150   200   Hadoop   GraphLab   Twister   Piccolo   Order of magnitude by exploiting properties of Natural Graphs PowerGraph   Hadoop results from [Kang et al. '11] Twister (in-memory MapReduce) [Ekanayake et al. ‘10] 31
  • 32. GraphLab2 is Scalable Yahoo Altavista Web Graph (2002): One of the largest publicly available web graphs 1.4 Billion Webpages, 6.6 Billion Links 7 Seconds per Iter. 1B links Nodes processed per second 64 HPC 1024 Cores (2048 30 lines of user code HT) 32
  • 33. Topic Modeling English language Wikipedia –  2.6M Documents, 8.3M Words, 500M Tokens –  Computationally intensive algorithm Million  Tokens  Per  Second   0   Smola  et  al.   PowerGraph   20   40   60   80   100   120   140   160   100 Yahoo! Machines Specifically engineered for this task 64 cc2.8xlarge EC2 Nodes 200 lines of code & 4 human hours 33  
  • 34. Triangle Counting on Twitter 40M Users, 1.4 Billion Links Counted: 34.8 Billion Triangles Hadoop [WWW’11] 1536 Machines 423 Minutes 64 Machines 15 Seconds 1000 x Faster 34   S.  Suri  and  S.  Vassilvitskii,  “CounMng  triangles  and  the  curse  of  the  last  reducer,”  WWW’11  
  • 35. 7. After 8. After By exploiting common patterns in graph data and computation: New ways to represent 
 real-world graphs New ways execute 
 graph algorithms Machine 1 Machine 2 Orders of magnitude improvements over existing systems
  • 37. Exciting Time to Work in ML J Unique opportunities to change the world!! With ML, I will cure cancer!!! With ML I will find true love. Why won’t ML read my mind??? L Building scalable learning system requires experts …
  • 38. But… Even  basics  of  scalable  ML   can  be  challenging   ML key to any new service we want to build 6  months  from  prototype   to  producMon   State-­‐of-­‐art  ML  algorithms   trapped  in  research  papers   Goal of GraphLab 3: Make large-scale machine learning accessible to all! J
  • 39. Adding a Python Layer Python  API   Graph     AnalyMcs   Graphical   Models   Computer   Vision   Clustering   Topic   Modeling   CollaboraMve   Filtering   GraphLab2  System   MPI/TCP-­‐IP   PThreads   EC2  HPC  Nodes   HDFS  
  • 40. Learning ML with 
 GraphLab Notebook https://beta.graphlab.com/examples!
  • 41. Prototype to Production
 with Python GraphLab: Easily install prototype locally Deploy to the cluster in one step
  • 42. Learn: 
 GraphLab Notebook Prototype: 
 pip install graphlab 
 è 
 local prototyping Production: 
 Same code scales to EC2 cluster
  • 43. GraphLab Toolkits Highly scalable, state-of-the-art 
 machine learning straight from python Graph 
 Analytics Graphical
 Models Computer
 Vision Clustering Topic
 Modeling Collaborative
 Filtering
  • 44. Machine Learning on Graphs partners@graphlab.com NIPS Workshop on Big Learning: biglearn.org Lake Tahoe, December 9th Joseph Gonzalez Co-Founder, GraphLab Inc. joseph@graphlab.com