SlideShare a Scribd company logo
Streaming and Online Algorithms for GraphX 
Graph Analytics Team 
Xia (Ivy) Zhu 
Intel Confidential — Do Not Forward
Why Streaming Processing on Graph? 
2 
• New stores join 
• New users join 
• New users 
browse/clicks and 
buy items 
• Old users 
browse/clicks and 
buy items 
• New ads added 
• … 
• Recommend products 
based on users’ interest 
• Recommend products 
based on users’ shopping 
habits 
• Recommend products 
based on users’ 
purchasing capability 
• Place ads which most 
likely will be clicked by 
users 
• … 
Everyday 
How 
To 
Huge amount of relationships are created each day, 
Wisely utilize them is important
Alibaba Is Not Alone, Graphs are Everywhere 
3 
100B Neuron 
100T Relationships 
1.23B Users 
160B Friendships 
1 Trillion Pages 
100s T Links 
Millions of Products 
and Users 
50M Users 
1B hours/moth watch 
Large Biological 
Cell Networks
… And Graphs Keep Evolving 
4
Streaming Processing Pipeline 
5 
Data Stream 
ETL 
Graph 
Creation 
ML 
Distributed Messaging System 
• We are using Kafka for distributed messaging 
• GraphX as graph processing engine
6 
What is GraphX 
• Graph processing engine on Spark 
• Support Pregel-type vertex programming 
• Unifies data-parallel and graph-parallel processing 
Picture Source: GraphX team
7 
Why GraphX 
• GraphLab performs well, but standalone 
• Giraph, open source, scales well, but performance is not good 
• GraphX supports both table and graph operations 
• On the same platform, Spark streaming provides basic streaming 
framework 
SchemaRDD’s RDD-Based 
RDDs, Transformations, and Actions 
Spark 
Spark Streaming 
real-time 
Spark 
SQL 
MLLib 
machine learning 
DStream’s: 
Streams of RDD’s 
Matrices 
RDD-Based 
Graphs 
GraphX 
graph processing/ 
machine learning 
Picture Source: Databricks
8 
Naïve Streaming Does not Scale 
• Current GraphX is designed for static graphs 
• Current Spark streaming provides limited types of state DStreams 
• Naïve approach: 
• Merge table data before going to graph processing pipeline 
• Re-generate whole graph and re-run ML at each window 
• Minimal changes to GraphX and Spark Streaming 
• Straightforward, but does not scale well 
180 
160 
140 
120 
100 
80 
60 
40 
20 
0 
Throughput vs Latency of Naive Graph Streaming 
1 2 3 4 5 6 7 8 9 
Latency(s) 
Sample Point
Our solution 
9 
• Static algorithms -> Online algorithms 
• Merge information at graph phase 
• Efficient graph store for evolving graph 
• Better partitioning algorithms to reduce replicas 
• Static index -> On the fly indexing method (ongoing)
Static vs Online Algorithms 
10 
• Static algorithms 
• Good for re-compute the whole graph at each time instance , and re-run ML 
• Become increasingly infeasible in Big Data era, given the size and growth rate 
of graphs 
• Online algorithms 
• Incremental machine learning is triggered by changes in the graph 
• We designed delta updates based online algorithms 
• Page rank as an example 
• Same idea is applicable to other machine learning algorithms
Static vs Online Page Rank 
11 
Static_PageRank 
// InitialVertexValue 
(0.0, 0.0) 
// first messsage 
initialMessage: 
msg = alpha/(1.0-alpha) 
// broadcast to neighbors 
SendMessage: 
if (edge.srcAttr._2 > tol) 
Iterator((edge.dstId, edge.srcAttr_2 * 
edge.attr)) 
//Aggregate Messages for each Vertex 
messageCombiner(a,b) : 
sum = a+b 
//Update Vertex 
vertexProgram(sum) : 
updates = (1.0 - alpha) * sum 
(oldPR + updates, updates) 
Online_PageRank 
// Initialize vertex value 
base graph: 
(0.0, 0.0) 
incremental graph: 
old vertices: 
(lastWindowPR, lastWindowDelta) 
new vertices: 
(alpha, alpha) 
// First Message 
initialMessage: 
base graph: 
msg = alpha/(1.0-alpha) 
incremental graph: 
none 
// broadcast to neighbors 
SendMessage: 
oldSrc->newDst: 
Iterator((edge.dstId,(edge.srcAttr_1 – alpha) * 
edge.attr)) 
newSrc->newDst or not converged: 
Iterator((edge.dstId,edge.srcAttr_2 * edge.attr)) 
//Aggregate Messages for each Vertex 
messageCombiner(a,b) : 
sum = a+b 
//Update Vertex 
vertexProgram(sum) : 
updates = (1.0 - alpha) * sum 
(oldPR + updates, updates)
GraphX Data Loading and Data Structure 
12 
Edge 
lists 
SSrrccIIdd 
DstId 
EdgeRDD 
DDaattaa 
IInnddeexx 
Re-HashPartition 
RRoouuttiinnggTTaabblleePPaarrttiittiioonn 
VVeerrtteexxRRDDDD 
RoutingTableMesssage 
HHaassSSrrccIIdd 
HHaassDDssttIIdd 
Replicated 
Vertex 
View 
GGrraapphhIImmppll 
EEddggeePPaarrttiittiioonn 
VVeerrtteexxPPaarrttiittiioonn 
Vid 
DDaattaa 
Mask 
Shippable 
Vertex 
Partition 
VVeerrtteexxPPaarrttiittiioonn 
Vid 
DDaattaa 
Mask
GraphX Data Loading and Data Structure 
13 
Edge 
lists 
SSrrccIIdd 
DstId 
EdgeRDD 
DDaattaa 
Index 
Re-HashPartition 
RRoouuttiinnggTTaabblleePPaarrttiittiioonn 
VVeerrtteexxRRDDDD 
RoutingTableMesssage 
HHaassSSrrccIIdd 
HHaassDDssttIIdd 
Replicated 
Vertex 
View 
GGrraapphhIImmppll 
EEddggeePPaarrttiittiioonn 
VVeerrtteexxPPaarrttiittiioonn 
Vid 
DDaattaa 
Mask 
Shippable 
Vertex 
Partition 
VVeerrtteexxPPaarrttiittiioonn 
Vid 
DDaattaa 
Mask 
Static Index 
Partitioning Algorithm can help 
reduce the replication factors
Partitioning Algorithm 
14 
• Torus-based partitioning 
• Divide overall partitions to A x B matrix 
• Vertex’s master partition is decided by Hash function 
• Replica set is in the same column as master partition (full column), and same row as 
master partition (  
⁄ + 1 elements starting from master partition) 
• The intersection between source replica set and target replica set decides where an 
edge is placed
Index Structure for Graph Streaming 
15 
• GraphX uses CSR(Compressed Sparse Row)-based index 
• Originated from sparse matrix compression 
• Good for finding all out edges of a source vertex 
• No support for finding all in edges of a target vertex. Need full table scan 
• At minimal, need to add CSC(Compressed Sparse Column) for indexing in edges 
Raw Edge Lists 
Src Dst Data 
3 2  
3 5  
3 9  
5 2 	 
5 3 
 
7 3  
8 5
8 6 
 
10 6  
Dst Data 
2  
5  
9  
2 	 
3 
 
3  
5
6 
 
6  
Idx Unique 
Src 
0 3 
3 5 
5 7 
6 8 
8 10 
CSR 
Data Src 
 3 
	 5 

 5 
 7 
 3
8 

 8 
 10 
 3 
Unique 
Dst 
Idx 
2 0 
3 2 
5 4 
6 6 
9 8 
CSC
Index Structure for Graph Streaming 
16 
• Both CSR and CSC need firstly sort edge lists and then create index. 
• Even better way is to build index on the fly 
• For graph streaming, need to support both fast insert/write and fast search/read 
• HashMap 
• Good for exact match, point search 
• Fast on insert and search 
• Good for graph with fixed/known size 
• Need to re-hash when size surpasses capacity 
• Trees: B-Tree, LSM-Tree (Log Structured Merge Tree), COLA(Cache Oblivious 
Lookahead Array) 
• Support both point search and range search 
• B-Tree good for fast search, slow for insert 
• LSM-Tree good for fast insert, slow for search 
• COLA achieves good tradeoff: fast insert and good enough search 
COLA based index for graph streaming
Putting Things Together: Our Streaming Pipeline 
17 
 
 
OML 
 
 
 
+ 
 
OML 
 
 
 
+ 
 
OML 
 
 
 
+ 
 
OML 
 
	 
	 
+ 
		 
OML 
		 
…
Performance - Convergence Rate 
18 
1.2 
Converage Rate 
Naive Incremental 
Normalized Number of Iterations Graph Size ( Num of Edges) 
1.0 
0.8 
0.6 
0.4 
0.2 
0.0 
Base +20% +40% +60% +80% +100% +150% +200%

More Related Content

PDF
Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)
PDF
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
PPTX
Large Scale Machine learning with Spark
PDF
Designing Distributed Machine Learning on Apache Spark
PDF
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
PPTX
OLAP Basics and Fundamentals by Bharat Kalia
PPTX
Hundreds of queries in the time of one - Gianmario Spacagna
PDF
Histograms at scale - Monitorama 2019
Interactive Graph Analytics with Spark-(Daniel Darabos, Lynx Analytics)
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Large Scale Machine learning with Spark
Designing Distributed Machine Learning on Apache Spark
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
OLAP Basics and Fundamentals by Bharat Kalia
Hundreds of queries in the time of one - Gianmario Spacagna
Histograms at scale - Monitorama 2019

What's hot (18)

PDF
FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
PPTX
Apache Flink Deep Dive
PDF
Pivoting Data with SparkSQL by Andrew Ray
PDF
04 2017 emea_roadshowmilan_mariadb columnstore
PPTX
Lens at apachecon
PDF
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
PPT
Case Study Real Time Olap Cubes
PPTX
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
PDF
Enhancements on Spark SQL optimizer by Min Qiu
PDF
Large-Scale Machine Learning with Apache Spark
PDF
Enhancing Spark SQL Optimizer with Reliable Statistics
PDF
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
PDF
Batch and Stream Graph Processing with Apache Flink
PDF
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
PPTX
Large Scale Machine Learning with Apache Spark
PPTX
AWS (Amazon Redshift) presentation
PDF
Materialized Column: An Efficient Way to Optimize Queries on Nested Columns
PDF
GraphFrames: Graph Queries in Spark SQL by Ankur Dave
FiloDB: Reactive, Real-Time, In-Memory Time Series at Scale
Apache Flink Deep Dive
Pivoting Data with SparkSQL by Andrew Ray
04 2017 emea_roadshowmilan_mariadb columnstore
Lens at apachecon
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
Case Study Real Time Olap Cubes
Apache Flink Deep-Dive @ Hadoop Summit 2015 in San Jose, CA
Enhancements on Spark SQL optimizer by Min Qiu
Large-Scale Machine Learning with Apache Spark
Enhancing Spark SQL Optimizer with Reliable Statistics
Implementing Near-Realtime Datacenter Health Analytics using Model-driven Ver...
Batch and Stream Graph Processing with Apache Flink
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Large Scale Machine Learning with Apache Spark
AWS (Amazon Redshift) presentation
Materialized Column: An Efficient Way to Optimize Queries on Nested Columns
GraphFrames: Graph Queries in Spark SQL by Ankur Dave
Ad

Viewers also liked (17)

PDF
Graphs as Streams: Rethinking Graph Processing in the Streaming Era
PDF
An excursion into Text Analytics with Apache Spark
PDF
GraphX and Pregel - Apache Spark
PPTX
Social Network Analysis with Spark
PPTX
Using spark for timeseries graph analytics
PDF
Building a Graph of all US Businesses Using Spark Technologies by Alexis Roos
PPTX
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
PDF
Neo4j Makes Graphs Easy- GraphDays
PPTX
IoT Analytics from Edge to Cloud - using IBM Informix
PDF
An excursion into Graph Analytics with Apache Spark GraphX
PDF
GraphX: Graph analytics for insights about developer communities
PDF
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
PDF
Apache Storm vs. Spark Streaming – two Stream Processing Platforms compared
PDF
Machine Learning and GraphX
PPTX
Big Data Analytics
PPTX
Gephi, Graphx, and Giraph
PDF
Graph database Use Cases
Graphs as Streams: Rethinking Graph Processing in the Streaming Era
An excursion into Text Analytics with Apache Spark
GraphX and Pregel - Apache Spark
Social Network Analysis with Spark
Using spark for timeseries graph analytics
Building a Graph of all US Businesses Using Spark Technologies by Alexis Roos
Exploring Titan and Spark GraphX for Analyzing Time-Varying Electrical Networks
Neo4j Makes Graphs Easy- GraphDays
IoT Analytics from Edge to Cloud - using IBM Informix
An excursion into Graph Analytics with Apache Spark GraphX
GraphX: Graph analytics for insights about developer communities
Relationship Extraction from Unstructured Text-Based on Stanford NLP with Spa...
Apache Storm vs. Spark Streaming – two Stream Processing Platforms compared
Machine Learning and GraphX
Big Data Analytics
Gephi, Graphx, and Giraph
Graph database Use Cases
Ad

Similar to Xia Zhu – Intel at MLconf ATL (20)

PDF
Practice of Streaming Processing of Dynamic Graphs: Concepts, Models, and Sys...
PDF
The Analytics Frontier of the Hadoop Eco-System
PPTX
Graphs in data structures are non-linear data structures made up of a finite ...
PDF
Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs
PPTX
Graph processing
PDF
Graph Analytics in Spark
PDF
Spark Meetup @ Netflix, 05/19/2015
PDF
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
PDF
Microservices, containers, and machine learning
PDF
Ling liu part 02:big graph processing
PDF
STINGER: Multi-threaded Graph Streaming
PDF
Web-Scale Graph Analytics with Apache® Spark™
PDF
F14 lec12graphs
PDF
STING: Spatio-Temporal Interaction Networks and Graphs for Intel Platforms
PDF
Time-Evolving Graph Processing On Commodity Clusters
PDF
MLconf seattle 2015 presentation
PDF
Exploring optimizations for dynamic PageRank algorithm based on GPU : V4
PDF
Graph Stream Processing : spinning fast, large scale, complex analytics
PDF
DyGraph: A Dynamic Graph Generator and Benchmark Suite : NOTES
PPTX
Big Stream Processing Systems, Big Graphs
Practice of Streaming Processing of Dynamic Graphs: Concepts, Models, and Sys...
The Analytics Frontier of the Hadoop Eco-System
Graphs in data structures are non-linear data structures made up of a finite ...
Scalable and Efficient Algorithms for Analysis of Massive, Streaming Graphs
Graph processing
Graph Analytics in Spark
Spark Meetup @ Netflix, 05/19/2015
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Microservices, containers, and machine learning
Ling liu part 02:big graph processing
STINGER: Multi-threaded Graph Streaming
Web-Scale Graph Analytics with Apache® Spark™
F14 lec12graphs
STING: Spatio-Temporal Interaction Networks and Graphs for Intel Platforms
Time-Evolving Graph Processing On Commodity Clusters
MLconf seattle 2015 presentation
Exploring optimizations for dynamic PageRank algorithm based on GPU : V4
Graph Stream Processing : spinning fast, large scale, complex analytics
DyGraph: A Dynamic Graph Generator and Benchmark Suite : NOTES
Big Stream Processing Systems, Big Graphs

More from MLconf (20)

PDF
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
PDF
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
PPTX
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
PDF
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
PPTX
Josh Wills - Data Labeling as Religious Experience
PDF
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
PDF
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
PDF
Meghana Ravikumar - Optimized Image Classification on the Cheap
PDF
Noam Finkelstein - The Importance of Modeling Data Collection
PDF
June Andrews - The Uncanny Valley of ML
PDF
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
PDF
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
PDF
Vito Ostuni - The Voice: New Challenges in a Zero UI World
PDF
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
PDF
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
PPTX
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
PPTX
Neel Sundaresan - Teaching a machine to code
PDF
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
PPTX
Soumith Chintala - Increasing the Impact of AI Through Better Software
PPTX
Roy Lowrance - Predicting Bond Prices: Regime Changes
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Josh Wills - Data Labeling as Religious Experience
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Meghana Ravikumar - Optimized Image Classification on the Cheap
Noam Finkelstein - The Importance of Modeling Data Collection
June Andrews - The Uncanny Valley of ML
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Vito Ostuni - The Voice: New Challenges in a Zero UI World
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Neel Sundaresan - Teaching a machine to code
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Soumith Chintala - Increasing the Impact of AI Through Better Software
Roy Lowrance - Predicting Bond Prices: Regime Changes

Recently uploaded (20)

PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PPTX
Big Data Technologies - Introduction.pptx
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
NewMind AI Weekly Chronicles - August'25 Week I
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Modernizing your data center with Dell and AMD
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Approach and Philosophy of On baking technology
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
NewMind AI Monthly Chronicles - July 2025
PDF
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Advanced IT Governance
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PDF
Electronic commerce courselecture one. Pdf
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
PDF
Chapter 3 Spatial Domain Image Processing.pdf
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Big Data Technologies - Introduction.pptx
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
NewMind AI Weekly Chronicles - August'25 Week I
MYSQL Presentation for SQL database connectivity
Modernizing your data center with Dell and AMD
Diabetes mellitus diagnosis method based random forest with bat algorithm
Approach and Philosophy of On baking technology
Unlocking AI with Model Context Protocol (MCP)
Mobile App Security Testing_ A Comprehensive Guide.pdf
NewMind AI Monthly Chronicles - July 2025
solutions_manual_-_materials___processing_in_manufacturing__demargo_.pdf
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Advanced IT Governance
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
Electronic commerce courselecture one. Pdf
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Peak of Data & AI Encore- AI for Metadata and Smarter Workflows
Chapter 3 Spatial Domain Image Processing.pdf

Xia Zhu – Intel at MLconf ATL

  • 1. Streaming and Online Algorithms for GraphX Graph Analytics Team Xia (Ivy) Zhu Intel Confidential — Do Not Forward
  • 2. Why Streaming Processing on Graph? 2 • New stores join • New users join • New users browse/clicks and buy items • Old users browse/clicks and buy items • New ads added • … • Recommend products based on users’ interest • Recommend products based on users’ shopping habits • Recommend products based on users’ purchasing capability • Place ads which most likely will be clicked by users • … Everyday How To Huge amount of relationships are created each day, Wisely utilize them is important
  • 3. Alibaba Is Not Alone, Graphs are Everywhere 3 100B Neuron 100T Relationships 1.23B Users 160B Friendships 1 Trillion Pages 100s T Links Millions of Products and Users 50M Users 1B hours/moth watch Large Biological Cell Networks
  • 4. … And Graphs Keep Evolving 4
  • 5. Streaming Processing Pipeline 5 Data Stream ETL Graph Creation ML Distributed Messaging System • We are using Kafka for distributed messaging • GraphX as graph processing engine
  • 6. 6 What is GraphX • Graph processing engine on Spark • Support Pregel-type vertex programming • Unifies data-parallel and graph-parallel processing Picture Source: GraphX team
  • 7. 7 Why GraphX • GraphLab performs well, but standalone • Giraph, open source, scales well, but performance is not good • GraphX supports both table and graph operations • On the same platform, Spark streaming provides basic streaming framework SchemaRDD’s RDD-Based RDDs, Transformations, and Actions Spark Spark Streaming real-time Spark SQL MLLib machine learning DStream’s: Streams of RDD’s Matrices RDD-Based Graphs GraphX graph processing/ machine learning Picture Source: Databricks
  • 8. 8 Naïve Streaming Does not Scale • Current GraphX is designed for static graphs • Current Spark streaming provides limited types of state DStreams • Naïve approach: • Merge table data before going to graph processing pipeline • Re-generate whole graph and re-run ML at each window • Minimal changes to GraphX and Spark Streaming • Straightforward, but does not scale well 180 160 140 120 100 80 60 40 20 0 Throughput vs Latency of Naive Graph Streaming 1 2 3 4 5 6 7 8 9 Latency(s) Sample Point
  • 9. Our solution 9 • Static algorithms -> Online algorithms • Merge information at graph phase • Efficient graph store for evolving graph • Better partitioning algorithms to reduce replicas • Static index -> On the fly indexing method (ongoing)
  • 10. Static vs Online Algorithms 10 • Static algorithms • Good for re-compute the whole graph at each time instance , and re-run ML • Become increasingly infeasible in Big Data era, given the size and growth rate of graphs • Online algorithms • Incremental machine learning is triggered by changes in the graph • We designed delta updates based online algorithms • Page rank as an example • Same idea is applicable to other machine learning algorithms
  • 11. Static vs Online Page Rank 11 Static_PageRank // InitialVertexValue (0.0, 0.0) // first messsage initialMessage: msg = alpha/(1.0-alpha) // broadcast to neighbors SendMessage: if (edge.srcAttr._2 > tol) Iterator((edge.dstId, edge.srcAttr_2 * edge.attr)) //Aggregate Messages for each Vertex messageCombiner(a,b) : sum = a+b //Update Vertex vertexProgram(sum) : updates = (1.0 - alpha) * sum (oldPR + updates, updates) Online_PageRank // Initialize vertex value base graph: (0.0, 0.0) incremental graph: old vertices: (lastWindowPR, lastWindowDelta) new vertices: (alpha, alpha) // First Message initialMessage: base graph: msg = alpha/(1.0-alpha) incremental graph: none // broadcast to neighbors SendMessage: oldSrc->newDst: Iterator((edge.dstId,(edge.srcAttr_1 – alpha) * edge.attr)) newSrc->newDst or not converged: Iterator((edge.dstId,edge.srcAttr_2 * edge.attr)) //Aggregate Messages for each Vertex messageCombiner(a,b) : sum = a+b //Update Vertex vertexProgram(sum) : updates = (1.0 - alpha) * sum (oldPR + updates, updates)
  • 12. GraphX Data Loading and Data Structure 12 Edge lists SSrrccIIdd DstId EdgeRDD DDaattaa IInnddeexx Re-HashPartition RRoouuttiinnggTTaabblleePPaarrttiittiioonn VVeerrtteexxRRDDDD RoutingTableMesssage HHaassSSrrccIIdd HHaassDDssttIIdd Replicated Vertex View GGrraapphhIImmppll EEddggeePPaarrttiittiioonn VVeerrtteexxPPaarrttiittiioonn Vid DDaattaa Mask Shippable Vertex Partition VVeerrtteexxPPaarrttiittiioonn Vid DDaattaa Mask
  • 13. GraphX Data Loading and Data Structure 13 Edge lists SSrrccIIdd DstId EdgeRDD DDaattaa Index Re-HashPartition RRoouuttiinnggTTaabblleePPaarrttiittiioonn VVeerrtteexxRRDDDD RoutingTableMesssage HHaassSSrrccIIdd HHaassDDssttIIdd Replicated Vertex View GGrraapphhIImmppll EEddggeePPaarrttiittiioonn VVeerrtteexxPPaarrttiittiioonn Vid DDaattaa Mask Shippable Vertex Partition VVeerrtteexxPPaarrttiittiioonn Vid DDaattaa Mask Static Index Partitioning Algorithm can help reduce the replication factors
  • 14. Partitioning Algorithm 14 • Torus-based partitioning • Divide overall partitions to A x B matrix • Vertex’s master partition is decided by Hash function • Replica set is in the same column as master partition (full column), and same row as master partition ( ⁄ + 1 elements starting from master partition) • The intersection between source replica set and target replica set decides where an edge is placed
  • 15. Index Structure for Graph Streaming 15 • GraphX uses CSR(Compressed Sparse Row)-based index • Originated from sparse matrix compression • Good for finding all out edges of a source vertex • No support for finding all in edges of a target vertex. Need full table scan • At minimal, need to add CSC(Compressed Sparse Column) for indexing in edges Raw Edge Lists Src Dst Data 3 2 3 5 3 9 5 2 5 3 7 3 8 5
  • 16. 8 6 10 6 Dst Data 2 5 9 2 3 3 5
  • 17. 6 6 Idx Unique Src 0 3 3 5 5 7 6 8 8 10 CSR Data Src 3 5 5 7 3
  • 18. 8 8 10 3 Unique Dst Idx 2 0 3 2 5 4 6 6 9 8 CSC
  • 19. Index Structure for Graph Streaming 16 • Both CSR and CSC need firstly sort edge lists and then create index. • Even better way is to build index on the fly • For graph streaming, need to support both fast insert/write and fast search/read • HashMap • Good for exact match, point search • Fast on insert and search • Good for graph with fixed/known size • Need to re-hash when size surpasses capacity • Trees: B-Tree, LSM-Tree (Log Structured Merge Tree), COLA(Cache Oblivious Lookahead Array) • Support both point search and range search • B-Tree good for fast search, slow for insert • LSM-Tree good for fast insert, slow for search • COLA achieves good tradeoff: fast insert and good enough search COLA based index for graph streaming
  • 20. Putting Things Together: Our Streaming Pipeline 17 OML + OML + OML + OML + OML …
  • 21. Performance - Convergence Rate 18 1.2 Converage Rate Naive Incremental Normalized Number of Iterations Graph Size ( Num of Edges) 1.0 0.8 0.6 0.4 0.2 0.0 Base +20% +40% +60% +80% +100% +150% +200%
  • 22. Performance - Communication Overhead 19 120% 100% 80% 60% 40% 20% 0% Communication Overhead Base +20% +40% +60% +80% +100% +150% +200% Normalized Number of Messages Sent Graph Size (Num of Edges) naive Incremental
  • 23. Ongoing Future Work 20 • Working on online version of ML algorithms in different categories • Performance evaluation on various online algorithms • Complete on the fly indexing work • Performance evaluation on different indexing methods
  • 24. Intel Confidential — Do Not Forward