SlideShare a Scribd company logo
Qingbo Hu1, Qiang Zhu2
Multi-label Graph Analysis
and Computations Using
GraphX
1. Qingbo Hu is a Senior Business Analytics Associate at LinkedIn
2. Qiang Zhu currently works for Airbnb. The work introduced in this talk was done when he was a manager at
LinkedIn
Overview
• Background
• Motivation and Goal
• Constructing Multi-label Graphs
• Multi-label PageRank
• Experiments
• Conclusion
Background
• Network Analysis
– Applications:
Telecommunication
Network
Bioinformatics Social Network
Background
• Network Analysis (cont’d)
– Features of interest:
• (In/Out) degrees
• # triangles
• (Strongly) connected component
• Etc.
– Graph-based algorithms
• PageRank [1]
• Label Propagation [2]
• HITS [3]
• etc.
Motivation and Goal
• Homogeneous Network
– Single type of nodes and single type of edges
– Example:
Citation networks: author, citation
Friendship networks: user, friendship
– Not enough to depict complicated real-life networks
– Supported by GraphX
Motivation and Goal
• Heterogeneous Networks
– Nodes of multiple types and edges of multiple types
– Example:
Social Network User Activity Graph: user, reply, comment, like etc.
LinkedIn Economic Graph: member, company, employment, connection
etc.
– Better resembles real-life networks
– Can be represented by labels on nodes and edges
– Not directly supported by GraphX
Multi-label graphs
Motivation and Goal
• Social activity graph on LinkedIn
– Nodes:
– Edges:
Member
Company 1
Company 2
Member 1 Member 2
Motivation and Goal
• Social activity graph on LinkedIn (cont’d)
– Questions:
• How many times a member likes/comments/shares other people’s posts?
• Who has the highest PageRank score in each company with respect to
like/comment/share behavior?
• Etc.
Network features with respect to labels
Graph-based algorithm on label level
Spark + GraphX
• No direct support
• Multiple subgraphs for different labels => waste of time and resource
• A unified solution is preferred
Motivation and Goal
• Solutions based on GraphX to provide Multi-label graph analysis
• Short-term goals
– Construction of multi-label graphs
– Efficient computation of PageRank score with respect to all labels
• Long-term goals
– A general API library supports the following additional operations:
• Multi-label Graph transformation
• Network features on the label level
– Implementations for additional common graph-based algorithms
• Label Propagation
• HITS
• Etc.
Constructing Multi-label Graphs
• Node
– (ID, labels, nodeFeatures)
• ID: a unique long associated with the node
• labels: A set contains node labels
• nodeFeatures: Other node dependent features
• Edge
– (fromID, toID, label, edgeFeatures)
• fromID: the ID of the edge’s source node
• toID: the ID of the edge’s target node
• label: A label associated with the edge
• edgeFeatures: Other edge dependent features
Constructing Multi-label Graphs
• Node labels vs. edge labels
– Edge label is more important in many network features
• PageRank score, (in/out) degrees, strongly connected component etc.
– Node labels are used to filter nodes
– Why?
• Edge labels are usually used to form meaningful subgraphs
– Random walk follows edges, degrees are respect to edge labels etc.
• Node labels can be absorbed in edges if necessary
– a graph transform operation
Top influencers for each company Top influencers within each company
Constructing Multi-label Graphs
• Methods to create a multi-label graph
– NodeRDDs + EdgeRDDs
– EdgeRDDs (no node labels)
– Load directly from file:
A list of edges: (source, target, label)
A list of nodes: (ID, label_1, label_2, …, label_n) => optional
– Transformation from other multi-label graphs
Multi-label PageRank
• PageRank
– Developed by Larry Page and Sergey Brin
– Used to rank web pages
– Important pages are always linked by other
important pages
– Iteratively updating scores until they converge
– The obtained score: PageRank score
Multi-label PageRank
• PageRank (cont’d)
– For an edge (pj, pi), the edge weight is defined by 1/ ,where
is the out degree of pj
– Initial score for every node: 1.0 or 1.0 / N
– Later iteration:
– In order to ensure convergence, we allow a small probability to be
“teleported” to any node (reset probability)
A
B
C
0.5
0.5
or
Multi-label PageRank
• PageRank (cont’d)
– Power iteration through matrix manipulation
• Vector: scores
• Matrix: transitional matrix
• Each iteration: vector * matrix
• Waste resource if the transitional matrix is sparse
– Directly simulate the computation process
• Easier for parallel implementation
• Pregel
Multi-label PageRank
• Pregel
– A general programming interface for graph-based algorithms
– Proposed by Google
– Supported by GraphX
– Iterative algorithm until convergence conditions are met
– For each iteration, we need to consider:
1. How to construct the message passed along edges?
=> Message sender
2. How to combine received messages on a node?
=> Message combiner
3. How to use the combined message to update the info on a node?
=> Vertex Program
Multi-label PageRank
• Construct a graph used for PageRank computation
– PageRankNodeType: Map[Int, (Double, Double)]
• label: the label associated with the PageRank score
• score: the value of PageRank score
• score_diff: the difference of scores between two iterations
– PageRankEdgeType: [Int, Double]
• label: the label associated with the message
• weight: the transitional probability on the edge
– PageRankMsgType: Map[Int, Double]
• label: the label associated with the message
• message: a double valued score used to update PageRank score
Why do we use Map[Int, Double] instead of (Int, Double)?
Multi-label PageRank
• Message Sender
def sendMessage(edge: EdgeTriplet[PageRankNodeType, (Short, Double)]) = {
// Label on the current edge
val label = edge.attr._1
if (edge.srcAttr(label)._2 > tol) {
val msg = mutable.Map[Short, Double]()
msg += label -> edge.srcAttr(label)._2 * edge.attr._2
Iterator((edge.dstId, msg))
}
else {
Iterator.empty
}
}
Create the message to be passed
on the edge as a map
Multi-label PageRank
• Message Combiner
def messageCombiner(a : PageRankMsgType, b : PageRankMsgType) :
PageRankMsgType = {
a ++ b.map{ case (k,v) => k -> (v + a.getOrElse(k, 0.0))}
}
Combine received maps into a single one
Multi-label PageRank
• Vertex Program
def vertexProgram(id: VertexId, attr: PageRankNodeType, msgSum:
PageRankMsgType): PageRankNodeType = {
…
attr.map{
case (label, (oldPR, lastDelta)) => {
val newPR = oldPR + (1.0 - resetProb) * msgSum.getOrElse(label, 0.0)
val newDelta = newPR - oldPR
(label -> (newPR, newDelta))
}
}
}
Using combined message to update
PageRank score
Experiments
• LinkedIn social activity graph
– Sampled from all social activities in Nov. 2016
– Nodes: ~2 million users
– Node labels: companies
– Edge labels:
• Like
• Share
• Comment
– Edges: ~76 million
– Rest probability: 0.15
– Convergence granularity: 1e-3
– Number of executor: 50
– Executor cores: 3
– Executor Memory: 12G
Experiments
• Convergence around 100 iterations
• Total running time: 30~40 mins
• A case study for LinkedIn:
Jeff Weiner
Daniel Roth Greg Call
Jeff Weiner
Kathy Caprino Isabelle Roughol
Jeff Weiner
Isabelle Roughol Akshay Kothari
Experiments
• Further discussions and lessons learned
– For edge type in multi-label graphs
(fromID, toID, label, edgeFeatures) => (fromID, toID, Map(label,
edgeFeatures))
• Reduce duplication and save space
• Slower process time
– Standard Pregel interface in GraphX
• Although data from the last iteration is unpersisted, DAG will keep
grow
• Might cause out of memory error
• Pregel interface with (local) checkpoint to cut off the DAG after
several iteration
– Test on larger data sets and various data sources
Conclusion
• Network Analysis
– Graph features
– Graph-based algorithms
• Homogeneous vs. Heterogeneous Networks
• Multi-label Graphs
– Node & Node labels
– Edge & Edge labels
– Constructing a multi-label graph
• Multi-label PageRank
– PageRank
– Pregel-based implementation
• Experiments
References
[1] Page, Lawrence, et al. The PageRank citation ranking: Bringing order to the web. Stanford InfoLab,
1999.
[2] Zhu, Xiaojin, and Zoubin Ghahramani. "Learning from labeled and unlabeled data with label
propagation." (2002): 1.
[3] Kleinberg, Jon M. "Hubs, authorities, and communities." ACM computing surveys (CSUR) 31.4es
(1999): 5.
Thank You!
Qingbo Hu (qihu@linkedin.com)

More Related Content

PPTX
Data Visulalization
PPTX
Chengqi zhang graph processing and mining in the era of big data
PPT
PDF
Gao cong geospatial social media data management and context-aware recommenda...
PDF
PDF in Smalltalk
PDF
Multi-Label Graph Analysis and Computations Using GraphX with Qiang Zhu and Q...
PDF
Large Scale Graph Processing with Apache Giraph
PDF
Graph Algorithms - Map-Reduce Graph Processing
Data Visulalization
Chengqi zhang graph processing and mining in the era of big data
Gao cong geospatial social media data management and context-aware recommenda...
PDF in Smalltalk
Multi-Label Graph Analysis and Computations Using GraphX with Qiang Zhu and Q...
Large Scale Graph Processing with Apache Giraph
Graph Algorithms - Map-Reduce Graph Processing

Similar to Multi-label graph analysis and computations using GraphX (20)

PDF
LDBC 8th TUC Meeting: Introduction and status update
ODP
FOSDEM 2014: Social Network Benchmark (SNB) Graph Generator
PPTX
Hadoop and Mapreduce for .NET User Group
ODP
FOSDEM2014 - Social Network Benchmark (SNB) Graph Generator - Peter Boncz
PDF
Understanding Hadoop through examples
PPTX
Graph Databases in the Microsoft Ecosystem
PDF
GraphFrames: DataFrame-based graphs for Apache® Spark™
PPTX
AI與大數據數據處理 Spark實戰(20171216)
PDF
managing big data
PDF
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...
PPTX
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
PPTX
MongoDB San Francisco 2013: Storing eBay's Media Metadata on MongoDB present...
PPTX
Storing eBay's Media Metadata on MongoDB, by Yuri Finkelstein, Architect, eBay
PPTX
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
PPT
design mapping lecture6-mapreducealgorithmdesign.ppt
PDF
Scalding big ADta
PDF
GraphGen: Conducting Graph Analytics over Relational Databases
PDF
GraphGen: Conducting Graph Analytics over Relational Databases
PPTX
Graph Databases
PPTX
Big Data Processing
LDBC 8th TUC Meeting: Introduction and status update
FOSDEM 2014: Social Network Benchmark (SNB) Graph Generator
Hadoop and Mapreduce for .NET User Group
FOSDEM2014 - Social Network Benchmark (SNB) Graph Generator - Peter Boncz
Understanding Hadoop through examples
Graph Databases in the Microsoft Ecosystem
GraphFrames: DataFrame-based graphs for Apache® Spark™
AI與大數據數據處理 Spark實戰(20171216)
managing big data
Spark as the Gateway Drug to Typed Functional Programming: Spark Summit East ...
Taboola's experience with Apache Spark (presentation @ Reversim 2014)
MongoDB San Francisco 2013: Storing eBay's Media Metadata on MongoDB present...
Storing eBay's Media Metadata on MongoDB, by Yuri Finkelstein, Architect, eBay
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
design mapping lecture6-mapreducealgorithmdesign.ppt
Scalding big ADta
GraphGen: Conducting Graph Analytics over Relational Databases
GraphGen: Conducting Graph Analytics over Relational Databases
Graph Databases
Big Data Processing
Ad

Recently uploaded (20)

PPTX
A Presentation on Artificial Intelligence
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PPTX
Cloud computing and distributed systems.
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PDF
Machine learning based COVID-19 study performance prediction
PDF
Empathic Computing: Creating Shared Understanding
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Assigned Numbers - 2025 - Bluetooth® Document
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
Digital-Transformation-Roadmap-for-Companies.pptx
PPTX
sap open course for s4hana steps from ECC to s4
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PPTX
Big Data Technologies - Introduction.pptx
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
A Presentation on Artificial Intelligence
The Rise and Fall of 3GPP – Time for a Sabbatical?
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Cloud computing and distributed systems.
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
NewMind AI Weekly Chronicles - August'25-Week II
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Building Integrated photovoltaic BIPV_UPV.pdf
Machine learning based COVID-19 study performance prediction
Empathic Computing: Creating Shared Understanding
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Assigned Numbers - 2025 - Bluetooth® Document
Mobile App Security Testing_ A Comprehensive Guide.pdf
MYSQL Presentation for SQL database connectivity
Digital-Transformation-Roadmap-for-Companies.pptx
sap open course for s4hana steps from ECC to s4
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Big Data Technologies - Introduction.pptx
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Ad

Multi-label graph analysis and computations using GraphX

  • 1. Qingbo Hu1, Qiang Zhu2 Multi-label Graph Analysis and Computations Using GraphX 1. Qingbo Hu is a Senior Business Analytics Associate at LinkedIn 2. Qiang Zhu currently works for Airbnb. The work introduced in this talk was done when he was a manager at LinkedIn
  • 2. Overview • Background • Motivation and Goal • Constructing Multi-label Graphs • Multi-label PageRank • Experiments • Conclusion
  • 3. Background • Network Analysis – Applications: Telecommunication Network Bioinformatics Social Network
  • 4. Background • Network Analysis (cont’d) – Features of interest: • (In/Out) degrees • # triangles • (Strongly) connected component • Etc. – Graph-based algorithms • PageRank [1] • Label Propagation [2] • HITS [3] • etc.
  • 5. Motivation and Goal • Homogeneous Network – Single type of nodes and single type of edges – Example: Citation networks: author, citation Friendship networks: user, friendship – Not enough to depict complicated real-life networks – Supported by GraphX
  • 6. Motivation and Goal • Heterogeneous Networks – Nodes of multiple types and edges of multiple types – Example: Social Network User Activity Graph: user, reply, comment, like etc. LinkedIn Economic Graph: member, company, employment, connection etc. – Better resembles real-life networks – Can be represented by labels on nodes and edges – Not directly supported by GraphX Multi-label graphs
  • 7. Motivation and Goal • Social activity graph on LinkedIn – Nodes: – Edges: Member Company 1 Company 2 Member 1 Member 2
  • 8. Motivation and Goal • Social activity graph on LinkedIn (cont’d) – Questions: • How many times a member likes/comments/shares other people’s posts? • Who has the highest PageRank score in each company with respect to like/comment/share behavior? • Etc. Network features with respect to labels Graph-based algorithm on label level Spark + GraphX • No direct support • Multiple subgraphs for different labels => waste of time and resource • A unified solution is preferred
  • 9. Motivation and Goal • Solutions based on GraphX to provide Multi-label graph analysis • Short-term goals – Construction of multi-label graphs – Efficient computation of PageRank score with respect to all labels • Long-term goals – A general API library supports the following additional operations: • Multi-label Graph transformation • Network features on the label level – Implementations for additional common graph-based algorithms • Label Propagation • HITS • Etc.
  • 10. Constructing Multi-label Graphs • Node – (ID, labels, nodeFeatures) • ID: a unique long associated with the node • labels: A set contains node labels • nodeFeatures: Other node dependent features • Edge – (fromID, toID, label, edgeFeatures) • fromID: the ID of the edge’s source node • toID: the ID of the edge’s target node • label: A label associated with the edge • edgeFeatures: Other edge dependent features
  • 11. Constructing Multi-label Graphs • Node labels vs. edge labels – Edge label is more important in many network features • PageRank score, (in/out) degrees, strongly connected component etc. – Node labels are used to filter nodes – Why? • Edge labels are usually used to form meaningful subgraphs – Random walk follows edges, degrees are respect to edge labels etc. • Node labels can be absorbed in edges if necessary – a graph transform operation Top influencers for each company Top influencers within each company
  • 12. Constructing Multi-label Graphs • Methods to create a multi-label graph – NodeRDDs + EdgeRDDs – EdgeRDDs (no node labels) – Load directly from file: A list of edges: (source, target, label) A list of nodes: (ID, label_1, label_2, …, label_n) => optional – Transformation from other multi-label graphs
  • 13. Multi-label PageRank • PageRank – Developed by Larry Page and Sergey Brin – Used to rank web pages – Important pages are always linked by other important pages – Iteratively updating scores until they converge – The obtained score: PageRank score
  • 14. Multi-label PageRank • PageRank (cont’d) – For an edge (pj, pi), the edge weight is defined by 1/ ,where is the out degree of pj – Initial score for every node: 1.0 or 1.0 / N – Later iteration: – In order to ensure convergence, we allow a small probability to be “teleported” to any node (reset probability) A B C 0.5 0.5 or
  • 15. Multi-label PageRank • PageRank (cont’d) – Power iteration through matrix manipulation • Vector: scores • Matrix: transitional matrix • Each iteration: vector * matrix • Waste resource if the transitional matrix is sparse – Directly simulate the computation process • Easier for parallel implementation • Pregel
  • 16. Multi-label PageRank • Pregel – A general programming interface for graph-based algorithms – Proposed by Google – Supported by GraphX – Iterative algorithm until convergence conditions are met – For each iteration, we need to consider: 1. How to construct the message passed along edges? => Message sender 2. How to combine received messages on a node? => Message combiner 3. How to use the combined message to update the info on a node? => Vertex Program
  • 17. Multi-label PageRank • Construct a graph used for PageRank computation – PageRankNodeType: Map[Int, (Double, Double)] • label: the label associated with the PageRank score • score: the value of PageRank score • score_diff: the difference of scores between two iterations – PageRankEdgeType: [Int, Double] • label: the label associated with the message • weight: the transitional probability on the edge – PageRankMsgType: Map[Int, Double] • label: the label associated with the message • message: a double valued score used to update PageRank score Why do we use Map[Int, Double] instead of (Int, Double)?
  • 18. Multi-label PageRank • Message Sender def sendMessage(edge: EdgeTriplet[PageRankNodeType, (Short, Double)]) = { // Label on the current edge val label = edge.attr._1 if (edge.srcAttr(label)._2 > tol) { val msg = mutable.Map[Short, Double]() msg += label -> edge.srcAttr(label)._2 * edge.attr._2 Iterator((edge.dstId, msg)) } else { Iterator.empty } } Create the message to be passed on the edge as a map
  • 19. Multi-label PageRank • Message Combiner def messageCombiner(a : PageRankMsgType, b : PageRankMsgType) : PageRankMsgType = { a ++ b.map{ case (k,v) => k -> (v + a.getOrElse(k, 0.0))} } Combine received maps into a single one
  • 20. Multi-label PageRank • Vertex Program def vertexProgram(id: VertexId, attr: PageRankNodeType, msgSum: PageRankMsgType): PageRankNodeType = { … attr.map{ case (label, (oldPR, lastDelta)) => { val newPR = oldPR + (1.0 - resetProb) * msgSum.getOrElse(label, 0.0) val newDelta = newPR - oldPR (label -> (newPR, newDelta)) } } } Using combined message to update PageRank score
  • 21. Experiments • LinkedIn social activity graph – Sampled from all social activities in Nov. 2016 – Nodes: ~2 million users – Node labels: companies – Edge labels: • Like • Share • Comment – Edges: ~76 million – Rest probability: 0.15 – Convergence granularity: 1e-3 – Number of executor: 50 – Executor cores: 3 – Executor Memory: 12G
  • 22. Experiments • Convergence around 100 iterations • Total running time: 30~40 mins • A case study for LinkedIn: Jeff Weiner Daniel Roth Greg Call Jeff Weiner Kathy Caprino Isabelle Roughol Jeff Weiner Isabelle Roughol Akshay Kothari
  • 23. Experiments • Further discussions and lessons learned – For edge type in multi-label graphs (fromID, toID, label, edgeFeatures) => (fromID, toID, Map(label, edgeFeatures)) • Reduce duplication and save space • Slower process time – Standard Pregel interface in GraphX • Although data from the last iteration is unpersisted, DAG will keep grow • Might cause out of memory error • Pregel interface with (local) checkpoint to cut off the DAG after several iteration – Test on larger data sets and various data sources
  • 24. Conclusion • Network Analysis – Graph features – Graph-based algorithms • Homogeneous vs. Heterogeneous Networks • Multi-label Graphs – Node & Node labels – Edge & Edge labels – Constructing a multi-label graph • Multi-label PageRank – PageRank – Pregel-based implementation • Experiments
  • 25. References [1] Page, Lawrence, et al. The PageRank citation ranking: Bringing order to the web. Stanford InfoLab, 1999. [2] Zhu, Xiaojin, and Zoubin Ghahramani. "Learning from labeled and unlabeled data with label propagation." (2002): 1. [3] Kleinberg, Jon M. "Hubs, authorities, and communities." ACM computing surveys (CSUR) 31.4es (1999): 5.
  • 26. Thank You! Qingbo Hu (qihu@linkedin.com)

Editor's Notes

  • #23: Daniel Roth: Editor in Chief Greg Call: Head of Veterans Program, works for Amazon now Kathy Caprino: LinkedIn Publishing Member Advisory Board Isabelle Roughol: Senior Editor Akshay Kothari: Head of LinkedIn India