SlideShare a Scribd company logo
Qingbo	Hu1,	Qiang Zhu2
Multi-label Graph Analysis
and Computations Using
GraphX
1. Qingbo	Hu	is	a	Senior	Business	Analytics	Associate	at	LinkedIn
2. QiangZhu	currently	works	for	Airbnb.	The	work	introduced	in	this	talk	was	done	when	he	was	a	manager	at	
LinkedIn
Overview
• Background
• Motivation and Goal
• Constructing Multi-label Graphs
• Multi-label PageRank
• Experiments
• Conclusion
Background
• Network Analysis
– Applications:
Telecommunication
Network
Bioinformatics Social Network
Background
• Network Analysis (cont’d)
– Features of interest:
• (In/Out) degrees
• # triangles
• (Strongly) connected component
• Etc.
– Graph-based algorithms
• PageRank [1]
• Label Propagation [2]
• HITS [3]
• etc.
Motivation and Goal
• Homogeneous Network
– Single type of nodes and single type of edges
– Example:
Citation networks: author, citation
Friendship networks: user, friendship
– Not enough to depict complicated real-life networks
– Supported by GraphX
Motivation and Goal
• Heterogeneous Networks
– Nodes of multiple types and edges of multiple types
– Example:
Social Network User Activity Graph: user, reply, comment, like etc.
LinkedIn Economic Graph: member, company, employment, connection
etc.
– Better resembles real-life networks
– Can be represented by labels on nodes and edges
– Not directly supported by GraphX
Multi-label graphs
Motivation and Goal
• Social activity graph on LinkedIn
– Nodes:
– Edges:
Member
Company 1
Company 2
Member 1 Member 2
Motivation and Goal
• Social activity graph on LinkedIn (cont’d)
– Questions:
• How many times a member likes/comments/shares other people’s posts?
• Who has the highest PageRank score in each company with respect to
like/comment/share behavior?
• Etc.
Network features with respect to labels
Graph-based algorithm on label level
Spark + GraphX
• No direct support
• Multiple subgraphs for different labels => waste of time and resource
• A unified solution is preferred
Motivation and Goal
• Solutions based on GraphX to provide Multi-label graph analysis
• Short-term goals
– Construction of multi-label graphs
– Efficient computation of PageRank score with respect to all labels
• Long-term goals
– A general API library supports the following additional operations:
• Multi-label Graph transformation
• Network features on the label level
– Implementations for additional common graph-based algorithms
• Label Propagation
• HITS
• Etc.
Constructing Multi-label Graphs
• Node
– (ID, labels, nodeFeatures)
• ID: a unique long associated with the node
• labels: A set contains node labels
• nodeFeatures: Other node dependent features
• Edge
– (fromID, toID, label, edgeFeatures)
• fromID: the ID of the edge’s source node
• toID: the ID of the edge’s target node
• label: A label associated with the edge
• edgeFeatures: Other edge dependent features
Constructing Multi-label Graphs
• Node labels vs. edge labels
– Edge label is more important in many network features
• PageRank score, (in/out) degrees, strongly connected component etc.
– Node labels are used to filter nodes
– Why?
• Edge labels are usually used to form meaningful subgraphs
– Random walk follows edges, degrees are respect to edge labels etc.
• Node labels can be absorbed in edges if necessary
– a graph transform operation
Top influencers for each company Top influencers within each company
Constructing Multi-label Graphs
• Methods to create a multi-label graph
– NodeRDDs + EdgeRDDs
– EdgeRDDs (no node labels)
– Load directly from file:
A list of edges: (source, target, label)
A list of nodes: (ID, label_1, label_2, …, label_n) => optional
– Transformation from other multi-label graphs
Multi-label PageRank
• PageRank
– Developed by Larry Page and Sergey Brin
– Used to rank web pages
– Important pages are always linked by other
important pages
– Iteratively updating scores until they converge
– The obtained score: PageRank score
Multi-label PageRank
• PageRank (cont’d)
– For an edge (pj, pi), the edge weight is defined by 1/ ,where
is the out degree of pj
– Initial score for every node: 1.0 or 1.0 / N
– Later iteration:
– In order to ensure convergence,we allow a small probability to be
“teleported” to any node (reset probability)
A
B
C
0.5
0.5
or
Multi-label PageRank
• PageRank (cont’d)
– Power iteration through matrix manipulation
• Vector: scores
• Matrix: transitional matrix
• Each iteration: vector * matrix
• Waste resource if the transitional matrix is sparse
– Directly simulate the computation process
• Easier for parallel implementation
• Pregel
Multi-label PageRank
• Pregel
– A general programming interface for graph-based algorithms
– Proposed by Google
– Supported by GraphX
– Iterative algorithm until convergence conditions are met
– For each iteration, we need to consider:
1. How to construct the message passed along edges?
=> Message sender
2. How to combine received messages on a node?
=> Message combiner
3. How to use the combined message to update the info on a node?
=> Vertex Program
Multi-label PageRank
• Construct a graph used for PageRank computation
– PageRankNodeType:Map[Int, (Double, Double)]
• label: the label associated with the PageRank score
• score: the value of PageRank score
• score_diff: the difference of scores between two iterations
– PageRankEdgeType:[Int, Double]
• label: the label associated with the message
• weight: the transitional probability on the edge
– PageRankMsgType:Map[Int, Double]
• label: the label associated with the message
• message: a double valued score used to update PageRank score
Why do we use Map[Int, Double] instead of (Int, Double)?
Multi-label PageRank
• Message Sender
def sendMessage(edge: EdgeTriplet[PageRankNodeType, (Short, Double)]) = {
// Label on the current edge
val label = edge.attr._1
if (edge.srcAttr(label)._2 > tol) {
val msg = mutable.Map[Short, Double]()
msg += label -> edge.srcAttr(label)._2 * edge.attr._2
Iterator((edge.dstId, msg))
}
else {
Iterator.empty
}
}
Create the message to be passed
on the edge as a map
Multi-label PageRank
• Message Combiner
def messageCombiner(a : PageRankMsgType, b : PageRankMsgType) :
PageRankMsgType = {
a ++ b.map{ case (k,v) => k -> (v + a.getOrElse(k, 0.0))}
}
Combine received maps into a single one
Multi-label PageRank
• Vertex Program
def vertexProgram(id: VertexId, attr: PageRankNodeType, msgSum:
PageRankMsgType): PageRankNodeType = {
…
attr.map{
case (label, (oldPR, lastDelta)) => {
val newPR = oldPR + (1.0 - resetProb) * msgSum.getOrElse(label, 0.0)
val newDelta = newPR - oldPR
(label -> (newPR, newDelta))
}
}
}
Using combined message to update
PageRank score
Experiments
• LinkedIn social activity graph
– Sampled from all social activities in Nov. 2016
– Nodes: ~2 million users
– Node labels: companies
– Edge labels:
• Like
• Share
• Comment
– Edges: ~76 million
– Rest probability: 0.15
– Convergence granularity: 1e-3
– Number of executor: 50
– Executor cores: 3
– Executor Memory: 12G
Experiments
• Convergence around 100 iterations
• Total running time: 30~40 mins
• A case study for LinkedIn:
Jeff Weiner
Daniel Roth Greg Call
Jeff Weiner
Kathy Caprino Isabelle Roughol
Jeff Weiner
Isabelle Roughol Akshay Kothari
Experiments
• Further discussions and lessons learned
– For edge type in multi-label graphs
(fromID, toID, label, edgeFeatures) => (fromID, toID, Map(label,
edgeFeatures))
• Reduce duplication and save space
• Slower process time
– Standard Pregel interface in GraphX
• Although data from the last iteration is unpersisted, DAG will keep
grow
• Might cause out of memory error
• Pregel interface with (local) checkpoint to cut off the DAG after
several iteration
– Test on larger data sets and various data sources
Conclusion
• Network Analysis
– Graph features
– Graph-based algorithms
• Homogeneous vs. Heterogeneous Networks
• Multi-label Graphs
– Node & Node labels
– Edge & Edge labels
– Constructing a multi-label graph
• Multi-label PageRank
– PageRank
– Pregel-based implementation
• Experiments
References
[1] Page, Lawrence, et al. The PageRank citation ranking: Bringing order to the web. Stanford InfoLab,
1999.
[2] Zhu, Xiaojin, and Zoubin Ghahramani. "Learning from labeled and unlabeled data with label
propagation." (2002): 1.
[3] Kleinberg, Jon M. "Hubs, authorities, and communities." ACM computing surveys (CSUR) 31.4es
(1999): 5.
Thank You!
Qingbo Hu (qihu@linkedin.com)

More Related Content

PDF
Yelp Ad Targeting at Scale with Apache Spark with Inaz Alaei-Novin and Joe Ma...
PDF
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
PDF
Random Walks on Large Scale Graphs with Apache Spark with Min Shen
PDF
Spark Summit EU talk by Reza Karimi
PDF
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
PDF
Apache con big data 2015 - Data Science from the trenches
PDF
Apache Spark's MLlib's Past Trajectory and new Directions
PDF
Spark summit 2017- Transforming B2B sales with Spark powered sales intelligence
Yelp Ad Targeting at Scale with Apache Spark with Inaz Alaei-Novin and Joe Ma...
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Random Walks on Large Scale Graphs with Apache Spark with Min Shen
Spark Summit EU talk by Reza Karimi
The Key to Machine Learning is Prepping the Right Data with Jean Georges Perrin
Apache con big data 2015 - Data Science from the trenches
Apache Spark's MLlib's Past Trajectory and new Directions
Spark summit 2017- Transforming B2B sales with Spark powered sales intelligence

What's hot (20)

PDF
Huawei Advanced Data Science With Spark Streaming
PDF
Spark Summit EU talk by Zoltan Zvara
PDF
Distributed End-to-End Drug Similarity Analytics and Visualization Workflow w...
PDF
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
PDF
Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...
PDF
Magellan-Spark as a Geospatial Analytics Engine by Ram Sriharsha
PDF
From Python Scikit-learn to Scala Apache Spark—The Road to Uncovering Botnets...
PPTX
Data science on big data. Pragmatic approach
PDF
Spark Meetup @ Netflix, 05/19/2015
PDF
Dev Ops Training
PDF
Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning
PDF
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
PPTX
From Pipelines to Refineries: scaling big data applications with Tim Hunter
PDF
Improving the Life of Data Scientists: Automating ML Lifecycle through MLflow
PDF
Spark: Interactive To Production
PDF
Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...
PPTX
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
PDF
Spark Summit EU talk by Nick Pentreath
PDF
Semi-Supervised Learning In An Adversarial Environment
PDF
Spark, GraphX, and Blockchains: Building a Behavioral Analytics Platform for ...
Huawei Advanced Data Science With Spark Streaming
Spark Summit EU talk by Zoltan Zvara
Distributed End-to-End Drug Similarity Analytics and Visualization Workflow w...
Building, Debugging, and Tuning Spark Machine Leaning Pipelines-(Joseph Bradl...
Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...
Magellan-Spark as a Geospatial Analytics Engine by Ram Sriharsha
From Python Scikit-learn to Scala Apache Spark—The Road to Uncovering Botnets...
Data science on big data. Pragmatic approach
Spark Meetup @ Netflix, 05/19/2015
Dev Ops Training
Handling Data Skew Adaptively In Spark Using Dynamic Repartitioning
Data-Driven Transformation: Leveraging Big Data at Showtime with Apache Spark
From Pipelines to Refineries: scaling big data applications with Tim Hunter
Improving the Life of Data Scientists: Automating ML Lifecycle through MLflow
Spark: Interactive To Production
Machine Learning as a Service: Apache Spark MLlib Enrichment and Web-Based Co...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
Spark Summit EU talk by Nick Pentreath
Semi-Supervised Learning In An Adversarial Environment
Spark, GraphX, and Blockchains: Building a Behavioral Analytics Platform for ...
Ad

Similar to Multi-Label Graph Analysis and Computations Using GraphX with Qiang Zhu and Qingbo Hu (20)

PPTX
Multi-label graph analysis and computations using GraphX
PDF
An excursion into Graph Analytics with Apache Spark GraphX
PDF
Improve ml predictions using graph algorithms (webinar july 23_19).pptx
PDF
Machine Learning and GraphX
PDF
Write Graph Algorithms Like a Boss Andrew Ray
PDF
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
PDF
Processing large-scale graphs with Google(TM) Pregel
PDF
Frank Celler – Processing large-scale graphs with Google(TM) Pregel - NoSQL m...
PPTX
Scalable Distributed Graph Algorithms on Apache Spark
PPTX
Graphs in data structures are non-linear data structures made up of a finite ...
PDF
Graph Analytics with ArangoDB
PDF
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
PPTX
PowerLyra@EuroSys2015
PDF
Graph machine learning table of content
PDF
Deep learning for molecules, introduction to chainer chemistry
PDF
Large scale graph processing
PDF
Using GraphX/Pregel on Browsing History to Discover Purchase Intent by Lisa Z...
PPT
Pagerank (from Google)
PPT
Lec5 Pagerank
PPT
Lec5 pagerank
Multi-label graph analysis and computations using GraphX
An excursion into Graph Analytics with Apache Spark GraphX
Improve ml predictions using graph algorithms (webinar july 23_19).pptx
Machine Learning and GraphX
Write Graph Algorithms Like a Boss Andrew Ray
Ehtsham Elahi, Senior Research Engineer, Personalization Science and Engineer...
Processing large-scale graphs with Google(TM) Pregel
Frank Celler – Processing large-scale graphs with Google(TM) Pregel - NoSQL m...
Scalable Distributed Graph Algorithms on Apache Spark
Graphs in data structures are non-linear data structures made up of a finite ...
Graph Analytics with ArangoDB
GraphX: Graph Analytics in Apache Spark (AMPCamp 5, 2014-11-20)
PowerLyra@EuroSys2015
Graph machine learning table of content
Deep learning for molecules, introduction to chainer chemistry
Large scale graph processing
Using GraphX/Pregel on Browsing History to Discover Purchase Intent by Lisa Z...
Pagerank (from Google)
Lec5 Pagerank
Lec5 pagerank
Ad

More from Databricks (20)

PPTX
DW Migration Webinar-March 2022.pptx
PPTX
Data Lakehouse Symposium | Day 1 | Part 1
PPT
Data Lakehouse Symposium | Day 1 | Part 2
PPTX
Data Lakehouse Symposium | Day 2
PPTX
Data Lakehouse Symposium | Day 4
PDF
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
PDF
Democratizing Data Quality Through a Centralized Platform
PDF
Learn to Use Databricks for Data Science
PDF
Why APM Is Not the Same As ML Monitoring
PDF
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
PDF
Stage Level Scheduling Improving Big Data and AI Integration
PDF
Simplify Data Conversion from Spark to TensorFlow and PyTorch
PDF
Scaling your Data Pipelines with Apache Spark on Kubernetes
PDF
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
PDF
Sawtooth Windows for Feature Aggregations
PDF
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
PDF
Re-imagine Data Monitoring with whylogs and Spark
PDF
Raven: End-to-end Optimization of ML Prediction Queries
PDF
Processing Large Datasets for ADAS Applications using Apache Spark
PDF
Massive Data Processing in Adobe Using Delta Lake
DW Migration Webinar-March 2022.pptx
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 4
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
Democratizing Data Quality Through a Centralized Platform
Learn to Use Databricks for Data Science
Why APM Is Not the Same As ML Monitoring
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
Stage Level Scheduling Improving Big Data and AI Integration
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Sawtooth Windows for Feature Aggregations
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Re-imagine Data Monitoring with whylogs and Spark
Raven: End-to-end Optimization of ML Prediction Queries
Processing Large Datasets for ADAS Applications using Apache Spark
Massive Data Processing in Adobe Using Delta Lake

Recently uploaded (20)

PDF
Lecture1 pattern recognition............
PPTX
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
PPTX
Introduction to machine learning and Linear Models
PDF
.pdf is not working space design for the following data for the following dat...
PPTX
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
PDF
Galatica Smart Energy Infrastructure Startup Pitch Deck
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PPTX
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
PPTX
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
PDF
Introduction to Data Science and Data Analysis
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PDF
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
PPT
Miokarditis (Inflamasi pada Otot Jantung)
PPTX
Computer network topology notes for revision
PPTX
STUDY DESIGN details- Lt Col Maksud (21).pptx
PPTX
MODULE 8 - DISASTER risk PREPAREDNESS.pptx
Lecture1 pattern recognition............
Introduction to Firewall Analytics - Interfirewall and Transfirewall.pptx
Introduction to machine learning and Linear Models
.pdf is not working space design for the following data for the following dat...
Microsoft-Fabric-Unifying-Analytics-for-the-Modern-Enterprise Solution.pptx
Introduction-to-Cloud-ComputingFinal.pptx
ALIMENTARY AND BILIARY CONDITIONS 3-1.pptx
Galatica Smart Energy Infrastructure Startup Pitch Deck
Qualitative Qantitative and Mixed Methods.pptx
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
DISORDERS OF THE LIVER, GALLBLADDER AND PANCREASE (1).pptx
Introduction to Data Science and Data Analysis
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
22.Patil - Early prediction of Alzheimer’s disease using convolutional neural...
Miokarditis (Inflamasi pada Otot Jantung)
Computer network topology notes for revision
STUDY DESIGN details- Lt Col Maksud (21).pptx
MODULE 8 - DISASTER risk PREPAREDNESS.pptx

Multi-Label Graph Analysis and Computations Using GraphX with Qiang Zhu and Qingbo Hu

  • 1. Qingbo Hu1, Qiang Zhu2 Multi-label Graph Analysis and Computations Using GraphX 1. Qingbo Hu is a Senior Business Analytics Associate at LinkedIn 2. QiangZhu currently works for Airbnb. The work introduced in this talk was done when he was a manager at LinkedIn
  • 2. Overview • Background • Motivation and Goal • Constructing Multi-label Graphs • Multi-label PageRank • Experiments • Conclusion
  • 3. Background • Network Analysis – Applications: Telecommunication Network Bioinformatics Social Network
  • 4. Background • Network Analysis (cont’d) – Features of interest: • (In/Out) degrees • # triangles • (Strongly) connected component • Etc. – Graph-based algorithms • PageRank [1] • Label Propagation [2] • HITS [3] • etc.
  • 5. Motivation and Goal • Homogeneous Network – Single type of nodes and single type of edges – Example: Citation networks: author, citation Friendship networks: user, friendship – Not enough to depict complicated real-life networks – Supported by GraphX
  • 6. Motivation and Goal • Heterogeneous Networks – Nodes of multiple types and edges of multiple types – Example: Social Network User Activity Graph: user, reply, comment, like etc. LinkedIn Economic Graph: member, company, employment, connection etc. – Better resembles real-life networks – Can be represented by labels on nodes and edges – Not directly supported by GraphX Multi-label graphs
  • 7. Motivation and Goal • Social activity graph on LinkedIn – Nodes: – Edges: Member Company 1 Company 2 Member 1 Member 2
  • 8. Motivation and Goal • Social activity graph on LinkedIn (cont’d) – Questions: • How many times a member likes/comments/shares other people’s posts? • Who has the highest PageRank score in each company with respect to like/comment/share behavior? • Etc. Network features with respect to labels Graph-based algorithm on label level Spark + GraphX • No direct support • Multiple subgraphs for different labels => waste of time and resource • A unified solution is preferred
  • 9. Motivation and Goal • Solutions based on GraphX to provide Multi-label graph analysis • Short-term goals – Construction of multi-label graphs – Efficient computation of PageRank score with respect to all labels • Long-term goals – A general API library supports the following additional operations: • Multi-label Graph transformation • Network features on the label level – Implementations for additional common graph-based algorithms • Label Propagation • HITS • Etc.
  • 10. Constructing Multi-label Graphs • Node – (ID, labels, nodeFeatures) • ID: a unique long associated with the node • labels: A set contains node labels • nodeFeatures: Other node dependent features • Edge – (fromID, toID, label, edgeFeatures) • fromID: the ID of the edge’s source node • toID: the ID of the edge’s target node • label: A label associated with the edge • edgeFeatures: Other edge dependent features
  • 11. Constructing Multi-label Graphs • Node labels vs. edge labels – Edge label is more important in many network features • PageRank score, (in/out) degrees, strongly connected component etc. – Node labels are used to filter nodes – Why? • Edge labels are usually used to form meaningful subgraphs – Random walk follows edges, degrees are respect to edge labels etc. • Node labels can be absorbed in edges if necessary – a graph transform operation Top influencers for each company Top influencers within each company
  • 12. Constructing Multi-label Graphs • Methods to create a multi-label graph – NodeRDDs + EdgeRDDs – EdgeRDDs (no node labels) – Load directly from file: A list of edges: (source, target, label) A list of nodes: (ID, label_1, label_2, …, label_n) => optional – Transformation from other multi-label graphs
  • 13. Multi-label PageRank • PageRank – Developed by Larry Page and Sergey Brin – Used to rank web pages – Important pages are always linked by other important pages – Iteratively updating scores until they converge – The obtained score: PageRank score
  • 14. Multi-label PageRank • PageRank (cont’d) – For an edge (pj, pi), the edge weight is defined by 1/ ,where is the out degree of pj – Initial score for every node: 1.0 or 1.0 / N – Later iteration: – In order to ensure convergence,we allow a small probability to be “teleported” to any node (reset probability) A B C 0.5 0.5 or
  • 15. Multi-label PageRank • PageRank (cont’d) – Power iteration through matrix manipulation • Vector: scores • Matrix: transitional matrix • Each iteration: vector * matrix • Waste resource if the transitional matrix is sparse – Directly simulate the computation process • Easier for parallel implementation • Pregel
  • 16. Multi-label PageRank • Pregel – A general programming interface for graph-based algorithms – Proposed by Google – Supported by GraphX – Iterative algorithm until convergence conditions are met – For each iteration, we need to consider: 1. How to construct the message passed along edges? => Message sender 2. How to combine received messages on a node? => Message combiner 3. How to use the combined message to update the info on a node? => Vertex Program
  • 17. Multi-label PageRank • Construct a graph used for PageRank computation – PageRankNodeType:Map[Int, (Double, Double)] • label: the label associated with the PageRank score • score: the value of PageRank score • score_diff: the difference of scores between two iterations – PageRankEdgeType:[Int, Double] • label: the label associated with the message • weight: the transitional probability on the edge – PageRankMsgType:Map[Int, Double] • label: the label associated with the message • message: a double valued score used to update PageRank score Why do we use Map[Int, Double] instead of (Int, Double)?
  • 18. Multi-label PageRank • Message Sender def sendMessage(edge: EdgeTriplet[PageRankNodeType, (Short, Double)]) = { // Label on the current edge val label = edge.attr._1 if (edge.srcAttr(label)._2 > tol) { val msg = mutable.Map[Short, Double]() msg += label -> edge.srcAttr(label)._2 * edge.attr._2 Iterator((edge.dstId, msg)) } else { Iterator.empty } } Create the message to be passed on the edge as a map
  • 19. Multi-label PageRank • Message Combiner def messageCombiner(a : PageRankMsgType, b : PageRankMsgType) : PageRankMsgType = { a ++ b.map{ case (k,v) => k -> (v + a.getOrElse(k, 0.0))} } Combine received maps into a single one
  • 20. Multi-label PageRank • Vertex Program def vertexProgram(id: VertexId, attr: PageRankNodeType, msgSum: PageRankMsgType): PageRankNodeType = { … attr.map{ case (label, (oldPR, lastDelta)) => { val newPR = oldPR + (1.0 - resetProb) * msgSum.getOrElse(label, 0.0) val newDelta = newPR - oldPR (label -> (newPR, newDelta)) } } } Using combined message to update PageRank score
  • 21. Experiments • LinkedIn social activity graph – Sampled from all social activities in Nov. 2016 – Nodes: ~2 million users – Node labels: companies – Edge labels: • Like • Share • Comment – Edges: ~76 million – Rest probability: 0.15 – Convergence granularity: 1e-3 – Number of executor: 50 – Executor cores: 3 – Executor Memory: 12G
  • 22. Experiments • Convergence around 100 iterations • Total running time: 30~40 mins • A case study for LinkedIn: Jeff Weiner Daniel Roth Greg Call Jeff Weiner Kathy Caprino Isabelle Roughol Jeff Weiner Isabelle Roughol Akshay Kothari
  • 23. Experiments • Further discussions and lessons learned – For edge type in multi-label graphs (fromID, toID, label, edgeFeatures) => (fromID, toID, Map(label, edgeFeatures)) • Reduce duplication and save space • Slower process time – Standard Pregel interface in GraphX • Although data from the last iteration is unpersisted, DAG will keep grow • Might cause out of memory error • Pregel interface with (local) checkpoint to cut off the DAG after several iteration – Test on larger data sets and various data sources
  • 24. Conclusion • Network Analysis – Graph features – Graph-based algorithms • Homogeneous vs. Heterogeneous Networks • Multi-label Graphs – Node & Node labels – Edge & Edge labels – Constructing a multi-label graph • Multi-label PageRank – PageRank – Pregel-based implementation • Experiments
  • 25. References [1] Page, Lawrence, et al. The PageRank citation ranking: Bringing order to the web. Stanford InfoLab, 1999. [2] Zhu, Xiaojin, and Zoubin Ghahramani. "Learning from labeled and unlabeled data with label propagation." (2002): 1. [3] Kleinberg, Jon M. "Hubs, authorities, and communities." ACM computing surveys (CSUR) 31.4es (1999): 5.
  • 26. Thank You! Qingbo Hu (qihu@linkedin.com)