SlideShare a Scribd company logo
Improve ML Predictions using
Graph Algorithms
Jennifer Reif, Neo4j
Amy Hodler, Neo4j
July 2019
#Neo4j
#GraphAnalytics
What in Common is Predictive?
Relationships:
Strongest Predictors of Behavior!
“Increasingly we're learning that you can make
better predictions about people by getting all the
information from their friends and their friends’
friends than you can from the information you
have about the person themselves”
James Fowler David Burkus
James Fowler
Albert-Laszlo
Barabasi
• Graphs for Predictions
• Connected Features
• Link Prediction
• Neo4j + Spark Workflow
Amy E. Hodler
Graph Analytics & AI
Program Manager, Neo4j
Amy.Hodler@neo4j.com
@amyhodler
Jennifer Reif
Labs Engineer, Neo4j
Jennifer.Reif@neo4j.com
@JMHReif
4
Native Graph Platforms are Designed for Connected Data
TRADITIONAL
PLATFORMS
BIG DATA
TECHNOLOGY
Store and retrieve data Aggregate and filter data Connections in data
Real time storage & retrieval Real-Time Connected Insights
Long running queries
aggregation & filtering
“Our Neo4j solution is literally thousands of
times faster than the prior MySQL solution, with
queries that require 10-100 times less code”
Volker Pacher, Senior Developer
Max # of hops ~3
Millions
5
Graph Databases Surging in Popularity
Trends since 2013
DB-Engines.com
6
Graph in AI Research is Taking Off
7
4,000
3,000
2,000
1,000
0
2010 2011 2012 2013 2014 2015 2016 2017 2018
Mentions in Dimension
Knowledge System
graph neural network
graph convolutional
graph embedding
graph learning
graph attention
graph kernel
graph completion
Research Papers on Graph-Related AI
Dimension Knowledge System
Machine Learning Eats A Lot of Data
Machine Learning uses algorithms to
train software using specific examples
and progressive improvements
Algorithms iterate, continually adjusting
to get closer to an objective goal, such as
error reduction
This learning requires a lot of data to a model and enabling it to learn how
to process and incorporate that information
8
• Many data science models ignore network structure & complex relationships
• Graphs add highly predictive features to existing ML models
• Otherwise unattainable predictions based on relationships
More Accurate Predictions
with the Data You Already Have
Machine Learning Pipeline
9
Graph Data Science Applications EXAMPLES
Financial
Crimes Recommendations
Cybersecurity
Predictive
Maintenance
Customer
Segmentation
Churn
Prediction
Search
& MDM
Drug
Discovery
10
Graph Data Science Gives Us
Better
Decisions
Knowledge
Graphs
Higher
Accuracy
Connected Feature
Engineering
More Trust
and Applicability
Graph Native
Learning
11
Connected Features
12
Connection-related metrics about our graph,
such as the number of relationships going
into or out of nodes, a count of potential
triangles, or neighbors in common.
13
What Are Connected Features?
Query (e.g. Cypher)
Real-time, local decisioning
and pattern matching
Graph Algorithms Libraries
Global analysis
and iterations
You know what you’re looking
for and making a decision
You’re learning the overall structure of a
network, updating data, and predicting
Local
Patterns
Global
Computation
Deriving Connected Features
14
Graph Feature Engineering
Feature Engineering is how we combine and process the data to
create new, more meaningful features, such as clustering or
connectivity metrics.
Add More Descriptive Features:
- Influence
- Relationships
- Communities
Extraction
15
16
Graph Feature Categories & Algorithms
Pathfinding
& Search
Finds the optimal paths or evaluates
route availability and quality
Centrality /
Importance
Determines the importance of
distinct nodes in the network
Community
Detection
Detects group clustering or
partition options
Heuristic
Link Prediction
Estimates the likelihood of nodes
forming a relationship
Evaluates how alike
nodes are
Similarity
Embeddings
Learned representations
of connectivity or topology
16
Link Prediction
17
18
Can we infer new interactions in the future?
What unobserved facts we’re missing?
+ 50 years of biomedical data
integrated in a knowledge
graph
Predicting new uses for drugs
by using the graph structure to
create features for link
prediction
Example: het.io
19
Example: het.io
20
21
Using Graph Algorithms
Explore, Plan, Measure
Find significant patterns and plan
for optimal structures
Score outcomes and set a
threshold value for a prediction
Feature Engineering for
Machine Learning
The measures as features to train
1st
Node
2nd
Node
Common
Neighbors
Preferential
Attachment
label
1 2 4 15 1
3 4 7 12 1
5 6 1 1 0
Example:
Predicting Collaboration
• Citation Network Dataset - Research Dataset
– “ArnetMiner: Extraction and Mining of Academic Social Networks”,
by J. Tang et al
– Used a subset with 52K papers, 80K authors, 140K author
relationships and 29K citation relationships
• Neo4j
– Create a co-authorship graph and connected feature engineering
• Spark and MLlib
– Train and test our model using a random forest classifier
23
Predicting Collaboration
with a Graph Enhanced ML Model
Our Link Prediction Workflow
Extract Data &
Store as Graph
Explore, Clean,
Modify
Prepare for
Machine
Learning
Train
Models
Evaluate
Results
Productionize
24
Our Link Prediction Workflow
Import Data
Create Co-Author
Graph
Extract Data &
Store as Graph
Explore, Clean,
Modify
Prepare for
Machine
Learning
Train
Models
Evaluate
Results
Productionize
25
26
Our Link Prediction Workflow
Extract Data &
Store as Graph
Explore, Clean,
Modify
Prepare for
Machine
Learning
Train
Models
Evaluate
Results
Productionize
Import Data
Create Co-Author
Graph
Identify sparse
feature areas
Feature
Engineering:
New graphy
features
27
Graph Algorithms Used for
Feature Engineering (few examples)
Preferential Attachment multiplies the number
of neighbors for pairs of nodes
Illustration be.amazd.com/link-prediction/28
Common Neighbors measures the number of
possible neighbors (triadic closure)
Graph Algorithms Used for
Feature Engineering (few examples)
Triangle counting and clustering coefficients
measure the density of connections around nodes
29
Louvain Modularity identifies interacting
communities and hierarchies
Our Link Prediction Workflow
Extract Data &
Store as Graph
Explore, Clean,
Modify
Prepare for
Machine
Learning
Train
Models
Evaluate
Results
Productionize
Import Data
Create Co-Author
Graph
Identify sparse
feature areas
Feature
Engineering:
New graphy
features
Train / Test Split
Resample:
Downsampled for
proportional
representation
30
31
Test/Train Split
1st
Node
2nd
Node
Common
Neighbors
Preferential
Attachment
label
1 2 4 15 1
3 4 7 12 1
5 6 1 1 0
2 12 3 3 0
4 9 4 8 1
7 10 12 36 1
8 11 2 3 0
32
Test/Train Split
1st
Node
2nd
Node
Common
Neighbors
Preferential
Attachment
label
1 2 4 15 1
3 4 7 12 1
5 6 1 1 0
2 12 3 3 0
4 9 4 8 1
7 10 12 36 1
8 11 2 3 0
Train
Test
OMG I’m Good!
Data Leakage!
Graph metric computation for the train
set touches data from the test set.
Did you get really high accuracy on your
first run without tuning?
33
Train and Test Graphs: Time Based Split
1st
Node
2nd
Node
Common
Neighbors
Preferential
Attachment
label
1 2 4 15 1
3 4 7 12 1
5 6 1 1 0
Train
Test
1st
Node
2nd
Node
Common
Neighbors
Preferential
Attachment
label
2 12 3 3 0
4 9 4 8 1
7 10 12 36 1
< 2006
>= 2006
34
Train and Test Graphs: Time Based Split
1st
Node
2nd
Node
Common
Neighbors
Preferential
Attachment
label
1 2 4 15 1
3 4 7 12 1
5 6 1 1 0
Train
Test
1st
Node
2nd
Node
Common
Neighbors
Preferential
Attachment
label
2 12 3 3 0
4 9 4 8 1
7 10 12 36 1
35
Class Imbalance
Negative
Examples
Positive
Examples
36
37
Class Imbalance
A very high accuracy model
could predict that a pair
of nodes are not linked.
Class Imbalance
38
Our Link Prediction Workflow
Extract Data &
Store as Graph
Explore, Clean,
Modify
Prepare for
Machine
Learning
Train
Models
Evaluate
Results
Productionize
Import Data
Create Co-Author
Graph
Identify sparse
feature areas
Feature
Engineering:
New graphy
features
Train / Test Split
Resample:
Downsampled for
proportional
representation
Model Selection:
Random Forest
Ensemble method
39
Picking a Classifier
40
Training Our Model
This is one decision tree in
our Random Forest used as a
binary classifier to learn how
to classify a pair: predicting
either linked or not linked.
41
42
4 Layered Models Trained
Common Authors Model
“Graphy” Model
Triangles Model
Community Model
• Common Authors
Adds:
• Pref. Attachment
• Total Neighbors
Adds:
• Min & Max Triangles
• Min & Max Clustering Coefficient
Adds:
• Label Propagation
• Louvain Modularity
Multiple graph features used to train the models
Our Link Prediction Workflow
Extract Data &
Store as Graph
Explore, Clean,
Modify
Prepare for
Machine
Learning
Train
Models
Evaluate
Results
Productionize
Import Data
Create Co-Author
Graph
Identify sparse
feature areas
Feature
Engineering:
New graphy
features
Train / Test Split
Resample:
Downsampled for
proportional
representation
Precision,
Accuracy, Recall
ROC Curve &
AUC
Model Selection:
Random Forest
Ensemble method
43
Measures
Accuracy Proportion of total correct predictions.
Beware of skewed data!
Precision Proportion of positive predictions that
are correct.
Low score = more false positives
Recall /
True Positive Rate
Proportion of actual positives that are
correct.
Low score = more false negatives
False Positive Rate Proportion of incorrect positives
ROC Curve & AUC X-Y Chart mapping above 2 metrics
(TPR and FPR) with area under curve
Result: First Model ROC & AUC
False Positives!
Common Authors
Model 1
45
FalseNegatives!
Result: All Models Common Authors
Model 1
Community
Model 4
46
Iteration & Tuning: Feature Influence
For feature importance, the
Spark random forest averages
the reduction in impurity
across all trees in the forest
Feature rankings are in
comparison to the group of
features evaluated
Also try PageRank!
Try removing different features
(LabelPropagation)
47
Graph Machine Learning Workflow
Data aggregation
Create and store
graphs
Extract Data &
Store as Graph
Explore, Clean,
Modify
Prepare for
Machine
Learning
Train
Models
Evaluate
Results
Productionize
Identify
uninteresting
features
Cleanse (outliers+)
Feature
engineering/
extraction
Train / Test split
Resample for
meaningful
representation
(proportional, etc.)
Precision,
accuracy, recall
(ROC curve & AUC)
SME Review
Cross-validation
Model & variable
selection
Hyperparameter
tuning
Ensemble methods
48
Resources
neo4j.com
• /sandbox
• /developer/graph-algorithms/
• /graphacademy/online-training/
Data & Code:
• This example from O’Reilly book
bit.ly/2FPgGVV (ML Folder)
Jennifer.Reif@neo4j.com @JMHReif
neo4j.com/
graph-algorithms-book
Amy.Hodler@neo4j.com @amyhodler
49

More Related Content

PDF
Improve ML Predictions using Graph Analytics (today!)
PDF
Improving Machine Learning using Graph Algorithms
PDF
Graph Algorithms for Developers
PDF
Real World Guide to Building Your Knowledge Graph
PPTX
Graphs and Financial Services Analytics
PDF
Real World Guide to Building Your Knowledge Graph
PDF
Graph-Powered Machine Learning
PDF
Illustrate the value in your connected data using Neo4j Bloom
Improve ML Predictions using Graph Analytics (today!)
Improving Machine Learning using Graph Algorithms
Graph Algorithms for Developers
Real World Guide to Building Your Knowledge Graph
Graphs and Financial Services Analytics
Real World Guide to Building Your Knowledge Graph
Graph-Powered Machine Learning
Illustrate the value in your connected data using Neo4j Bloom

What's hot (20)

PDF
Leveraging Graphs for Better AI
PDF
GraphTour London 2020 - Graphs for AI, Amy Hodler
PDF
Graph Data Science DEMO for fraud analysis
PDF
How Graphs Enhance AI
PDF
AI, ML and Graph Algorithms: Real Life Use Cases with Neo4j
PDF
Neo4j: What's Under the Hood & How Knowing This Can Help You
PDF
GraphTour 2020 - Graphs & AI: A Path for Data Science
PDF
Transforming AI with Graphs: Real World Examples using Spark and Neo4j
PDF
What Is GDS and Neo4j’s GDS Library
PDF
Neo4j GraphTalk Basel - Building intelligent Software with Graphs
PDF
Illustrating Graphs Visually through Neo4j Bloom
PDF
Intro to graphs for HR analytics
PDF
Knowledge graphs, meet Deep Learning
PDF
Graph analytic and machine learning
PDF
GraphTour London 2020 - Customer Journey
PDF
Graph technology meetup slides
PDF
Data Modeling with Neo4j
PPTX
Graph Analytics
PDF
Neo4j GraphDay Seattle- Sept19- Connected data imperative
PDF
3. Relationships Matter: Using Connected Data for Better Machine Learning
Leveraging Graphs for Better AI
GraphTour London 2020 - Graphs for AI, Amy Hodler
Graph Data Science DEMO for fraud analysis
How Graphs Enhance AI
AI, ML and Graph Algorithms: Real Life Use Cases with Neo4j
Neo4j: What's Under the Hood & How Knowing This Can Help You
GraphTour 2020 - Graphs & AI: A Path for Data Science
Transforming AI with Graphs: Real World Examples using Spark and Neo4j
What Is GDS and Neo4j’s GDS Library
Neo4j GraphTalk Basel - Building intelligent Software with Graphs
Illustrating Graphs Visually through Neo4j Bloom
Intro to graphs for HR analytics
Knowledge graphs, meet Deep Learning
Graph analytic and machine learning
GraphTour London 2020 - Customer Journey
Graph technology meetup slides
Data Modeling with Neo4j
Graph Analytics
Neo4j GraphDay Seattle- Sept19- Connected data imperative
3. Relationships Matter: Using Connected Data for Better Machine Learning
Ad

Similar to Improve ml predictions using graph algorithms (webinar july 23_19).pptx (20)

PDF
Improve ML Predictions using Connected Feature Extraction
PPTX
How Graphs are Changing AI
PDF
Leveraging Graphs for Better AI
PDF
La strada verso il successo con i database a grafo, la Graph Data Science e l...
PDF
Leveraging Graphs for AI and ML - Alicia Frame, Neo4j
PDF
How Graph Technology is Changing AI
PDF
Neo4j GraphTalk Düsseldorf - Building intelligent solutions with Graphs
PDF
Graph Data Science: The Secret to Accelerating Innovation with AI/ML
PDF
Introduction to Neo4j
PPTX
Fast Focus: SQL Server Graph Database & Processing
PDF
Neo4j Graph Data Science - Webinar
PPTX
Azure Databricks for Data Scientists
PDF
Mastering Customer Data on Apache Spark
PDF
How Graph Databases used in Police Department?
PDF
Relationships Matter: Using Connected Data for Better Machine Learning
PDF
How Graph Algorithms Answer your Business Questions in Banking and Beyond
PDF
The Analytics Frontier of the Hadoop Eco-System
PPTX
Using Connected Data and Graph Technology to Enhance Machine Learning and Art...
PPTX
Using Set Cover to Optimize a Large-Scale Low Latency Distributed Graph
PDF
Introduction to Machine Learning with SciKit-Learn
Improve ML Predictions using Connected Feature Extraction
How Graphs are Changing AI
Leveraging Graphs for Better AI
La strada verso il successo con i database a grafo, la Graph Data Science e l...
Leveraging Graphs for AI and ML - Alicia Frame, Neo4j
How Graph Technology is Changing AI
Neo4j GraphTalk Düsseldorf - Building intelligent solutions with Graphs
Graph Data Science: The Secret to Accelerating Innovation with AI/ML
Introduction to Neo4j
Fast Focus: SQL Server Graph Database & Processing
Neo4j Graph Data Science - Webinar
Azure Databricks for Data Scientists
Mastering Customer Data on Apache Spark
How Graph Databases used in Police Department?
Relationships Matter: Using Connected Data for Better Machine Learning
How Graph Algorithms Answer your Business Questions in Banking and Beyond
The Analytics Frontier of the Hadoop Eco-System
Using Connected Data and Graph Technology to Enhance Machine Learning and Art...
Using Set Cover to Optimize a Large-Scale Low Latency Distributed Graph
Introduction to Machine Learning with SciKit-Learn
Ad

More from Neo4j (20)

PDF
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
PDF
Jin Foo - Prospa GraphSummit Sydney Presentation.pdf
PDF
GraphSummit Singapore Master Deck - May 20, 2025
PPTX
Graphs & GraphRAG - Essential Ingredients for GenAI
PPTX
Neo4j Knowledge for Customer Experience.pptx
PPTX
GraphTalk New Zealand - The Art of The Possible.pptx
PDF
Neo4j: The Art of the Possible with Graph
PDF
Smarter Knowledge Graphs For Public Sector
PDF
GraphRAG and Knowledge Graphs Exploring AI's Future
PDF
Matinée GenAI & GraphRAG Paris - Décembre 24
PDF
ANZ Presentation: GraphSummit Melbourne 2024
PDF
Google Cloud Presentation GraphSummit Melbourne 2024: Building Generative AI ...
PDF
Telstra Presentation GraphSummit Melbourne: Optimising Business Outcomes with...
PDF
Hands-On GraphRAG Workshop: GraphSummit Melbourne 2024
PDF
Démonstration Digital Twin Building Wire Management
PDF
Swiss Life - Les graphes au service de la détection de fraude dans le domaine...
PDF
Démonstration Supply Chain - GraphTalk Paris
PDF
The Art of Possible - GraphTalk Paris Opening Session
PPTX
How Siemens bolstered supply chain resilience with graph-powered AI insights ...
PDF
Knowledge Graphs for AI-Ready Data and Enterprise Deployment - Gartner IT Sym...
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Jin Foo - Prospa GraphSummit Sydney Presentation.pdf
GraphSummit Singapore Master Deck - May 20, 2025
Graphs & GraphRAG - Essential Ingredients for GenAI
Neo4j Knowledge for Customer Experience.pptx
GraphTalk New Zealand - The Art of The Possible.pptx
Neo4j: The Art of the Possible with Graph
Smarter Knowledge Graphs For Public Sector
GraphRAG and Knowledge Graphs Exploring AI's Future
Matinée GenAI & GraphRAG Paris - Décembre 24
ANZ Presentation: GraphSummit Melbourne 2024
Google Cloud Presentation GraphSummit Melbourne 2024: Building Generative AI ...
Telstra Presentation GraphSummit Melbourne: Optimising Business Outcomes with...
Hands-On GraphRAG Workshop: GraphSummit Melbourne 2024
Démonstration Digital Twin Building Wire Management
Swiss Life - Les graphes au service de la détection de fraude dans le domaine...
Démonstration Supply Chain - GraphTalk Paris
The Art of Possible - GraphTalk Paris Opening Session
How Siemens bolstered supply chain resilience with graph-powered AI insights ...
Knowledge Graphs for AI-Ready Data and Enterprise Deployment - Gartner IT Sym...

Recently uploaded (20)

PPTX
modul_python (1).pptx for professional and student
PPTX
Qualitative Qantitative and Mixed Methods.pptx
PDF
How to run a consulting project- client discovery
PPTX
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PPTX
IBA_Chapter_11_Slides_Final_Accessible.pptx
PPTX
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
PPTX
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
PDF
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PDF
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
PDF
annual-report-2024-2025 original latest.
PPTX
Database Infoormation System (DBIS).pptx
PPT
ISS -ESG Data flows What is ESG and HowHow
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
PPT
Predictive modeling basics in data cleaning process
PPTX
CYBER SECURITY the Next Warefare Tactics
PDF
[EN] Industrial Machine Downtime Prediction
PPTX
Managing Community Partner Relationships
modul_python (1).pptx for professional and student
Qualitative Qantitative and Mixed Methods.pptx
How to run a consulting project- client discovery
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx
IBA_Chapter_11_Slides_Final_Accessible.pptx
Copy of 16 Timeline & Flowchart Templates – HubSpot.pptx
01_intro xxxxxxxxxxfffffffffffaaaaaaaaaaafg
Systems Analysis and Design, 12th Edition by Scott Tilley Test Bank.pdf
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
annual-report-2024-2025 original latest.
Database Infoormation System (DBIS).pptx
ISS -ESG Data flows What is ESG and HowHow
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
Predictive modeling basics in data cleaning process
CYBER SECURITY the Next Warefare Tactics
[EN] Industrial Machine Downtime Prediction
Managing Community Partner Relationships

Improve ml predictions using graph algorithms (webinar july 23_19).pptx

  • 1. Improve ML Predictions using Graph Algorithms Jennifer Reif, Neo4j Amy Hodler, Neo4j July 2019 #Neo4j #GraphAnalytics
  • 2. What in Common is Predictive?
  • 3. Relationships: Strongest Predictors of Behavior! “Increasingly we're learning that you can make better predictions about people by getting all the information from their friends and their friends’ friends than you can from the information you have about the person themselves” James Fowler David Burkus James Fowler Albert-Laszlo Barabasi
  • 4. • Graphs for Predictions • Connected Features • Link Prediction • Neo4j + Spark Workflow Amy E. Hodler Graph Analytics & AI Program Manager, Neo4j Amy.Hodler@neo4j.com @amyhodler Jennifer Reif Labs Engineer, Neo4j Jennifer.Reif@neo4j.com @JMHReif 4
  • 5. Native Graph Platforms are Designed for Connected Data TRADITIONAL PLATFORMS BIG DATA TECHNOLOGY Store and retrieve data Aggregate and filter data Connections in data Real time storage & retrieval Real-Time Connected Insights Long running queries aggregation & filtering “Our Neo4j solution is literally thousands of times faster than the prior MySQL solution, with queries that require 10-100 times less code” Volker Pacher, Senior Developer Max # of hops ~3 Millions 5
  • 6. Graph Databases Surging in Popularity Trends since 2013 DB-Engines.com 6
  • 7. Graph in AI Research is Taking Off 7 4,000 3,000 2,000 1,000 0 2010 2011 2012 2013 2014 2015 2016 2017 2018 Mentions in Dimension Knowledge System graph neural network graph convolutional graph embedding graph learning graph attention graph kernel graph completion Research Papers on Graph-Related AI Dimension Knowledge System
  • 8. Machine Learning Eats A Lot of Data Machine Learning uses algorithms to train software using specific examples and progressive improvements Algorithms iterate, continually adjusting to get closer to an objective goal, such as error reduction This learning requires a lot of data to a model and enabling it to learn how to process and incorporate that information 8
  • 9. • Many data science models ignore network structure & complex relationships • Graphs add highly predictive features to existing ML models • Otherwise unattainable predictions based on relationships More Accurate Predictions with the Data You Already Have Machine Learning Pipeline 9
  • 10. Graph Data Science Applications EXAMPLES Financial Crimes Recommendations Cybersecurity Predictive Maintenance Customer Segmentation Churn Prediction Search & MDM Drug Discovery 10
  • 11. Graph Data Science Gives Us Better Decisions Knowledge Graphs Higher Accuracy Connected Feature Engineering More Trust and Applicability Graph Native Learning 11
  • 13. Connection-related metrics about our graph, such as the number of relationships going into or out of nodes, a count of potential triangles, or neighbors in common. 13 What Are Connected Features?
  • 14. Query (e.g. Cypher) Real-time, local decisioning and pattern matching Graph Algorithms Libraries Global analysis and iterations You know what you’re looking for and making a decision You’re learning the overall structure of a network, updating data, and predicting Local Patterns Global Computation Deriving Connected Features 14
  • 15. Graph Feature Engineering Feature Engineering is how we combine and process the data to create new, more meaningful features, such as clustering or connectivity metrics. Add More Descriptive Features: - Influence - Relationships - Communities Extraction 15
  • 16. 16 Graph Feature Categories & Algorithms Pathfinding & Search Finds the optimal paths or evaluates route availability and quality Centrality / Importance Determines the importance of distinct nodes in the network Community Detection Detects group clustering or partition options Heuristic Link Prediction Estimates the likelihood of nodes forming a relationship Evaluates how alike nodes are Similarity Embeddings Learned representations of connectivity or topology 16
  • 18. 18 Can we infer new interactions in the future? What unobserved facts we’re missing?
  • 19. + 50 years of biomedical data integrated in a knowledge graph Predicting new uses for drugs by using the graph structure to create features for link prediction Example: het.io 19
  • 21. 21 Using Graph Algorithms Explore, Plan, Measure Find significant patterns and plan for optimal structures Score outcomes and set a threshold value for a prediction Feature Engineering for Machine Learning The measures as features to train 1st Node 2nd Node Common Neighbors Preferential Attachment label 1 2 4 15 1 3 4 7 12 1 5 6 1 1 0
  • 23. • Citation Network Dataset - Research Dataset – “ArnetMiner: Extraction and Mining of Academic Social Networks”, by J. Tang et al – Used a subset with 52K papers, 80K authors, 140K author relationships and 29K citation relationships • Neo4j – Create a co-authorship graph and connected feature engineering • Spark and MLlib – Train and test our model using a random forest classifier 23 Predicting Collaboration with a Graph Enhanced ML Model
  • 24. Our Link Prediction Workflow Extract Data & Store as Graph Explore, Clean, Modify Prepare for Machine Learning Train Models Evaluate Results Productionize 24
  • 25. Our Link Prediction Workflow Import Data Create Co-Author Graph Extract Data & Store as Graph Explore, Clean, Modify Prepare for Machine Learning Train Models Evaluate Results Productionize 25
  • 26. 26
  • 27. Our Link Prediction Workflow Extract Data & Store as Graph Explore, Clean, Modify Prepare for Machine Learning Train Models Evaluate Results Productionize Import Data Create Co-Author Graph Identify sparse feature areas Feature Engineering: New graphy features 27
  • 28. Graph Algorithms Used for Feature Engineering (few examples) Preferential Attachment multiplies the number of neighbors for pairs of nodes Illustration be.amazd.com/link-prediction/28 Common Neighbors measures the number of possible neighbors (triadic closure)
  • 29. Graph Algorithms Used for Feature Engineering (few examples) Triangle counting and clustering coefficients measure the density of connections around nodes 29 Louvain Modularity identifies interacting communities and hierarchies
  • 30. Our Link Prediction Workflow Extract Data & Store as Graph Explore, Clean, Modify Prepare for Machine Learning Train Models Evaluate Results Productionize Import Data Create Co-Author Graph Identify sparse feature areas Feature Engineering: New graphy features Train / Test Split Resample: Downsampled for proportional representation 30
  • 31. 31 Test/Train Split 1st Node 2nd Node Common Neighbors Preferential Attachment label 1 2 4 15 1 3 4 7 12 1 5 6 1 1 0 2 12 3 3 0 4 9 4 8 1 7 10 12 36 1 8 11 2 3 0
  • 32. 32 Test/Train Split 1st Node 2nd Node Common Neighbors Preferential Attachment label 1 2 4 15 1 3 4 7 12 1 5 6 1 1 0 2 12 3 3 0 4 9 4 8 1 7 10 12 36 1 8 11 2 3 0 Train Test
  • 33. OMG I’m Good! Data Leakage! Graph metric computation for the train set touches data from the test set. Did you get really high accuracy on your first run without tuning? 33
  • 34. Train and Test Graphs: Time Based Split 1st Node 2nd Node Common Neighbors Preferential Attachment label 1 2 4 15 1 3 4 7 12 1 5 6 1 1 0 Train Test 1st Node 2nd Node Common Neighbors Preferential Attachment label 2 12 3 3 0 4 9 4 8 1 7 10 12 36 1 < 2006 >= 2006 34
  • 35. Train and Test Graphs: Time Based Split 1st Node 2nd Node Common Neighbors Preferential Attachment label 1 2 4 15 1 3 4 7 12 1 5 6 1 1 0 Train Test 1st Node 2nd Node Common Neighbors Preferential Attachment label 2 12 3 3 0 4 9 4 8 1 7 10 12 36 1 35
  • 37. 37 Class Imbalance A very high accuracy model could predict that a pair of nodes are not linked.
  • 39. Our Link Prediction Workflow Extract Data & Store as Graph Explore, Clean, Modify Prepare for Machine Learning Train Models Evaluate Results Productionize Import Data Create Co-Author Graph Identify sparse feature areas Feature Engineering: New graphy features Train / Test Split Resample: Downsampled for proportional representation Model Selection: Random Forest Ensemble method 39
  • 41. Training Our Model This is one decision tree in our Random Forest used as a binary classifier to learn how to classify a pair: predicting either linked or not linked. 41
  • 42. 42 4 Layered Models Trained Common Authors Model “Graphy” Model Triangles Model Community Model • Common Authors Adds: • Pref. Attachment • Total Neighbors Adds: • Min & Max Triangles • Min & Max Clustering Coefficient Adds: • Label Propagation • Louvain Modularity Multiple graph features used to train the models
  • 43. Our Link Prediction Workflow Extract Data & Store as Graph Explore, Clean, Modify Prepare for Machine Learning Train Models Evaluate Results Productionize Import Data Create Co-Author Graph Identify sparse feature areas Feature Engineering: New graphy features Train / Test Split Resample: Downsampled for proportional representation Precision, Accuracy, Recall ROC Curve & AUC Model Selection: Random Forest Ensemble method 43
  • 44. Measures Accuracy Proportion of total correct predictions. Beware of skewed data! Precision Proportion of positive predictions that are correct. Low score = more false positives Recall / True Positive Rate Proportion of actual positives that are correct. Low score = more false negatives False Positive Rate Proportion of incorrect positives ROC Curve & AUC X-Y Chart mapping above 2 metrics (TPR and FPR) with area under curve
  • 45. Result: First Model ROC & AUC False Positives! Common Authors Model 1 45 FalseNegatives!
  • 46. Result: All Models Common Authors Model 1 Community Model 4 46
  • 47. Iteration & Tuning: Feature Influence For feature importance, the Spark random forest averages the reduction in impurity across all trees in the forest Feature rankings are in comparison to the group of features evaluated Also try PageRank! Try removing different features (LabelPropagation) 47
  • 48. Graph Machine Learning Workflow Data aggregation Create and store graphs Extract Data & Store as Graph Explore, Clean, Modify Prepare for Machine Learning Train Models Evaluate Results Productionize Identify uninteresting features Cleanse (outliers+) Feature engineering/ extraction Train / Test split Resample for meaningful representation (proportional, etc.) Precision, accuracy, recall (ROC curve & AUC) SME Review Cross-validation Model & variable selection Hyperparameter tuning Ensemble methods 48
  • 49. Resources neo4j.com • /sandbox • /developer/graph-algorithms/ • /graphacademy/online-training/ Data & Code: • This example from O’Reilly book bit.ly/2FPgGVV (ML Folder) Jennifer.Reif@neo4j.com @JMHReif neo4j.com/ graph-algorithms-book Amy.Hodler@neo4j.com @amyhodler 49