SlideShare a Scribd company logo
Entity Resolution In Graph Data
An Active Learning Approach
William Lyon
@lyonwj
lyonwj.com
Sept 2017
William Lyon
Developer Relations Engineer @neo4j
will@neo4j.com
@lyonwj
lyonwj.com
https://neo4j.com/graph-database-data-journalism-accelerator-program/
Entity Resolution
Deduplication
?
Example: Campaign Finance
contributions
committees
candidates
http://www.fec.gov/finance/disclosure/ftpdet.shtml
Datamodel
http://bit.ly/neo4jire
candidates
https://github.com/johnymontana/neo4j-datasets/tree/master/us-fec-elections-2016
committees
https://github.com/johnymontana/neo4j-datasets/tree/master/us-fec-elections-2016
contributions
Unique id???
https://github.com/johnymontana/neo4j-datasets/tree/master/us-fec-elections-2016
contributions
Synthetic id on
concat(name, zip)
https://github.com/johnymontana/neo4j-datasets/tree/master/us-fec-elections-2016
Sample data
Will Lyon- Entity Resolution
Will Lyon- Entity Resolution
Will Lyon- Entity Resolution
Entity Resolution In Graph Data
?
Entity Resolution In Graph Data
1) Inferred relationships
2) Joining datasets
3) Aggregating traversals
1) Inferred relationships
https://offshoreleaks.icij.org/pages/database
1) Inferred relationships
https://offshoreleaks.icij.org/pages/database
1) Inferred relationships
LIVES_WITH
https://offshoreleaks.icij.org/pages/database
2) Joining datasets
2) Joining datasets
http://www.lyonwj.com/2017/01/30/trumpworld-us-contracting-data-neo4j/
+
2) Joining datasets
http://www.lyonwj.com/2017/01/30/trumpworld-us-contracting-data-neo4j/
+
2) Joining datasets
http://www.cnn.com/2017/05/19/politics/private-prisons/index.html
https://www.nytimes.com/2017/02/24/opinion/under-mr-trump
http://www.cnn.com/2017/08/18/politics/private-prison-department-of-justice/index.html
3) Aggregating traversals
3) Aggregating traversals
Entity Resolution
An Active Learning Approach
Are these the same entity?
Name Address Phone Email
Bob Loblaw 111 E 5th Ave. San Mateo, CA bob@neo4j.com
Name Address Phone Email
Robert Loblaw 111 5th Ave 855-636-4532 bob@neo4j.com
Are these the same entity?
Name Address Phone Email
Bob Loblaw 111 E 5th Ave. San Mateo, CA bob@neo4j.com
Name Address Phone Email
Robert Loblaw 111 5th Ave 855-636-4532 bob@neo4j.com
Probably, but how to quantify?
Are these the same entity?
Name
Bob Loblaw
Name
Robert Loblaw
Probably, but how to quantify?
Are these the same entity?
Name
Bob Loblaw
Name
Robert Loblaw
String distance (similarity) metric
Are these the same entity?
Name
Bob Loblaw
Name
Robert Loblaw
Edit distance
Are these the same entity?
Name
Bob Loblaw
Name
Robert Loblaw
Edit distance
Edits required to convert “Bob
Loblaw” to “Robert Loblaw”
Are these the same entity?
Name
Bob Loblaw
Name
Robert Loblaw
Edit distance
4
Are these the same entity?
Name
Bob Loblaw
Name
Robert Loblaw
Edit distance
Are these the same entity?
Name
Bob Loblaw
Name
Robert Loblaw
Edit distance
Are these the same entity?
Name
Bob Loblaw
Name
Robert Loblaw
TF/IDF
• Term based
• set of words
• Order doesn’t matter
• Words weighted based on probability
Are these the same entity?
Name
Bob Loblaw
Name
Robert Loblaw
TF/IDF
• Pro:
• takes advantage of frequency
• Order doesn’t matter
• Willam Cohen -vs- Cohen, William
• Con:
• Spelling errors / abbreviations
• Order doesn’t matter
• City National Bank -vs- National City Bank
Are these the same entity?
Name
Bob Loblaw
Name
Robert Loblaw
Edit distance probabilistic extensions
• Gap distance
• Edit distance + HMM
Are these the same entity?
Name
Bob Loblaw
Name
Robert Loblaw
Soundex
• Phonetic indexing scheme
• Genealogy
• Soundex code
• Hash
Are these the same entity?
Name
Bob Loblaw
Name
Robert Loblaw
Soundex
Are these the same entity?
Name Address Phone Email
Bob Loblaw 111 E 5th Ave. San Mateo, CA bob@neo4j.com
Name Address Phone Email
Robert Loblaw 111 5th Ave 855-636-4532 bob@neo4j.com
Dist 4
Are these the same entity?
Name Address Phone Email
Bob Loblaw 111 E 5th Ave. San Mateo, CA bob@neo4j.com
Name Address Phone Email
Robert Loblaw 111 5th Ave 855-636-4532 bob@neo4j.com
Dist 4 17 Null 0
Are these the same entity?
Name Address Phone Email
Bob Loblaw 111 E 5th Ave. San Mateo, CA bob@neo4j.com
Name Address Phone Email
Robert Loblaw 111 5th Ave 855-636-4532 bob@neo4j.com
Dist 4 17 Null 0
Weight ??? ??? ??? ???
Are these the same entity?
Name Address Phone Email
Bob Loblaw 111 E 5th Ave. San Mateo, CA bob@neo4j.com
Name Address Phone Email
Robert Loblaw 111 5th Ave 855-636-4532 bob@neo4j.com
Dist 4 17 Null 0
Weight 0.2 0.03 0.02 0.75
Are these the same entity?
Name Address Phone Email
Bob Loblaw 111 E 5th Ave. San Mateo, CA bob@neo4j.com
Name Address Phone Email
Robert Loblaw 111 5th Ave 855-636-4532 bob@neo4j.com
Dist 4 17 Null 0
Weight 0.2 0.03 0.02 0.75
Weighted distance: (4*0.2)+(0.03*17)+(0*0.02)+(0*.75) = 1.31
Are these the same entity?
Name Address Phone Email
Bob Loblaw 111 E 5th Ave. San Mateo, CA bob@neo4j.com
Name Address Phone Email
Robert Loblaw 111 5th Ave 855-636-4532 bob@neo4j.com
Dist 4 17 Null 0
Weight 0.2 0.03 0.02 0.75
Where did the weights come from?
Active Learning
Active Learning
• Goal: learn weights
for distance per field
• Example pairs w/
labels
• Minimize human
labeling time
• Present borderline
pairs to human for
labeling
• Relearn weights
• Iterate
Name Address Phone Email
Bob Loblaw 111 E 5th Ave. San Mateo, CA bob@neo4j.com
Name Address Phone Email
Robert Loblaw 111 5th Ave 855-636-4532 bob@neo4j.com
Logistic regression
• Categorical dependent variable
• binary (0,1)
• Classification
• Fit logistic function
https://en.wikipedia.org/wiki/Logistic_function
Logistic regression - example
Distance 0.50 0.75 1.00 1.25 1.50 1.75 1.75 2.00 2.25 2.50 2.75 3.00 3.25 3.50 4.00 4.25 4.50 4.75 5.00 5.50
Nonduplicate 0 0 0 0 0 0 1 0 1 0 1 0 1 0 1 1 1 1 1 1
Logistic regression - example
Probability of non-duplicate vs edit distance
Probabilityofnon-duplicate
Edit distance
Distance 0.50 0.75 1.00 1.25 1.50 1.75 1.75 2.00 2.25 2.50 2.75 3.00 3.25 3.50 4.00 4.25 4.50 4.75 5.00 5.50
Nonduplicate 0 0 0 0 0 0 1 0 1 0 1 0 1 0 1 1 1 1 1 1
Logistic regression - example
Probability of non-duplicate vs edit distance
Probabilityofnon-duplicate
Edit distance
Distance 0.50 0.75 1.00 1.25 1.50 1.75 1.75 2.00 2.25 2.50 2.75 3.00 3.25 3.50 4.00 4.25 4.50 4.75 5.00 5.50
Nonduplicate 0 0 0 0 0 0 1 0 1 0 1 0 1 0 1 1 1 1 1 1
Coefficient Std Err
Intercept -4.0777 1.7610
Distance 1.5046 0.6287
Blocking - making smart comparisons
• Problem: All-pairs comparison is expensive
• (1000*999) / 2 = 499,500 pairs
• Duplicates are rare
• almost all pairs are not duplicates
• Blocking functions limit record pairs to be
compared
• “Somewhat similar” records
Blocking - making smart comparisons
• Predicate blocks
• whole field
• token field
• common integer
• same three char start
• common n gram
• Index block
• inverted index
• find similar records
close to each other in
the index
Code samples
Dedupe by Datamade
• Entity resolution Python library
• Active learning
• Logistic regression
• Blocking
• Generate CSV or query
database directly
• Cypher, python driver for
Neo4j
https://github.com/dedupeio
Training
Active Learning
Writing results
Will Lyon- Entity Resolution
Resources
Neo4j Sandbox
neo4jsandbox.com
GraphConnect
GraphConnect.com
Promo code: INTUIT50
GraphConnect
http://graphconnect.com/ Promo code: INTUIT50
(you)-[:HAVE]->(?)
(?)<-[:ANSWERS]-(will)

More Related Content

PDF
Data Pipline Observability meetup
PDF
A Primer on Entity Resolution
PDF
Graph based data models
PPTX
Data Lake Overview
PDF
Future of Data Engineering
PDF
DI&A Slides: Data Lake vs. Data Warehouse
ODP
Graph databases
PPTX
Big data analytics
Data Pipline Observability meetup
A Primer on Entity Resolution
Graph based data models
Data Lake Overview
Future of Data Engineering
DI&A Slides: Data Lake vs. Data Warehouse
Graph databases
Big data analytics

What's hot (20)

PDF
GSK: How Knowledge Graphs Improve Clinical Reporting Workflows
PDF
RWDG Slides: A Complete Set of Data Governance Roles & Responsibilities
PPTX
Graph databases
PPTX
Intro to Azure Data Factory v1
PDF
Designing Secure Cisco Data Centers
PDF
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...
PPTX
NoSQL Graph Databases - Why, When and Where
PPTX
Intro to Neo4j
PDF
Fraud Detection with Graphs at the Danish Business Authority
PPTX
Introduction to snowflake
PDF
Apache Sedona: how to process petabytes of agronomic data with Spark
PPTX
Intro to Data Vault 2.0 on Snowflake
PPTX
Data Lakehouse, Data Mesh, and Data Fabric (r2)
PPTX
Snowflake Overview
PDF
Rise of the Data Cloud
PPTX
Data Lakehouse, Data Mesh, and Data Fabric (r1)
PPTX
Pentaho
PPTX
Modern Data Warehousing with the Microsoft Analytics Platform System
PDF
Intro to Graphs and Neo4j
PPTX
Azure Data Factory for Azure Data Week
GSK: How Knowledge Graphs Improve Clinical Reporting Workflows
RWDG Slides: A Complete Set of Data Governance Roles & Responsibilities
Graph databases
Intro to Azure Data Factory v1
Designing Secure Cisco Data Centers
The Modern Data Team for the Modern Data Stack: dbt and the Role of the Analy...
NoSQL Graph Databases - Why, When and Where
Intro to Neo4j
Fraud Detection with Graphs at the Danish Business Authority
Introduction to snowflake
Apache Sedona: how to process petabytes of agronomic data with Spark
Intro to Data Vault 2.0 on Snowflake
Data Lakehouse, Data Mesh, and Data Fabric (r2)
Snowflake Overview
Rise of the Data Cloud
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Pentaho
Modern Data Warehousing with the Microsoft Analytics Platform System
Intro to Graphs and Neo4j
Azure Data Factory for Azure Data Week
Ad

Similar to Will Lyon- Entity Resolution (20)

PDF
Cold-Start Recommendations to Users With Rich Profiles
PDF
WTF is Semantic Web?
PDF
Schema.org - An Extending Influence
PDF
Schema.org - Extending Benefits
PDF
Harness the Potential of Local Search for Your Business
PDF
Identifying The Benefit of Linked Data
PDF
Schema.org Structured data the What, Why, & How
PDF
LinkedIn's Active/Active Evolution
PPTX
WordPress - Whats going on in the server?
PPTX
Bis 155 papers learn by doing bis155papers.com
PDF
Genetic Malware
PDF
Genetic Malware
PDF
Schema.org: What It Means For You and Your Library
PDF
Premature optimisation: The Root of All Evil
PDF
DevCommerce Conference 2016: Performance, anti-patterns e stacks pra desenvol...
PDF
Free The Enterprise With Ruby & Master Your Own Domain
PDF
Introduction to RDA
PDF
Schema.org: Where did that come from!
KEY
Innovateeurope
ZIP
Yahoo! Developer Networks ♥ Startups
Cold-Start Recommendations to Users With Rich Profiles
WTF is Semantic Web?
Schema.org - An Extending Influence
Schema.org - Extending Benefits
Harness the Potential of Local Search for Your Business
Identifying The Benefit of Linked Data
Schema.org Structured data the What, Why, & How
LinkedIn's Active/Active Evolution
WordPress - Whats going on in the server?
Bis 155 papers learn by doing bis155papers.com
Genetic Malware
Genetic Malware
Schema.org: What It Means For You and Your Library
Premature optimisation: The Root of All Evil
DevCommerce Conference 2016: Performance, anti-patterns e stacks pra desenvol...
Free The Enterprise With Ruby & Master Your Own Domain
Introduction to RDA
Schema.org: Where did that come from!
Innovateeurope
Yahoo! Developer Networks ♥ Startups
Ad

More from Neo4j (20)

PDF
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
PDF
Jin Foo - Prospa GraphSummit Sydney Presentation.pdf
PDF
GraphSummit Singapore Master Deck - May 20, 2025
PPTX
Graphs & GraphRAG - Essential Ingredients for GenAI
PPTX
Neo4j Knowledge for Customer Experience.pptx
PPTX
GraphTalk New Zealand - The Art of The Possible.pptx
PDF
Neo4j: The Art of the Possible with Graph
PDF
Smarter Knowledge Graphs For Public Sector
PDF
GraphRAG and Knowledge Graphs Exploring AI's Future
PDF
Matinée GenAI & GraphRAG Paris - Décembre 24
PDF
ANZ Presentation: GraphSummit Melbourne 2024
PDF
Google Cloud Presentation GraphSummit Melbourne 2024: Building Generative AI ...
PDF
Telstra Presentation GraphSummit Melbourne: Optimising Business Outcomes with...
PDF
Hands-On GraphRAG Workshop: GraphSummit Melbourne 2024
PDF
Démonstration Digital Twin Building Wire Management
PDF
Swiss Life - Les graphes au service de la détection de fraude dans le domaine...
PDF
Démonstration Supply Chain - GraphTalk Paris
PDF
The Art of Possible - GraphTalk Paris Opening Session
PPTX
How Siemens bolstered supply chain resilience with graph-powered AI insights ...
PDF
Knowledge Graphs for AI-Ready Data and Enterprise Deployment - Gartner IT Sym...
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Jin Foo - Prospa GraphSummit Sydney Presentation.pdf
GraphSummit Singapore Master Deck - May 20, 2025
Graphs & GraphRAG - Essential Ingredients for GenAI
Neo4j Knowledge for Customer Experience.pptx
GraphTalk New Zealand - The Art of The Possible.pptx
Neo4j: The Art of the Possible with Graph
Smarter Knowledge Graphs For Public Sector
GraphRAG and Knowledge Graphs Exploring AI's Future
Matinée GenAI & GraphRAG Paris - Décembre 24
ANZ Presentation: GraphSummit Melbourne 2024
Google Cloud Presentation GraphSummit Melbourne 2024: Building Generative AI ...
Telstra Presentation GraphSummit Melbourne: Optimising Business Outcomes with...
Hands-On GraphRAG Workshop: GraphSummit Melbourne 2024
Démonstration Digital Twin Building Wire Management
Swiss Life - Les graphes au service de la détection de fraude dans le domaine...
Démonstration Supply Chain - GraphTalk Paris
The Art of Possible - GraphTalk Paris Opening Session
How Siemens bolstered supply chain resilience with graph-powered AI insights ...
Knowledge Graphs for AI-Ready Data and Enterprise Deployment - Gartner IT Sym...

Recently uploaded (20)

PPTX
Impressionism_PostImpressionism_Presentation.pptx
PPTX
Tour Presentation Educational Activity.pptx
PPTX
PHIL.-ASTRONOMY-AND-NAVIGATION of ..pptx
PPTX
Project and change Managment: short video sequences for IBA
PPTX
BIOLOGY TISSUE PPT CLASS 9 PROJECT PUBLIC
PDF
Tunisia's Founding Father(s) Pitch-Deck 2022.pdf
PPTX
Effective_Handling_Information_Presentation.pptx
PPTX
water for all cao bang - a charity project
PDF
Presentation1 [Autosaved].pdf diagnosiss
PPTX
Learning-Plan-5-Policies-and-Practices.pptx
PDF
Nykaa-Strategy-Case-Fixing-Retention-UX-and-D2C-Engagement (1).pdf
PPTX
nose tajweed for the arabic alphabets for the responsive
PPTX
worship songs, in any order, compilation
PPTX
Tablets And Capsule Preformulation Of Paracetamol
PDF
Parts of Speech Prepositions Presentation in Colorful Cute Style_20250724_230...
PPTX
fundraisepro pitch deck elegant and modern
PPTX
S. Anis Al Habsyi & Nada Shobah - Klasifikasi Hambatan Depresi.pptx
DOC
学位双硕士UTAS毕业证,墨尔本理工学院毕业证留学硕士毕业证
PPTX
An Unlikely Response 08 10 2025.pptx
PPTX
Self management and self evaluation presentation
Impressionism_PostImpressionism_Presentation.pptx
Tour Presentation Educational Activity.pptx
PHIL.-ASTRONOMY-AND-NAVIGATION of ..pptx
Project and change Managment: short video sequences for IBA
BIOLOGY TISSUE PPT CLASS 9 PROJECT PUBLIC
Tunisia's Founding Father(s) Pitch-Deck 2022.pdf
Effective_Handling_Information_Presentation.pptx
water for all cao bang - a charity project
Presentation1 [Autosaved].pdf diagnosiss
Learning-Plan-5-Policies-and-Practices.pptx
Nykaa-Strategy-Case-Fixing-Retention-UX-and-D2C-Engagement (1).pdf
nose tajweed for the arabic alphabets for the responsive
worship songs, in any order, compilation
Tablets And Capsule Preformulation Of Paracetamol
Parts of Speech Prepositions Presentation in Colorful Cute Style_20250724_230...
fundraisepro pitch deck elegant and modern
S. Anis Al Habsyi & Nada Shobah - Klasifikasi Hambatan Depresi.pptx
学位双硕士UTAS毕业证,墨尔本理工学院毕业证留学硕士毕业证
An Unlikely Response 08 10 2025.pptx
Self management and self evaluation presentation

Will Lyon- Entity Resolution