SlideShare a Scribd company logo
Document Classification with Neo4j 
(graphs)-[:are]->(everywhere) 
© All Rights Reserved 2014 | Neo Technology, Inc. 
@kennybastani 
Neo4j Developer Evangelist
© All Rights Reserved 2014 | Neo Technology, Inc. 
Agenda 
• Introduction to Neo4j 
• Introduction to Graph-based Document Classification 
• Graph-based Hierarchical Pattern Recognition 
• Generating a Vector Space Model for Recommendations 
• Graphify for Neo4j 
• U.S. Presidential Speech Transcript Analysis 
2
Introduction to Neo4j 
© All Rights Reserved 2014 | Neo Technology, Inc. 
3
The Property Graph Data Model 
© All Rights Reserved 2014 | Neo Technology, Inc. 
4
© All Rights Reserved 2014 | Neo Technology, Inc. 
John 
Sally 
Graph Databases 
Book 
5
© All Rights Reserved 2014 | Neo Technology, Inc. 
name: John 
age: 27 
name: Sally 
age: 32 
FRIEND_OF 
since: 01/09/2013 
title: Graph Databases 
authors: Ian Robinson, 
Jim Webber 
HAS_READ 
on: 2/03/2013 
rating: 5 
HAS_READ 
on: 02/09/2013 
rating: 4 
FRIEND_OF 
since: 01/09/2013 
6
The Relational Table Model 
© All Rights Reserved 2014 | Neo Technology, Inc. 
7
Customers Customer_Accounts Accounts 
© All Rights Reserved 2014 | Neo Technology, Inc. 
8
The Neo4j Browser 
© All Rights Reserved 2014 | Neo Technology, Inc. 
9
Neo4j Browser - finding help 
© All Rights Reserved 2014 | Neo Technology, Inc. 
http://localhost:7474/ 
10
Execute Cypher, Visualize 
© All Rights Reserved 2014 | Neo Technology, Inc. 
11
Introduction to Document Classification 
© All Rights Reserved 2014 | Neo Technology, Inc. 
12
© All Rights Reserved 2014 | Neo Technology, Inc. 
Document Classification 
Automatically assign a document to one or more classes 
Documents may be classified according to their subjects or 
according to other attributes 
Automatically classify unlabeled documents to a set of relevant 
classes using labeled training data 
13
Example Use Cases for Document 
© All Rights Reserved 2014 | Neo Technology, Inc. 
Classification 
14
Sentiment Analysis for Movie Reviews 
Scenario: A movie website allows users to submit reviews describing what they 
either liked or disliked about a particular movie. 
© All Rights Reserved 2014 | Neo Technology, Inc. 
Problem: The user reviews are unstructured text. 
How do I automatically generate a score indicating whether the review was 
positive or negative? 
Solution: Train a natural language parsing model on a dataset that has been 
labeled in previous reviews as either positive or negative. 
15
Recommend Relevant Tags 
Scenario: A Q/A website allows users to submit questions and receive answers 
from other users. 
Problem: Users sometime do not know what tags to apply to their questions in 
order to increase discoverability for receiving answers. 
Solution: Automatically recommend the most relevant tags for questions by 
classifying the text from training on previous questions. 
© All Rights Reserved 2014 | Neo Technology, Inc. 
16
Recommend Similar Articles 
Scenario: A news website provides hundreds of new articles a day to users on a 
broad range of topics. 
Problem: The site needs to increase user engagement and time spent on the site. 
Solution: Train natural language parsing models for daily articles in order to 
provide recommendations for highly relevant articles at the bottom of each page. 
© All Rights Reserved 2014 | Neo Technology, Inc. 
17
How Automated Document Classification Works 
© All Rights Reserved 2014 | Neo Technology, Inc. 
18
Label 
© All Rights Reserved 2014 | Neo Technology, Inc. 
X Y 
Document 
Document 
Document 
Document 
Label Label 
Assign a set of labels that describes the 
document’s text 
Supervised Learning 
Step 1: Create a Training Dataset 
Z 
19
Step 2: Train a Natural Language Parsing Model 
p 
X Y 
= State Machine 
© All Rights Reserved 2014 | Neo Technology, Inc. 
Deep feature representations are selected and 
learned using an evolutionary algorithm 
State machines represent predicates that evaluate to 
0 or 1 for a text match 
State machines map to classes of document labels 
that matched text during training 
Deep Learning 
p p 
p p p 
Class 
Class 
Z 
Class 
20
cos(θ) 
© All Rights Reserved 2014 | Neo Technology, Inc. 
Unlabeled Document 
The natural language parsing model is 
used to classify other unlabeled 
documents 
X 
Class 
Y 
Class 
Z 
Class 
0.99 
0.67 
0.01 
cos(θ) 
cos(θ) 
Step 3: Classify Unlabeled Documents 
21
Hierarchical Pattern Recognition 
© All Rights Reserved 2014 | Neo Technology, Inc. 
(HPR) 
22
What is Hierarchical Pattern Recognition (HPR)? 
HPR is a graph-based deep learning algorithm I 
created that learns deep feature representations in 
linear time — 
I created the algorithm to do graph-based traversals 
using a hierarchy of finite state machines (FSM). 
Designed for scalable performance in P time: 
© All Rights Reserved 2014 | Neo Technology, Inc. 
23
Influences & Inspirations 
+ = 
p 
p p 
p p p 
X Y Z 
© All Rights Reserved 2014 | Neo Technology, Inc. 
24 
Ray Kurzweil 
(Pattern Recognition Theory of Mind) 
Jeff Hawkins 
(Hierarchical Temporal Memory) 
Hierarchical Pattern Recognition
How does feature extraction work? 
p 
© All Rights Reserved 2014 | Neo Technology, Inc. 
25 
Hierarchical Pattern Recognition 
“Deep” feature representations are learned and associated 
with labels that are mapped to documents that the feature 
was discovered in. 
The feature hierarchy is translated into a Vector Space Model 
for classification on feature vectors generated from unlabeled 
text. 
p p 
p p p 
X Y Z 
HPR uses a probabilistic model in combination with an 
evolutionary algorithm to generate hierarchies of deep feature 
representations.
Graph-based feature learning 
© All Rights Reserved 2014 | Neo Technology, Inc. 
26
Learning new features from 
matches on training data 
© All Rights Reserved 2014 | Neo Technology, Inc. 
27
Cost Function for the Generations of Features 
Reproduction occurs after a threshold of matches has been 
exceeded for a feature. 
After replication the cost function is applied to increase that 
threshold every time the feature reproduces. 
is the current threshold on the feature node. 
is the minimum threshold, which I chose as 5 for new features. 
© All Rights Reserved 2014 | Neo Technology, Inc. 
Cost function: 
28
© All Rights 29 Reserved 2014 | Neo Technology, Inc.
Vector Space Model 
© All Rights Reserved 2014 | Neo Technology, Inc. 
30
Generating Feature Vectors 
The natural language parsing model created during training can be 
turned into a global feature index. 
This global feature index is a list of Neo4j internal IDs for every feature 
in the hierarchy. 
Using that global feature index, a multi-dimensional vector space is 
created with a length equal to the number of features in the hierarchy. 
© All Rights Reserved 2014 | Neo Technology, Inc. 
31
Relevance Rankings 
“Relevance rankings of documents in a keyword search can be 
calculated, using the assumptions of document similarities theory, by 
comparing the deviation of angles between each document vector and 
the original query vector where the query is represented as the same 
kind of vector as the documents.” - Wikipedia 
© All Rights Reserved 2014 | Neo Technology, Inc. 
32
Vector-based Cosine Similarity Measure 
In practice, it is easier to calculate the cosine of the angle between the 
vectors, instead of the angle itself: 
© All Rights Reserved 2014 | Neo Technology, Inc. 
33
Cosine Similarity & Vector Space Model 
© All Rights Reserved 2014 | Neo Technology, Inc. 
34
Vector-based Cosine Similarity Measure 
“The resulting similarity ranges from -1 meaning exactly opposite, to 1 
meaning exactly the same, with 0 usually indicating independence, 
and in-between values indicating intermediate similarity or 
dissimilarity.” 
© All Rights Reserved 2014 | Neo Technology, Inc. 
via Wikipedia 
35
Graphify for Neo4j 
© All Rights Reserved 2014 | Neo Technology, Inc. 
36
Graphify for Neo4j 
Graphify is a Neo4j unmanaged extension used for 
document and text classification using graph-based 
hierarchical pattern recognition. 
© All Rights Reserved 2014 | Neo Technology, Inc. 
https://github.com/kbastani/graphify 
37
Example Project 
Head over to the GitHub project page and clone it to your 
local machine. 
Follow the directions listed in the README.md to install the 
extension. 
Navigate to the /examples directory of the project. 
© All Rights Reserved 2014 | Neo Technology, Inc. 
Run: 
examples/graphify-examples-author/src/java/org/neo4j/nlp/examples/author/main.java 
38
U.S. Presidential Speech 
Transcript Analysis 
© All Rights Reserved 2014 | Neo Technology, Inc. 
39
Identify the Political Affiliation of a Presidential Speech 
This example ingests a set of texts from presidential speeches with 
labels from the author of that speech in training phase. After building 
the training models, unlabeled presidential speeches are classified in 
the test phase. 
© All Rights Reserved 2014 | Neo Technology, Inc. 
40
The Presidents 
© All Rights Reserved 2014 | Neo Technology, Inc. 
• Ronald Reagan 
• labels: liberal, republican, ronald-reagan 
• George H.W. Bush 
• labels: conservative, republican, bush41 
• Bill Clinton 
• labels: liberal, democrat, bill-clinton 
• George W. Bush 
• labels: conservative, republican, bush43 
• Barack Obama 
• labels: liberal, democrat, barack-obama 
41
© All Rights Reserved 2014 | Neo Technology, Inc. 
Training 
Each of the presidents in the example have 6 speeches to analyze. 
4 of the speeches are used to build a natural language parsing model. 
2 of the speeches are used to test the validity of that model. 
42
Get Similar Labels/Classes 
© All Rights Reserved 2014 | Neo Technology, Inc. 
43
Ronald Reagan 
republican 0.7182046285385341 
liberal 0.644281223102398 
democrat 0.4854114595950056 
conservative 0.4133639188595147 
bill-clinton 0.4057969121945167 
barack-obama 0.323947855372623 
bush41 0.3222644898334092 
bush43 0.3161309849153592 
© All Rights Reserved 2014 | Neo Technology, Inc. 
Class Similarity 
44
George H.W. Bush 
conservative 0.7032274806766954 
republican 0.6047256274615608 
liberal 0.4439742461594541 
democrat 0.39114918238853674 
bill-clinton 0.3234223107986785 
ronald-reagan 0.3222644898334092 
barack-obama 0.2929260544514002 
bush43 0.29106733975087984 
© All Rights Reserved 2014 | Neo Technology, Inc. 
Class Similarity 
45
democrat 0.8375678825642422 
liberal 0.7847858060182163 
republican 0.5561860529059708 
conservative 0.45365774896422445 
barack-obama 0.4507676679770066 
ronald-reagan 0.4057969121945167 
bush43 0.365042482383354 
bush41 0.3234223107986785 
© All Rights Reserved 2014 | Neo Technology, Inc. 
Bill Clinton 
Class Similarity 
46
George W. Bush 
conservative 0.820636570272315 
republican 0.7056890956512284 
liberal 0.5075788396061254 
democrat 0.4505424322086937 
bill-clinton 0.365042482383354 
barack-obama 0.33801949243378965 
ronald-reagan 0.3161309849153592 
bush41 0.29106733975087984 
© All Rights Reserved 2014 | Neo Technology, Inc. 
Class Similarity 
47
Barack Obama 
democrat 0.7668017370739147 
liberal 0.7184792203867296 
republican 0.4847680475425114 
bill-clinton 0.4507676679770066 
conservative 0.4149264161292232 
bush43 0.33801949243378965 
ronald-reagan 0.323947855372623 
bush41 0.2929260544514002 
© All Rights Reserved 2014 | Neo Technology, Inc. 
Class Similarity 
48
Get involved in the Neo4j community 
© All Rights Reserved 2014 | Neo Technology, Inc. 
49
http://stackoverflow.com/questions/tagged/neo4j 
© All Rights Reserved 2014 | Neo Technology, Inc. 
50
http://groups.google.com/group/neo4j 
© All Rights Reserved 2014 | Neo Technology, Inc. 
51
https://github.com/neo4j/neo4j/issues 
© All Rights Reserved 2014 | Neo Technology, Inc. 
52
http://neo4j.meetup.com/ 
© All Rights Reserved 2014 | Neo Technology, Inc. 
53
© All Rights Reserved 2014 | Neo Technology, Inc. 
(Thank You) 
54
Twitter www.twitter.com/kennybastani 
LinkedIn www.linkedin.com/in/kennybastani 
GitHub www.github.com/kbastani 
© All Rights Reserved 2014 | Neo Technology, Inc. 
Get in touch 
55

More Related Content

PPTX
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
PPTX
Intro to Neo4j
PDF
Intro to Neo4j and Graph Databases
PPTX
論文要約:AUGMIX: A SIMPLE DATA PROCESSING METHOD TO IMPROVE ROBUSTNESS AND UNCERT...
PPTX
03 hive query language (hql)
PDF
Knowledge graphs, meet Deep Learning
PPT
Date warehousing concepts
PPTX
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Hadoop Tutorial For Beginners | Apache Hadoop Tutorial For Beginners | Hadoop...
Intro to Neo4j
Intro to Neo4j and Graph Databases
論文要約:AUGMIX: A SIMPLE DATA PROCESSING METHOD TO IMPROVE ROBUSTNESS AND UNCERT...
03 hive query language (hql)
Knowledge graphs, meet Deep Learning
Date warehousing concepts
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...

What's hot (20)

PPTX
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
PPTX
Microsoft Data Integration Pipelines: Azure Data Factory and SSIS
PDF
Neo4j Jesus Barrasa The Art of the Possible with Graph
PDF
Apache Spark Crash Course
PDF
Graphs for Finance - AML with Neo4j Graph Data Science
PPTX
Introduction to Graph Databases
PDF
Intepretability / Explainable AI for Deep Neural Networks
PDF
Unlocking Geospatial Analytics Use Cases with CARTO and Databricks
ZIP
NoSQL databases
PDF
【学会発表】U-Net++とSE-Netを統合した画像セグメンテーションのための転移学習モデル【IBIS2020】
PPTX
Big Data & Hadoop Introduction
PPTX
CNNの可視化手法Grad-CAMの紹介~CNNさん、あなたはどこを見ているの?~ | OHS勉強会#6
ODP
Neo4j Spatial - Backing a GIS with a true graph database
PDF
Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...
PDF
FIBO in Neo4j: Applying Knowledge Graphs in the Financial Industry
PDF
SSII2022 [SS1] ニューラル3D表現の最新動向〜 ニューラルネットでなんでも表せる?? 〜​
PPTX
Hadoop technology
PDF
Intro to Graphs and Neo4j
PDF
Building Applications with a Graph Database
PDF
Discovery of Linear Acyclic Models Using Independent Component Analysis
Introduction To Hadoop | What Is Hadoop And Big Data | Hadoop Tutorial For Be...
Microsoft Data Integration Pipelines: Azure Data Factory and SSIS
Neo4j Jesus Barrasa The Art of the Possible with Graph
Apache Spark Crash Course
Graphs for Finance - AML with Neo4j Graph Data Science
Introduction to Graph Databases
Intepretability / Explainable AI for Deep Neural Networks
Unlocking Geospatial Analytics Use Cases with CARTO and Databricks
NoSQL databases
【学会発表】U-Net++とSE-Netを統合した画像セグメンテーションのための転移学習モデル【IBIS2020】
Big Data & Hadoop Introduction
CNNの可視化手法Grad-CAMの紹介~CNNさん、あなたはどこを見ているの?~ | OHS勉強会#6
Neo4j Spatial - Backing a GIS with a true graph database
Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...
FIBO in Neo4j: Applying Knowledge Graphs in the Financial Industry
SSII2022 [SS1] ニューラル3D表現の最新動向〜 ニューラルネットでなんでも表せる?? 〜​
Hadoop technology
Intro to Graphs and Neo4j
Building Applications with a Graph Database
Discovery of Linear Acyclic Models Using Independent Component Analysis
Ad

Viewers also liked (19)

PPT
Natural language search using Neo4j
PDF
Natural Language Processing with Graph Databases and Neo4j
PPT
Natural Language Processing with Neo4j
PDF
Building a Graph-based Analytics Platform
PDF
Data Modeling with Neo4j
PDF
Open Source Big Graph Analytics on Neo4j with Apache Spark
PPT
Big Graph Analytics on Neo4j with Apache Spark
PDF
Graph database Use Cases
PPTX
Neo4J Open Source Graph Database
PDF
20141216 graph database prototyping ams meetup
PPT
Dnc Day 4 – Obama Speech
PDF
The impact of language planning, terminology planning, and arabicization, on ...
PDF
Meryl streep took a stand against donald trump
PDF
AP Invoice Processing for JD Edwards_Bottomline Technologies
KEY
Document Classification In PHP
PPT
The war on terrorism
PDF
M893 & m894 seahawks contest
PPTX
Visual Resume
PPTX
Adivina de _quienes_son_las_siguientes_cansiones[1]
Natural language search using Neo4j
Natural Language Processing with Graph Databases and Neo4j
Natural Language Processing with Neo4j
Building a Graph-based Analytics Platform
Data Modeling with Neo4j
Open Source Big Graph Analytics on Neo4j with Apache Spark
Big Graph Analytics on Neo4j with Apache Spark
Graph database Use Cases
Neo4J Open Source Graph Database
20141216 graph database prototyping ams meetup
Dnc Day 4 – Obama Speech
The impact of language planning, terminology planning, and arabicization, on ...
Meryl streep took a stand against donald trump
AP Invoice Processing for JD Edwards_Bottomline Technologies
Document Classification In PHP
The war on terrorism
M893 & m894 seahawks contest
Visual Resume
Adivina de _quienes_son_las_siguientes_cansiones[1]
Ad

Similar to Document Classification with Neo4j (20)

PDF
Remote Desktop Manager Enterprise 2024.3.29
PDF
Capcut Pro Crack For PC Latest 2025 Version
PDF
LDPlayer Free Download (Latest version 2025)
PDF
Apple Logic Pro X for MacOS Free Download
PDF
Neo4j Generative AI workshop at GraphSummit London 14 Nov 2023.pdf
PDF
IA Generativa y Grafos de Neo4j: RAG time
PDF
Neo4j y GenAI
PDF
Webinar - IA generativa e grafi Neo4j: RAG time!
PPTX
Demystifying Graph Neural Networks
PDF
YouTube Downloader v3.4.9 APK Download
PDF
Wondershare UniConverter for MacOS Download
PDF
Minitab Free crack Download (Latest 2025)
PDF
TunesKit Video Repair 2.0.0.11 Free Download
PPTX
GraphSummit Milan & Stockholm - Neo4j: The Art of the Possible with Graph
PDF
Transforming AI with Graphs: Real World Examples using Spark and Neo4j
PDF
Transforming AI with Graphs: Real World Examples using Spark and Neo4j
PDF
Graph Algorithms for Developers
PDF
3. Relationships Matter: Using Connected Data for Better Machine Learning
PDF
GPT and Graph Data Science to power your Knowledge Graph
PPTX
Knowledge Graphs and Generative AI_GraphSummit Minneapolis Sept 20.pptx
Remote Desktop Manager Enterprise 2024.3.29
Capcut Pro Crack For PC Latest 2025 Version
LDPlayer Free Download (Latest version 2025)
Apple Logic Pro X for MacOS Free Download
Neo4j Generative AI workshop at GraphSummit London 14 Nov 2023.pdf
IA Generativa y Grafos de Neo4j: RAG time
Neo4j y GenAI
Webinar - IA generativa e grafi Neo4j: RAG time!
Demystifying Graph Neural Networks
YouTube Downloader v3.4.9 APK Download
Wondershare UniConverter for MacOS Download
Minitab Free crack Download (Latest 2025)
TunesKit Video Repair 2.0.0.11 Free Download
GraphSummit Milan & Stockholm - Neo4j: The Art of the Possible with Graph
Transforming AI with Graphs: Real World Examples using Spark and Neo4j
Transforming AI with Graphs: Real World Examples using Spark and Neo4j
Graph Algorithms for Developers
3. Relationships Matter: Using Connected Data for Better Machine Learning
GPT and Graph Data Science to power your Knowledge Graph
Knowledge Graphs and Generative AI_GraphSummit Minneapolis Sept 20.pptx

More from Kenny Bastani (9)

PDF
In the Eventual Consistency of Succeeding at Microservices
PDF
Building Cloud Native Architectures with Spring
PDF
Extending the Platform with Spring Boot and Cloud Foundry
PDF
Back your app with MySQL and Redis on Cloud Foundry
PDF
Using Docker, Neo4j, and Spring Cloud for Developing Microservices
PDF
Cloud Native Java Microservices
PPTX
Building REST APIs with Spring Boot and Spring Cloud
PDF
Neo4j Graph Data Modeling
PDF
Building Killer Apps with Neo4j 2.0
In the Eventual Consistency of Succeeding at Microservices
Building Cloud Native Architectures with Spring
Extending the Platform with Spring Boot and Cloud Foundry
Back your app with MySQL and Redis on Cloud Foundry
Using Docker, Neo4j, and Spring Cloud for Developing Microservices
Cloud Native Java Microservices
Building REST APIs with Spring Boot and Spring Cloud
Neo4j Graph Data Modeling
Building Killer Apps with Neo4j 2.0

Recently uploaded (20)

PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPT
Teaching material agriculture food technology
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PDF
Review of recent advances in non-invasive hemoglobin estimation
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PPTX
sap open course for s4hana steps from ECC to s4
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
MYSQL Presentation for SQL database connectivity
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
cuic standard and advanced reporting.pdf
PDF
Electronic commerce courselecture one. Pdf
PDF
Empathic Computing: Creating Shared Understanding
PDF
Encapsulation theory and applications.pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Machine learning based COVID-19 study performance prediction
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Network Security Unit 5.pdf for BCA BBA.
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Teaching material agriculture food technology
Chapter 3 Spatial Domain Image Processing.pdf
Review of recent advances in non-invasive hemoglobin estimation
Diabetes mellitus diagnosis method based random forest with bat algorithm
Mobile App Security Testing_ A Comprehensive Guide.pdf
sap open course for s4hana steps from ECC to s4
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
MYSQL Presentation for SQL database connectivity
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
cuic standard and advanced reporting.pdf
Electronic commerce courselecture one. Pdf
Empathic Computing: Creating Shared Understanding
Encapsulation theory and applications.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
Machine learning based COVID-19 study performance prediction
Advanced methodologies resolving dimensionality complications for autism neur...
gpt5_lecture_notes_comprehensive_20250812015547.pdf

Document Classification with Neo4j

  • 1. Document Classification with Neo4j (graphs)-[:are]->(everywhere) © All Rights Reserved 2014 | Neo Technology, Inc. @kennybastani Neo4j Developer Evangelist
  • 2. © All Rights Reserved 2014 | Neo Technology, Inc. Agenda • Introduction to Neo4j • Introduction to Graph-based Document Classification • Graph-based Hierarchical Pattern Recognition • Generating a Vector Space Model for Recommendations • Graphify for Neo4j • U.S. Presidential Speech Transcript Analysis 2
  • 3. Introduction to Neo4j © All Rights Reserved 2014 | Neo Technology, Inc. 3
  • 4. The Property Graph Data Model © All Rights Reserved 2014 | Neo Technology, Inc. 4
  • 5. © All Rights Reserved 2014 | Neo Technology, Inc. John Sally Graph Databases Book 5
  • 6. © All Rights Reserved 2014 | Neo Technology, Inc. name: John age: 27 name: Sally age: 32 FRIEND_OF since: 01/09/2013 title: Graph Databases authors: Ian Robinson, Jim Webber HAS_READ on: 2/03/2013 rating: 5 HAS_READ on: 02/09/2013 rating: 4 FRIEND_OF since: 01/09/2013 6
  • 7. The Relational Table Model © All Rights Reserved 2014 | Neo Technology, Inc. 7
  • 8. Customers Customer_Accounts Accounts © All Rights Reserved 2014 | Neo Technology, Inc. 8
  • 9. The Neo4j Browser © All Rights Reserved 2014 | Neo Technology, Inc. 9
  • 10. Neo4j Browser - finding help © All Rights Reserved 2014 | Neo Technology, Inc. http://localhost:7474/ 10
  • 11. Execute Cypher, Visualize © All Rights Reserved 2014 | Neo Technology, Inc. 11
  • 12. Introduction to Document Classification © All Rights Reserved 2014 | Neo Technology, Inc. 12
  • 13. © All Rights Reserved 2014 | Neo Technology, Inc. Document Classification Automatically assign a document to one or more classes Documents may be classified according to their subjects or according to other attributes Automatically classify unlabeled documents to a set of relevant classes using labeled training data 13
  • 14. Example Use Cases for Document © All Rights Reserved 2014 | Neo Technology, Inc. Classification 14
  • 15. Sentiment Analysis for Movie Reviews Scenario: A movie website allows users to submit reviews describing what they either liked or disliked about a particular movie. © All Rights Reserved 2014 | Neo Technology, Inc. Problem: The user reviews are unstructured text. How do I automatically generate a score indicating whether the review was positive or negative? Solution: Train a natural language parsing model on a dataset that has been labeled in previous reviews as either positive or negative. 15
  • 16. Recommend Relevant Tags Scenario: A Q/A website allows users to submit questions and receive answers from other users. Problem: Users sometime do not know what tags to apply to their questions in order to increase discoverability for receiving answers. Solution: Automatically recommend the most relevant tags for questions by classifying the text from training on previous questions. © All Rights Reserved 2014 | Neo Technology, Inc. 16
  • 17. Recommend Similar Articles Scenario: A news website provides hundreds of new articles a day to users on a broad range of topics. Problem: The site needs to increase user engagement and time spent on the site. Solution: Train natural language parsing models for daily articles in order to provide recommendations for highly relevant articles at the bottom of each page. © All Rights Reserved 2014 | Neo Technology, Inc. 17
  • 18. How Automated Document Classification Works © All Rights Reserved 2014 | Neo Technology, Inc. 18
  • 19. Label © All Rights Reserved 2014 | Neo Technology, Inc. X Y Document Document Document Document Label Label Assign a set of labels that describes the document’s text Supervised Learning Step 1: Create a Training Dataset Z 19
  • 20. Step 2: Train a Natural Language Parsing Model p X Y = State Machine © All Rights Reserved 2014 | Neo Technology, Inc. Deep feature representations are selected and learned using an evolutionary algorithm State machines represent predicates that evaluate to 0 or 1 for a text match State machines map to classes of document labels that matched text during training Deep Learning p p p p p Class Class Z Class 20
  • 21. cos(θ) © All Rights Reserved 2014 | Neo Technology, Inc. Unlabeled Document The natural language parsing model is used to classify other unlabeled documents X Class Y Class Z Class 0.99 0.67 0.01 cos(θ) cos(θ) Step 3: Classify Unlabeled Documents 21
  • 22. Hierarchical Pattern Recognition © All Rights Reserved 2014 | Neo Technology, Inc. (HPR) 22
  • 23. What is Hierarchical Pattern Recognition (HPR)? HPR is a graph-based deep learning algorithm I created that learns deep feature representations in linear time — I created the algorithm to do graph-based traversals using a hierarchy of finite state machines (FSM). Designed for scalable performance in P time: © All Rights Reserved 2014 | Neo Technology, Inc. 23
  • 24. Influences & Inspirations + = p p p p p p X Y Z © All Rights Reserved 2014 | Neo Technology, Inc. 24 Ray Kurzweil (Pattern Recognition Theory of Mind) Jeff Hawkins (Hierarchical Temporal Memory) Hierarchical Pattern Recognition
  • 25. How does feature extraction work? p © All Rights Reserved 2014 | Neo Technology, Inc. 25 Hierarchical Pattern Recognition “Deep” feature representations are learned and associated with labels that are mapped to documents that the feature was discovered in. The feature hierarchy is translated into a Vector Space Model for classification on feature vectors generated from unlabeled text. p p p p p X Y Z HPR uses a probabilistic model in combination with an evolutionary algorithm to generate hierarchies of deep feature representations.
  • 26. Graph-based feature learning © All Rights Reserved 2014 | Neo Technology, Inc. 26
  • 27. Learning new features from matches on training data © All Rights Reserved 2014 | Neo Technology, Inc. 27
  • 28. Cost Function for the Generations of Features Reproduction occurs after a threshold of matches has been exceeded for a feature. After replication the cost function is applied to increase that threshold every time the feature reproduces. is the current threshold on the feature node. is the minimum threshold, which I chose as 5 for new features. © All Rights Reserved 2014 | Neo Technology, Inc. Cost function: 28
  • 29. © All Rights 29 Reserved 2014 | Neo Technology, Inc.
  • 30. Vector Space Model © All Rights Reserved 2014 | Neo Technology, Inc. 30
  • 31. Generating Feature Vectors The natural language parsing model created during training can be turned into a global feature index. This global feature index is a list of Neo4j internal IDs for every feature in the hierarchy. Using that global feature index, a multi-dimensional vector space is created with a length equal to the number of features in the hierarchy. © All Rights Reserved 2014 | Neo Technology, Inc. 31
  • 32. Relevance Rankings “Relevance rankings of documents in a keyword search can be calculated, using the assumptions of document similarities theory, by comparing the deviation of angles between each document vector and the original query vector where the query is represented as the same kind of vector as the documents.” - Wikipedia © All Rights Reserved 2014 | Neo Technology, Inc. 32
  • 33. Vector-based Cosine Similarity Measure In practice, it is easier to calculate the cosine of the angle between the vectors, instead of the angle itself: © All Rights Reserved 2014 | Neo Technology, Inc. 33
  • 34. Cosine Similarity & Vector Space Model © All Rights Reserved 2014 | Neo Technology, Inc. 34
  • 35. Vector-based Cosine Similarity Measure “The resulting similarity ranges from -1 meaning exactly opposite, to 1 meaning exactly the same, with 0 usually indicating independence, and in-between values indicating intermediate similarity or dissimilarity.” © All Rights Reserved 2014 | Neo Technology, Inc. via Wikipedia 35
  • 36. Graphify for Neo4j © All Rights Reserved 2014 | Neo Technology, Inc. 36
  • 37. Graphify for Neo4j Graphify is a Neo4j unmanaged extension used for document and text classification using graph-based hierarchical pattern recognition. © All Rights Reserved 2014 | Neo Technology, Inc. https://github.com/kbastani/graphify 37
  • 38. Example Project Head over to the GitHub project page and clone it to your local machine. Follow the directions listed in the README.md to install the extension. Navigate to the /examples directory of the project. © All Rights Reserved 2014 | Neo Technology, Inc. Run: examples/graphify-examples-author/src/java/org/neo4j/nlp/examples/author/main.java 38
  • 39. U.S. Presidential Speech Transcript Analysis © All Rights Reserved 2014 | Neo Technology, Inc. 39
  • 40. Identify the Political Affiliation of a Presidential Speech This example ingests a set of texts from presidential speeches with labels from the author of that speech in training phase. After building the training models, unlabeled presidential speeches are classified in the test phase. © All Rights Reserved 2014 | Neo Technology, Inc. 40
  • 41. The Presidents © All Rights Reserved 2014 | Neo Technology, Inc. • Ronald Reagan • labels: liberal, republican, ronald-reagan • George H.W. Bush • labels: conservative, republican, bush41 • Bill Clinton • labels: liberal, democrat, bill-clinton • George W. Bush • labels: conservative, republican, bush43 • Barack Obama • labels: liberal, democrat, barack-obama 41
  • 42. © All Rights Reserved 2014 | Neo Technology, Inc. Training Each of the presidents in the example have 6 speeches to analyze. 4 of the speeches are used to build a natural language parsing model. 2 of the speeches are used to test the validity of that model. 42
  • 43. Get Similar Labels/Classes © All Rights Reserved 2014 | Neo Technology, Inc. 43
  • 44. Ronald Reagan republican 0.7182046285385341 liberal 0.644281223102398 democrat 0.4854114595950056 conservative 0.4133639188595147 bill-clinton 0.4057969121945167 barack-obama 0.323947855372623 bush41 0.3222644898334092 bush43 0.3161309849153592 © All Rights Reserved 2014 | Neo Technology, Inc. Class Similarity 44
  • 45. George H.W. Bush conservative 0.7032274806766954 republican 0.6047256274615608 liberal 0.4439742461594541 democrat 0.39114918238853674 bill-clinton 0.3234223107986785 ronald-reagan 0.3222644898334092 barack-obama 0.2929260544514002 bush43 0.29106733975087984 © All Rights Reserved 2014 | Neo Technology, Inc. Class Similarity 45
  • 46. democrat 0.8375678825642422 liberal 0.7847858060182163 republican 0.5561860529059708 conservative 0.45365774896422445 barack-obama 0.4507676679770066 ronald-reagan 0.4057969121945167 bush43 0.365042482383354 bush41 0.3234223107986785 © All Rights Reserved 2014 | Neo Technology, Inc. Bill Clinton Class Similarity 46
  • 47. George W. Bush conservative 0.820636570272315 republican 0.7056890956512284 liberal 0.5075788396061254 democrat 0.4505424322086937 bill-clinton 0.365042482383354 barack-obama 0.33801949243378965 ronald-reagan 0.3161309849153592 bush41 0.29106733975087984 © All Rights Reserved 2014 | Neo Technology, Inc. Class Similarity 47
  • 48. Barack Obama democrat 0.7668017370739147 liberal 0.7184792203867296 republican 0.4847680475425114 bill-clinton 0.4507676679770066 conservative 0.4149264161292232 bush43 0.33801949243378965 ronald-reagan 0.323947855372623 bush41 0.2929260544514002 © All Rights Reserved 2014 | Neo Technology, Inc. Class Similarity 48
  • 49. Get involved in the Neo4j community © All Rights Reserved 2014 | Neo Technology, Inc. 49
  • 50. http://stackoverflow.com/questions/tagged/neo4j © All Rights Reserved 2014 | Neo Technology, Inc. 50
  • 51. http://groups.google.com/group/neo4j © All Rights Reserved 2014 | Neo Technology, Inc. 51
  • 52. https://github.com/neo4j/neo4j/issues © All Rights Reserved 2014 | Neo Technology, Inc. 52
  • 53. http://neo4j.meetup.com/ © All Rights Reserved 2014 | Neo Technology, Inc. 53
  • 54. © All Rights Reserved 2014 | Neo Technology, Inc. (Thank You) 54
  • 55. Twitter www.twitter.com/kennybastani LinkedIn www.linkedin.com/in/kennybastani GitHub www.github.com/kbastani © All Rights Reserved 2014 | Neo Technology, Inc. Get in touch 55

Editor's Notes

  • #6: When we think about data, we tend to think about how things are connected. This is a natural part of how we talk about things, and also of the graph model. “This is also a graph, but with some data attached. Here: we’ve attached names to the nodes and described the type of the relationships.”
  • #7: “We can take this further, and attach arbitrary key/value pairs” This is the Property Graph Model, which has the following characteristics: It contains Nodes and Relationships, both of which can contain properties (key-value pairs). Relationships are always between exactly 2 nodes. They have a type, and they are directed. “There are other graph models, however everyone in the industry has converged on the idea that this model is the most obvious and the most useful for real humans and the application we’re building”
  • #8: Let’s review the relational table model, to see the difference from the graph property model
  • #9: Start with Customers and Accounts “We have a customer, Alice.” “She’s got 3 accounts” “To keep track of which accounts Alice owns, we need a 3rd table, to store the mapping. Typically called a join table.”
  • #11: Dashboard, for monitoring of key stats Node, Relationship and Property “counts” are just estimates (actually represent the allocated ID space for each graph entity)
  • #12: “The Console is where you can run graph queries, written in Cypher.” We’ll be using this starting... now.
  • #24: Disclaimer: This is a graph-based approach to text classification and pattern recognition. This can be done in many different ways, including SVM, bayesian networks, belief networks, and many other approaches. I chose to create this on top of Neo4j because first its a database and second its already formatted as a network. This gives me the advantage of not worrying about data storage.
  • #27: Explain how the genetic algorithm works.
  • #41: I chose this example project because it’s easy to get presidential speeches online and it seemed like a good example to get others going with Graphify.
  • #50: “Get involved with the community, attend meetups, browse our open source code libraries, including Neo4j, by visiting us on GitHub.”
  • #51: “Visit stackoverflow.com with the tag Neo4j to get fast answers to your questions. We have a very active community of contributors that provide thorough answers 24/7. If you get stuck, make sure you head there.”
  • #52: “The same goes for Google groups, if you prefer that format over Stackoverflow.”
  • #53: “You can visit us on GitHub to submit or browse issues.”
  • #54: “Finally, I urge you to check out our website’s meetup page to find out where meetups are happening all around the world. Also we encourage you to share your experience with Neo4j, your applications, and your use cases by speaking at a local meetup. If you’re interested, please reach out to me, my contact details are in the next slide.”
  • #55: “Thank you for spending some time with me and learning about Neo4j and Cypher.”
  • #56: “Get in touch with me about meetups and Neo4j community events happening around the world.” “I’ll now open up the floor to questions.”