SlideShare a Scribd company logo
ETL into Neo4j
    Max De Marzi
About Me
    Built the Neography Gem (Ruby
    Wrapper to the Neo4j REST API)
    Playing with Neo4j since 10/2009


•   My Blog: http://maxdemarzi.com
•   Find me on Twitter: @maxdemarzi
•   Email me: maxdemarzi@gmail.com
•   GitHub: http://github.com/maxdemarzi
Agenda
•   ETL your mind
•   ETL with Batch and the REST API
•   ETL with Gremlin and Groovy
•   ETL with the Batch Importer
•   ETL from SQL
ETL your Mind

You have to start there
More Relational than Relational
Stop thinking about how   Start thinking about relationships
Tables are related
Objects like to mingle
Optimized for “trees” of data   Optimized for seeing the forest and the
                                trees, and the branches, and the trunks
SELECT skills.*, user_skill.*
FROM users
JOIN user_skill ON users.id = user_skill.user_id
JOIN skills ON user_skill.skill_id = skill.id
WHERE users.id = 1
START user = node(1)
MATCH user -[user_skill]-> skill
RETURN skill, user_skill
Property Graph
Language    LanguageCountry          Country

language_code     language_code      country_code
language_name     country_code       country_name
word_count        primary            flag_uri




       Language                             Country

name                                 name
                    IS_SPOKEN_IN
code                                 code
word_count           as_primary      flag_uri
name: “Canada”
                 languages_spoken: “[ „English‟, „French‟ ]”




                           language:“English”     spoken_in
                                                               name: “USA”




name: “Canada”




                 language:“French”    spoken_in
                                                     name: “France”
Country

                 name
                 flag_uri
                 language_name
                 number_of_words
                 yes_in_langauge
                 no_in_language
                 currency_code
                 currency_name

       Country
                                          Language
name                               name
flag_uri                SPEAKS
                                   number_of_words
                                   yes
                                   no
                        Currency
                   code
                   name
ETL with Batch and the REST API
Batch command from REST API




Great for importing Facebook/Twitter friends

Keep each request under 10k commands

Preferably send a request every 2k to 5k commands
Using Batch from Neography
Why Batch
 Transactional: any failures not
 committed.

 Ordered: responses guaranteed
 to be in the same order as sent.

 Continuous loading/updating
 nodes and relationships in
 spurts or streaming.
ETL with Gremlin and Groovy
Commit every 1000 changes or so, make sure to stop the transaction to commit the
last few changes at the very end.


Look into auto-indexing to make life easier.




Disabled by default. See Docs for trick to make it full text
instead of exact index.

http://docs.neo4j.org/chunked/milestone/auto-indexing.html
Crazy Format is ok
                                                               Id :: Title :: Genre|Genre|Genre

                                                               But it’s preferable to stay clear of
                                                               escape characters like “|”




String location of data file, converted to URL, then processed one line at a time.
Movie vertex created, genre vertex created unless it exists (index lookup), edge
from movie to genre is created.

Full walk-through on http://maxdemarzi.com/2012/01/13/neo4j-on-heroku-part-
one/
ETL with the Batch Importer
Installation Walk-Through
Testing it




7.5M nodes, 42M relationships in just over 3 minutes on a laptop.
Loading it into Neo4j




Full walk-through on http://maxdemarzi.com/2012/02/28/batch-importer-part-1/
When to use the Batch Importer?

           • 1st time loading or
             periodic reloading

           • When you need Speed

           • When you don’t mind a
             little Java
ETL from SQL
Identities who vouched for each other




row_number() and INTO are our friends
The “term” vouched for will serve as our relationship type, status is a relationship property.
Notice there are no node ids.
These are automatic, clkao is node 1
No time to get coffee   >8-[
What about multiple types of nodes?
No problem, just add the MAX(node_id) from the first table.




                  Full walk-through at:
                  http://maxdemarzi.com/2012/02/28/batch-importer-part-2/

                  Need help? E-mail me, catch me on Google chat or Skype.

                  Please don’t be shy…. and read my blog:


                   http://maxdemarzi.com
Thank you!
 http://maxdemarzi.com

More Related Content

PDF
YAMAHA XS 400 25 - [BRICO NO MOTOR] - Instalación tomas de corriente auxiliar...
PDF
Converting Relational to Graph Databases
PPTX
Relational to Graph - Import
PDF
Data Modeling with Neo4j
PDF
Importing Data into Neo4j quickly and easily - StackOverflow
PDF
Designing and Building a Graph Database Application – Architectural Choices, ...
PPTX
Migrating from MongoDB to Neo4j - Lessons Learned
PDF
Пути открытого доступа и открытые лицензии для научных журналов
YAMAHA XS 400 25 - [BRICO NO MOTOR] - Instalación tomas de corriente auxiliar...
Converting Relational to Graph Databases
Relational to Graph - Import
Data Modeling with Neo4j
Importing Data into Neo4j quickly and easily - StackOverflow
Designing and Building a Graph Database Application – Architectural Choices, ...
Migrating from MongoDB to Neo4j - Lessons Learned
Пути открытого доступа и открытые лицензии для научных журналов

Viewers also liked (20)

PDF
Правовой стандарт научной коммуникации в мире
PDF
Model Visualisation
PDF
How the IBM Platform LSF Architecture Accelerates Technical Computing
PPTX
Cloud PARTE: Elastic Complex Event Processing based on Mobile Actors
PPTX
Flink. Pure Streaming
PDF
Twitter's Real Time Stack - Processing Billions of Events Using Distributed L...
PDF
Graph Stream Processing : spinning fast, large scale, complex analytics
PPTX
Webinar: Large Scale Graph Processing with IBM Power Systems & Neo4j
KEY
Neo4j -[:LOVES]-> Cypher
PDF
Deploying Massive Scale Graphs for Realtime Insights
PDF
Intro to Graphs and Neo4j
PDF
Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink
PDF
Neo4j in Depth
PDF
Introducing Neo4j 3.1: New Security and Clustering Architecture
PDF
Graph Processing with Apache TinkerPop
PDF
Gelly in Apache Flink Bay Area Meetup
PPTX
Knowledge Architecture: Graphing Your Knowledge
PPT
An Introduction to Container Organization with Docker Swarm, Kubernetes, Meso...
PDF
20170126 big data processing
PDF
GraphDay Stockholm - Levaraging Graph-Technology to fight Financial Fraud
Правовой стандарт научной коммуникации в мире
Model Visualisation
How the IBM Platform LSF Architecture Accelerates Technical Computing
Cloud PARTE: Elastic Complex Event Processing based on Mobile Actors
Flink. Pure Streaming
Twitter's Real Time Stack - Processing Billions of Events Using Distributed L...
Graph Stream Processing : spinning fast, large scale, complex analytics
Webinar: Large Scale Graph Processing with IBM Power Systems & Neo4j
Neo4j -[:LOVES]-> Cypher
Deploying Massive Scale Graphs for Realtime Insights
Intro to Graphs and Neo4j
Gelly-Stream: Single-Pass Graph Streaming Analytics with Apache Flink
Neo4j in Depth
Introducing Neo4j 3.1: New Security and Clustering Architecture
Graph Processing with Apache TinkerPop
Gelly in Apache Flink Bay Area Meetup
Knowledge Architecture: Graphing Your Knowledge
An Introduction to Container Organization with Docker Swarm, Kubernetes, Meso...
20170126 big data processing
GraphDay Stockholm - Levaraging Graph-Technology to fight Financial Fraud
Ad

Similar to ETL into Neo4j (20)

PPTX
Introduction to Graph Databases
PPTX
NoSQL, Neo4J for Java Developers , OracleWeek-2012
PDF
Database Anti Patterns
PPT
Application Modeling with Graph Databases
PDF
QCon 2014 - How Shutl delivers even faster with Neo4j
KEY
Neo4j
PDF
Leveraging the Power of Graph Databases in PHP
PPTX
Windy City DB - Recommendation Engine with Neo4j
ODP
Grails goes Graph
PPT
Cypher
PPTX
Introduction to Gremlin
PDF
Walkthrough Neo4j 1.9 & 2.0
PDF
managing big data
PDF
EWD 3 Training Course Part 18: Modelling NoSQL Databases using Global Storage
PPTX
Graph Database Query Languages
PDF
Introduction to Neo4j - a hands-on crash course
PDF
Fluentd loves MongoDB, at MongoDB SV User Group, July 17, 2012
PDF
Neo4j Introduction - Game of Thrones
PDF
Neo4j: Graph-like power
Introduction to Graph Databases
NoSQL, Neo4J for Java Developers , OracleWeek-2012
Database Anti Patterns
Application Modeling with Graph Databases
QCon 2014 - How Shutl delivers even faster with Neo4j
Neo4j
Leveraging the Power of Graph Databases in PHP
Windy City DB - Recommendation Engine with Neo4j
Grails goes Graph
Cypher
Introduction to Gremlin
Walkthrough Neo4j 1.9 & 2.0
managing big data
EWD 3 Training Course Part 18: Modelling NoSQL Databases using Global Storage
Graph Database Query Languages
Introduction to Neo4j - a hands-on crash course
Fluentd loves MongoDB, at MongoDB SV User Group, July 17, 2012
Neo4j Introduction - Game of Thrones
Neo4j: Graph-like power
Ad

More from Max De Marzi (20)

PDF
AI, Tariffs and Supply Chains in Knowledge Graphs
PDF
DataDay 2023 Presentation
PDF
DataDay 2023 Presentation - Notes
PPTX
Developer Intro Deck-PowerPoint - Download for Speaker Notes
PDF
Outrageous Ideas for Graph Databases
PDF
Neo4j Training Cypher
PDF
Neo4j Training Modeling
PPTX
Neo4j Training Introduction
PDF
Detenga el fraude complejo con Neo4j
PDF
Data Modeling Tricks for Neo4j
PDF
Fraud Detection and Neo4j
PDF
Detecion de Fraude con Neo4j
PDF
Neo4j Data Science Presentation
PDF
Neo4j Stored Procedure Training Part 2
PDF
Neo4j Stored Procedure Training Part 1
PDF
Decision Trees in Neo4j
PDF
Neo4j y Fraude Spanish
PDF
Data modeling with neo4j tutorial
PDF
Neo4j Fundamentals
PDF
Neo4j Presentation
AI, Tariffs and Supply Chains in Knowledge Graphs
DataDay 2023 Presentation
DataDay 2023 Presentation - Notes
Developer Intro Deck-PowerPoint - Download for Speaker Notes
Outrageous Ideas for Graph Databases
Neo4j Training Cypher
Neo4j Training Modeling
Neo4j Training Introduction
Detenga el fraude complejo con Neo4j
Data Modeling Tricks for Neo4j
Fraud Detection and Neo4j
Detecion de Fraude con Neo4j
Neo4j Data Science Presentation
Neo4j Stored Procedure Training Part 2
Neo4j Stored Procedure Training Part 1
Decision Trees in Neo4j
Neo4j y Fraude Spanish
Data modeling with neo4j tutorial
Neo4j Fundamentals
Neo4j Presentation

Recently uploaded (20)

PDF
A comparative analysis of optical character recognition models for extracting...
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
MIND Revenue Release Quarter 2 2025 Press Release
PPTX
Tartificialntelligence_presentation.pptx
PDF
Accuracy of neural networks in brain wave diagnosis of schizophrenia
PDF
cuic standard and advanced reporting.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
PDF
NewMind AI Weekly Chronicles - August'25-Week II
PDF
Electronic commerce courselecture one. Pdf
PPTX
SOPHOS-XG Firewall Administrator PPT.pptx
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PPTX
Spectroscopy.pptx food analysis technology
PDF
gpt5_lecture_notes_comprehensive_20250812015547.pdf
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
1. Introduction to Computer Programming.pptx
A comparative analysis of optical character recognition models for extracting...
Programs and apps: productivity, graphics, security and other tools
MIND Revenue Release Quarter 2 2025 Press Release
Tartificialntelligence_presentation.pptx
Accuracy of neural networks in brain wave diagnosis of schizophrenia
cuic standard and advanced reporting.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Video forgery: An extensive analysis of inter-and intra-frame manipulation al...
NewMind AI Weekly Chronicles - August'25-Week II
Electronic commerce courselecture one. Pdf
SOPHOS-XG Firewall Administrator PPT.pptx
The Rise and Fall of 3GPP – Time for a Sabbatical?
Spectral efficient network and resource selection model in 5G networks
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Spectroscopy.pptx food analysis technology
gpt5_lecture_notes_comprehensive_20250812015547.pdf
Diabetes mellitus diagnosis method based random forest with bat algorithm
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
1. Introduction to Computer Programming.pptx

ETL into Neo4j

  • 1. ETL into Neo4j Max De Marzi
  • 2. About Me Built the Neography Gem (Ruby Wrapper to the Neo4j REST API) Playing with Neo4j since 10/2009 • My Blog: http://maxdemarzi.com • Find me on Twitter: @maxdemarzi • Email me: maxdemarzi@gmail.com • GitHub: http://github.com/maxdemarzi
  • 3. Agenda • ETL your mind • ETL with Batch and the REST API • ETL with Gremlin and Groovy • ETL with the Batch Importer • ETL from SQL
  • 4. ETL your Mind You have to start there
  • 5. More Relational than Relational Stop thinking about how Start thinking about relationships Tables are related
  • 6. Objects like to mingle Optimized for “trees” of data Optimized for seeing the forest and the trees, and the branches, and the trunks
  • 7. SELECT skills.*, user_skill.* FROM users JOIN user_skill ON users.id = user_skill.user_id JOIN skills ON user_skill.skill_id = skill.id WHERE users.id = 1
  • 8. START user = node(1) MATCH user -[user_skill]-> skill RETURN skill, user_skill
  • 10. Language LanguageCountry Country language_code language_code country_code language_name country_code country_name word_count primary flag_uri Language Country name name IS_SPOKEN_IN code code word_count as_primary flag_uri
  • 11. name: “Canada” languages_spoken: “[ „English‟, „French‟ ]” language:“English” spoken_in name: “USA” name: “Canada” language:“French” spoken_in name: “France”
  • 12. Country name flag_uri language_name number_of_words yes_in_langauge no_in_language currency_code currency_name Country Language name name flag_uri SPEAKS number_of_words yes no Currency code name
  • 13. ETL with Batch and the REST API
  • 14. Batch command from REST API Great for importing Facebook/Twitter friends Keep each request under 10k commands Preferably send a request every 2k to 5k commands
  • 15. Using Batch from Neography
  • 16. Why Batch Transactional: any failures not committed. Ordered: responses guaranteed to be in the same order as sent. Continuous loading/updating nodes and relationships in spurts or streaming.
  • 17. ETL with Gremlin and Groovy
  • 18. Commit every 1000 changes or so, make sure to stop the transaction to commit the last few changes at the very end. Look into auto-indexing to make life easier. Disabled by default. See Docs for trick to make it full text instead of exact index. http://docs.neo4j.org/chunked/milestone/auto-indexing.html
  • 19. Crazy Format is ok Id :: Title :: Genre|Genre|Genre But it’s preferable to stay clear of escape characters like “|” String location of data file, converted to URL, then processed one line at a time. Movie vertex created, genre vertex created unless it exists (index lookup), edge from movie to genre is created. Full walk-through on http://maxdemarzi.com/2012/01/13/neo4j-on-heroku-part- one/
  • 20. ETL with the Batch Importer
  • 22. Testing it 7.5M nodes, 42M relationships in just over 3 minutes on a laptop.
  • 23. Loading it into Neo4j Full walk-through on http://maxdemarzi.com/2012/02/28/batch-importer-part-1/
  • 24. When to use the Batch Importer? • 1st time loading or periodic reloading • When you need Speed • When you don’t mind a little Java
  • 26. Identities who vouched for each other row_number() and INTO are our friends
  • 27. The “term” vouched for will serve as our relationship type, status is a relationship property.
  • 28. Notice there are no node ids. These are automatic, clkao is node 1
  • 29. No time to get coffee >8-[
  • 30. What about multiple types of nodes? No problem, just add the MAX(node_id) from the first table. Full walk-through at: http://maxdemarzi.com/2012/02/28/batch-importer-part-2/ Need help? E-mail me, catch me on Google chat or Skype. Please don’t be shy…. and read my blog: http://maxdemarzi.com