SlideShare a Scribd company logo
July, 2020
Shobhna Srivastava
Enhancing Search
results with Graph
Neo4j/Elsevier
Context
■ Elsevier is a global
information & analytics
business specializing in
Science & health
■ Scopus – “Expertly curated
abstract & citations
database”
■ https://www.scopus.com/
IN PRODUCT
Problem definition
4
Doesn’t enable changes or enriching document with new data points
This processing is fragile
Costly solution
Hardware used
•90 nodes Solr indexing cluster (this is separate to live search cluster)
•Redshift
•Of course processing EC2 instances
Old document enrichment pipeline
•Index is created in Solr
•Redshift updated from Solr
•Then new counts are calculated, and diff done with old Solr index
•Then the updates are applied to Solr index
•And finally live Solr cluster is updated
Bounded context
Runtime system –
performance is
important
Aware of starting
node or nodes
Depth first or
breadth first
traversal
Metrics generation
5
Why
graph?
Classic multi-level graph traversals
Many-to-many relations on input data
Non-trivial & multi-level joins
Most enrichment is done
on relationships and how data are
connected to each other
6
Technology choice
Neo4J Neptune
Meets QPS ✓ ⚠ Neptune is much slower with with queries that require longer traversals
(i.e. "rolled up" queries per organisation count - 7 ms on Neo4j vs 7 seconds
on Neptune)
Scalability ⚠Tested with graph size that fits into cache, with larger graph some
smarter caching should be implemented
⚠ Works fast on larger instances (supposedly because of the cache size),
so with larger graph some application-level optimisations might be required.
A bit trickier than Neo4j because cache settings are not visible/configurable
Indexing ✓ ⚠Indexes are not configurable
Transaction management ✓ ⚠ Every traversal is a single transaction, manual commit/rollback are not
supported
Easy of cluster management ✓ Out-of-the box clustering with enterprise license
Unless enterprise licence purchased clustering and data replication
should be handled by us
✓ Easy out-of-the box data replication, immediate consistency
Cost 2 r4.4xlarge instances + LB ~ 1800 USD/month 2 r4.4xlarge instances + 250 GB storage (estimated based on test data) ~
2015 USD/month + 0.2 USD/1 million I/O requests (1,600 million requests
made only during testing)
7
ARCHITECTURE COMPONENTS
8
Relations update example
9
Result
■ ~300,000,000 nodes
– Work (Article, books, chapter) – 268,419,884
– Person (Author) – 40,633,203
– Organisation - 13,044,870
– Journal - 227,747
■ ~1,000,000,000 relations
■ ~1,000,000 updates a day
■ Hardware used (From ~90+ to ~9 nodes)
– 3 nodes (r4.4xlarge)
– 3 nodes data processing
– 3 nodes for API
10
Future work
11
Weighted ranking
Guided navigation
Related entities Suggestion
New links Associations

More Related Content

PDF
The Protein Regulatory Networks of COVID-19 - A Knowledge Graph Created by El...
PDF
The Case for Graphs in Supply Chains
PDF
Industrial production process visualization with the Elastic Stack in real-ti...
PDF
Unlocking Value in Device Data Using Spark: Spark Summit East talk by John La...
PPTX
Scaling Your Skillset with Your Data with Jarrett Garcia (Nielsen)
PDF
PDF
Elastic on a Hyper-Converged Infrastructure for Operational Log Analytics
PDF
Managing R&D Data on Parallel Compute Infrastructure
The Protein Regulatory Networks of COVID-19 - A Knowledge Graph Created by El...
The Case for Graphs in Supply Chains
Industrial production process visualization with the Elastic Stack in real-ti...
Unlocking Value in Device Data Using Spark: Spark Summit East talk by John La...
Scaling Your Skillset with Your Data with Jarrett Garcia (Nielsen)
Elastic on a Hyper-Converged Infrastructure for Operational Log Analytics
Managing R&D Data on Parallel Compute Infrastructure

What's hot (20)

PDF
Data kitchen 7 agile steps - big data fest 9-18-2015
PDF
Massively Scalable Computational Finance with SciDB
PPTX
Big Data in the Cloud with Azure Marketplace Images
PPTX
Managed Cluster Services
PDF
Big Data Day LA 2016/ NoSQL track - Architecting Real Life IoT Architecture, ...
PPTX
Big Data LDN 2016: All data is equal – but some data is more equal than others
PDF
American Ancestors Use Case - Scalability & Support Using the Elasticsearch S...
PDF
Case Study: Big Data Analytics
PPTX
Calum McCrea, Software Engineer at Kx Systems, "Kx: How Wall Street Tech can ...
PPTX
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
PDF
Build Real-Time Applications with Databricks Streaming
PDF
Lambda Architecture in the Cloud with Azure Databricks with Andrei Varanovich
PDF
Offload, Transform, and Present - The New World of Data Integration
PPTX
Big Data – A New Testing Challenge
PDF
Democratizing Machine Learning: Perspective from a scikit-learn Creator
PPTX
Delivering digital transformation and business impact with io t, machine lear...
PPTX
Building big data solutions on azure
PPTX
The key to unlocking the Value in the IoT? Managing the Data!
PDF
The State of the Data Warehouse in 2017 and Beyond
PPT
Query O
Data kitchen 7 agile steps - big data fest 9-18-2015
Massively Scalable Computational Finance with SciDB
Big Data in the Cloud with Azure Marketplace Images
Managed Cluster Services
Big Data Day LA 2016/ NoSQL track - Architecting Real Life IoT Architecture, ...
Big Data LDN 2016: All data is equal – but some data is more equal than others
American Ancestors Use Case - Scalability & Support Using the Elasticsearch S...
Case Study: Big Data Analytics
Calum McCrea, Software Engineer at Kx Systems, "Kx: How Wall Street Tech can ...
Virtualizing Analytics with Apache Spark: Keynote by Arsalan Tavakoli
Build Real-Time Applications with Databricks Streaming
Lambda Architecture in the Cloud with Azure Databricks with Andrei Varanovich
Offload, Transform, and Present - The New World of Data Integration
Big Data – A New Testing Challenge
Democratizing Machine Learning: Perspective from a scikit-learn Creator
Delivering digital transformation and business impact with io t, machine lear...
Building big data solutions on azure
The key to unlocking the Value in the IoT? Managing the Data!
The State of the Data Warehouse in 2017 and Beyond
Query O
Ad

Similar to The Protein Regulatory Networks of COVID-19 - A Knowledge Graph Created by Elsevier (20)

PPTX
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
PPTX
Taking Splunk to the Next Level – Architecture
PPT
Data ware housing - Introduction to data ware housing process.
PPTX
Performing Oracle Health Checks Using APEX
PDF
Kylin and Druid Presentation
PPTX
Jethro data meetup index base sql on hadoop - oct-2014
PDF
PPTX
An AMIS Overview of Oracle database 12c (12.1)
PPTX
Using Couchbase and Elasticsearch as data layers
PPTX
Taking Splunk to the Next Level - Architecture Breakout Session
PDF
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
PDF
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
PPTX
Taking Splunk to the Next Level - Architecture Breakout Session
PDF
20141206 4 q14_dataconference_i_am_your_db
PDF
Kaspersky Lab Products Remover 1.0.5497.0
PPTX
Novo Nordisk's journey in developing an open-source application on Neo4j
PDF
Windows 11 Professional 2025 with Office 2021
PDF
LDPlayer Free Download (Latest version 2025)
PDF
ProtonVPN Crack Free Download [Updated Version]
PPTX
Introduction to data mining and data warehousing
Enterprise Data World 2018 - Building Cloud Self-Service Analytical Solution
Taking Splunk to the Next Level – Architecture
Data ware housing - Introduction to data ware housing process.
Performing Oracle Health Checks Using APEX
Kylin and Druid Presentation
Jethro data meetup index base sql on hadoop - oct-2014
An AMIS Overview of Oracle database 12c (12.1)
Using Couchbase and Elasticsearch as data layers
Taking Splunk to the Next Level - Architecture Breakout Session
Self-serve analytics journey at Celtra: Snowflake, Spark, and Databricks
Jeremy Engle's slides from Redshift / Big Data meetup on July 13, 2017
Taking Splunk to the Next Level - Architecture Breakout Session
20141206 4 q14_dataconference_i_am_your_db
Kaspersky Lab Products Remover 1.0.5497.0
Novo Nordisk's journey in developing an open-source application on Neo4j
Windows 11 Professional 2025 with Office 2021
LDPlayer Free Download (Latest version 2025)
ProtonVPN Crack Free Download [Updated Version]
Introduction to data mining and data warehousing
Ad

More from Neo4j (20)

PDF
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
PDF
Jin Foo - Prospa GraphSummit Sydney Presentation.pdf
PDF
GraphSummit Singapore Master Deck - May 20, 2025
PPTX
Graphs & GraphRAG - Essential Ingredients for GenAI
PPTX
Neo4j Knowledge for Customer Experience.pptx
PPTX
GraphTalk New Zealand - The Art of The Possible.pptx
PDF
Neo4j: The Art of the Possible with Graph
PDF
Smarter Knowledge Graphs For Public Sector
PDF
GraphRAG and Knowledge Graphs Exploring AI's Future
PDF
Matinée GenAI & GraphRAG Paris - Décembre 24
PDF
ANZ Presentation: GraphSummit Melbourne 2024
PDF
Google Cloud Presentation GraphSummit Melbourne 2024: Building Generative AI ...
PDF
Telstra Presentation GraphSummit Melbourne: Optimising Business Outcomes with...
PDF
Hands-On GraphRAG Workshop: GraphSummit Melbourne 2024
PDF
Démonstration Digital Twin Building Wire Management
PDF
Swiss Life - Les graphes au service de la détection de fraude dans le domaine...
PDF
Démonstration Supply Chain - GraphTalk Paris
PDF
The Art of Possible - GraphTalk Paris Opening Session
PPTX
How Siemens bolstered supply chain resilience with graph-powered AI insights ...
PDF
Knowledge Graphs for AI-Ready Data and Enterprise Deployment - Gartner IT Sym...
MASTERDECK GRAPHSUMMIT SYDNEY (Public).pdf
Jin Foo - Prospa GraphSummit Sydney Presentation.pdf
GraphSummit Singapore Master Deck - May 20, 2025
Graphs & GraphRAG - Essential Ingredients for GenAI
Neo4j Knowledge for Customer Experience.pptx
GraphTalk New Zealand - The Art of The Possible.pptx
Neo4j: The Art of the Possible with Graph
Smarter Knowledge Graphs For Public Sector
GraphRAG and Knowledge Graphs Exploring AI's Future
Matinée GenAI & GraphRAG Paris - Décembre 24
ANZ Presentation: GraphSummit Melbourne 2024
Google Cloud Presentation GraphSummit Melbourne 2024: Building Generative AI ...
Telstra Presentation GraphSummit Melbourne: Optimising Business Outcomes with...
Hands-On GraphRAG Workshop: GraphSummit Melbourne 2024
Démonstration Digital Twin Building Wire Management
Swiss Life - Les graphes au service de la détection de fraude dans le domaine...
Démonstration Supply Chain - GraphTalk Paris
The Art of Possible - GraphTalk Paris Opening Session
How Siemens bolstered supply chain resilience with graph-powered AI insights ...
Knowledge Graphs for AI-Ready Data and Enterprise Deployment - Gartner IT Sym...

Recently uploaded (20)

PPTX
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
PPTX
Introduction-to-Cloud-ComputingFinal.pptx
PPTX
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
PPTX
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
PDF
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
PPTX
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
PPTX
Managing Community Partner Relationships
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PDF
Optimise Shopper Experiences with a Strong Data Estate.pdf
PDF
Oracle OFSAA_ The Complete Guide to Transforming Financial Risk Management an...
PPTX
SAP 2 completion done . PRESENTATION.pptx
PDF
Transcultural that can help you someday.
PPTX
importance of Data-Visualization-in-Data-Science. for mba studnts
PPTX
Database Infoormation System (DBIS).pptx
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
PPTX
Data_Analytics_and_PowerBI_Presentation.pptx
PDF
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
PDF
[EN] Industrial Machine Downtime Prediction
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
Qualitative Qantitative and Mixed Methods.pptx
Market Analysis -202507- Wind-Solar+Hybrid+Street+Lights+for+the+North+Amer...
Introduction-to-Cloud-ComputingFinal.pptx
mbdjdhjjodule 5-1 rhfhhfjtjjhafbrhfnfbbfnb
CEE 2 REPORT G7.pptxbdbshjdgsgjgsjfiuhsd
168300704-gasification-ppt.pdfhghhhsjsjhsuxush
The THESIS FINAL-DEFENSE-PRESENTATION.pptx
Managing Community Partner Relationships
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Optimise Shopper Experiences with a Strong Data Estate.pdf
Oracle OFSAA_ The Complete Guide to Transforming Financial Risk Management an...
SAP 2 completion done . PRESENTATION.pptx
Transcultural that can help you someday.
importance of Data-Visualization-in-Data-Science. for mba studnts
Database Infoormation System (DBIS).pptx
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
Data_Analytics_and_PowerBI_Presentation.pptx
Data Engineering Interview Questions & Answers Batch Processing (Spark, Hadoo...
[EN] Industrial Machine Downtime Prediction
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Qualitative Qantitative and Mixed Methods.pptx

The Protein Regulatory Networks of COVID-19 - A Knowledge Graph Created by Elsevier

  • 1. July, 2020 Shobhna Srivastava Enhancing Search results with Graph Neo4j/Elsevier
  • 2. Context ■ Elsevier is a global information & analytics business specializing in Science & health ■ Scopus – “Expertly curated abstract & citations database” ■ https://www.scopus.com/
  • 4. Problem definition 4 Doesn’t enable changes or enriching document with new data points This processing is fragile Costly solution Hardware used •90 nodes Solr indexing cluster (this is separate to live search cluster) •Redshift •Of course processing EC2 instances Old document enrichment pipeline •Index is created in Solr •Redshift updated from Solr •Then new counts are calculated, and diff done with old Solr index •Then the updates are applied to Solr index •And finally live Solr cluster is updated
  • 5. Bounded context Runtime system – performance is important Aware of starting node or nodes Depth first or breadth first traversal Metrics generation 5
  • 6. Why graph? Classic multi-level graph traversals Many-to-many relations on input data Non-trivial & multi-level joins Most enrichment is done on relationships and how data are connected to each other 6
  • 7. Technology choice Neo4J Neptune Meets QPS ✓ ⚠ Neptune is much slower with with queries that require longer traversals (i.e. "rolled up" queries per organisation count - 7 ms on Neo4j vs 7 seconds on Neptune) Scalability ⚠Tested with graph size that fits into cache, with larger graph some smarter caching should be implemented ⚠ Works fast on larger instances (supposedly because of the cache size), so with larger graph some application-level optimisations might be required. A bit trickier than Neo4j because cache settings are not visible/configurable Indexing ✓ ⚠Indexes are not configurable Transaction management ✓ ⚠ Every traversal is a single transaction, manual commit/rollback are not supported Easy of cluster management ✓ Out-of-the box clustering with enterprise license Unless enterprise licence purchased clustering and data replication should be handled by us ✓ Easy out-of-the box data replication, immediate consistency Cost 2 r4.4xlarge instances + LB ~ 1800 USD/month 2 r4.4xlarge instances + 250 GB storage (estimated based on test data) ~ 2015 USD/month + 0.2 USD/1 million I/O requests (1,600 million requests made only during testing) 7
  • 10. Result ■ ~300,000,000 nodes – Work (Article, books, chapter) – 268,419,884 – Person (Author) – 40,633,203 – Organisation - 13,044,870 – Journal - 227,747 ■ ~1,000,000,000 relations ■ ~1,000,000 updates a day ■ Hardware used (From ~90+ to ~9 nodes) – 3 nodes (r4.4xlarge) – 3 nodes data processing – 3 nodes for API 10
  • 11. Future work 11 Weighted ranking Guided navigation Related entities Suggestion New links Associations