SlideShare a Scribd company logo
O N D E X & G R A P H D B S
M A R C O B R A N D I Z I , 1 6 / 1 0 / 2 0 1 7
G O A L S
• Evaluate graph databases (GDBs)/frameworkd/etc in relation to ONDEX needs
• Assess GDBs as kNetMiner/ONDEX backends
• Evaluate a new architecture where raw data access is entirely based on a GDB
• Evaluate a new data exchange format, possibly integrated with one GDBs
• and hence, evaluate the data models too
• Assess data query/manipulation languages (expressivity, ease of use, speed)
• Assess that performance fits to ONDEX needs
T E S T D A T A
Trait Ontology (TO) 1500 nodes, is-a and part-of relations (i.e., mostly tree)
Gene Ontology (GO) Tree with 46k nodes
AraCyc/BioPAX Heterogeneous net, 23k nodes, 40k relations
Ara-kNet Heterogeneous net, 350k nodes 1.150M relations
T E S T S E T T I N G S ( R D F )
T E S T S E T T I N G S ( N E O 4 J )
R D F
R D F / L I N K E D D A T A
E S S E N T I A L S
• Simple, Fine-Grained Data
Model: Property/Value Pairs &
Typed Links
• Designed for Data Integration:
• Universal Identifiers, W3C
Standards
• Strong (even too much)
emphasis on knowledge
modelling via
schemas/ontologies
• Designed for the Web:
Resolvable URIs, Web APIs
R D F / L I N K E D D A T A E S S E N T I A L S
Integration as native citizen, strong emphasis on knowledge modelling, schemas, ontologies
D A T A M O D E L : O N D E X I N R D F
E X A M P L E Q U E R I E S
Count concepts (classes) in Trait Ontology:
select count (distinct ?c) WHERE {
?c a odxcc:TO_TERM.
}
Parts of membrane (transitively):
select distinct ?csup ?supName ?c ?name
WHERE {
?csup odx:conceptName ?supName.
FILTER ( ?supName = "cellular membrane" )
?c odxrt:part_of* ?csup.
?c odx:conceptName ?name.
}
LIMIT 1000
Proteins related to pathways:
select distinct ?prot ?pway {
?prot odxrt:pd_by|odxrt:cs_by ?react;
a odxcc:Protein.
?react a odxcc:Reaction.
?react odxrt:part_of ?pway.
?pway a odxcc:Path.
}
LIMIT 1000
optimised order
‘|’ for property paths
E X A M P L E Q U E R I E S
# part 2
union {
# Branch 2
?prot ^odxrt:ac_by|odxrt:is_a ?enz.
?prot a odxcc:Protein.
?enz a odxcc:Enzyme.
{
# Branch 2.1
?enz odxrt:ac_by|odxrt:in_by ?comp.
?comp a odxcc:Compound.
?comp odxrt:cs_by|odxrt:pd_by ?trns
?trns a odxcc:Transport
}
union {
# Branch 2.2
?enz ^odxrt:ca_by ?trns.
?trns a odxcc:Transport
}
?trns odxrt:part_of ?pway.
?pway a odxcc:Path.
}
} LIMIT 1000
prefix odx: <http://ondex.sourceforge.net/ondex-core#>
prefix odxcc: <http://www.ondex.org/ex/conceptClass/>
prefix odxc: <http://www.ondex.org/ex/concept/>
prefix odxrt: <http://www.ondex.org/ex/relationType/>
prefix odxr: <http://www.ondex.org/ex/relation/>
prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
select distinct ?prot ?pway {
where {
# Branch 1
?prot odxrt:pd_by|odxrt:cs_by ?react.
?prot a odxcc:Protein.
?react a odxcc:Reaction.
?react odxrt:part_of ?pway.
?pway a odxcc:Path.
}
# to be continued…
Proteins related to pathways:
R D F P E R F O R M A N C E
Simple, common queries (Fuseki)
R D F P E R F O R M A N C E
Queries over ONDEX paths (Fuseki)
R D F P E R F O R M A N C E
Queries over ONDEX paths, Virtuoso
N E O 4 J
N E O 4 J E S S E N T I A L S
• Designed to backup applications
• much less about standards or Web-based sharing
• Very little to manage schemas (more later)
• No native data format (except Cypher, support for
GraphML, RDF)
• Initially based on API only, now Cypher available
• Compact, easy, no URIs (can be used as strings)
• Very performant
• Hasn’t much for clustering/federation, but Cypher can be
used in TinkerPop
• More commercial (not necessarily good)
• Cool management interface
• Probably easier to use for the average Java developer
Image credits: https://goo.gl/YLhCXG
N E O 4 J D A T A M O D E L
Both nodes and relations can have attributes
Nodes & relations have labels
(i.e., string-based types)
Cool management interface
(SPARQL version might be a student project)
C Y P H E R Q U E R Y / D M
L A N G U A G E
Proteins->Reactions->Pathways:
// chain of paths, node selection via property (exploits indices)
MATCH (prot:Protein) - [csby:consumed_by] -> (:Reaction) - [:part_of] -> (pway:Path{ title: ‘apoptosis’ })
// further conditions, but often not performant
WHERE prot.name =~ ‘(?i)^DNA.+’
// Usual projection and post-selection operators
RETURN prot.name, pway
// Relations can have properties
ORDER BY csby.pvalue
LIMIT 1000
Single-path (or same-direction branching) easy to write:
MATCH (prot:Protein) - [:pd_by|cs_by] -> (:Reaction) - [:part_of*1..3] ->
(pway:Path)
RETURN ID(prot), ID(pway) LIMIT 1000
// Very compact forms available, depending on the data
MATCH (prot:Protein) - (pway:Path) RETURN pway
C Y P H E R Q U E R Y / D M
L A N G U A G E
DML features:
MATCH (prot:Protein{ name:’P53’ }), (pway:Path{ title:’apoptosis’})
CREATE (prot) - [:participates_in] -> (pway)
DML features, embeddable in Java/Python/etc:
UNWIND $rows AS row // $rows set by the invoker, programmatically
MATCH (prot:Protein{ id: row.protId }), (pway:Path{ id:row.pathId })
CREATE (prot) - [relation:participates_in] -> (pway)
SET relation = row.relationAttributes
C Y P H E R / N E O 4 J P E R F O R M A N C E
Simple, common queries
C Y P H E R / N E O 4 J P E R F O R M A N C E
Path Queries
S O U N D S G O O D , B U T …
select distinct ?prot ?pway {
where {
# Branch 1
…
}
union {
# Branch 2
…
{
# Branch 2.1
}
union {
# Branch 2.2
}
…
}
}
• In Cypher?!
• I couldn’t find a decent way, although it might be possible (https://goo.gl/Rpa9SM)
• Partially possible in straightforward way, but redundantly, e.g., Branch 2:
MATCH (prot:Protein) <- [:ac_by] - (:Enzyme) <- [:ca_by] - (:Transport) <- [:part_of] -
(pway:Path)
RETURN prot, pway LIMIT 100
UNION
MATCH (prot:Protein) - [:is_a] -> (:Enzyme) <- [:ca_by] - (:Transport) <- [:part_of] -
(pway:Path)
RETURN prot, pway LIMIT 100
A D D E N D U M
select distinct ?prot ?pway {
where {
# Branch 1
…
}
union {
# Branch 2
…
{
# Branch 2.1
}
union {
# Branch 2.2
}
…
}
}
• In Cypher?!
Unions+branches partially possible by means of paths in WHERE:
// Branch 2
MATCH (prot:Protein), (enz:Enzyme), (tns:Transport) - [:part_of] -> (path:Path)
WHERE ( (enz) - [:ac_by|:in_by] -> (:Comp) - [:pd_by|:cs_by] -> (tns) // Branch 2.1
OR (tns) - [:ca_by] -> (enz) ) //Branch 2.2 (pt1)
AND ( (prot) - [:is_a] -> (enz) OR (prot) <- [:ac_by] - (enz) ) // Branch 2.2 (pt2)
RETURN prot, path LIMIT 30
UNION
// Branch1
MATCH (prot:Protein) - [:pd_by|:cs_by] -> (:Reaction) - [:part_of] -> (path:Path)
RETURN prot, path LIMIT 30
• However,
• 41249ms to execute against wheat net.
• it generates cartesian products and can
easily explode
S O U N D S G O O D , B U T …
• What about schemas/metadata/ontologies?
• Node and relations can only have multiple labels attached, which are just
strings. Rich schema-operations not so easy:
• Select any kind of protein, including enzymes, cytokines
• Select any type of ‘interacts with’, including ‘catalysed by’, ‘consumed by’,
‘produced by’ (might require ‘inverse of’)
• Basically, has a relational-oriented view about the schemas
S O U N D S G O O D , B U T …
• Basically, it’s relational-oriented about schemas
• we might still be OK with metadata modelled as graphs, however:
• MATCH (molecule:Molecule),
(molType:Class)-[:is_a*]->(:Class{ name:’Protein’ })
WHERE LABELS molType IN LABELS (molecule)
• It’s expensive to compute (doesn’t exploit indexes)
• MATCH (molecule:Molecule:$additionalLabel) CREATE …
• Parameterising on labels not possible
• Requires non parametric Cypher string => UNWIND-based bulk loading impossible
• => bad performance
• Programmatic approach possible, but a lot of problems with things like Lucene version mismatches (one reason
being that ONDEX would require review and proper plug-in architecture)
F L A T , R D F - L I K E M O D E L
Code for both converters:
github:/marco-brandizi/odx_neo4j_converter_test
F L A T M O D E L I M P A C T O N
C Y P H E R
Structured model:
MATCH (prot:Protein{ id: '250169' }) - [:cs_by] -> (react:Reaction) - [:part_of] -> (pway:Path)
RETURN * LIMIT 100
Flat model:
MATCH (prot:Concept {id: '250169', ccName: 'Protein'})
<- [:from] - (csby:Relation {name: 'cs_by' })
- [:to] -> (react:Concept { ccName: 'Reaction'})
<- [:from] - (partof:Relation {name:'part_of'}) - [:to]
-> (pway:Concept {ccName:'Path'})
RETURN * LIMIT 100
Rich schema-based queries
MATCH (mol:{Concept}) <- [:conceptClass] - (cc:ConceptClass),
(cc) <- [:specializationOf*] - (:ConceptClass{name:’Protein’}
F L A T M O D E L P E R F O R M A N C E
Simple, common queries
F L A T M O D E L P E R F O R M A N C E
Typical ONDEX Graph Queries
I M P A C T O N C Y P H E R
Rich schema-based queries
From:
MATCH (molecule:Molecule), (molType:Class)-[:is_a*]->(:Class{ name:’Protein’ })
WHERE molType.label IN LABELS (molecule)
To:
MATCH (mol:{Concept}) <- [:conceptClass] - (cc:ConceptClass),
(cc) <- [:specializationOf*] - (:ConceptClass{name:’Protein’}
now it’s efficient-enough (especially with length restrictions)
However…
I M P A C T O N C Y P H E R
Rich schema-based queries
MATCH (mol:{Concept}) <- [:conceptClass] - (cc:ConceptClass),
(cc) <- [:specializationOf*] - (:ConceptClass{name:’Protein’}
now it’s efficient-enough (especially with length restrictions)
However…
from: MATCH (react:Reaction) - [:part_of] -> (pway:Path)
to: MATCH (react:Concept {ccName: ‘Reaction’})
<- [:from] - (partof:Relation {name:'part_of'})
- [:to] -> (pway:Concept {ccName:'Path'})
What if we want variable-length part_of?
Not currently possible in Cypher (nor in SPARQL),
maybe in future (https://github.com/neo4j/neo4j/issues/88)
=> Having both model, redundantly, would probably be worth
=> makes it not so different than RDF
O T H E R I S S U E S
• Data Exchange format?
• None, except Cypher
• DML not so performant
• In particular, no standard data exchange format
• Could be combined with RDF
• Is Neo4j Open Source?
• Produced by a company, only the Community Edition is OSS
• OpenCypher is available
• Cypher backed by Gremlin/TinkerPop
• Apache project, more reliable OSS-wide
• Performance comparable with Neo4j (https://goo.gl/NK1tn2)
• More choice of implementations
• Alternative QL, but more complicated IMHO (Cypher supported)
Image credits: https://goo.gl/ysBFF2
C O N C L U S I O N S
Neo4J/GraphDBs Virtuoso/Triple Stores
Data X format - +
Data model
+ Relations with properties
- Metadata management
- Relations cannot have properties (req. reification)
+ Metadata as first citizen
Performance + - (comparable)
QL
+ Easier (eg, compact, omissions)? - Expressivity
for some patterns (unions, DML)
- Harder? (URIs, namespaces, verbosity) + More
expressive
Standardisation,
openness
- +
Scalability, big data - TinkerPop probably better
LB/Cluster solutions Over TinkerPop (via SAIL
implementation)
C O N C L U S I O N S
C O N C L U S I O N S
C O N C L U S I O N S
W H Y ?
• Graph + APIs
• Clearer architecture, open to more
applications, not only kNetMiner
• QL makes it easier to develop further
components/analyses/applications
• Standard Data model and format
• Don’t reinvent the wheel
• Data sharing
• Data and app integration
C O N C L U S I O N
S

More Related Content

PDF
Python Workshop. LUG Maniapl
PPTX
Sparql
PDF
Generics Past, Present and Future
PDF
Large-scale Reasoning with a Complex Cultural Heritage Ontology (CIDOC CRM) ...
PPTX
Input/Output Exploring java.io
PPTX
A brief tour of modern Java
PPTX
Apache pig presentation_siddharth_mathur
PDF
C interview-questions-techpreparation
Python Workshop. LUG Maniapl
Sparql
Generics Past, Present and Future
Large-scale Reasoning with a Complex Cultural Heritage Ontology (CIDOC CRM) ...
Input/Output Exploring java.io
A brief tour of modern Java
Apache pig presentation_siddharth_mathur
C interview-questions-techpreparation

What's hot (17)

PPTX
Introduction to Haskell: 2011-04-13
PPTX
Python Interview Questions | Python Interview Questions And Answers | Python ...
PDF
Data translation with SPARQL 1.1
PDF
Babar: Knowledge Recognition, Extraction and Representation
PPT
XML and XPath details
PDF
Tackling repetitive tasks with serial or parallel programming in R
PPTX
Lambdas And Streams Hands On Lab, JavaOne 2014
PDF
JavaParser - A tool to generate, analyze and refactor Java code
PDF
Manipulating string data with a pattern in R
PPTX
SQL Server Select Topics
PPTX
Lz77 by ayush
PPT
Java stream
PPTX
Python 3.6 Features 20161207
PPTX
Java Input Output (java.io.*)
PDF
Lambdas and Streams in Java SE 8: Making Bulk Operations simple - Simon Ritter
PDF
Lambdas And Streams Hands On Lab
PDF
Database & Technology 1 _ Tom Kyte _ Efficient PL SQL - Why and How to Use.pdf
Introduction to Haskell: 2011-04-13
Python Interview Questions | Python Interview Questions And Answers | Python ...
Data translation with SPARQL 1.1
Babar: Knowledge Recognition, Extraction and Representation
XML and XPath details
Tackling repetitive tasks with serial or parallel programming in R
Lambdas And Streams Hands On Lab, JavaOne 2014
JavaParser - A tool to generate, analyze and refactor Java code
Manipulating string data with a pattern in R
SQL Server Select Topics
Lz77 by ayush
Java stream
Python 3.6 Features 20161207
Java Input Output (java.io.*)
Lambdas and Streams in Java SE 8: Making Bulk Operations simple - Simon Ritter
Lambdas And Streams Hands On Lab
Database & Technology 1 _ Tom Kyte _ Efficient PL SQL - Why and How to Use.pdf
Ad

Similar to A Preliminary survey of RDF/Neo4j as backends for KnetMiner (20)

PDF
Open-source from/in the enterprise: the RDKit
PDF
Scalable up genomic analysis with ADAM
PDF
BioSD Tutorial 2014 Editition
PDF
Neo4j_Cypher.pdf
PPT
NOSQL and Cassandra
PDF
Design for Scalability in ADAM
PDF
User biglm
PPTX
Protein threading using context specific alignment potential ismb-2013
PDF
Stanford'12 Intro to Ontology Based Data Access for RDBMS through query rewri...
PDF
PGQL: A Language for Graphs
PDF
Python for Chemistry
PDF
Python for Chemistry
PDF
Rdf conjunctive query selectivity estimation
PPTX
Knetminer Backend Training, Nov 2018
PPTX
Getting the best of Linked Data and Property Graphs: rdf2neo and the KnetMine...
PDF
RDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
PDF
RDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
PDF
Scaling up genomic analysis with ADAM
PDF
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...
PDF
Creating the PromQL Transpiler for Flux by Julius Volz, Co-Founder | Prometheus
Open-source from/in the enterprise: the RDKit
Scalable up genomic analysis with ADAM
BioSD Tutorial 2014 Editition
Neo4j_Cypher.pdf
NOSQL and Cassandra
Design for Scalability in ADAM
User biglm
Protein threading using context specific alignment potential ismb-2013
Stanford'12 Intro to Ontology Based Data Access for RDBMS through query rewri...
PGQL: A Language for Graphs
Python for Chemistry
Python for Chemistry
Rdf conjunctive query selectivity estimation
Knetminer Backend Training, Nov 2018
Getting the best of Linked Data and Property Graphs: rdf2neo and the KnetMine...
RDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
RDF4U: RDF Graph Visualization by Interpreting Linked Data as Knowledge
Scaling up genomic analysis with ADAM
Neo4j Morpheus: Interweaving Table and Graph Data with SQL and Cypher in Apac...
Creating the PromQL Transpiler for Flux by Julius Volz, Co-Founder | Prometheus
Ad

More from Rothamsted Research, UK (20)

PPTX
FAIR Agronomy, where are we? The KnetMiner Use Case
PPTX
Interoperable Data for KnetMiner and DFW Use Cases
PPTX
AgriSchemas: Sharing Agrifood data with Bioschemas
PPTX
Publishing and Consuming FAIR Data A Case in the Agri-Food Domain
PPTX
Continuos Integration @Knetminer
PDF
Better Data for a Better World
PPTX
AgriSchemas Progress Report
PPTX
AgriFood Data, Models, Standards, Tools, Use Cases
PDF
Notes about SWAT4LS 2018
PDF
Towards FAIRer Biological Knowledge Networks 
Using a Hybrid Linked Data 
and...
PDF
Behind the Scenes of KnetMiner: Towards Standardised and Interoperable Knowle...
ODP
graph2tab, a library to convert experimental workflow graphs into tabular for...
PDF
Interoperable Open Data: Which Recipes?
PDF
Linked Data with the EBI RDF Platform
PDF
BioSD Linked Data: Lessons Learned
PDF
myEquivalents, aka a new cross-reference service
PDF
Dev 2014 LOD tutorial
PDF
BioSamples Database Linked Data, SWAT4LS Tutorial
PDF
Uk onto net_2013_notes_brandizi
FAIR Agronomy, where are we? The KnetMiner Use Case
Interoperable Data for KnetMiner and DFW Use Cases
AgriSchemas: Sharing Agrifood data with Bioschemas
Publishing and Consuming FAIR Data A Case in the Agri-Food Domain
Continuos Integration @Knetminer
Better Data for a Better World
AgriSchemas Progress Report
AgriFood Data, Models, Standards, Tools, Use Cases
Notes about SWAT4LS 2018
Towards FAIRer Biological Knowledge Networks 
Using a Hybrid Linked Data 
and...
Behind the Scenes of KnetMiner: Towards Standardised and Interoperable Knowle...
graph2tab, a library to convert experimental workflow graphs into tabular for...
Interoperable Open Data: Which Recipes?
Linked Data with the EBI RDF Platform
BioSD Linked Data: Lessons Learned
myEquivalents, aka a new cross-reference service
Dev 2014 LOD tutorial
BioSamples Database Linked Data, SWAT4LS Tutorial
Uk onto net_2013_notes_brandizi

Recently uploaded (20)

PDF
Autodesk AutoCAD Crack Free Download 2025
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PDF
Adobe Illustrator 28.6 Crack My Vision of Vector Design
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PDF
wealthsignaloriginal-com-DS-text-... (1).pdf
PDF
Designing Intelligence for the Shop Floor.pdf
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PDF
Download FL Studio Crack Latest version 2025 ?
PPTX
Monitoring Stack: Grafana, Loki & Promtail
PPTX
Oracle Fusion HCM Cloud Demo for Beginners
PPTX
L1 - Introduction to python Backend.pptx
PPTX
AMADEUS TRAVEL AGENT SOFTWARE | AMADEUS TICKETING SYSTEM
PDF
Tally Prime Crack Download New Version 5.1 [2025] (License Key Free
PDF
How to Make Money in the Metaverse_ Top Strategies for Beginners.pdf
PDF
CCleaner Pro 6.38.11537 Crack Final Latest Version 2025
PDF
Wondershare Filmora 15 Crack With Activation Key [2025
PPTX
Operating system designcfffgfgggggggvggggggggg
PPTX
CHAPTER 2 - PM Management and IT Context
PPTX
Computer Software and OS of computer science of grade 11.pptx
Autodesk AutoCAD Crack Free Download 2025
Internet Downloader Manager (IDM) Crack 6.42 Build 41
Odoo Companies in India – Driving Business Transformation.pdf
Adobe Illustrator 28.6 Crack My Vision of Vector Design
How to Choose the Right IT Partner for Your Business in Malaysia
wealthsignaloriginal-com-DS-text-... (1).pdf
Designing Intelligence for the Shop Floor.pdf
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
Download FL Studio Crack Latest version 2025 ?
Monitoring Stack: Grafana, Loki & Promtail
Oracle Fusion HCM Cloud Demo for Beginners
L1 - Introduction to python Backend.pptx
AMADEUS TRAVEL AGENT SOFTWARE | AMADEUS TICKETING SYSTEM
Tally Prime Crack Download New Version 5.1 [2025] (License Key Free
How to Make Money in the Metaverse_ Top Strategies for Beginners.pdf
CCleaner Pro 6.38.11537 Crack Final Latest Version 2025
Wondershare Filmora 15 Crack With Activation Key [2025
Operating system designcfffgfgggggggvggggggggg
CHAPTER 2 - PM Management and IT Context
Computer Software and OS of computer science of grade 11.pptx

A Preliminary survey of RDF/Neo4j as backends for KnetMiner

  • 1. O N D E X & G R A P H D B S M A R C O B R A N D I Z I , 1 6 / 1 0 / 2 0 1 7
  • 2. G O A L S • Evaluate graph databases (GDBs)/frameworkd/etc in relation to ONDEX needs • Assess GDBs as kNetMiner/ONDEX backends • Evaluate a new architecture where raw data access is entirely based on a GDB • Evaluate a new data exchange format, possibly integrated with one GDBs • and hence, evaluate the data models too • Assess data query/manipulation languages (expressivity, ease of use, speed) • Assess that performance fits to ONDEX needs
  • 3. T E S T D A T A Trait Ontology (TO) 1500 nodes, is-a and part-of relations (i.e., mostly tree) Gene Ontology (GO) Tree with 46k nodes AraCyc/BioPAX Heterogeneous net, 23k nodes, 40k relations Ara-kNet Heterogeneous net, 350k nodes 1.150M relations
  • 4. T E S T S E T T I N G S ( R D F )
  • 5. T E S T S E T T I N G S ( N E O 4 J )
  • 7. R D F / L I N K E D D A T A E S S E N T I A L S • Simple, Fine-Grained Data Model: Property/Value Pairs & Typed Links • Designed for Data Integration: • Universal Identifiers, W3C Standards • Strong (even too much) emphasis on knowledge modelling via schemas/ontologies • Designed for the Web: Resolvable URIs, Web APIs
  • 8. R D F / L I N K E D D A T A E S S E N T I A L S Integration as native citizen, strong emphasis on knowledge modelling, schemas, ontologies
  • 9. D A T A M O D E L : O N D E X I N R D F
  • 10. E X A M P L E Q U E R I E S Count concepts (classes) in Trait Ontology: select count (distinct ?c) WHERE { ?c a odxcc:TO_TERM. } Parts of membrane (transitively): select distinct ?csup ?supName ?c ?name WHERE { ?csup odx:conceptName ?supName. FILTER ( ?supName = "cellular membrane" ) ?c odxrt:part_of* ?csup. ?c odx:conceptName ?name. } LIMIT 1000 Proteins related to pathways: select distinct ?prot ?pway { ?prot odxrt:pd_by|odxrt:cs_by ?react; a odxcc:Protein. ?react a odxcc:Reaction. ?react odxrt:part_of ?pway. ?pway a odxcc:Path. } LIMIT 1000 optimised order ‘|’ for property paths
  • 11. E X A M P L E Q U E R I E S # part 2 union { # Branch 2 ?prot ^odxrt:ac_by|odxrt:is_a ?enz. ?prot a odxcc:Protein. ?enz a odxcc:Enzyme. { # Branch 2.1 ?enz odxrt:ac_by|odxrt:in_by ?comp. ?comp a odxcc:Compound. ?comp odxrt:cs_by|odxrt:pd_by ?trns ?trns a odxcc:Transport } union { # Branch 2.2 ?enz ^odxrt:ca_by ?trns. ?trns a odxcc:Transport } ?trns odxrt:part_of ?pway. ?pway a odxcc:Path. } } LIMIT 1000 prefix odx: <http://ondex.sourceforge.net/ondex-core#> prefix odxcc: <http://www.ondex.org/ex/conceptClass/> prefix odxc: <http://www.ondex.org/ex/concept/> prefix odxrt: <http://www.ondex.org/ex/relationType/> prefix odxr: <http://www.ondex.org/ex/relation/> prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> select distinct ?prot ?pway { where { # Branch 1 ?prot odxrt:pd_by|odxrt:cs_by ?react. ?prot a odxcc:Protein. ?react a odxcc:Reaction. ?react odxrt:part_of ?pway. ?pway a odxcc:Path. } # to be continued… Proteins related to pathways:
  • 12. R D F P E R F O R M A N C E Simple, common queries (Fuseki)
  • 13. R D F P E R F O R M A N C E Queries over ONDEX paths (Fuseki)
  • 14. R D F P E R F O R M A N C E Queries over ONDEX paths, Virtuoso
  • 15. N E O 4 J
  • 16. N E O 4 J E S S E N T I A L S • Designed to backup applications • much less about standards or Web-based sharing • Very little to manage schemas (more later) • No native data format (except Cypher, support for GraphML, RDF) • Initially based on API only, now Cypher available • Compact, easy, no URIs (can be used as strings) • Very performant • Hasn’t much for clustering/federation, but Cypher can be used in TinkerPop • More commercial (not necessarily good) • Cool management interface • Probably easier to use for the average Java developer Image credits: https://goo.gl/YLhCXG
  • 17. N E O 4 J D A T A M O D E L Both nodes and relations can have attributes Nodes & relations have labels (i.e., string-based types) Cool management interface (SPARQL version might be a student project)
  • 18. C Y P H E R Q U E R Y / D M L A N G U A G E Proteins->Reactions->Pathways: // chain of paths, node selection via property (exploits indices) MATCH (prot:Protein) - [csby:consumed_by] -> (:Reaction) - [:part_of] -> (pway:Path{ title: ‘apoptosis’ }) // further conditions, but often not performant WHERE prot.name =~ ‘(?i)^DNA.+’ // Usual projection and post-selection operators RETURN prot.name, pway // Relations can have properties ORDER BY csby.pvalue LIMIT 1000 Single-path (or same-direction branching) easy to write: MATCH (prot:Protein) - [:pd_by|cs_by] -> (:Reaction) - [:part_of*1..3] -> (pway:Path) RETURN ID(prot), ID(pway) LIMIT 1000 // Very compact forms available, depending on the data MATCH (prot:Protein) - (pway:Path) RETURN pway
  • 19. C Y P H E R Q U E R Y / D M L A N G U A G E DML features: MATCH (prot:Protein{ name:’P53’ }), (pway:Path{ title:’apoptosis’}) CREATE (prot) - [:participates_in] -> (pway) DML features, embeddable in Java/Python/etc: UNWIND $rows AS row // $rows set by the invoker, programmatically MATCH (prot:Protein{ id: row.protId }), (pway:Path{ id:row.pathId }) CREATE (prot) - [relation:participates_in] -> (pway) SET relation = row.relationAttributes
  • 20. C Y P H E R / N E O 4 J P E R F O R M A N C E Simple, common queries
  • 21. C Y P H E R / N E O 4 J P E R F O R M A N C E Path Queries
  • 22. S O U N D S G O O D , B U T … select distinct ?prot ?pway { where { # Branch 1 … } union { # Branch 2 … { # Branch 2.1 } union { # Branch 2.2 } … } } • In Cypher?! • I couldn’t find a decent way, although it might be possible (https://goo.gl/Rpa9SM) • Partially possible in straightforward way, but redundantly, e.g., Branch 2: MATCH (prot:Protein) <- [:ac_by] - (:Enzyme) <- [:ca_by] - (:Transport) <- [:part_of] - (pway:Path) RETURN prot, pway LIMIT 100 UNION MATCH (prot:Protein) - [:is_a] -> (:Enzyme) <- [:ca_by] - (:Transport) <- [:part_of] - (pway:Path) RETURN prot, pway LIMIT 100
  • 23. A D D E N D U M select distinct ?prot ?pway { where { # Branch 1 … } union { # Branch 2 … { # Branch 2.1 } union { # Branch 2.2 } … } } • In Cypher?! Unions+branches partially possible by means of paths in WHERE: // Branch 2 MATCH (prot:Protein), (enz:Enzyme), (tns:Transport) - [:part_of] -> (path:Path) WHERE ( (enz) - [:ac_by|:in_by] -> (:Comp) - [:pd_by|:cs_by] -> (tns) // Branch 2.1 OR (tns) - [:ca_by] -> (enz) ) //Branch 2.2 (pt1) AND ( (prot) - [:is_a] -> (enz) OR (prot) <- [:ac_by] - (enz) ) // Branch 2.2 (pt2) RETURN prot, path LIMIT 30 UNION // Branch1 MATCH (prot:Protein) - [:pd_by|:cs_by] -> (:Reaction) - [:part_of] -> (path:Path) RETURN prot, path LIMIT 30 • However, • 41249ms to execute against wheat net. • it generates cartesian products and can easily explode
  • 24. S O U N D S G O O D , B U T … • What about schemas/metadata/ontologies? • Node and relations can only have multiple labels attached, which are just strings. Rich schema-operations not so easy: • Select any kind of protein, including enzymes, cytokines • Select any type of ‘interacts with’, including ‘catalysed by’, ‘consumed by’, ‘produced by’ (might require ‘inverse of’) • Basically, has a relational-oriented view about the schemas
  • 25. S O U N D S G O O D , B U T … • Basically, it’s relational-oriented about schemas • we might still be OK with metadata modelled as graphs, however: • MATCH (molecule:Molecule), (molType:Class)-[:is_a*]->(:Class{ name:’Protein’ }) WHERE LABELS molType IN LABELS (molecule) • It’s expensive to compute (doesn’t exploit indexes) • MATCH (molecule:Molecule:$additionalLabel) CREATE … • Parameterising on labels not possible • Requires non parametric Cypher string => UNWIND-based bulk loading impossible • => bad performance • Programmatic approach possible, but a lot of problems with things like Lucene version mismatches (one reason being that ONDEX would require review and proper plug-in architecture)
  • 26. F L A T , R D F - L I K E M O D E L Code for both converters: github:/marco-brandizi/odx_neo4j_converter_test
  • 27. F L A T M O D E L I M P A C T O N C Y P H E R Structured model: MATCH (prot:Protein{ id: '250169' }) - [:cs_by] -> (react:Reaction) - [:part_of] -> (pway:Path) RETURN * LIMIT 100 Flat model: MATCH (prot:Concept {id: '250169', ccName: 'Protein'}) <- [:from] - (csby:Relation {name: 'cs_by' }) - [:to] -> (react:Concept { ccName: 'Reaction'}) <- [:from] - (partof:Relation {name:'part_of'}) - [:to] -> (pway:Concept {ccName:'Path'}) RETURN * LIMIT 100 Rich schema-based queries MATCH (mol:{Concept}) <- [:conceptClass] - (cc:ConceptClass), (cc) <- [:specializationOf*] - (:ConceptClass{name:’Protein’}
  • 28. F L A T M O D E L P E R F O R M A N C E Simple, common queries
  • 29. F L A T M O D E L P E R F O R M A N C E Typical ONDEX Graph Queries
  • 30. I M P A C T O N C Y P H E R Rich schema-based queries From: MATCH (molecule:Molecule), (molType:Class)-[:is_a*]->(:Class{ name:’Protein’ }) WHERE molType.label IN LABELS (molecule) To: MATCH (mol:{Concept}) <- [:conceptClass] - (cc:ConceptClass), (cc) <- [:specializationOf*] - (:ConceptClass{name:’Protein’} now it’s efficient-enough (especially with length restrictions) However…
  • 31. I M P A C T O N C Y P H E R Rich schema-based queries MATCH (mol:{Concept}) <- [:conceptClass] - (cc:ConceptClass), (cc) <- [:specializationOf*] - (:ConceptClass{name:’Protein’} now it’s efficient-enough (especially with length restrictions) However… from: MATCH (react:Reaction) - [:part_of] -> (pway:Path) to: MATCH (react:Concept {ccName: ‘Reaction’}) <- [:from] - (partof:Relation {name:'part_of'}) - [:to] -> (pway:Concept {ccName:'Path'}) What if we want variable-length part_of? Not currently possible in Cypher (nor in SPARQL), maybe in future (https://github.com/neo4j/neo4j/issues/88) => Having both model, redundantly, would probably be worth => makes it not so different than RDF
  • 32. O T H E R I S S U E S • Data Exchange format? • None, except Cypher • DML not so performant • In particular, no standard data exchange format • Could be combined with RDF • Is Neo4j Open Source? • Produced by a company, only the Community Edition is OSS • OpenCypher is available • Cypher backed by Gremlin/TinkerPop • Apache project, more reliable OSS-wide • Performance comparable with Neo4j (https://goo.gl/NK1tn2) • More choice of implementations • Alternative QL, but more complicated IMHO (Cypher supported) Image credits: https://goo.gl/ysBFF2
  • 33. C O N C L U S I O N S Neo4J/GraphDBs Virtuoso/Triple Stores Data X format - + Data model + Relations with properties - Metadata management - Relations cannot have properties (req. reification) + Metadata as first citizen Performance + - (comparable) QL + Easier (eg, compact, omissions)? - Expressivity for some patterns (unions, DML) - Harder? (URIs, namespaces, verbosity) + More expressive Standardisation, openness - + Scalability, big data - TinkerPop probably better LB/Cluster solutions Over TinkerPop (via SAIL implementation)
  • 34. C O N C L U S I O N S
  • 35. C O N C L U S I O N S
  • 36. C O N C L U S I O N S
  • 37. W H Y ? • Graph + APIs • Clearer architecture, open to more applications, not only kNetMiner • QL makes it easier to develop further components/analyses/applications • Standard Data model and format • Don’t reinvent the wheel • Data sharing • Data and app integration
  • 38. C O N C L U S I O N S