SlideShare a Scribd company logo
Datomic R-trees

James Sofra
@sofra
https://github.com/jsofra/datomic-rtree
Summary
●

Motivations

●

Datomic overview

●

Datomic R-tree implementation

●

Hilbert Curves

●

Bulk loading (via Hilbert Curves)

●

Future plans
Motivations
●

I have an interest in geospatial applications
–

●

e.g. Thunderstorm probability application
(THESPA)

Datomic is an interesting database that
makes different trade-offs to other
databases
–

Wonder how far we can take the ability to
describe arbitrary structures in Datomic
Why don't we have both?
Datomic Overview
●

Immutable database

●

Time-base facts (stored as entites)

●

ACID transactions

●

Expressive queries using Datalog

●

Pluggable storage

●

Flexible enough to act as row, column or graph database

●

Schema that describes attributes that can be attached to
entities
–

●

Attributes have a type; String, Long, Double, Inst, Ref etc.

Database functions
–

Stored in the database, see the in transaction value
Datomic Overview Architecture
Datomic Motivations
●

Things that make Datomic appealing for spatial data
–

Time-base nature of Datomic is useful for time series data which we
often have

–

No need to add spatial operations (union, intersection, etc.) to the
database, can be handled by libraries in the peers

–

Spatial indexes can be stored as regular data, allows for a lot of
freedom over choice of index, handling multiple indexes over subsets
of the data in space and time

–

Flexible entity structures are useful because spatial data frequently
does not fit nicely in a table

–

Immutability is surprisingly useful in lots of different applications!
R-trees
●

●

●

●

●

●

Efficient query of
multi-dimensional data
Groups nearby objects
Balanced (all leaf nodes at
same level)
Aims for nodes minimise
empty space coverage and
overlap
Designed for storage on disk
(as used in databases)

"R-Trees: A Dynamic Index Structure for Spatial Searching"
–

Guttman, A (1984)
R-trees - Insertions
●
●

●

Choose a leaf node to insert
Insert entry into leaf node and enlarge
node
If node has more than max number of
children split the node and propagate
enlargement and splits up tree
Datomic R-tree - Schema
:rtree/root

:db.type/ref

:rtree/max-children

:db.type/long

:rtree/min-children

:db.type/long

:node/children

:db.type/ref

:node/is-leaf?

:db.type/boolean

:node/entry

:db.type/ref

:bbox/min-x

:db.type/double

:bbox/min-y

:db.type/double

:bbox/max-x

:db.type/double

:bbox/max-y

:db.type/double
Datomic R-tree choose-leaf
Datomic R-tree split-node
Datomic R-tree pick-seeds
Datomic R-tree - pick-next
Datomic R-tree –
regular transaction
Transaction for
adding new entry,
calls database
function
Database function

New entry with new ID

Add new entry as
child to leaf node
Datomic R-tree –
split transaction
New entry
Remove root
Create new
leaf nodes

Add new root
Bulk loading
●

Issues with single insertion loading of R-tree
–
–

●

●

●

Becomes slow with with many insertions
The resulting tree is not as always as efficient as it
could be

Bulk loading builds a tree once from a number
of entities
Two basic approaches top-down and
bottom-up
Bulk loading does not imply bulk insertion
Bulk loading – sort based loading
●

Aims for better R-tree performance

●

Bottom-up approach

●

Sorts all entities in an order that aims to preserve locality

●

●

●

Partitions the entities into clusters that are (hopefully)
spatially collocated
Recursively apply partitioning to build up the tree
“Sort-based Query-adaptive Loading of R-trees”
–

●

D. Achakeev; B. Seeger; P. Widmayer (2012)

“Sort-based parallel loading of R-trees”
–

D. Achakeev; M. Seidemann; M. Schmidt; B. Seeger (2012)
Hilbert Curves
●

●

●

●

a continuous fractal
space-filling curve
first described by
mathematician David Hilbert in
1891
useful because it enables
mapping from 2D to 1D
preserving some notion of
locality
Other options are; Peano
curve, Z-order curve (aka
Morton Curve)
Hilbert Curves
●

●

●

●

a continuous fractal
space-filling curve
first described by
mathematician David Hilbert in
1891
useful because it enables
mapping from 2D to 1D
preserving some notion of
locality
Other options are; Peano
curve, Z-order curve (aka
Morton Curve)
Bulk loading – hilbert sort based

●

Better Hilbert partitioning
Bulk loading via Hilbert curves
●

●

●

●

Insert all entities into Datomic (or using
existing entities)
Entities include an indexed Hilbert value
attribute
Obtain a seq of the entities using the :avet
index with the Hilbert value
Perform partioning
Bulk - hilbert-ents

Takes advantage of Datomic index API to get direct
access to the Hilbert index
Bulk - min-cost-index

List of options for the
next partition point
Must be at least
min-children in the
partition
Bulk - cost-partition
Bulk - p-cost-partition
Bulk - dyn-cost-partition
Conclusions
●

It works!
(install-single-insertions conn 50000 20 10)
–

"Elapsed time: 119114.342783 msecs"

(install-and-bulk-load conn 50000 20 10)
–

"Elapsed time: 6511.543299 msecs"

(time (naive-intersecting all-entries search-box))
–

"Elapsed time: 870.575802 msecs"

(time (intersecting root search-box))
–

"Elapsed time: 2.927883 msecs"
* note these times should be regarded with suspicion since they
only use the in memory database
Future plans
●

Retractions and updates

●

Bulk insertions

●

More search and query support

●

●

Schema for supporting Meridian Shapes
and Features
Investigate other R-trees; R* tree, R+ tree
Questions?

Thanks you! Any questions?
James Sofra
@sofra
Other Interesting
Resources
●

●

"The R*-tree: an efficient and robust access method for points
and rectangles"
“OMT: Overlap Minimizing Top-down Bulk Loading Algorithm for
R-tree.”
–

●

“The Priority R-Tree: A Practically Efficient and Worst-Case
Optimal R-Tree”
–

●

L. Arge; M. de Berg; K. Yi (2004)

“Compact Hilbert Indices”
–

●

T. Lee; S. Lee (2003)

Hamilton. C (2006)

“R-Trees: Theory and Applications”
–

Manolopoulos. Y; Nanopoulos. A; Papadopoulos. A. N; Theodoridis. Y
(2006)

More Related Content

PPTX
Making data storage more efficient
PDF
KDB+/R Integration
PDF
[Paper Reading]Orca: A Modular Query Optimizer Architecture for Big Data
PPTX
Geo data analytics
PPTX
TFIDF and Machine Learning – efficient hybrid processing
PPTX
Scrap Your MapReduce - Apache Spark
PDF
RSP-QL*: Querying Data-Level Annotations in RDF Streams
PDF
Implementing HDF5 in MATLAB
Making data storage more efficient
KDB+/R Integration
[Paper Reading]Orca: A Modular Query Optimizer Architecture for Big Data
Geo data analytics
TFIDF and Machine Learning – efficient hybrid processing
Scrap Your MapReduce - Apache Spark
RSP-QL*: Querying Data-Level Annotations in RDF Streams
Implementing HDF5 in MATLAB

What's hot (19)

PPTX
Ronalao termpresent
PDF
TriHUG 3/14: HBase in Production
PDF
[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...
PPT
19compression
ODP
Google's Dremel
PDF
Introduction to kdb+
PDF
[2D3]TurboGraph- Ultrafast graph analystics engine for billion-scale graphs i...
PDF
Working Experience_V5.0
PPTX
Dremel interactive analysis of web scale datasets
PPT
Dremel: Interactive Analysis of Web-Scale Datasets
PDF
DB reading group may 16, 2018
PDF
Using python to analyze spatial data
PPTX
High Dimensional Indexing using MongoDB (MongoSV 2012)
PPTX
Hello cloud 3
PDF
If the Data Cannot Come To The Algorithm...
PPTX
Progress_190118
PDF
Ch 5: Introduction to heap overflows
PDF
Handling the growth of data
PDF
SystemML - Datapalooza Denver - 05.17.16 MWD
Ronalao termpresent
TriHUG 3/14: HBase in Production
[Paper reading] Interleaving with Coroutines: A Practical Approach for Robust...
19compression
Google's Dremel
Introduction to kdb+
[2D3]TurboGraph- Ultrafast graph analystics engine for billion-scale graphs i...
Working Experience_V5.0
Dremel interactive analysis of web scale datasets
Dremel: Interactive Analysis of Web-Scale Datasets
DB reading group may 16, 2018
Using python to analyze spatial data
High Dimensional Indexing using MongoDB (MongoSV 2012)
Hello cloud 3
If the Data Cannot Come To The Algorithm...
Progress_190118
Ch 5: Introduction to heap overflows
Handling the growth of data
SystemML - Datapalooza Denver - 05.17.16 MWD
Ad

Similar to Datomic R-trees (20)

PPT
Indexing Data Structure
PPTX
RTree Spatial Indexing with MongoDB - MongoDC
PDF
Spatial Indexing
PPTX
Chap 2 – dynamic versions of r trees
PPTX
Chap 2 – dynamic versions of r trees
PDF
SPATIAL R-TREE INDEX BASED ON GRID DIVISION FOR QUERY PROCESSING
PPT
PAM.ppt
PPTX
Spatial databases
PPT
Data indexing presentation part 2
PPTX
Optimizing spatial database
PDF
Spatial index(2)
PDF
GCUBE INDEXING
PDF
R-Tree Implementation of Image Databases
PDF
A survey on massively Parallelism for indexing multidimensional datasets on t...
PPTX
Red black tree
PDF
Physics for Game Programmers: Spatial Data Structures
PDF
M tree
PPTX
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
PDF
Exploring Advanced Data Structures - Hiike
PPTX
2014 11 lucene spatial temporal update
Indexing Data Structure
RTree Spatial Indexing with MongoDB - MongoDC
Spatial Indexing
Chap 2 – dynamic versions of r trees
Chap 2 – dynamic versions of r trees
SPATIAL R-TREE INDEX BASED ON GRID DIVISION FOR QUERY PROCESSING
PAM.ppt
Spatial databases
Data indexing presentation part 2
Optimizing spatial database
Spatial index(2)
GCUBE INDEXING
R-Tree Implementation of Image Databases
A survey on massively Parallelism for indexing multidimensional datasets on t...
Red black tree
Physics for Game Programmers: Spatial Data Structures
M tree
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Exploring Advanced Data Structures - Hiike
2014 11 lucene spatial temporal update
Ad

Recently uploaded (20)

PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PDF
cuic standard and advanced reporting.pdf
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PDF
Agricultural_Statistics_at_a_Glance_2022_0.pdf
PDF
Mobile App Security Testing_ A Comprehensive Guide.pdf
PDF
Advanced methodologies resolving dimensionality complications for autism neur...
PPTX
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
PPTX
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
PDF
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
PPTX
Cloud computing and distributed systems.
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PPTX
MYSQL Presentation for SQL database connectivity
PDF
KodekX | Application Modernization Development
PPTX
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
PDF
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
PDF
Building Integrated photovoltaic BIPV_UPV.pdf
PPTX
Programs and apps: productivity, graphics, security and other tools
PDF
Network Security Unit 5.pdf for BCA BBA.
The Rise and Fall of 3GPP – Time for a Sabbatical?
cuic standard and advanced reporting.pdf
Dropbox Q2 2025 Financial Results & Investor Presentation
Agricultural_Statistics_at_a_Glance_2022_0.pdf
Mobile App Security Testing_ A Comprehensive Guide.pdf
Advanced methodologies resolving dimensionality complications for autism neur...
ACSFv1EN-58255 AWS Academy Cloud Security Foundations.pptx
Effective Security Operations Center (SOC) A Modern, Strategic, and Threat-In...
Optimiser vos workloads AI/ML sur Amazon EC2 et AWS Graviton
Cloud computing and distributed systems.
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
MYSQL Presentation for SQL database connectivity
KodekX | Application Modernization Development
Detection-First SIEM: Rule Types, Dashboards, and Threat-Informed Strategy
Spectral efficient network and resource selection model in 5G networks
Profit Center Accounting in SAP S/4HANA, S4F28 Col11
Architecting across the Boundaries of two Complex Domains - Healthcare & Tech...
Building Integrated photovoltaic BIPV_UPV.pdf
Programs and apps: productivity, graphics, security and other tools
Network Security Unit 5.pdf for BCA BBA.

Datomic R-trees

  • 2. Summary ● Motivations ● Datomic overview ● Datomic R-tree implementation ● Hilbert Curves ● Bulk loading (via Hilbert Curves) ● Future plans
  • 3. Motivations ● I have an interest in geospatial applications – ● e.g. Thunderstorm probability application (THESPA) Datomic is an interesting database that makes different trade-offs to other databases – Wonder how far we can take the ability to describe arbitrary structures in Datomic
  • 4. Why don't we have both?
  • 5. Datomic Overview ● Immutable database ● Time-base facts (stored as entites) ● ACID transactions ● Expressive queries using Datalog ● Pluggable storage ● Flexible enough to act as row, column or graph database ● Schema that describes attributes that can be attached to entities – ● Attributes have a type; String, Long, Double, Inst, Ref etc. Database functions – Stored in the database, see the in transaction value
  • 7. Datomic Motivations ● Things that make Datomic appealing for spatial data – Time-base nature of Datomic is useful for time series data which we often have – No need to add spatial operations (union, intersection, etc.) to the database, can be handled by libraries in the peers – Spatial indexes can be stored as regular data, allows for a lot of freedom over choice of index, handling multiple indexes over subsets of the data in space and time – Flexible entity structures are useful because spatial data frequently does not fit nicely in a table – Immutability is surprisingly useful in lots of different applications!
  • 8. R-trees ● ● ● ● ● ● Efficient query of multi-dimensional data Groups nearby objects Balanced (all leaf nodes at same level) Aims for nodes minimise empty space coverage and overlap Designed for storage on disk (as used in databases) "R-Trees: A Dynamic Index Structure for Spatial Searching" – Guttman, A (1984)
  • 9. R-trees - Insertions ● ● ● Choose a leaf node to insert Insert entry into leaf node and enlarge node If node has more than max number of children split the node and propagate enlargement and splits up tree
  • 10. Datomic R-tree - Schema :rtree/root :db.type/ref :rtree/max-children :db.type/long :rtree/min-children :db.type/long :node/children :db.type/ref :node/is-leaf? :db.type/boolean :node/entry :db.type/ref :bbox/min-x :db.type/double :bbox/min-y :db.type/double :bbox/max-x :db.type/double :bbox/max-y :db.type/double
  • 14. Datomic R-tree - pick-next
  • 15. Datomic R-tree – regular transaction Transaction for adding new entry, calls database function Database function New entry with new ID Add new entry as child to leaf node
  • 16. Datomic R-tree – split transaction New entry Remove root Create new leaf nodes Add new root
  • 17. Bulk loading ● Issues with single insertion loading of R-tree – – ● ● ● Becomes slow with with many insertions The resulting tree is not as always as efficient as it could be Bulk loading builds a tree once from a number of entities Two basic approaches top-down and bottom-up Bulk loading does not imply bulk insertion
  • 18. Bulk loading – sort based loading ● Aims for better R-tree performance ● Bottom-up approach ● Sorts all entities in an order that aims to preserve locality ● ● ● Partitions the entities into clusters that are (hopefully) spatially collocated Recursively apply partitioning to build up the tree “Sort-based Query-adaptive Loading of R-trees” – ● D. Achakeev; B. Seeger; P. Widmayer (2012) “Sort-based parallel loading of R-trees” – D. Achakeev; M. Seidemann; M. Schmidt; B. Seeger (2012)
  • 19. Hilbert Curves ● ● ● ● a continuous fractal space-filling curve first described by mathematician David Hilbert in 1891 useful because it enables mapping from 2D to 1D preserving some notion of locality Other options are; Peano curve, Z-order curve (aka Morton Curve)
  • 20. Hilbert Curves ● ● ● ● a continuous fractal space-filling curve first described by mathematician David Hilbert in 1891 useful because it enables mapping from 2D to 1D preserving some notion of locality Other options are; Peano curve, Z-order curve (aka Morton Curve)
  • 21. Bulk loading – hilbert sort based ● Better Hilbert partitioning
  • 22. Bulk loading via Hilbert curves ● ● ● ● Insert all entities into Datomic (or using existing entities) Entities include an indexed Hilbert value attribute Obtain a seq of the entities using the :avet index with the Hilbert value Perform partioning
  • 23. Bulk - hilbert-ents Takes advantage of Datomic index API to get direct access to the Hilbert index
  • 24. Bulk - min-cost-index List of options for the next partition point Must be at least min-children in the partition
  • 28. Conclusions ● It works! (install-single-insertions conn 50000 20 10) – "Elapsed time: 119114.342783 msecs" (install-and-bulk-load conn 50000 20 10) – "Elapsed time: 6511.543299 msecs" (time (naive-intersecting all-entries search-box)) – "Elapsed time: 870.575802 msecs" (time (intersecting root search-box)) – "Elapsed time: 2.927883 msecs" * note these times should be regarded with suspicion since they only use the in memory database
  • 29. Future plans ● Retractions and updates ● Bulk insertions ● More search and query support ● ● Schema for supporting Meridian Shapes and Features Investigate other R-trees; R* tree, R+ tree
  • 30. Questions? Thanks you! Any questions? James Sofra @sofra
  • 31. Other Interesting Resources ● ● "The R*-tree: an efficient and robust access method for points and rectangles" “OMT: Overlap Minimizing Top-down Bulk Loading Algorithm for R-tree.” – ● “The Priority R-Tree: A Practically Efficient and Worst-Case Optimal R-Tree” – ● L. Arge; M. de Berg; K. Yi (2004) “Compact Hilbert Indices” – ● T. Lee; S. Lee (2003) Hamilton. C (2006) “R-Trees: Theory and Applications” – Manolopoulos. Y; Nanopoulos. A; Papadopoulos. A. N; Theodoridis. Y (2006)