The TileDB Embedded Storage Engine

TileDB webinars
The TileDB Embedded
Storage Engine
Founder & CEO of TileDB, Inc.
Dr. Stavros Papadopoulos

Who is this webinar for?
Those wanting to learn about data storage fundamentals
Layout, compression, IO, etc.
Those looking to efficiently store/access any kind of data to/from anywhere
Dataframes, genomics, LiDAR, SAR, weather, and more, with a single engine
Those tired of managing custom, inefficient data formats
Formats not supporting fast updates, indexing, versioning, cloud performance

Disclaimer
I am the exclusive recipient of complaints
Email me at: stavros@tiledb.com
All the credit for our amazing work goes to our powerful team
Check it out at https://tiledb.com/about

Deep roots at the intersection of HPC, databases and data science
Traction with telecoms, pharmas, hospitals and other scientific organizations
40 members with expertise across all applications and domains
Who we are
TileDB got spun out from MIT and Intel Labs in 2017
WHERE IT ALL STARTED
Raised over $20M, we are very well capitalized
INVESTORS

What is TileDB Embedded?
An embeddable C library that stores and accesses multi-dimensional arrays
Dense array Sparse array
It implements very fast array slicing across dimensions

Superior
performance
Built in C
Fully-parallelized
Columnar format
Multiple compressors
R-trees for sparse arrays
TileDB Embedded at a Glance
https://github.com/TileDBInc/TileDB
Open source:
Rapid updates
& data versioning
Immutable writes
Lock-free
Parallel reader / writer model
Time traveling
Schema evolution

https://github.com/TileDBInc/TileDB
Open source:
Extreme
interoperability
Numerous APIs
Numerous integrations
All backends
Optimized
for the cloud
Immutable writes
Parallel IO
Minimization of requests

APIs & tool Integrations with zero-copy where possible
TileDB Embedded
Open-source interoperable
storage with a universal
open-spec array format
● Parallel IO, rapid reads & writes
● Columnar, cloud-optimized
● Data versioning & time traveling

Why arrays?
The basics
Advanced internal mechanics
Examples
Work in progress
Agenda
Comparison to other formats and engines
Docs at docs.tiledb.com

Byte 0 1 ...
Regardless of what kind of data you have, it is laid out in a 1D storage medium
Why Arrays?
where each task may slice
Algorithm as a task graph
Regardless of what kind of algorithm you run, the algorithm involves a set of slices

Why Arrays?
Byte 0 1 ...
Byte 0 1 ...
Performance is absolutely dictated by the slice result locality on the 1D medium

Why Arrays?
Arrays provide a flexible way to map/slice any-dimensional (ND data to/from a 1D layout
Giving different “importance” to different dimensions (order and tiling)
Choosing whether dimension coordinates should be materialized or not (dense vs. sparse)
Considering compression, encryption and other filters (tiling)
Abstracting all the engineering magic that it takes to make everything very fast (engine)
Unifying the data model for all application domains! (universality)
Building indices for fast search (e.g., R-trees)

Arrays Subsume Dataframes
Sparse array
Dataframe
Dense vector

Arrays Are Universal
What else can be modeled as an array
LiDAR 3D sparse)
SAR 2D or 3D dense)
Population genomics (3D sparse)
Single-cell genomics (2D dense or sparse)
Biomedical imaging (2D or 3D dense)
Even flat files!!! 1D dense)
Time series (ND dense or sparse)
Weather (2D or 3D dense)
Graphs (2D sparse)
Video (3D dense)
Key-values (1D or ND sparse)

dense_array1
├── __t2_t2_uuid1_v
│ ├── __fragment_metadata.tdb
│ └── a0.tdb
├── __t2_t2_uuid1.ok
├── __lock.tdb
├── __meta
└── __schema
└── __t1_t1_uuid2
A Simple 2D Dense Array
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
fragment
schema
attribute data

A Simple 2D Sparse Array
sparse_array1
├── __t2_t2_uuid2_v
│ ├── a0.tdb
│ ├── d0.tdb
│ └── d1.tdb
├── __t2_t2_uuid2_v.ok
├── __meta
└── __schema
└── __t1_t2_uuid1
1 2
3
4 5
6
fragment
schema
attribute data
coordinates

Groups
dense_group
├── __tiledb_group.tdb
└── nested_group
├── __tiledb_group.tdb
└── dense_array1
├── __meta
└── __schema
Groups provide an easy way to hierarchically organize arrays

Array Metadata
dense_array1
├── __t2_t2_uuid2_v
│ └── a0.tdb
├── __t2_t2_uuid2_v.ok
├── __meta
│ └── __t3_t3_uuid3
└── __schema
└── __t1_t2_uuid1
You can attach any number of (key, value) pairs to an array
The key must be string, and the value can be anything
metadata go here

Multiple Attributes
dense_array1
├── __t2_t2_uuid1_v
│ └── a0.tdb
│ └── a1.tdb
├── __meta
└── __schema
└── __t1_t1_uuid2
1,a 2,b 3,c 4,d
5,e 6,f 7,g 8,h
9,i 10,j 11,k 12,l
13,m 14,n 15,o 16,p
You can store more than one values in each cell, even of different type
TileDB has a “columnar” format that allows you to efficiently subselect on attributes
attribute data

Var-length Attributes
dense_array3
├── __t2_t2_uuid1_v
│ └── a0.tdb
│ └── a0_var.tdb
├── __meta
└── __schema
└── __t1_t1_uuid2
TileDB supports storing variable-length values in a cell (of any data type)
a bb ccc dddd
e ff ggg hhhh
i jj kkk lll
m nn ooo pppp
offsets
var-length data

Var-length Dimensions
sparse_array4
├── __t2_t2_uuid1_v
│ └── a0.tdb
│ └── d0.tdb
│ └── d0_var.tdb
├── __meta
└── __schema
└── __t1_t1_uuid2
You can also have var-length dimensions and slice naturally using string ranges
Applicable only to sparse arrays
offsets
var-length data
a bb ccc dddd e ff
1 2 3 4 5 6
unbounded domain
infinite gaps

Heterogeneous Dimensions
4
1.0
0.0
“dddd”
0.4
infinite string
dimension
infinite float32
dimension
Sparse array allow you to have dimensions of different types
The following 2D array allows efficient slicing on a string and a float32 dimension

Arrays as Dataframes
An array is essentially a dataframe
where dimensions are special (they are “indexed”)

What About Cloud Object Stores?
array_name → {s3,azure,gcs,tiledb}://path/array_name
Everything
demonstrated works
as is on the cloud

Tiling | Dense Arrays
fetches the whole array from storage
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
space tile
extents
fetches only a portion of the array (a tile)
A space tile is the atomic unit of IO
space tile
extents

Cell Layout | Dense Arrays
Three parameters define the values layout on storage, called the global order
Space tile extents
Tile order/layout (row-major or column-major)
Cell order/layout (row-major or column major)
row-major tile order
row-major cell order
22
space
tiles
col-major tile order
22
space
tiles
col-major cell order
42
space
tiles

Tiling & Cell Layout | Sparse Arrays
Sparse arrays store only non-empty cells
Grouping non-empty cells with space tiles would be inefficient (due to potential skew)
The atomic unit of IO in sparse arrays is the data tile, of fixed (user-defined) capacity
First impose a global order similar to dense arrays, then group based on capacity
22
space
tiles
capacity 2
space tile
extents
space tile
extents
22
space
tiles
capacity 4
data tile

Hilbert Order | Sparse Arrays
Space tiles greatly affect the cell layout in sparse arrays
Sometimes it is very difficult to define a good space tiling (especially with floats and strings)
For such cases, the Hilbert order is ideal (no tile extents and order)
For floats we discretize the domain into buckets
based on the number of dimensions
For strings we assign a number of bits per dimension
and then use the string prefixes as numbers

Tile Filters
TileDB allows a wide range of filters to be applied to each tile prior to its storage
Compressors (gzip, zstd, bzip2, …)
Checksums
Encryption
The atomic unit of filtering is the chunk (typically equal to the L1 cache size)
TileDB applies the filters across chunks in parallel in a pipeline
chunk
tile
zstd
AES256

Versioning and Time Traveling
In TileDB, every write is immutable
Each (batch) write creates a timestamped fragment
With fragments, TileDB implements
versioning and time traveling

Versioning and Time Traveling | Dense Arrays
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
Read at 0,t1
100 3 4
7 8
9 10 11 12
13 14 15 16
Read at 0,t2
200
500 600
100 - -
- -
- - - -
- - - -
Read at (t1,t2
200
500 600
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
Write at t1
100 200
500 600
Write at t2

100
Write at t2
40
Versioning and Time Traveling | Sparse Arrays
1 2
3
4 5
6
Write at t1
100 2
3
40 5
6
Read at 0,t2
1 100
Read at (t1,t2
40
When no dups
allowed
1
4
6
Read at 0,t1
2
3
5

When dups
are allowed
4
dups
100
Write at t2
40
Versioning and Time Traveling | Sparse Arrays
1 2
3
4 5
6
Write at t1
100
Read at (t1,t2
40
1
4
6
Read at 0,t1
2
3
5
100 2
3
40 5
6
Read at 0,t2
1

Indexing
TileDB has a three-level indexing approach
Fragment timestamps (in the fragment names) for time traveling
Non-empty domain in each fragment’s metadata
Either simple offset arithmetic (dense) or R-trees (sparse)
1. Get list of fragment names (with .ok)
t1_t1_uuid1_v
t2_t2_uuid2_v
...
2. Ignore fragments with timestamp not in time traveling interval
3. Ignore fragments with non-empty domain not overlapping slice
__fragment_metadata.tdb
__fragment_metadata.tdb
4a. Ignore dense tiles via implicit positional indexing, or
4b. Ignore sparse tiles from the R-tree that do not overlap the slice
Algorithm

A slicing query would just traverse the tree
top-down, visiting only nodes/MBRs that
intersect the slice
Indexing
Given the non-empty domain, the space tile extents and the
tile order, we can find easily that this slice overlaps the
second and fourth tile
1 2 3 4
5 6 7 8
9 10 11 12
13 14 15 16
22
space
tiles
MBR1
MBR2
MBR3
MBR4
22
space
tiles
capacity
2
R-tree
(stored in fragment metadata)
MBR1 MBR2 MBR3 MBR4

Consolidation & Vacuuming
Numerous fragments can lead to performance degradation (loss of locality, expensive listing)
TileDB supports two levels of consolidation
Fragment metadata (group the non-empty domains in a single place)
Fragments (better preserve data locality)
Old fragments are preserved after consolidation (for time traveling)
TileDB can vacuum old fragments to save space and boost listing
Time traveling will not work on vacuumed fragments

Attribute Filter Push-Down
TileDB supports pushing attribute filter conditions down to the engine
That typically boosts performance
Much fewer data gets copied around
More L1-cache conscious
More opportunities for parallelism and vectorization

Schema Evolution
TileDB supports schema evolution (since v2.4
Adding an attribute
Dropping an attribute
More schema evolution features are coming up
Full versioning and time traveling is supported

Notes on Writing
Lots of flexibility in writing in different orders, different domain subarray, etc.
Support for lock-free, parallel writing
Tips for performance:
Each tile should be 100KB  1MB
Each fragment should be 1  2GB
Fragments should not “interleave”
Run fragment metadata consolidation (especially on cloud object stores)
No support for deletions and updates yet (coming up soon)

Notes on Reading
TileDB is eventually consistent
Support for parallel writers, parallel readers (all lock-free)
Support for reads in different layouts
Support for “streaming reads” (incomplete queries)
Tips for performance:
Allocate sufficient space for the result buffers (minimize incomplete queries)
Tune written layout based on the read layout (application dependent)
Push down coordinate and attribute filter conditions

Coming Up
More schema evolution features
Support for deletes and updates
Git-like versioning
ACID via modularizing locking
More tile filters (e.g., sum, min, max)
RLE and dictionary compression on strings
Computations on compressed data
Linear Algebra operations
More SQL push down (e.g., group by)
Graph algorithms

High-level Comparisons
vs. HDF5
TileDB is cloud-native
TileDB has support for sparse arrays
vs. Zarr
TileDB is built in C and is more interoperable
TileDB has support for sparse arrays
TileDB has support for versioning and time traveling

High-level Comparisons
vs. Parquet
TileDB is multi-dimensional and supports more flexible layouts
TileDB has support for dense arrays
vs. Delta Lake
TileDB does not rely on Spark, Presto or other subsystem
TileDB has support for dense arrays
TileDB does not support deletes, updates and full ACID (yet)
TileDB is natively multi-dimensional and supports more flexible layouts

The Universal Database
Thank you
WE ARE HIRING
Apply at tiledb.workable.com

The TileDB Embedded Storage Engine

More Related Content

Similar to The TileDB Embedded Storage Engine (20)

More from Stavros Papadopoulos (6)

Recently uploaded (20)

The TileDB Embedded Storage Engine