20121024 mongodb-boston (1)

MongoDB and
®
Fractal Tree Indexes

Tim Callaghan*!
VP/Engineering, Tokutek!
tim@tokutek.com!
!
!
MongoDB Boston 2012

* not [yet] a MongoDB expert

1

B-tree Definition

In computer science, a B-tree is a tree data
structure that keeps data sorted and allows searches,
sequential access, insertions, and deletions
in logarithmic time.

http://en.wikipedia.org/wiki/B-tree

B-tree Overview

I will use a simple single-pivot example

throughout this presentation

Basic B-tree

Pivots

Pointers

Internal Nodes -
Path to data

Leaf Nodes -

Actual Data

5

B-tree example

22

10

99

2, 3, 4

10,20

22,25

99

* Pivot Rule is >=

B-tree - insert

“Insert 15”

22

10

99

2, 3, 4

10,15,20

22,25

99

Value stored in leaf node

B-tree - search

“Find 25”

22

10

99

2, 3, 4

10,20

22,25

99

B-tree - storage

Performance is IO limited when bigger than RAM:

try to ﬁt all internal nodes and some leaf nodes

22

RAM

10

99

DISK

RAM

2, 3, 4

10,20

22,25

99

B-tree – serial insertions

Serial insertion workloads are in-memory,

think MongoDB’s “_id” index

22

RAM

10

99

DISK

RAM

2, 3, 4

10,20

22,25

99

Fractal Tree Indexes

message All internal nodes
buffer

have message buffers

message message
buffer

buffer

similar to B-trees

different than B-trees

- store data in leaf nodes

- message buffer in all internal nodes

- use PK for ordering

- doesn’t need to update leaf node immediately

- much larger nodes (4MB vs. 8KB*)

Fractal Tree Indexes – “insert 15”

insert(15)

22

10

99

2, 3, 4

10, 20

22, 25

99

No IO is required, all internal nodes usually ﬁt in RAM

13

Fractal Tree Indexes – “find 25”

insert(15)

22

insert(20)

insert(25)

10

99

delete(3)

2, 3, 4

10

22, 25

99

14


insert(15)

22

insert(20)

insert(25)

10

99

delete(3)

2, 3, 4

10

22, 25

99

Buffer is full, push messages down to next level.

15


insert(15)

22

10

99

2, 4, 8

10, 20, 25

22, 25

99

Inserted 8, 20, 25. Deleted 3.

16

Fractal Tree Indexes – compression

•  Large node size (4MB) leads to high compression
ratios.
•  Supports zlib, quicklz, and lzma compression
algorithms.
•  Compression is generally 5x to 25x, similar to what
gzip and 7z can do to your data.
•  Significantly less disk space needed
•  Less writes, bigger writes
•  Both of which are great for SSDs
•  Reads are highly compressed, more data per IO

17

So what does this have to do with
MongoDB?

18

So what does this have to do with
MongoDB?

* Watch Tyler Brock’s presentation “Indexing
and Query Optimization”

19

MongoDB Storage

db.test.insert({foo:55})

db.test.ensureIndex({foo:1})

PK index (_id + pointer)

Secondary Index (foo + pointer)

25

85

10

99

40

120

(2,ptr2),

(10,ptr10)

(25,ptr25), (101,ptr101)

(2,ptr10),

(55,ptr4)

(90,ptr2)

(2599,ptr98)

(4,ptr4)

(98,ptr98)

(35,ptr101)

The “pointer” tells MongoDB where to look in the data ﬁles for the actual
document data.

20

MongoDB Storage

B-trees

25

85

10

99

40

120

(2,ptr2),

(10,ptr10)

(25,ptr25), (101,ptr101)

(2,ptr10),

(55,ptr4)

(90,ptr2)

(2599,ptr98)

(4,ptr4)

(98,ptr98)

(35,ptr101)

21

Who is Tokutek and what have we done?

•  Tokutek’s Fractal Tree Index Implementations
•  MySQL Storage Engine (TokuDB)
•  BerkeleyDB API
•  File System (TokuFS)
•  Recently added Fractal Tree Indexes to
MongoDB 2.2
•  Existing indexes are still supported
•  Source changes are available via our blog at
www.tokutek.com/tokuview
•  This is a work in progress (see roadmap
slides)

22

MongoDB and Fractal Tree Indexes

as simple as

db.test.ensureIndex({foo:1}, {v:2})

23

Indexing Options #1

db.test.ensureIndex({foo:1},{v:2,
blocksize:4194304,
basementsize=131072,
compression:quicklz,
clustering:false})

•  Node size, defaults to 4MB.

24

Indexing Options #2

blocksize:4194304,
clustering:false})

•  Basement node size, defaults to 128K.
•  Smallest retrievable unit of a leaf node,
efficient point queries

25

Indexing Options #3

blocksize:4194304,
clustering:false})

•  Compression algorithm, defaults to quicklz.
•  Supports quicklz, lzma, zlib, and none.
•  LZMA provides 40% additional compression
beyond quicklz, needs more CPU.
•  Decompression is of quicklz and lzma are
similar.
26

Indexing Options #4

blocksize:4194304,
clustering:false})

•  Clustering indexes store data by key and
include the entire document as the payload
(rather than a pointer to the document)
•  Always “cover” a query, no need to retrieve
the document data

27

How well does it perform?

Three Benchmarks
•  Benchmark 1 : Raw insertion performance
•  Benchmark 2 : Insertion plus queries
•  Benchmark 3 : Covered indexes vs. clustering
indexes

28

Benchmarks…

Race Results
•  First Place = John
•  Second Place = Tim
•  Third Place = Frank

29

Benchmarks…

Race Results

Frank can say the following:
“I finished third, but Tim was second to last.”

30

Benchmarks…

Race Results

Frank can say the following:
“I finished third, but Tim was second to last.”

Understand benchmark specifics and review all results.

31

Benchmark 1 : Overview

•  Measure single threaded insertion performance
•  Document is URI (character), name (character),
origin (character), creation date (timestamp), and
expiration date (timestamp)
•  Secondary indexes on URI, name, origin, expiration
•  Machine specifics:
– Sun x4150, (2) Xeon 5460, 8GB RAM, StorageTek
Controller (256MB, write-back), 4x10K SAS/RAID 0
– Ubuntu 10.04 Server (64-bit), ext4 filesystem
– MongoDB v2.2.RC0

32

Benchmark 1 : Without Journaling

33

Benchmark 1 : With Journaling

34

Benchmark 1 : Observations

•  Fractal Tree Indexing insertion performance is 8x
better than standard MongoDB indexing with
journaling, and 11x without journaling
•  Fractal Tree Indexing insertion performance
reaches steady state, even at 200 million
insertions. MongoDB insertion performance seems
to be in continual decline at only 50 million
insertions
•  B-tree performance is great until the working data
set > RAM

35


•  Measure single threaded insertion
performance while querying for 1000
documents with a URI greater than or equal
to a randomly selected value once every 60
seconds
•  Document is same as benchmark 1
•  Secondary indexes on URI, name, origin, expiration
•  Fractal Tree Index on URI is clustering
– clustering indexes store entire document inline
– Compression controls disk usage
– no need to get document data from elsewhere
–  db.tokubench.ensureIndex({URI:1}, {v:2, clustering:true})

•  Same hardware as benchmark 1
36

Benchmark 2 : Insertion Performance

37

Benchmark 2 : Query Latency

38


•  Fractal Tree Indexing insertion performance is 10x
better than standard MongoDB indexing
•  Fractal Tree Indexing query latency is 268x better
than standard MongoDB indexing
set > RAM
•  Random lookups are bad

...but what about MongoDB’s covered indexes?

39


•  Same workload and hardware as benchmark 2
•  Create a MongoDB covered index on URI to
eliminate lookups in the data files.
–  db.tokubench.ensureIndex({URI:1,creation:1,name:1,origin:1})

40

Benchmark 3 : Insertion Performance

41

Benchmark 3 : Query Latency

42


•  Fractal Tree Indexing insertion performance is still
3.7x better than standard MongoDB indexing
•  Fractal Tree Indexing query latency is 3.2x better
than standard MongoDB indexing (although the
MongoDB performance is highly variable)
set > RAM
•  MongoDB’s covered indexes can help a lot
– But what happens when I add new fields to my
document?
o Do I drop and re-create by including my new field?
o Do I live without it?
– Clustered Fractal Tree Indexes keep on covering your
queries!
43

Roadmap : Continuing the Implementation

•  Optimize Indexing Insert/Update/Delete Operations
– Each of our secondary indexes is currently creating and
committing a transaction for each operation
– A single transaction envelope will improve performance

44


•  Add Support for Parallel Array Indexes
– MongoDB does not support indexing the following two
fields:
o {a: [1, 2], b: [1, 2]}
– “it could get out of hand”
– Ticketed on 3/24/2010,
jira.mongodb.org/browse/SERVER-826
– Benchmark coming soon…

45


•  Add Crash Safety
– Our implementation is not [yet] crash safe with the
MongoDB PK/heap storage mechanism.
– MongoDB journal is separate from Fractal Tree Index
logs.
– Need to create a transactional envelope around both of
them

46


•  Replace MongoDB data store and PK index
– A clustering index on _id eliminates the need for two
storage systems
– Compression greatly reduces disk footprint
– This is a large task

47

We are looking for evaluators!

Email me at tim@tokutek.com

See me after the presentation

48

Questions?

Tim Callaghan
tim@tokutek.com
@tmcallaghan

More detailed benchmark information
in my blogs at
www.tokutek.com/tokuview
49

20121024 mongodb-boston (1)

More Related Content

Similar to 20121024 mongodb-boston (1) (20)

More from MongoDB (20)

20121024 mongodb-boston (1)