Introduction to Memoria

Introduction to
Memoria
The Application Specific Data Structures Toolkit

Motivation I
● Memory is cheap, but fast memory will always be
expensive and limited in size.
● Random access to DRAM memory is relatively
slow and hasn't improved significantly for
decades (it takes about 50 ns).
● The sequential one is 10-100-1000 times faster.
● Memory is hierarchical, the faster the memory,
the less its size is.
● And this limitation is physical (the speed of light
is bounded).
● We need to fit as much information into the
limited memory as possible.
● And we need data structures exploiting fast
sequential access over the slow random one.

Motivation II
● Most of data structures for main memory haven't changed for decades. They are
still based on directly mapped linked lists where links are represented with memory
pointers.
● Pointer operations have O(1) theoretical complexity in the plain RAM model but in
the hierarchical memory it is not true.
● Moreover pointers consume too much memory, especially on 64-bit architectures.
STL std::set<BigInt> consumes 40 bytes for every tree node (Linux GCC 4.6), this
is 5 times larger than the size of data (sizeof(BigInt) = 8 bytes). An std::set<BigInt>
containing 1M elements will consume 40 MiB of memory.
● The situation is even worse for the in-memory representation of structured
documents (XML, HTML, ODF etc) where the document representation consumes
100+ times more memory than the raw data (check the Firefox memory usage).
● Linked lists based data structures are simple and flexible, they just do their job. But
how well does they perform on the modern hardware? Let's check performance of
two different implementations of balanced search tree – one of the most
fundamental data structures.

Balanced Search Trees
● Let's consider balanced partial sums tree
representing an ordered sequence of
numbers (it is used in Memoria).
● And instead of implementing it with well-
known linked lists, lets pack it into an
array: PackedTree and PackedSet.
● We are not going to limit ourselves with
binary tree case. PackedTree is multiary
tree.
● Let's check its performance under
various conditions and compare with
std::set<> (that is based on the binary
RB-tree).

PackedTree Performance Analysis
● PC Box for benchmarks has Intel Q9400 @ 3GHz with 2x3M L2 Cache CPU, 8G
DDR2 RAM @ 750 MHz, Fedora Core 16 with GCC 4.6.3
● When packed tree fits into the CPU cache, trees with low fanout perform better.
● But when data is in memory, trees with high fanout perform better (except for the
64-children one).
● This is because balanced tree with high fanout has less levels that means less
random access operations. And search for the next child in a node is sequential
and quick.
● DRAM has high latency and CPU loses hundreds of cycles in case of cache
miss. These lost cycles are hidden reserve of performance for data structures.
● Linked list based STL std::set<> is much faster than PackedTree if the data
structure fits into CPU cache, but PackedTree is much faster otherwise.
● This is despite the fact that packed tree performs much more mathematical
operations to find one element of a sequence than pointer based RB-tree inside
std::set<>.

Motivation III
● The situation is even worse with external memory. Which is so slow at random
access that each such external IO operation should be taken into account.
● No size fits all here. Efficient data structures for external memory are always
application specific or workload specific (think NoSQL).
● It is true not only for external memory. Intelligent data processing requires such
complex data structures as searchable sequences and bit vectors.
● Giant volumes of hot structured data need succinct and compressed versions of
data structures.
● This is the reason why application specific data structures are black magic of
computer science. Each of them incorporates significant amount of hardcore
knowledge.
● The only way to accelerate the progress in this area is solution sharing. To make
this possible we need a development framework or toolkit for data structures in
generalized hierarchical memory (from CPU registers to data grids).

What is Memoria?
● Memoria for data structures is like Qt for GUI and like LLVM for program code. It
separates logical data representation in the form of data structures from its physical
representation in hierarchical memory. The only universal way to achieve high
performance is to implicitly reorganize physical data layout according to workload-
specific access patterns.
● The word is Latin, and can be translated as "memory". Memoria was the term for
aspects involving memory in Western classical rhetoric discourse.
● Memoria is written in C++ with and relies heavily on template metaprogramming for
data structure construction from basic building blocks. Memoria provides STL-like
API for client code. GCC 4.6 and Clang 3.0 are supported.
● It uses modular design with separation of concerns. Physical memory block
management is isolated behind the Allocator interface.
● Default implementation of Allocator (SmallInMemoryAllocator) provides serialization
of the allocator's state to a stream and copy-on-write based transactions.
● The project core provides templated balanced search tree as well as several basic
data structures such as Map, Set, Vector and VectorMap over it.

The Hidden Reserve of
Performance
● The fundamental data structure of Memoria is balanced
tree of array-packed trees of limited size. Something
like B-Tree but with many differences...
● Balanced tree of Memoria is generalized by design and
can be customized to various types of balanced search
trees.
● It is transactional and abstracted from physical memory
manager (Allocator).
● So huge amount of computational work is performed
when balanced tree is read or written.
● But, as benchmarks show, Memoria Set<> is only
about 3 times slower than lightweight PackedSet, when
the most part of the tree is in RAM.
● And it is even faster than STL set<> in this case.
● Even for linear ordered scan. Check it...

Memoria Update Performance
● A cool feature of Memoria balanced search tree is batch
update operations.
● If multiple updates are performed in a batch then some
amount of computational work can be shared between
individual update operations.
● There are two kinds of batching: batch
insertions/deletions and transactional grouping.
● There is no upper limit on batch or transaction size.
● In our benchmarks random insertion rate went from about
450K/sec for single insert to 20M/sec for 128-elements
batch. And finally reached more than 80M/sec for
batches of size 10K+ elements.
● Placing several updates in one transaction does not
introduce such giant performance improvement. Only 3-5
times. Check it...

Update Rate Limits
● How fast can updates be? Memoria uses 4K memory
blocks for search tree nodes by default, variable block size
is supported.
● Let's Memove() data in randomly selected 4K blocks at
randomly selected indexes (emulating insertions into
blocks).
● For arrays which don't fit in CPU cache we get about 2M
moves/sec and about 4GB/sec of memory throughput (for
our Q9400 PC Box).
● So 2M insertions per second is an upper bound for random
individual insertion rate of our balanced tree.
● From the previous benchmark, 450K insertions/sec is not
very far from this limit.
● Memoria core data structure does not introduce significant
performance overhead over hardware memory level.
● And there is a room for improvements.

Dynamic Vector
● Dynamic Vector is based on partial sums key-value
map where key is an offset of the data block in the
vector's index space and value is an ID of this
block.
● Partial sums tree provides insert/remove/access
operations with O(Log N) worst-case complexity.
● It is not a replacement of std::vector<> if fast
random access is required.
● But sequential read throughput is limited only with
main memory. The benchmark shows that it comes
quite close to the limit (4 GB/sec) even for 4K
blocks.
● Random access throughput for 4K blocks comes
close to 1.5 GB/sec that is very high in absolute
numbers :)
● Random access read performance for 128 byte
blocks is about 850K op/sec. Check it...

Vector Update Performance
● Insertions are slower then reads mainly because it is
necessary to move data in data blocks and update index
tree for the new data.
● In absolute numbers if sequential read comes close to
4GB/s then sequential append reaches only 1.2 GB/sec for
our PC Box.
● Random insert memory throughput for 16K blocks comes
close to 600MB/sec. That is much higher than current
HDD/SSD are able to consume.
● Random insert performance for 128 byte blocks is about
550K writes/sec that is not far from 2M/sec practical limit.
● Vector does not introduce significant overhead over
hardware memory for random insertions.
● Check it...

VectorMap
● VectorMap is a mapping from BigInt (64
bits) to dynamic vector region.
● It is a combination of a dynamic vector and
a set of pairs (Key, Offset) represented with
two-key partial sum tree.
● It is relatively succinct – only 8 or 16 bytes
per entry (this value doesn't include internal
search tree nodes). For 256 byte values
total overhead is less than 10%.
● 300K random reads and 200K random
writes of 128 byte values.
● 3.8/1.2 GB/sec memory read/wright
throughput for 256K byte values.
● Up to 2/0.3 GB/sec sequential read/write
throughput for 256 byte values.
● It also supports batch updates.

Roadmap
● Better alignment with modern ● LOUDS/DFUDS succinct trees.
theoretical results for balanced trees
(cache-oblivious etc).
● Generic searchable sequences of 1 to
8 bit symbols on top of
● Multithreading with MVCC-like conflict Vector/VectoMap.
resolution.
● Multiary Wavelet Tree and searchable
● Native integration with various virtual sequences over arbitrary alphabets.
machines build on top of LLVM JIT.
● Full-text search indexes for NLP and
● Extending core support for external other applications.
memory.
● Succinct graph representation.
● Shared memory support for allocators
to share data structures between
● Etc...
processes. ● Etc...
● Variable blocks size support.
● Dynamic bit vector with rank()/select()
operations.

Memoria
Is still in active development
But your contributions are welcome

If you are interested please follow us on Bitbucket
http://bitbucket.org/vsmirnov/memoria

For any questions feel free to contact me
Victor Smirnov <aist11@gmail.com>

Introduction to Memoria

More Related Content

What's hot (20)

Similar to Introduction to Memoria (20)

Recently uploaded (20)

Introduction to Memoria