Extend db

Sridhar Valaguru
(c) Copy right 2013 contact extenddb@gmail.com

 Motivation
 Example Usecases
 eXTend DB.
 Design
 Extensibility
 Limitations
 Morph DB – KeyValue pair
 Design and implementation
 Block management design
 implementation
 Caches
 Unique Approach towards the Database

 No SQL document Database like mongo steadily becoming
popular.
 Mongo DB features suitable for wide variety of applications
over traditional sql databases
 JSON-style documents with dynamic schemas offer simplicity
and power.
 Rich, document-based queries.
 Index on any attribute.
 Fast in-place updates and atomic modifiers.
 Features like replication , sharding , High availability , Map
reduce etc. are not applicable in this context.

 Features mentioned previously are also applicable for stand-alone
applications installed/running on user machines.
 There are few problems in using Mongo DB in such applications.
 External dependency on Mongo DB .
 User needs to install it separately.
 User has to manage Mongo DB for the application to work.
 Possibility of name space collision among different unrelated applications.
 Unnecessary client-server communication impacts performance.
 So there is need for an embedded (into application) document database
with similar features as Mongo DB. Basically sqlite equivalent of Mongo
DB.
 An extensible database is a plus.

 Logging library -
 Each log file entry could be an object in the database.
 Indexes could be created at later point in time to analyze log
files using rich querying.
 File tagging application -
 Each file information could be stored as an object to the DB.
 With tags attached removed dynamically.
 Indexed data could extend the object with new fields.
 Querying / searching based on tags or indexed data.

 Single node user-space NFS server –
 Stores all metadata information into the database.
 Maps filehandle to object/file attributes.
 Objects accessed with filehandles and/or parent
file handle and name.
 File data stored separately outside the database
using object-id based name space.
 Any other stand-alone applications .

 No SQL Document Database
 Stores BSON documents
 Embedded into process
 Mongo DB like querying interface
 Extensible
 Each database collection is stored into set of
files in user specified directory.

Application
DataBase API
Query related Management api
Query
Optimizer
Extensible
Query Module
Storage Layer
Tokyo cabinet Morph DB
In-memory
Key value DB

 Data is stored in 3 types of files backed up by storage
layer key value database.
 Descriptor DB –
 Holds information about the list of indexes in the database.
 Main DB –
 Stores all the document information with generated BSON
object ids as keys.
 BSON object id uniquely identifies the object in the
collection.
 Index DB –
 Stores references of objects with particular field values as
key and list of object ids.

 Simple weight based query optimizer.
 Index with the least number of objects is
chosen.

 Provides 2 functionalities for database engine
 Given a query in bson object format returns a list
of indexes which can be used for the particular
query.
▪ This is in-turn used by query optimizer for finding the
best index to use.
 Takes bson object and a query bson object returns
whether the object matches the query or not.

 Query module implements comparison
operator between 2 bson elements.
 Has no knowledge of storage layer , just
operates on the given bson objects.
 Can be overridden by users by registering user
specified comparison operators.
 This could be very useful for custom binary
data stored in database.
 Different query operators are implemented in
the module for providing complex querying.

 Operators let a object be selected in different ways other than just by
comparing the value is equal to the value in query.
 E.g.
 {‘a’:3} will match and all documents which has field a with value 3. This is a
simple query.
 But if we want to get all objects whose values are greater than 3 we cant
accomplish this with simple query.
 {‘a’:{‘$gt’:3}} is the query which will match all the documents where the value
is greater than 3.
 Here operator ‘$gt’ is given meaning “greater than”.
 Any field name starting with “$” is considered as an operator and the rest
of the name gives the name of the operator.
 Querying function looks up for the operator in the registered list and
invokes the handler to check whether the field matches the criteria in
query.
 By default various operators like $lt, $lte ,$nin, $all , $in, $exists have
been implemented.

 Custom operators can be registered with the
query module.
 When a particular query comes the
corresponding user call back will be invoked.
 Call back takes value of the field as one
parameter and value of the query value as other
one and returns boolean.
 This way query language of eXTend DB can be
extended without having the need to edit the
code of the database or wait for the developer to
implement the features.

 Abstract layer which provides key-value storage.
 Isolates data storage from the rest of the
database engine.
 Only place where the data is stored.
 Backend can be any key-value pair database.
 E.g.
 Tokyo cabinet
 Morph DB
 In memory key value pair
 Currently tokyo-cabinet is the default key-value
pair backend which stores all the data to files.
 Also Morph DB backend is almost complete.

 Different backends can be chosen depending
the type of data stored.
 E.g.
 Index Databases can be stored completely in
memory which will provide fast access.
 Main DB could be stored using tokyo cabinet back
end.
 For persistent indexes Morph DB could be used .

 Easy to use mongodb like embedded
database.
 Extensible storage backends .
 Extensible query language.
 Completely customizable query behavior .

 Tokyocabinet updates are not in-place
 Every time the object is expanded old space in
file is discarded new space is found.
 This is a serious problem for heavy update
workload.
 Tokyo cabinet by default writes to memory need
to do sync to sync the data to file.
 If application crashes without sync data is lost.
 Sync calls are costly.
 Incase sync gets called after every insert the
performance is very low.

 Morph DB is a key value pair database aimed
at solving the limitations of tokyo cabinet.
 Aims of Morph DB –
 Fast in-place updates / object expansion.
 A fast block management layer which could reuse
storage used by deleted objects.
 Once written data read should not be slowed
down by block management layer.
 Writes all data directly to the file while
maintaining performance.

 B+ Tree implementation on top of block
management layer.
 Provides generation based cursors.
 Cursors can work while DB is being modified.
 Can search for values in a range of keys.

 Provides 2 basic functionalities
 Data Write –
▪ Finds allocates resources in file
▪ Writes the data to suitable location(s).
▪ Returns an address where the data is written.
▪ Upper layer must store this reference to read the data back.
▪ Data is not interpreted.
 Data Read –
▪ Given the address which was earlier returned by the Data
write reads data from the offset or links of offsets
▪ Verifies the checksum of each piece
▪ Returns stitched object to the caller.

 File storage is managed in terms of resource
clusters.
 Each resource cluster contains some header
information and the resources followed by it.
 Unique property of resources is that it is of
variable size instead of a fixed single block size
like in various solutions.
 Individual resources (block) size varies from 128
bytes to 4MB.
 This range of block sizes makes it suitable for
data of various sizes from very small values to 16
MB.

 Clusters are allocated on-demand for a particular
type of resource.
 Cluster sizes start from 128K and subsequent
cluster sizes are double the previous one capped
by 32 MB.
 Increasing cluster sizes makes the database file
size small initially and grows along with the data
size.
 In case of small clusters header information
could be significant size compared to the
resource sizes.

 Data is stored in list of blocks each stores reference to
next block in the list.
 Each chunk stores the checksum of the entire data.
 This helps in identifying corrupt or partially updated
links.
 When data is expanded according to the expanding
data size suitable block is allocated and linked.
 There is a cap on link counts there can be maximum 4
links.
 Once data spreads across 4 links data is automatically
defragmented and a suitable block bigger is found for
the entire block which will reduce number of links.

 Block allocation takes a block size parameter .
 A free block of specified size found in the bitmap
residing in cluster header and the address is returned.
 DiskAddr structure identifying resource is 64 bit , bit-
field structure.
 56-bit component directly gives the address of the
resources .
 So no translation of address in IO path.
 4-bit type field indicates the resource size 0 for 128
and 1 for 256 and so on.
 Type field helps identifying the resource when freeing.

 Block allocation need to be extremely fast.
 Caches used to remember last cluster from
which data was allocated cluster.
 One such cache for each resource type.
 Cache state makes allocation O(1) in case of
series of allocations.
 Freeing resource will set the cache state to
point to the lowest offset resource.
 Always search continues in the next clusters.

 System calls (mostly pread/pwrite were used)
are very fast in some machines(core i3
processors). Doing large number of small
writes were not a problem.
 In other machines (core 2 Duo) system calls
were significantly slower and huge percentage
of time was spent in system calls.
 Memory mapped IOs were significantly faster.

 Mapping entire file has few problems.
 File sizes can grow
 In 32-bit machines will limit the database size.
 Unused regions could be mapped and kernel could choose to remove
wrong set of pages.
 To avoid above draw backs list of mmapped blocks were used.
 Number was limited by 10 to limit the virtual address usage.
 Least recently used mmapped region is removed if new region is to
be mmapped.
 Whenever a cluster is allocated whole cluster is mmapped.
 For each IO this list is checked if it is a hit simple memcpy is done
or else fall back to old system call.
 This improved the performance by almost 50 % in slow machines.

 B+ Tree uses the block management layer to
store its internal nodes and data.
 Block manager has no information about how
the blocks are going to be used.
 Provides a slot for the upper layer to store a
reference to its superblock.
 Internal nodes stores all keys of the nodes and
references to corresponding child nodes/values.
 Parent pointer is not maintained on-disk this
makes the splitting of nodes fast.
 Parent child relation ship is established during
search.

 All the nodes being modified are in-memory.
 Nodes are pinned in cache.
 After each modification node is written back
to file.

 Concurrent modifications can be allowed by taking write
lock on root of sub-tree which could be modified by
insert/delete.
 An insert in B+Tree could modify few to all nodes in the
path from root to the leaf.
 The highest level which will be modified could be found by
whether child could overrun by the insert.
 If child is overrun then parent will be modified.
 So instead of locking root we just need to lock the subtree
whose root is the top most parent which could be
modified.
 Similar speculation could be done for deletes.
 All the nodes from root to the first child which could be
modified will be locked for read.

 Tokyocabinet -
http://fallabs.com/tokyocabinet/spex-en.html
 Mongo DB - http://www.mongodb.org/

Extend db

More Related Content

What's hot (20)

Similar to Extend db (20)

Recently uploaded (20)

Extend db