Schema Agnostic Indexing with Azure DocumentDB

Schema Agnostic Indexing with
Azure DocumentDB
@dharmashukla, DocumentDB
Presented at VLDB 2015
Sudipta Sengupta, Justin Levandoski,
David Lomet
Microsoft Research
Dharma Shukla, Shireesh Thota, Karthik Raman,
Madhan Gajendran, Ankur Shah, Sergii Ziuzin,
Krishnan Sundaram, Miguel Gonzalez Guajardo, Anna
Wawrzyniak, Samer Boshra,
Renato Ferreira, Mohamed Nassar,
Michael Koltachev, Ji Huang
Microsoft Corporation

 Overview of DocumentDB
 Schema Agnostic Indexing
 Logical Index Organization
 Physical Index Organization
 Summary
Outline

 Fully managed, multi-tenant, geo-distributed document database service on
Azure
 Born out of the needs of internal Microsoft applications; GA since April 2015
 Built from the ground up with resource governance
 Provisioned throughput, performance isolation, OPEX efficiency
 Well defined consistency levels with predictable performance
 Database engine built for JSON & JavaScript
 Automatic indexing of JSON values and rich (SQL and JavaScript) query
 JavaScript language integrated transactions and query directly inside the database engine
What is DocumentDB?
Strong Bounded Staleness Session Eventual

Architecture
Database
Collection
Document
Account
User
Permission

JavaScript Object Literals
JSON serializable
values (aka JSON
Infoset)
{
"locations":
[
{ "country": "Germany", "city": "Berlin" },
{ "country": "France", "city": "Paris" }
],
"headquarter": "Belgium",
"exports":[{ "city": "Moscow" },{ "city": "Athens"}]
}
locations headquarter exports
0 1
country
Germany
city
Berlin
country
France
city
Paris
city
Moscow
city
Athens
Belgium 0 1
• Automatic indexing of document trees without
requiring schema or secondary indices
• SQL and JavaScript query processing on the trees
• Lazy materialization of JavaScript values from the
instances of trees
JSON document as tree
Schema-agnostic indexing

• Index is a union of all the document trees
Common
structure
• Structural information and instance values are normalized into a
unifying concept of JSON-Path
Terms Postings List/Values
$/location/0/ 1, 2
location/0/country/ 1, 2
location/0/city/ 1, 2
0/country/Germany 1, 2
1/country/France 2
… …
0/city/Moscow 2
0/dealers/0 2
0
Germany
location
0
location
country
0
country
Range (>, <, !=) &
ORDERBY queries
0
Germany
location
0
location
country
0
country
Wildcard queries Spatial queries
0
coordinates
Dynamic
Encoding of
Postings List
(E-WAH/differential)
Logical Index Organization

Query
{
"results":
[
{
"locations":
[
{"country":"Germany","city":"Berlin"},
{"country":"France","city":"Paris"}
]
}
]
}
{ "locations":
[ { "country": "Germany", "city": "Berlin" },
{ "country": "France", "city": "Paris" }
],
"headquarter": "Belgium",
"exports": [{ "city": "Moscow" }, { "city": "Athens" }]
}
{ "locations": [{ "country": "Germany", "city": "Bonn", "revenue": 200 } ],
"headquarter": "Italy",
"exports": [ { "city": "Berlin","dealers": [{"name": "Hans"}] }, { "city": "Athens" }
]
}
locations headquarter exports
0 1
country
Germany
city
Berlin
country
France
city
Paris
city
Moscow
city
Athens
Belgium
locations headquarter
0
country
Germany
city
Bonn
revenue
200
Italy
0 1
exports
city
Berlin
city
Athens
0
1
dealers
0
Hans
name
0
locations
0 1
country
Germany
city
Berlin
country
France
city
Paris
SELECT C.locations
FROM company C
WHERE C.headquarter = "Belgium"
results
Query result
Input documents
function businessLogic() {
var country = "Belgium";
__.filter(function(x){return x.headquarter===country;});}
SQL JavaScript

doc_id =5
key: “age/22”
payload: +doc5
key: “age/21”
payload: -doc5
key: “city/seattle”
payload: +doc5
key: “zip/98103”
payload: +doc5
…
Path/Posting List updates
Index
Query Processor
Indexscan > “age/30”
< “age/32”
doc1, doc5, doc7
System model for writes and queries

B-Tree
Cache
Log Structured Store
Index Maintanance Requirements
• Support sustained volume of rapid writes
without any term locality
• Queries should honor various consistency
levels
• Index maintenance must operate within
frugal resource budget
• Low write, read and space amplification

Page P
Page
ID
Physical
Address
P
Mapping Table
Δ: Insert record 50
Δ: Delete record 48
Δ: Update record 35 Δ: Insert record 60
Consolidated Page P
Update record 35 Insert record 60
HighlyConcurrentPageUpdatesHighly concurrent index updates

Base page
Log-structured Store on SSD
.
.
.
.
.
Mapping
table
Writeorderinginlog
Base page
Base page
-record
-record
(Latch-free)
Flush Buffer
(8MB)
.
.
Base page
-record
-record
RAM
-record
WriteOptimizedStorageOrganizationWrite optimized storage organization

• Little to no term locality on index write path
• Unable to keep “hot set” of leaf pages
cached in memory
• Performing read to modify each leaf node
leads to very high I/O overhead
• Requires method to maintain efficient write
path for sustained term ingestion with
predictable performance
update term t1
delete term t58
insert term t109
update term t179
update term t568
delete term t732
Lack of term locality

Blindupdates&ValueMerge
Address
Mapping Table
Log Structured Store (LSS)
T  {doc1, doc2, doc3, doc5}
Term T  -doc2
P
Read I/O
Page Stub
Address
Mapping Table
Log Structured Store (LSS)
Term T  +doc5
P
T->+doc2 T->-doc2
Page Stub
{doc1, doc2, doc3} {+doc5} {-doc2}
Term lookup or full
page consolidate
Page P
T  {doc1, doc2, doc3}
Add doc5 to posting list for term T
Page P
Page P
…
Consolidated Page P
Blind update for term T
Blind updates and value merge
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
0 2000 4000 6000 8000 10000
NumberofIOs
Index Size (MB)
Update Blind Update

Schema Agnostic Indexing with Azure DocumentDB

More Related Content

What's hot (20)

Viewers also liked (10)

Similar to Schema Agnostic Indexing with Azure DocumentDB (20)

Recently uploaded (20)

Schema Agnostic Indexing with Azure DocumentDB