SlideShare a Scribd company logo
Azure Cosmos DB
Microsoft’s globally distributed database service
Agenda
1. Introduction
2. System model
3. Global distribution
4. Resource governance
5. Conclusion
1. Introduction
Introduction
• Started in 2010 as Project Florence
• Generally available since 2017
• Ring-0, foundational Azure service, operating in all (54+) Azure regions and all
Azure clouds (DoD, Sovereign, etc.), serving 10s of trillions requests/day
• Became ubiquitous inside Microsoft and is one of the fastest growing Azure
services
Database designed for the cloud
Global Distribution
1
Elastic and unlimited scalability
Millions of transactions/sec
Petabytes of data
Hundreds transactions/sec
Gigabytes of data
2
Cost efficiencies with fine-grained
multi-tenancy
3
Design goals
1. Elastically scale throughput on-demand across any number of Azure regions
around the world, within 5s at the 99th percentile
2. Deliver <10ms end-to-end client-server read/write latencies at the 99th
percentile, in any Azure region
3. Offer 99.999% read/write high availability
4. Provide tunable consistency models for developers
5. Operate at a very low cost
6. Provide strict performance isolation between transactional and analytical
workloads
7. Build schema-agnostic engine to support unbounded schema versioning
8. Support multiple data models, and multiple popular OSS database APIs all
operating on the same underlying data
• 3
.
• 4
.
• 5
.
6
.
Globally Distributed Multi-Model Database service
7
Azure Cosmos DB
2. System model
Logical system model
• Using their Azure subscription, tenants create
one or more Cosmos accounts
• Customers can associate one or more Azure
regions with their Cosmos account and specify
the API for interacting with their data.
• Currently supported APIs – MongoDB,
Cassandra, SQL, Gremlin, Etcd, Spark, etc.
• Cosmos account manages one or more
databases
• Depending on the API, a database manages
one or more tables (or collections or graphs)
• Tables, Collections and Graphs get translated
to Containers
• Containers are schema-agnostic repositories of
Items
• Containers are horizontally-partitioned based
on throughput and storage
• A partition consists of replica set and is a
strongly consistent and highly available
• A partition-set consists of partitions globally
distributed across multiple Azure regions
Tenants
Physical system model
• Cosmos DB service is
bootstrapped in a DC via the
CCP (Cosmic Control Plane)
• CCP is fully decentralized and
with its state replicated using
another internal instance of
Cosmos DB
• In each DC, multiple Cosmos
compute and storage clusters
• Each cluster is spread across
10-20 fault domains and 3
AZs
• Continuous capacity
management with dynamic,
predictive x-cluster, x-DC load
balancers
• Each machine hosts replicas
belonging to 200-500 tenants
• Each replica hosts an instance
of Cosmos database engine
Database Engine – key ideas
Index is a union of materialized trees -> Automatic indexing
Materializing ingested content as a tree → blurs the boundary between schema and instance value
Schemas are speculatively/dynamically inferred
from the content
1
2 3
Avro
Parquet
Database Engine – key ideas
Index is decoupled from the content; content
separately stored in row & column stores; all fully
resource governed, log structured and latch-free
Schema agnostic and version resilient storage encoding and data model for content -> multi-model engine
User specified TTL policies for each store; strict performance
isolation between OLTP & analytical workloads
4
5
6
JSON
BSON Parquet
CQL
SQL
ProtoBuf
Data model of an item
Thrift
Column compressed, append-only redo-
log + invalidation bitmaps stored on
inexpensive, off-cluster storage
row store
column store
Committed updates stored
on local SSDs
Tiering
Database Engine – key ideas
Multiple APIs get mapped onto Query IL
Logical snapshots over the redo-log for
both, time-travel queries as well as PITR
Fully resource governed; capable of operating within
frugal resource budgets
7
9 10
column store
Blind writes via commutative Merge
8
Query IL
t
snapshots
3. Global distribution
Global distribution is turnkey
Partition as a Lego block for coordination
• Partition is used as a strongly consistent, highly
available, resource-governed, reusable building block
to solve many coordination problems in the system
• Global replication
• Inter-cluster partition load balancing
• Partition management operations including split, merge,
clone
• On-demand streaming
• Nested consensus
• Two reconfiguration state machines with failure detectors
composing with each other
• Dynamically self-adjusting membership and quorums
• Built-in resource governance
• Leaders and followers are allotted different budgets
• Leaders are unburdened from serving reads
Split Merge Global replication
Global distribution • The distribution protocol propagates writes within a partition-set, which typically
spans over multiple regions
• Instead of stretching a replica-set across all regions, the protocol maintains a nested
replica-set in each region
• Significantly improves fault tolerance and avoids in-cast bottlenecks due to hierarchical
aggregation of acks
• Massive, dynamic connected graph of partitions stretching across Azure datacenters
worldwide
• Control-plane ensures strong routing consistency despite continuous addition and
removal of regions, partition load balancing, splits, merge etc.
• Topology state is made highly available directly inside the data plane
• Guaranteed low latency and high availability reads and writes with multi-master
replication
• Reads and writes are always local to the region
• Guaranteed RPO and RTO of 0
• Multi-homing with location aware routing, programmatic support to add/remove
regions, simulate regional disaster, trigger failover etc.
• Five precisely defined consistency levels: strong, bounded-staleness, session, prefix,
and eventual consistency
• The consistency level selected determines the read quorum size within the replica-set in
local region
• Distribution protocol touches each write only once to achieve the immediate
commit within a region upon receipt
• The protocol remains scale independent and latency insensitive with N
Consensus within a replica set
1
2
3
4
5
6
7
8
leader - Receive request and start txn, validate & analyze content,
generate schemas, generate index terms
leader - Update index store, update content store
leader - Append redo-log
leader - Replicate to followers
follower/xp-leader - start txn, apply updates to index and content
stores, update redo log, commit txn, replicate to remote-
partitions (xp-leader)
follower - Send ack to the leader
leader - Receive the quorum ack
leader - commit txn
leader - Columnar compression of redo records, apply
invalidations, group commit to column store
leader - Send response to the client
9
Partition
Consensus within a partition set
1
2
2
1
4
4
1
1
• All writes are durably committed in local quorum
• Writes are visible immediately in the region
• Change log of tentative writes is shipped for merge via anti-
entropy
• Each region maintains sequence number for tentative writes
• Version vector clock captures the global progress
• Record level conflicts are detected using vector clock
comparison
• Conflict resolution is performed at a dynamically picked
arbiter
• Results of conflict resolutions are shipped to all regions to
ensure guaranteed convergence
P1 – write [1,0,0], P2 – write [0,a,0] and P3 - write [0,0,x]
P1 – apply [1,0,0] locally and xp-replicate [1,0,0] to P2 and P3
P2 - apply [0,a,0][1,0,0] locally and P3 – apply [0,0,x] [1,0,0] locally
P2 - merge [0,a,0] and P3 – merge [0,0,x]
P1 – merge [0,a,0][0,0,x]
P1 – conflict resolution [1,a,x]
2
2
2
2
3
3
3
3
3
3
3
5
4
5
6
6
P1
P2
P3
Precisely defined consistency levels
Strong Bounded-staleness Session Consistent-prefix Eventual
Continuous consistency checking over live data
Cosmos DB Cluster
Consistency Checker Service
Request logs
Metrics DB
Azure Portal
Consistency
metrics
Blue vertices -> Writes
Green vertices -> Reads
Linearizable History
Dependency graph of operations
Customer
Requests
Observed consistency is reported to customers along with
the Probability Bounded Staleness (PBS) metric
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 500 1000 1500 2000 2500
P
Time After Commit(ms)
P(Consistency)
US West US East North Europe
5. Resource governance
Fully resource governed stack
• Stateless communication protocols everywhere with
fixed upper bound on processing latency
• Capacity management, COGS and SLAs all depend on
stringent resource governance across the entire stack
• Request Unit (RU)
• Rate based currency (/sec, /min, /hr)
• Normalized across various database operations
• ML pipeline to calculate the query charges across
different datasets and query patterns
• Need to remain consistent across hardware
generations
• Automated perf and RG tracking every four hours to
detect code regressions
• All engine micro-operations are finely calibrated to live
within the fixed budgets of system resources
100,000,000
1,000,000,000
10,000,000,000
100,000,000,000
1,000,000,000,000
sum_TotalRequestCharge
Consumed
RUs
3 trillion transactions in 3 days across 20 Azure regions
Central US West US East US East US 2 North Europe West Europe
North Central US East Asia South Central US Southeast Asia Australia East West US 2
West Central US Japan East Brazil South UK South South India Canada Central
Australia Southeast UK West Japan West Germany Central West India Central India
Korea South Canada East Korea Central Central US EUAP East US 2 EUAP Germany Northeast
Microsoft internal workloads 9/7/2017-9/10/2017
Highest throughput consumption between
9/7/2017- 9/10/2017 in Central US
Zooming into one of clusters in Central US
N-102 is
serving
more RUs
TenantsonN-102inCentralUS
Multi-tenancy and horizontal partitioning
Multi-tenancy and global distribution
6. Conclusion
Conclusion
• Cosmos is Microsoft’s globally distributed database service
• Foundational Azure service on which other Azure services depend on
• Battle tested by mission critical, globally distributed workloads and
massive scale
• Multi-master replication, global distribution, fine-grained resource
governance and responsive partition management are central to its
architecture
• Precisely defined multiple consistency levels with clear tradeoffs
• Comprehensive and stringent SLAs spanning consistency, latency,
throughput and availability
7. References
References
• Azure Cosmos DB, http://cosmosdb.com
• Consistency Levels in Azure Cosmos DB, https://docs.microsoft.com/en-us/azure/cosmos-db/consistency-
levels
• “Schema-Agnostic Indexing with Azure DocumentDB,” PVLDB 8(12), 1668-1679 (2016)
• D. Terry. Replicated data consistency explained through baseball, Microsoft Technical Report MSR-TR-2011-
137, October 2011. To appear in Communications of the ACM, December 2013.
• Lisa Glendenning , Ivan Beschastnikh , Arvind Krishnamurthy , Thomas Anderson, Scalable consistency in
Scatter, Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles, October 23-26,
2011, Cascais, Portugal
• Moni Naor , Avishai Wool, The Load, Capacity, and Availability of Quorum Systems, SIAM Journal on
Computing, v.27 n.2, p.423-447, April 1998
• Douglas B. Terry, Alan J. Demers, Karin Petersen, Mike Spreitzer, Marvin Theimer, and Brent B. Welch. 1994.
Session Guarantees for Weakly Consistent Replicated Data. In Parallel and Distributed Information Systems
(PDIS). 140–149.
• Peter Bailis , Shivaram Venkataraman , Michael J. Franklin , Joseph M. Hellerstein , Ion Stoica,
Probabilistically bounded staleness for practical partial quorums, Proceedings of the VLDB Endowment, v.5
n.8, p.776-787, April 2012
http://joincosmosdb.com

More Related Content

PDF
A Day in the Life of a ClickHouse Query Webinar Slides
PPTX
4) ASPECTOS LEGALES Y ÉTICOS DE LA SEGURIDAD INFORMÁTICA
PPTX
Introduction to Hyper-V
PDF
MIPI DevCon 2020 | MIPI A-PHY: Laying the Groundwork for MIPI’s Automotive Se...
PPTX
Storage Area Networks, Networks, Networking, Computer Networks
PDF
A Fast Intro to Fast Query with ClickHouse, by Robert Hodges
PDF
Directory services
PDF
Big Data Analytics with MariaDB ColumnStore
A Day in the Life of a ClickHouse Query Webinar Slides
4) ASPECTOS LEGALES Y ÉTICOS DE LA SEGURIDAD INFORMÁTICA
Introduction to Hyper-V
MIPI DevCon 2020 | MIPI A-PHY: Laying the Groundwork for MIPI’s Automotive Se...
Storage Area Networks, Networks, Networking, Computer Networks
A Fast Intro to Fast Query with ClickHouse, by Robert Hodges
Directory services
Big Data Analytics with MariaDB ColumnStore

What's hot (19)

PDF
Impossibility of Consensus with One Faulty Process - Papers We Love SF
PDF
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
PDF
ClickHouse Monitoring 101: What to monitor and how
PPTX
iSCSI (Internet Small Computer System Interface)
PDF
Everything you always wanted to know about Redis but were afraid to ask
PPTX
Postgres MVCC - A Developer Centric View of Multi Version Concurrency Control
PPT
GT.M: A Tried and Tested Open-Source NoSQL Database
PDF
Oracle database hot backup and recovery
PPTX
RocksDB compaction
PDF
Graph Computing with JanusGraph
PPTX
NoSQL and Couchbase
PDF
Modern Database Systems (for Genealogy)
PDF
Redpanda and ClickHouse
PDF
oneM2M - taking a look inside
PDF
Ebs stats
PDF
GlusterFS CTDB Integration
DOC
Hướng dẫn cấu hình database availability group
PPT
PARTS DE LA CPU
PDF
ClickHouse Unleashed 2020: Our Favorite New Features for Your Analytical Appl...
Impossibility of Consensus with One Faulty Process - Papers We Love SF
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
ClickHouse Monitoring 101: What to monitor and how
iSCSI (Internet Small Computer System Interface)
Everything you always wanted to know about Redis but were afraid to ask
Postgres MVCC - A Developer Centric View of Multi Version Concurrency Control
GT.M: A Tried and Tested Open-Source NoSQL Database
Oracle database hot backup and recovery
RocksDB compaction
Graph Computing with JanusGraph
NoSQL and Couchbase
Modern Database Systems (for Genealogy)
Redpanda and ClickHouse
oneM2M - taking a look inside
Ebs stats
GlusterFS CTDB Integration
Hướng dẫn cấu hình database availability group
PARTS DE LA CPU
ClickHouse Unleashed 2020: Our Favorite New Features for Your Analytical Appl...
Ad

Similar to Cosmos DB at VLDB 2019 (20)

PPTX
Technical overview of Azure Cosmos DB
PPTX
Tech-Spark: Exploring the Cosmos DB
PDF
Select Stars: A DBA's Guide to Azure Cosmos DB (SQL Saturday Oslo 2018)
PDF
Azure Cosmos DB - The Swiss Army NoSQL Cloud Database
PDF
Select Stars: A SQL DBA's Introduction to Azure Cosmos DB (SQL Saturday Orego...
PDF
Select Stars: A DBA's Guide to Azure Cosmos DB (Chicago Suburban SQL Server U...
PDF
Azure Cosmos DB - Technical Deep Dive
PDF
Azure Cosmos DB - NoSQL Strikes Back (An introduction to the dark side of you...
PDF
Lessons learnt from building a globally distributed database service from the...
PPTX
Tour de France Azure PaaS 3/7 Stocker des informations
PPTX
cosmodb ppt.pptxfkhkfsgkhgfkfghkhsadaljlsfdfhkgjh
PPTX
Azure Cosmos DB - NET Conf AR 2017 - English
PPTX
Azure Cosmos DB + Gremlin API in Action
PPTX
Survey of the Microsoft Azure Data Landscape
PPTX
cosmodb ppt personal.pptxgskjhkjsfgkhkjgskhk
PPTX
Patterns in the cloud
PPTX
Cosmos db
PPTX
Beyond Jurassic NoSQL: New Designs for a New World
PPTX
Globally Distributed Modern Apps using Azure Cosmos DB and Azure Functions
PDF
Modeling data and best practices for the Azure Cosmos DB.
Technical overview of Azure Cosmos DB
Tech-Spark: Exploring the Cosmos DB
Select Stars: A DBA's Guide to Azure Cosmos DB (SQL Saturday Oslo 2018)
Azure Cosmos DB - The Swiss Army NoSQL Cloud Database
Select Stars: A SQL DBA's Introduction to Azure Cosmos DB (SQL Saturday Orego...
Select Stars: A DBA's Guide to Azure Cosmos DB (Chicago Suburban SQL Server U...
Azure Cosmos DB - Technical Deep Dive
Azure Cosmos DB - NoSQL Strikes Back (An introduction to the dark side of you...
Lessons learnt from building a globally distributed database service from the...
Tour de France Azure PaaS 3/7 Stocker des informations
cosmodb ppt.pptxfkhkfsgkhgfkfghkhsadaljlsfdfhkgjh
Azure Cosmos DB - NET Conf AR 2017 - English
Azure Cosmos DB + Gremlin API in Action
Survey of the Microsoft Azure Data Landscape
cosmodb ppt personal.pptxgskjhkjsfgkhkjgskhk
Patterns in the cloud
Cosmos db
Beyond Jurassic NoSQL: New Designs for a New World
Globally Distributed Modern Apps using Azure Cosmos DB and Azure Functions
Modeling data and best practices for the Azure Cosmos DB.
Ad

Recently uploaded (20)

PPTX
Lecture Notes Electrical Wiring System Components
PDF
Operating System & Kernel Study Guide-1 - converted.pdf
PDF
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
PPTX
bas. eng. economics group 4 presentation 1.pptx
PPTX
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PDF
PPT on Performance Review to get promotions
PPTX
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
PPTX
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
PPTX
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
PDF
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
PDF
Embodied AI: Ushering in the Next Era of Intelligent Systems
PDF
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
PDF
R24 SURVEYING LAB MANUAL for civil enggi
PDF
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
PPTX
UNIT 4 Total Quality Management .pptx
PDF
Automation-in-Manufacturing-Chapter-Introduction.pdf
PPTX
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
PPTX
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
PPTX
additive manufacturing of ss316l using mig welding
PPTX
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx
Lecture Notes Electrical Wiring System Components
Operating System & Kernel Study Guide-1 - converted.pdf
Enhancing Cyber Defense Against Zero-Day Attacks using Ensemble Neural Networks
bas. eng. economics group 4 presentation 1.pptx
MET 305 2019 SCHEME MODULE 2 COMPLETE.pptx
PPT on Performance Review to get promotions
Recipes for Real Time Voice AI WebRTC, SLMs and Open Source Software.pptx
FINAL REVIEW FOR COPD DIANOSIS FOR PULMONARY DISEASE.pptx
KTU 2019 -S7-MCN 401 MODULE 2-VINAY.pptx
BMEC211 - INTRODUCTION TO MECHATRONICS-1.pdf
Embodied AI: Ushering in the Next Era of Intelligent Systems
Evaluating the Democratization of the Turkish Armed Forces from a Normative P...
R24 SURVEYING LAB MANUAL for civil enggi
Mitigating Risks through Effective Management for Enhancing Organizational Pe...
UNIT 4 Total Quality Management .pptx
Automation-in-Manufacturing-Chapter-Introduction.pdf
Infosys Presentation by1.Riyan Bagwan 2.Samadhan Naiknavare 3.Gaurav Shinde 4...
CARTOGRAPHY AND GEOINFORMATION VISUALIZATION chapter1 NPTE (2).pptx
additive manufacturing of ss316l using mig welding
Engineering Ethics, Safety and Environment [Autosaved] (1).pptx

Cosmos DB at VLDB 2019

  • 1. Azure Cosmos DB Microsoft’s globally distributed database service
  • 2. Agenda 1. Introduction 2. System model 3. Global distribution 4. Resource governance 5. Conclusion
  • 4. Introduction • Started in 2010 as Project Florence • Generally available since 2017 • Ring-0, foundational Azure service, operating in all (54+) Azure regions and all Azure clouds (DoD, Sovereign, etc.), serving 10s of trillions requests/day • Became ubiquitous inside Microsoft and is one of the fastest growing Azure services
  • 5. Database designed for the cloud Global Distribution 1 Elastic and unlimited scalability Millions of transactions/sec Petabytes of data Hundreds transactions/sec Gigabytes of data 2 Cost efficiencies with fine-grained multi-tenancy 3
  • 6. Design goals 1. Elastically scale throughput on-demand across any number of Azure regions around the world, within 5s at the 99th percentile 2. Deliver <10ms end-to-end client-server read/write latencies at the 99th percentile, in any Azure region 3. Offer 99.999% read/write high availability 4. Provide tunable consistency models for developers 5. Operate at a very low cost 6. Provide strict performance isolation between transactional and analytical workloads 7. Build schema-agnostic engine to support unbounded schema versioning 8. Support multiple data models, and multiple popular OSS database APIs all operating on the same underlying data
  • 7. • 3 . • 4 . • 5 . 6 . Globally Distributed Multi-Model Database service 7 Azure Cosmos DB
  • 9. Logical system model • Using their Azure subscription, tenants create one or more Cosmos accounts • Customers can associate one or more Azure regions with their Cosmos account and specify the API for interacting with their data. • Currently supported APIs – MongoDB, Cassandra, SQL, Gremlin, Etcd, Spark, etc. • Cosmos account manages one or more databases • Depending on the API, a database manages one or more tables (or collections or graphs) • Tables, Collections and Graphs get translated to Containers • Containers are schema-agnostic repositories of Items • Containers are horizontally-partitioned based on throughput and storage • A partition consists of replica set and is a strongly consistent and highly available • A partition-set consists of partitions globally distributed across multiple Azure regions Tenants
  • 10. Physical system model • Cosmos DB service is bootstrapped in a DC via the CCP (Cosmic Control Plane) • CCP is fully decentralized and with its state replicated using another internal instance of Cosmos DB • In each DC, multiple Cosmos compute and storage clusters • Each cluster is spread across 10-20 fault domains and 3 AZs • Continuous capacity management with dynamic, predictive x-cluster, x-DC load balancers • Each machine hosts replicas belonging to 200-500 tenants • Each replica hosts an instance of Cosmos database engine
  • 11. Database Engine – key ideas Index is a union of materialized trees -> Automatic indexing Materializing ingested content as a tree → blurs the boundary between schema and instance value Schemas are speculatively/dynamically inferred from the content 1 2 3 Avro Parquet
  • 12. Database Engine – key ideas Index is decoupled from the content; content separately stored in row & column stores; all fully resource governed, log structured and latch-free Schema agnostic and version resilient storage encoding and data model for content -> multi-model engine User specified TTL policies for each store; strict performance isolation between OLTP & analytical workloads 4 5 6 JSON BSON Parquet CQL SQL ProtoBuf Data model of an item Thrift Column compressed, append-only redo- log + invalidation bitmaps stored on inexpensive, off-cluster storage row store column store Committed updates stored on local SSDs Tiering
  • 13. Database Engine – key ideas Multiple APIs get mapped onto Query IL Logical snapshots over the redo-log for both, time-travel queries as well as PITR Fully resource governed; capable of operating within frugal resource budgets 7 9 10 column store Blind writes via commutative Merge 8 Query IL t snapshots
  • 16. Partition as a Lego block for coordination • Partition is used as a strongly consistent, highly available, resource-governed, reusable building block to solve many coordination problems in the system • Global replication • Inter-cluster partition load balancing • Partition management operations including split, merge, clone • On-demand streaming • Nested consensus • Two reconfiguration state machines with failure detectors composing with each other • Dynamically self-adjusting membership and quorums • Built-in resource governance • Leaders and followers are allotted different budgets • Leaders are unburdened from serving reads Split Merge Global replication
  • 17. Global distribution • The distribution protocol propagates writes within a partition-set, which typically spans over multiple regions • Instead of stretching a replica-set across all regions, the protocol maintains a nested replica-set in each region • Significantly improves fault tolerance and avoids in-cast bottlenecks due to hierarchical aggregation of acks • Massive, dynamic connected graph of partitions stretching across Azure datacenters worldwide • Control-plane ensures strong routing consistency despite continuous addition and removal of regions, partition load balancing, splits, merge etc. • Topology state is made highly available directly inside the data plane • Guaranteed low latency and high availability reads and writes with multi-master replication • Reads and writes are always local to the region • Guaranteed RPO and RTO of 0 • Multi-homing with location aware routing, programmatic support to add/remove regions, simulate regional disaster, trigger failover etc. • Five precisely defined consistency levels: strong, bounded-staleness, session, prefix, and eventual consistency • The consistency level selected determines the read quorum size within the replica-set in local region • Distribution protocol touches each write only once to achieve the immediate commit within a region upon receipt • The protocol remains scale independent and latency insensitive with N
  • 18. Consensus within a replica set 1 2 3 4 5 6 7 8 leader - Receive request and start txn, validate & analyze content, generate schemas, generate index terms leader - Update index store, update content store leader - Append redo-log leader - Replicate to followers follower/xp-leader - start txn, apply updates to index and content stores, update redo log, commit txn, replicate to remote- partitions (xp-leader) follower - Send ack to the leader leader - Receive the quorum ack leader - commit txn leader - Columnar compression of redo records, apply invalidations, group commit to column store leader - Send response to the client 9 Partition
  • 19. Consensus within a partition set 1 2 2 1 4 4 1 1 • All writes are durably committed in local quorum • Writes are visible immediately in the region • Change log of tentative writes is shipped for merge via anti- entropy • Each region maintains sequence number for tentative writes • Version vector clock captures the global progress • Record level conflicts are detected using vector clock comparison • Conflict resolution is performed at a dynamically picked arbiter • Results of conflict resolutions are shipped to all regions to ensure guaranteed convergence P1 – write [1,0,0], P2 – write [0,a,0] and P3 - write [0,0,x] P1 – apply [1,0,0] locally and xp-replicate [1,0,0] to P2 and P3 P2 - apply [0,a,0][1,0,0] locally and P3 – apply [0,0,x] [1,0,0] locally P2 - merge [0,a,0] and P3 – merge [0,0,x] P1 – merge [0,a,0][0,0,x] P1 – conflict resolution [1,a,x] 2 2 2 2 3 3 3 3 3 3 3 5 4 5 6 6 P1 P2 P3
  • 20. Precisely defined consistency levels Strong Bounded-staleness Session Consistent-prefix Eventual
  • 21. Continuous consistency checking over live data Cosmos DB Cluster Consistency Checker Service Request logs Metrics DB Azure Portal Consistency metrics Blue vertices -> Writes Green vertices -> Reads Linearizable History Dependency graph of operations Customer Requests Observed consistency is reported to customers along with the Probability Bounded Staleness (PBS) metric 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 500 1000 1500 2000 2500 P Time After Commit(ms) P(Consistency) US West US East North Europe
  • 23. Fully resource governed stack • Stateless communication protocols everywhere with fixed upper bound on processing latency • Capacity management, COGS and SLAs all depend on stringent resource governance across the entire stack • Request Unit (RU) • Rate based currency (/sec, /min, /hr) • Normalized across various database operations • ML pipeline to calculate the query charges across different datasets and query patterns • Need to remain consistent across hardware generations • Automated perf and RG tracking every four hours to detect code regressions • All engine micro-operations are finely calibrated to live within the fixed budgets of system resources
  • 24. 100,000,000 1,000,000,000 10,000,000,000 100,000,000,000 1,000,000,000,000 sum_TotalRequestCharge Consumed RUs 3 trillion transactions in 3 days across 20 Azure regions Central US West US East US East US 2 North Europe West Europe North Central US East Asia South Central US Southeast Asia Australia East West US 2 West Central US Japan East Brazil South UK South South India Canada Central Australia Southeast UK West Japan West Germany Central West India Central India Korea South Canada East Korea Central Central US EUAP East US 2 EUAP Germany Northeast Microsoft internal workloads 9/7/2017-9/10/2017 Highest throughput consumption between 9/7/2017- 9/10/2017 in Central US
  • 25. Zooming into one of clusters in Central US N-102 is serving more RUs
  • 28. Multi-tenancy and global distribution
  • 30. Conclusion • Cosmos is Microsoft’s globally distributed database service • Foundational Azure service on which other Azure services depend on • Battle tested by mission critical, globally distributed workloads and massive scale • Multi-master replication, global distribution, fine-grained resource governance and responsive partition management are central to its architecture • Precisely defined multiple consistency levels with clear tradeoffs • Comprehensive and stringent SLAs spanning consistency, latency, throughput and availability
  • 32. References • Azure Cosmos DB, http://cosmosdb.com • Consistency Levels in Azure Cosmos DB, https://docs.microsoft.com/en-us/azure/cosmos-db/consistency- levels • “Schema-Agnostic Indexing with Azure DocumentDB,” PVLDB 8(12), 1668-1679 (2016) • D. Terry. Replicated data consistency explained through baseball, Microsoft Technical Report MSR-TR-2011- 137, October 2011. To appear in Communications of the ACM, December 2013. • Lisa Glendenning , Ivan Beschastnikh , Arvind Krishnamurthy , Thomas Anderson, Scalable consistency in Scatter, Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles, October 23-26, 2011, Cascais, Portugal • Moni Naor , Avishai Wool, The Load, Capacity, and Availability of Quorum Systems, SIAM Journal on Computing, v.27 n.2, p.423-447, April 1998 • Douglas B. Terry, Alan J. Demers, Karin Petersen, Mike Spreitzer, Marvin Theimer, and Brent B. Welch. 1994. Session Guarantees for Weakly Consistent Replicated Data. In Parallel and Distributed Information Systems (PDIS). 140–149. • Peter Bailis , Shivaram Venkataraman , Michael J. Franklin , Joseph M. Hellerstein , Ion Stoica, Probabilistically bounded staleness for practical partial quorums, Proceedings of the VLDB Endowment, v.5 n.8, p.776-787, April 2012