SlideShare a Scribd company logo
Omid: Scalable an d Highly Available
Transaction Processing for Phoenix

Ohad Shacham, Edward Bortnikov ⎪ PhoenixCon, Jun 13, 2017
Let’s Get Started …
2
Our Yahoo Journey with Transactions over HBase



Omid for Users: Semantics, API, Integration with Phoenix



Omid for Programmers: Architecture and Use Cases



Omid, Advanced: Scalability, HA, Low-Latency
Transaction Processing in NoSQL @Yahoo
3
Motivation: Data Pipelines (Search, Mail, etc.)



Stream Processing a Popular Pattern

Compute Tasks process Data Items that arrive in the Real Time 

Intermediate Artifacts stored in NoSQL (KV-)Storage



Extensive Use of Hadoop Technologies (Storm, HBase)



Scale: Thousands of Hadoop Nodes
Content Indexing for Search
Crawl Docproc
Link
Analysis Stream
Crawl		
schedule	
Content	
Queue	
Links	
STORM
HBase
Zooming in on Tasks
Document processing


Read page content from the store 


Compute search index features


Update computed features

Link processing


Read outgoing links for a page


Update reference for all linked-to pages



begin
begin
commit
commit
Transaction Processing: ACID 101
6
Multiple data accesses in a single logical operation

Atomic 


“All or nothing” – no partial effect observable

Consistent


The DB transitions from one valid state to another

Isolated


Appear to execute in isolation 

Durable


Committed data cannot disappear
Omid (‫)امید‬
7
2011 

Incepted

@Yahoo Research

“Omid1”

2014

Large-Scale

Deployment

@Yahoo

2014/5

Major Re-Design

for Scalability & HA

“Omid2”

2016

Apache 

Incubator

2017

Prototype

Integration

with Phoenix

Transaction Processing Service for Apache HBase
Contributors
8
Ohad Shacham

Yahoo Research

Francisco 

Perez Sorrosal

Yahoo
Edward Bortnikov

Yahoo Research

Eshcar Hillel

Yahoo Research

Idit Keidar

Yahoo, Technion

Ivan Kelly

Midokura



Sameer Paranjpye 

Databricks

Matthieu Morel

Skyscanner 

Igor Katkov

Atlassian

Yonatan Gottesman

Yahoo Research
Omid 101
9
Client Library + Runtime Service



Database Agnostic (can work with other backends)



Snapshot Isolation consistency 



Very Scalable (>380K peak tps) and Highly Available
Omid Programming Example
10
TransactionManager tm = HBaseTransactionManager.newInstance();

TTable txTable = new TTable("MY_TX_TABLE”);



Transaction tx = tm.begin(); // Control path



Put row1 = new Put(Bytes.toBytes("EXAMPLE_ROW1"));

row1.add(family, qualifier, Bytes.toBytes("val1"));

txTable.put(tx, row1); // Data path



Put row2 = new Put(Bytes.toBytes("EXAMPLE_ROW2"));

row2.add(family, qualifier, Bytes.toBytes("val2")); 

txTable.put(tx, row2); // Data path



tm.commit(tx); // Control path
Snapshot Isolation (SI) Semantics
Distinct read (snapshot) and write (commit) points

No write-write conflicts allowed
Tephra: Sibling Technology
12
Transaction Processing technology for HBase



SI Semantics. Design Similar to Omid1 



Apache Incubator since 2016



Integrated with Phoenix to provide ACID semantics (BETA)

Implements some Phoenix-specific scenarios
Phoenix-Omid Integration
13
Work in Progress under JIRA PHOENIX-3623



Backward Compatible – Configurable TP Provider Choice

Current Options: Tephra and Omid



How?

Internal Transaction Abstraction Layer (TAL) API

Multiple Implementations, Configurable Instantiation
Transaction Processing, Refactored
14
Transaction
Abstraction Layer 

Tephra
Client

Omid

Client



Phoenix



Phoenix

Tephra
Client

Refactor
How Omid Works
Client

Begin/Commit

Data
 Data
 Data

Commit

	Table

Persist

Commit

Verify commit
Read/Write

Conflict
Detection

15
Transaction
Manager
(TSO)

Lock-Free SI Implementation. Exploits Built-in MVCC.
Transacti
on
Manager

Client

Begin

Data
 Data
 Data

Commit 

Table

t1

Write (k1, v1, t1)

Write (k2, v2, t1)

Read (k’, last committed t’ < t1)

(k1, v1, t1)
 (k2, v2, t1)

Execution Example
tr = t1

Transaction
Manager

16
Client

Commit: t1, {k1, k2} 

Data
 Data
 Data

Commit 

Table

t2

(k1, v1, t1)
 (k2, v2, t1)

Write (t1, t2)

(t1, t2)

Execution Example
tr = t1

tc = t2

17
Transaction
Manager
Client

Data
 Data
 Data

Commit 

Table

Read (k1, t3)

(k1, v1, t1)
 (k2, v2, t1)
 (t1, t2)

Read (t1)

Execution Example
tr = t3

18
Bottleneck!

Transaction
Manager
Client

Data
 Data
 Data

Commit 

Table

t2

(t1, t2)
(k1,v1,t1,t2)
 (k2,v2,t1,t2)

Delete(t1)

Post-Commit Timestamp Replication
tr = t1

tc = t2

Update
commit
cells

19
Transaction
Manager
Data
 Data
 Data

Commit 

Table

Read (k1, t3)

Using Commit Cells
Client

tr = t3

20
Transaction
Manager

(k1,v1,t1,t2)
 (k2,v2,t1,t2)
Phoenix – New Scenarios for Omid
21
Secondary Indexes

On-the-Fly Index Creation

Atomic Updates

Query by Secondary Key



Extended Snapshot Isolation 

Read-Your-Own-Writes Queries
On-the-Fly Secondary Index Creation
22
CREATE INDEX (CI) in parallel with writes to the base table



How? Distinguish between the pre-CI and post-CI data



CREATE INDEX command issue time defines a timestamp

1. All data committed before snapshot: scanned, bulk-inserted into index 

2. All data generated after snapshot: triggers random update of index

3. All transactions in flight at snapshot time: aborted (FENCE)
Secondary Index: Creation and Maintenance
23
T1

T2

T3

CREATE INDEX started

T4

CREATE INDEX complete

T5

T6



Bulk-
Inserted
into index
 Abort

(enforced
upon
commit)





Added by a
coprocessor



Added by a
coprocessor



Index
update
(stored
procedure)
Extended Snapshot Isolation
24
CREATE TABLE T (ID INT); 



BEGIN;



1: INSERT INTO T 


SELECT ID+10 FROM T;

2: INSERT INTO T 

SELECT ID+100 FROM T;



COMMIT;

Traditional SI: Read-Your-Writes



Challenge: 

Circular Dependency 

(Statement in Infinite Loop)



Solution: Moving Snapshot

(series of checkpoint snapshots)
Moving Snapshot Implementation
25
Checkpoint for

Statement 1

Checkpoint for

Statement 2

Writes by 

Statement 1

Timestamps allocated by TM in blocks.

Client promotes the checkpoint.
Omid Scalability
26
Extremely lean Client-Transaction Manager protocol

Omid1, Tephra replicate the entire state to client side upon BEGIN



Aggressive batching of writes to CT in Transaction Manager



Concurrent conflict detection (experimental)



HA algorithm incurs zero overhead in the mainstream
0

50

100

150

200

250

300

350

400

450

500

550

Omid1
 Omid1 Non Durable
 Omid
 Omid Non Durable

Tps*103
Throughput Benchmark
YCSB workload driver

12-core Transaction Manager 

1G network
0

500

1000

1500

2000

2500

document inversion
 duplicate detection
 out-link processing
 in-link processing
 stream to runtime

TaskLatency(ms)

Commit + CT update

Begin

Compute

Read

Update

Overhead in Production: Web Search Indexing
Low-Latency Omid (Experimental)
29
Original Design: Throughput-Oriented Applications in Mind

Sometimes, this comes at the expense of latency 

Example: writes to Commit Table batched at the Transaction Manager



Key: Dissolve the Transaction Manager I/O Bottleneck

Distribute the Commit Table and the Writes to it



How? 

The client, rather than the TM, persists the Commit Timestamp (CTS)

CTS embedded in the first row written by the transaction
Benchmark: Single-Write Transaction Workload
0

10

20

30

40

50

60

70

80

0
 50
 100
 150
 200
 250
 300

Omid

Low latency

Throughput (tps * 103)

Latency(msec)
Summary
31
Scalable, Highly Available Open Source Transaction Processing



Battle-Tested, Ready for Public Cloud



Integration with Apache Phoenix Underway (GA in 2017)
Thanks to Our Partners for Being Awesome

32
Backup

33
Architecture, Recapped
Client

Begin/Commit

Data
 Data
 Data

Commit

	Table

Persist

Commit

Verify commit
Read/Write

SPoF

34
Transaction
Manager
(TSO)
HA: Primary-Backup Transaction Manager
Client

Data
 Data
 Data

Commit

	Table

35
Transaction
Manager
(TSO)
Transaction
Manager

Recovery
state (ZK)
 Primary

Backup
Split Brain
Client

Commit

	Table

36
Transaction
Manager
(TSO)
Transaction
Manager
 Primary

Backup

Race
Conditions

Violate SI

Take I: 

Fence CT upon 

every write (slow!)
HA Algorithm – Key Ideas
37
Old and New Primaries may write conflicting commit records

No Locks!



Client detects inconsistencies, invalidates problematic records



Lease-Based Leader Election 

Optimization: Local lease check before/after writing to CT

Zero Overhead in Non-Recovery Scenarios

More Related Content

PDF
Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...
PDF
Apache Kafka, and the Rise of Stream Processing
PDF
Exactly-Once Made Easy: Transactional Messaging Improvement for Usability and...
PDF
Follow the (Kafka) Streams
PPTX
Kostas Tzoumas_Stephan Ewen - Keynote -The maturing data streaming ecosystem ...
PDF
Asynchronous Epoch Commits for Fast and Reliable Data Stream Execution in Apa...
PDF
Better Kafka Performance Without Changing Any Code | Simon Ritter, Azul
PDF
What's New in Confluent Platform 5.5
Consistency and Completeness: Rethinking Distributed Stream Processing in Apa...
Apache Kafka, and the Rise of Stream Processing
Exactly-Once Made Easy: Transactional Messaging Improvement for Usability and...
Follow the (Kafka) Streams
Kostas Tzoumas_Stephan Ewen - Keynote -The maturing data streaming ecosystem ...
Asynchronous Epoch Commits for Fast and Reliable Data Stream Execution in Apa...
Better Kafka Performance Without Changing Any Code | Simon Ritter, Azul
What's New in Confluent Platform 5.5

What's hot (20)

PPTX
Going Reactive with Spring 5
PDF
Introduction to Stream Processing with Apache Flink (2019-11-02 Bengaluru Mee...
PDF
Tips & Tricks for Apache Kafka®
PDF
Exactly-once Data Processing with Kafka Streams - July 27, 2017
PDF
How to Build an Apache Kafka® Connector
PDF
Hadoop made fast - Why Virtual Reality Needed Stream Processing to Survive
PPTX
Stream Processing using Samza SQL
PPTX
Reactive Spring 5
PDF
Building a Streaming Platform with Kafka
PPTX
Jack Gudenkauf sparkug_20151207_7
PDF
A Practical Guide to Selecting a Stream Processing Technology
PDF
ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelines
PPTX
Resilience from Theory to Practice
PDF
Introducing Kafka's Streams API
PDF
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...
PDF
Improving Logging Ingestion Quality At Pinterest: Fighting Data Corruption An...
PDF
Build Event-Driven Microservices with Confluent Cloud Workshop #1
PDF
Apache kafka meet_up_zurich_at_swissre_from_zero_to_hero_with_kafka_connect_2...
PPTX
Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafka
PDF
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
Going Reactive with Spring 5
Introduction to Stream Processing with Apache Flink (2019-11-02 Bengaluru Mee...
Tips & Tricks for Apache Kafka®
Exactly-once Data Processing with Kafka Streams - July 27, 2017
How to Build an Apache Kafka® Connector
Hadoop made fast - Why Virtual Reality Needed Stream Processing to Survive
Stream Processing using Samza SQL
Reactive Spring 5
Building a Streaming Platform with Kafka
Jack Gudenkauf sparkug_20151207_7
A Practical Guide to Selecting a Stream Processing Technology
ETL as a Platform: Pandora Plays Nicely Everywhere with Real-Time Data Pipelines
Resilience from Theory to Practice
Introducing Kafka's Streams API
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streamin...
Improving Logging Ingestion Quality At Pinterest: Fighting Data Corruption An...
Build Event-Driven Microservices with Confluent Cloud Workshop #1
Apache kafka meet_up_zurich_at_swissre_from_zero_to_hero_with_kafka_connect_2...
Kafka Summit NYC 2017 - Data Processing at LinkedIn with Apache Kafka
Bravo Six, Going Realtime. Transitioning Activision Data Pipeline to Streaming
Ad

Similar to Omid: Scalable and Highly Available Transaction Processing for Phoenix (20)

PPTX
Omid: scalable and highly available transaction processing for Apache Phoenix
PPTX
Omid: scalable and highly available transaction processing for Apache Phoenix
PPTX
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
PDF
Transaction in HBase, by Andreas Neumann, Cask
PDF
Omid Efficient Transaction Mgmt and Processing for HBase
PDF
HBaseCon2017 Transactions in HBase
PDF
Omid: Efficient Transaction Management and Incremental Processing for HBase (...
PDF
ACID Transactions in Apache Phoenix with Apache Tephra™ (incubating), by Poor...
PDF
hbaseconasia2019 Recent work on HBase at Pinterest
PDF
A critique of snapshot isolation: eurosys 2012
PPTX
How YugaByte DB Implements Distributed PostgreSQL
PDF
Transactions Over Apache HBase
PDF
Under The Hood Of A Shard-Per-Core Database Architecture
PDF
Data Mesh @ Yelp - 2019
PDF
From scheduled downtime to self-healing
PPT
Rapid, Scalable Web Development with MongoDB, Ming, and Python
PPTX
Dissecting Open Source Cloud Evolution: An OpenStack Case Study
PDF
Apache Tephra
PDF
HBaseCon 2013:High-Throughput, Transactional Stream Processing on Apache HBase
PDF
Blockchain meets database
Omid: scalable and highly available transaction processing for Apache Phoenix
Omid: scalable and highly available transaction processing for Apache Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Transaction in HBase, by Andreas Neumann, Cask
Omid Efficient Transaction Mgmt and Processing for HBase
HBaseCon2017 Transactions in HBase
Omid: Efficient Transaction Management and Incremental Processing for HBase (...
ACID Transactions in Apache Phoenix with Apache Tephra™ (incubating), by Poor...
hbaseconasia2019 Recent work on HBase at Pinterest
A critique of snapshot isolation: eurosys 2012
How YugaByte DB Implements Distributed PostgreSQL
Transactions Over Apache HBase
Under The Hood Of A Shard-Per-Core Database Architecture
Data Mesh @ Yelp - 2019
From scheduled downtime to self-healing
Rapid, Scalable Web Development with MongoDB, Ming, and Python
Dissecting Open Source Cloud Evolution: An OpenStack Case Study
Apache Tephra
HBaseCon 2013:High-Throughput, Transactional Stream Processing on Apache HBase
Blockchain meets database
Ad

Recently uploaded (20)

PDF
top salesforce developer skills in 2025.pdf
PDF
2025 Textile ERP Trends: SAP, Odoo & Oracle
PPTX
Operating system designcfffgfgggggggvggggggggg
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PPTX
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
PDF
How to Choose the Right IT Partner for Your Business in Malaysia
PPTX
Reimagine Home Health with the Power of Agentic AI​
PPTX
CHAPTER 2 - PM Management and IT Context
PPTX
assetexplorer- product-overview - presentation
PPTX
ai tools demonstartion for schools and inter college
PPTX
Embracing Complexity in Serverless! GOTO Serverless Bengaluru
PPTX
Introduction to Artificial Intelligence
PPTX
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PDF
Navsoft: AI-Powered Business Solutions & Custom Software Development
PDF
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
PDF
Which alternative to Crystal Reports is best for small or large businesses.pdf
PPTX
VVF-Customer-Presentation2025-Ver1.9.pptx
PDF
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
top salesforce developer skills in 2025.pdf
2025 Textile ERP Trends: SAP, Odoo & Oracle
Operating system designcfffgfgggggggvggggggggg
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Oracle E-Business Suite: A Comprehensive Guide for Modern Enterprises
How to Choose the Right IT Partner for Your Business in Malaysia
Reimagine Home Health with the Power of Agentic AI​
CHAPTER 2 - PM Management and IT Context
assetexplorer- product-overview - presentation
ai tools demonstartion for schools and inter college
Embracing Complexity in Serverless! GOTO Serverless Bengaluru
Introduction to Artificial Intelligence
Agentic AI Use Case- Contract Lifecycle Management (CLM).pptx
Odoo Companies in India – Driving Business Transformation.pdf
Navsoft: AI-Powered Business Solutions & Custom Software Development
Why TechBuilder is the Future of Pickup and Delivery App Development (1).pdf
Which alternative to Crystal Reports is best for small or large businesses.pdf
VVF-Customer-Presentation2025-Ver1.9.pptx
SAP S4 Hana Brochure 3 (PTS SYSTEMS AND SOLUTIONS)
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025

Omid: Scalable and Highly Available Transaction Processing for Phoenix

  • 1. Omid: Scalable an d Highly Available Transaction Processing for Phoenix Ohad Shacham, Edward Bortnikov ⎪ PhoenixCon, Jun 13, 2017
  • 2. Let’s Get Started … 2 Our Yahoo Journey with Transactions over HBase Omid for Users: Semantics, API, Integration with Phoenix Omid for Programmers: Architecture and Use Cases Omid, Advanced: Scalability, HA, Low-Latency
  • 3. Transaction Processing in NoSQL @Yahoo 3 Motivation: Data Pipelines (Search, Mail, etc.) Stream Processing a Popular Pattern Compute Tasks process Data Items that arrive in the Real Time Intermediate Artifacts stored in NoSQL (KV-)Storage Extensive Use of Hadoop Technologies (Storm, HBase) Scale: Thousands of Hadoop Nodes
  • 4. Content Indexing for Search Crawl Docproc Link Analysis Stream Crawl schedule Content Queue Links STORM HBase
  • 5. Zooming in on Tasks Document processing Read page content from the store Compute search index features Update computed features Link processing Read outgoing links for a page Update reference for all linked-to pages begin begin commit commit
  • 6. Transaction Processing: ACID 101 6 Multiple data accesses in a single logical operation Atomic “All or nothing” – no partial effect observable Consistent The DB transitions from one valid state to another Isolated Appear to execute in isolation Durable Committed data cannot disappear
  • 7. Omid (‫)امید‬ 7 2011 Incepted @Yahoo Research “Omid1” 2014 Large-Scale Deployment @Yahoo 2014/5 Major Re-Design for Scalability & HA “Omid2” 2016 Apache Incubator 2017 Prototype Integration with Phoenix Transaction Processing Service for Apache HBase
  • 8. Contributors 8 Ohad Shacham Yahoo Research Francisco Perez Sorrosal Yahoo Edward Bortnikov Yahoo Research Eshcar Hillel Yahoo Research Idit Keidar Yahoo, Technion Ivan Kelly Midokura Sameer Paranjpye Databricks Matthieu Morel Skyscanner Igor Katkov Atlassian Yonatan Gottesman Yahoo Research
  • 9. Omid 101 9 Client Library + Runtime Service Database Agnostic (can work with other backends) Snapshot Isolation consistency Very Scalable (>380K peak tps) and Highly Available
  • 10. Omid Programming Example 10 TransactionManager tm = HBaseTransactionManager.newInstance(); TTable txTable = new TTable("MY_TX_TABLE”); Transaction tx = tm.begin(); // Control path Put row1 = new Put(Bytes.toBytes("EXAMPLE_ROW1")); row1.add(family, qualifier, Bytes.toBytes("val1")); txTable.put(tx, row1); // Data path Put row2 = new Put(Bytes.toBytes("EXAMPLE_ROW2")); row2.add(family, qualifier, Bytes.toBytes("val2")); txTable.put(tx, row2); // Data path tm.commit(tx); // Control path
  • 11. Snapshot Isolation (SI) Semantics Distinct read (snapshot) and write (commit) points No write-write conflicts allowed
  • 12. Tephra: Sibling Technology 12 Transaction Processing technology for HBase SI Semantics. Design Similar to Omid1 Apache Incubator since 2016 Integrated with Phoenix to provide ACID semantics (BETA) Implements some Phoenix-specific scenarios
  • 13. Phoenix-Omid Integration 13 Work in Progress under JIRA PHOENIX-3623 Backward Compatible – Configurable TP Provider Choice Current Options: Tephra and Omid How? Internal Transaction Abstraction Layer (TAL) API Multiple Implementations, Configurable Instantiation
  • 14. Transaction Processing, Refactored 14 Transaction Abstraction Layer Tephra Client Omid Client Phoenix Phoenix Tephra Client Refactor
  • 15. How Omid Works Client Begin/Commit Data Data Data Commit Table Persist Commit Verify commit Read/Write Conflict Detection 15 Transaction Manager (TSO) Lock-Free SI Implementation. Exploits Built-in MVCC.
  • 16. Transacti on Manager Client Begin Data Data Data Commit Table t1 Write (k1, v1, t1) Write (k2, v2, t1) Read (k’, last committed t’ < t1) (k1, v1, t1) (k2, v2, t1) Execution Example tr = t1 Transaction Manager 16
  • 17. Client Commit: t1, {k1, k2} Data Data Data Commit Table t2 (k1, v1, t1) (k2, v2, t1) Write (t1, t2) (t1, t2) Execution Example tr = t1 tc = t2 17 Transaction Manager
  • 18. Client Data Data Data Commit Table Read (k1, t3) (k1, v1, t1) (k2, v2, t1) (t1, t2) Read (t1) Execution Example tr = t3 18 Bottleneck! Transaction Manager
  • 19. Client Data Data Data Commit Table t2 (t1, t2) (k1,v1,t1,t2) (k2,v2,t1,t2) Delete(t1) Post-Commit Timestamp Replication tr = t1 tc = t2 Update commit cells 19 Transaction Manager
  • 20. Data Data Data Commit Table Read (k1, t3) Using Commit Cells Client tr = t3 20 Transaction Manager (k1,v1,t1,t2) (k2,v2,t1,t2)
  • 21. Phoenix – New Scenarios for Omid 21 Secondary Indexes On-the-Fly Index Creation Atomic Updates Query by Secondary Key Extended Snapshot Isolation Read-Your-Own-Writes Queries
  • 22. On-the-Fly Secondary Index Creation 22 CREATE INDEX (CI) in parallel with writes to the base table How? Distinguish between the pre-CI and post-CI data CREATE INDEX command issue time defines a timestamp 1. All data committed before snapshot: scanned, bulk-inserted into index 2. All data generated after snapshot: triggers random update of index 3. All transactions in flight at snapshot time: aborted (FENCE)
  • 23. Secondary Index: Creation and Maintenance 23 T1 T2 T3 CREATE INDEX started T4 CREATE INDEX complete T5 T6 Bulk- Inserted into index Abort (enforced upon commit) Added by a coprocessor Added by a coprocessor Index update (stored procedure)
  • 24. Extended Snapshot Isolation 24 CREATE TABLE T (ID INT); BEGIN; 1: INSERT INTO T SELECT ID+10 FROM T; 2: INSERT INTO T SELECT ID+100 FROM T; COMMIT; Traditional SI: Read-Your-Writes Challenge: Circular Dependency (Statement in Infinite Loop) Solution: Moving Snapshot (series of checkpoint snapshots)
  • 25. Moving Snapshot Implementation 25 Checkpoint for Statement 1 Checkpoint for Statement 2 Writes by Statement 1 Timestamps allocated by TM in blocks. Client promotes the checkpoint.
  • 26. Omid Scalability 26 Extremely lean Client-Transaction Manager protocol Omid1, Tephra replicate the entire state to client side upon BEGIN Aggressive batching of writes to CT in Transaction Manager Concurrent conflict detection (experimental) HA algorithm incurs zero overhead in the mainstream
  • 27. 0 50 100 150 200 250 300 350 400 450 500 550 Omid1 Omid1 Non Durable Omid Omid Non Durable Tps*103 Throughput Benchmark YCSB workload driver 12-core Transaction Manager 1G network
  • 28. 0 500 1000 1500 2000 2500 document inversion duplicate detection out-link processing in-link processing stream to runtime TaskLatency(ms) Commit + CT update Begin Compute Read Update Overhead in Production: Web Search Indexing
  • 29. Low-Latency Omid (Experimental) 29 Original Design: Throughput-Oriented Applications in Mind Sometimes, this comes at the expense of latency Example: writes to Commit Table batched at the Transaction Manager Key: Dissolve the Transaction Manager I/O Bottleneck Distribute the Commit Table and the Writes to it How? The client, rather than the TM, persists the Commit Timestamp (CTS) CTS embedded in the first row written by the transaction
  • 30. Benchmark: Single-Write Transaction Workload 0 10 20 30 40 50 60 70 80 0 50 100 150 200 250 300 Omid Low latency Throughput (tps * 103) Latency(msec)
  • 31. Summary 31 Scalable, Highly Available Open Source Transaction Processing Battle-Tested, Ready for Public Cloud Integration with Apache Phoenix Underway (GA in 2017)
  • 32. Thanks to Our Partners for Being Awesome 32
  • 34. Architecture, Recapped Client Begin/Commit Data Data Data Commit Table Persist Commit Verify commit Read/Write SPoF 34 Transaction Manager (TSO)
  • 35. HA: Primary-Backup Transaction Manager Client Data Data Data Commit Table 35 Transaction Manager (TSO) Transaction Manager Recovery state (ZK) Primary Backup
  • 37. HA Algorithm – Key Ideas 37 Old and New Primaries may write conflicting commit records No Locks! Client detects inconsistencies, invalidates problematic records Lease-Based Leader Election Optimization: Local lease check before/after writing to CT Zero Overhead in Non-Recovery Scenarios