SlideShare a Scribd company logo
In-memory OLTP storage with
persistence and transac on support
Alexander Korotkov
Postgres Professional
October 25, 2017
Alexander Korotkov In-memory OLTP storage with persistence and transac on support 1 / 34
Disclaimer
▶ This talk is not about something produc on ready. Don’t hold
your breath while wai ng for use some of considered
func onality in produc on. When this func onality will be
available and produc on-ready, it might become something
drama cally different.
▶ This talk is about some intermediate results achieved during
development. These results are presented for discussion and
brainstorming in order to make further development be er.
Alexander Korotkov In-memory OLTP storage with persistence and transac on support 2 / 34
What this talk is about?
▶ We (Postgres Pro) have a prototype of in-memory OLTP storage
implemented using FDW.
▶ It’s proof of concept of opportuni es for in-memory OLTP in
PostgreSQL (it was debatable that there are any).
▶ It’s yet another example of alterna ve storage implemented
using FDW interface before we’ve na ve pluggable storages.
So, it’s waypoint to verify where we are on pluggable storages.
Alexander Korotkov In-memory OLTP storage with persistence and transac on support 3 / 34
Why pluggable storages?
▶ Lack of pluggable storages support is understood as
limita on.
PostgreSQL was always posi oned as highly extendable DBMS while
lack of pluggable storages support is large gap in this area.
▶ Rising interest in PostgreSQL from enterprises.
Postgres-centric companies have enough of resources to support
mul ple storage engines. Enterprises are also interested in using
PostgreSQL for non-OLTP tasks. Alterna ve storages might improve
OLTP too (UNDO log for be er update performance).
Alexander Korotkov In-memory OLTP storage with persistence and transac on support 4 / 34
Use cases for pluggable storages
▶ Different MVCC implementa on: mostly varia ons of UNDO
log, but not only.
▶ Data compression: either row-level, page-level etc...
▶ Non disk-oriented storage: in-memory, SSD-op mized,
NVRAM-op mized.
▶ Non heap-like rows layout: index-organized table (IOT)
including LSM.
▶ Non row-oriented data layout: either column or parquet
layouts.
Alexander Korotkov In-memory OLTP storage with persistence and transac on support 5 / 34
Current state of pluggable storage
https://www.postgresql.org/message-id/flat/
20160812231527.GA690404%40alvherre.pgsql
▶ Started as quite mechanical separa on of heap_* methods into
storage AM interface.
▶ Boundary of storage layer was significantly shi ed during
discussions.
▶ S ll a lot of work to do.
Alexander Korotkov In-memory OLTP storage with persistence and transac on support 6 / 34
Current view on pluggable storage
proper es
▶ All the storages should share same transac on model (or have
no transac ons at all).
▶ All the storages should write same WAL stream.
▶ Tuples have to be iden fied by TIDs (further improvement is
possible).
▶ Storages should share some of index access methods.
▶ Index access method interface should be expanded with new
func ons (at least retail tuple delete).
▶ Storages may have completely different MVCC implementa on.
Alexander Korotkov In-memory OLTP storage with persistence and transac on support 7 / 34
Why FDW for prototyping
Pro:
▶ FDW is completely free in the way it scans and modifies the
foreign table.
▶ This approach is already used in cstore_fdw1
, vops2
.
Cons:
▶ Lack of control on associated resources,
▶ Lack of DDL support.
1
https://github.com/citusdata/cstore_fdw
2
https://github.com/postgrespro/vops
Alexander Korotkov In-memory OLTP storage with persistence and transac on support 8 / 34
Why in-memory?
▶ No extra mapping layer (buffer manager) to traverse from one page
to another.
▶ Row-level WAL takes less space (no page-level informa on, no
explicit index logging), but slower to apply.
▶ Be er IO u liza on (write both snapshots and WAL are wri en
sequen ally).
Alexander Korotkov In-memory OLTP storage with persistence and transac on support 9 / 34
What this par cular in-memory engine is?
▶ Index organized table where index is in-memory B-tree.
▶ This B-tree supports transac ons and MVCC using UNDO log
which is circular buffer in memory containing both row-level
and page-level records.
▶ It writes full data snapshots on checkpoints and row-level WAL.
Alexander Korotkov In-memory OLTP storage with persistence and transac on support 10 / 34
Why our in-memory engine is a good
example of pluggable storage
Because it does the things in a quite different way.
▶ It stores data in main memory with quite different
model of persistence: full data snapshots plus
row-level WAL.
▶ It doesn’t have heap-like layout.
▶ It uses very different MVCC implementa on:
combina on of row-level and page-level undo logs.
Alexander Korotkov In-memory OLTP storage with persistence and transac on support 11 / 34
Why our in-memory engine is a bad
example of pluggable storage
▶ It uses CSN snapshot model which is far from ge ng
commi ed.
▶ Tuples aren’t iden fied by TIDs.
▶ Persistence is implemented using set of hacks.
Alexander Korotkov In-memory OLTP storage with persistence and transac on support 12 / 34
Configura on parameters
▶ in_memory_engine.shared_pool_size – size of
separate pool of 1k pages for in-memory tables.
▶ in_memory_engine.undo_size – size of circular
buffer for undo records to support transac ons and
MVCC.
Alexander Korotkov In-memory OLTP storage with persistence and transac on support 13 / 34
Usage: defining in-memory table and
inser ng data
CREATE EXTENSION in_memory;
CREATE FOREIGN TABLE im_test
(
id int8 NOT NULL,
val text NOT NULL
) SERVER in_memory OPTIONS (indices ’unique (id)’,
persistent ’true’);
INSERT INTO im_test
(SELECT id, ’val’ || id FROM generate_series(1, 1000000) id);
Alexander Korotkov In-memory OLTP storage with persistence and transac on support 14 / 34
Usage: querying a single key
# EXPLAIN ANALYZE SELECT * FROM im_test WHERE id = 50000;
QUERY PLAN
----------------------------------------------------------------
Foreign Scan on im_test (cost=0.06..4.52 rows=357 width=40)
(actual time=0.190..0.191 rows=1 loops=1)
Pk conds: (id = 50000)
Planning time: 0.635 ms
Execution time: 0.260 ms
(4 rows)
Alexander Korotkov In-memory OLTP storage with persistence and transac on support 15 / 34
Usage: querying a key range
# EXPLAIN ANALYZE SELECT * FROM im_test
WHERE id >= 10000 AND id <= 20000;
QUERY PLAN
----------------------------------------------------------------
Foreign Scan on im_test (cost=0.06..5.41 rows=357 width=40)
(actual time=0.045..4.194 rows=10001 loops=1)
Pk conds: (id >= 10000 AND id <= 20000)
Planning time: 0.075 ms
Execution time: 4.915 ms
(4 rows)
Alexander Korotkov In-memory OLTP storage with persistence and transac on support 16 / 34
Usage: querying for non-key condi on
# EXPLAIN ANALYZE SELECT * FROM im_test WHERE val LIKE ’%1111%’;
QUERY PLAN
----------------------------------------------------------------
Foreign Scan on im_test (cost=0.06..891.83 rows=571 width=40)
(actual time=0.345..187.325 rows=280 loops=1)
Filter: (val ~~ ’%1111%’::text)
Rows Removed by Filter: 999720
Planning time: 0.046 ms
Execution time: 187.375 ms
(5 rows)
Alexander Korotkov In-memory OLTP storage with persistence and transac on support 17 / 34
Usage: non-persistent tables are writable on
standby
*** Master ***
# CREATE FOREIGN TABLE im_test (id int8 NOT NULL, val text NOT NULL)
SERVER in_memory OPTIONS (indices ’unique (id)’,
persistent ’false’);
# INSERT INTO im_test
(SELECT id, ’val’ || id FROM generate_series(1, 1000000) id);
INSERT 0 1000000
*** Standby ***
# SELECT * FROM im_test;
id | val
----+-----
(0 rows)
# INSERT INTO im_test VALUES (1, ’foo’), (2, ’bar’);
INSERT 0 2
# SELECT * FROM im_test;
id | val
----+-----
1 | foo
2 | bar
(2 rows)
Alexander Korotkov In-memory OLTP storage with persistence and transac on support 18 / 34
Limita ons
▶ Only B-tree with limited func onality is supported.
▶ No secondary indexes are supported yet.
▶ No out-of-line storage are supported for tuples yet.
▶ Undo log shouldn’t wraparound during single transac on (that
transac on is automa cally aborted).
▶ If required undo record is already overflowed then “snapshot’s
too old” error is emi ed.
▶ Serializable isola on level isn’t supported.
▶ Replica on isn’t supported yet.
Alexander Korotkov In-memory OLTP storage with persistence and transac on support 19 / 34
Read-only benchmark
0 50 100 150 200 250
# Clients
0
200000
400000
600000
800000
1000000
1200000
1400000
1600000
QPS
pgbench -s 1000 -j $n -c $n -M prepared -S on 4 x 18 cores Intel Xeon E7-8890 processors
mean of 3 3-minute runs with shared_buffers = 32GB, max_connections = 300
builtin
in-memory
Alexander Korotkov In-memory OLTP storage with persistence and transac on support 20 / 34
Why there is no win?
Storage is only one layer par cipa ng in query
execu on. There are also:
▶ Network layer,
▶ Executor,
▶ Parser (analyze & rewrite if not prepared),
▶ Transac on management (including snapshot
acquirement),
▶ ...
Alexander Korotkov In-memory OLTP storage with persistence and transac on support 21 / 34
Measuring overheads
0 50 100 150 200 250
# Clients
0
500000
1000000
1500000
2000000
2500000
3000000
3500000
QPS
pgbench -s 1000 -j $n -c $n -M prepared -S on 4 x 18 cores Intel Xeon E7-8890 processors
mean of 3 3-minute runs with shared_buffers = 32GB, max_connections = 300
read-only
SELECT 1;
;
Alexander Korotkov In-memory OLTP storage with persistence and transac on support 22 / 34
Read-only benchmark:
fetch 9 values per single query
set aid1 random(1, 100000 * :scale)
set aid2 random(1, 100000 * :scale)
set aid3 random(1, 100000 * :scale)
set aid4 random(1, 100000 * :scale)
set aid5 random(1, 100000 * :scale)
set aid6 random(1, 100000 * :scale)
set aid7 random(1, 100000 * :scale)
set aid8 random(1, 100000 * :scale)
set aid9 random(1, 100000 * :scale)
SELECT abalance FROM pgbench_accounts WHERE
aid IN (:aid1, :aid2, :aid3, :aid4, :aid5, :aid6, :aid7, :aid8, :aid9);
Alexander Korotkov In-memory OLTP storage with persistence and transac on support 23 / 34
Read-only benchmark:
fetch 9 values per single query
0 50 100 150 200 250
# Clients
0
200000
400000
600000
800000
1000000
1200000
QPS
pgbench -s 1000 -j $n -c $n -M prepared -f ro9.sql on 4 x 18 cores Intel Xeon E7-8890 processors
mean of 3 3-minute runs with shared_buffers = 32GB, max_connections = 300
builtin
in-memory
Alexander Korotkov In-memory OLTP storage with persistence and transac on support 24 / 34
Read-only benchmark:
compare values-per-second
0 50 100 150 200 250
# Clients
0
2000000
4000000
6000000
8000000
10000000
VPS
pgbench -s 1000 -j $n -c $n -M prepared -S on 4 x 18 cores Intel Xeon E7-8890 processors
mean of 3 3-minute runs with shared_buffers = 32GB, max_connections = 300
builtin-1
builtin-9
in_memory-1
in_memory-9
Alexander Korotkov In-memory OLTP storage with persistence and transac on support 25 / 34
Read-write benchmark without persistence
(async commit)
0 50 100 150 200 250
# Clients
0
50000
100000
150000
200000
250000
TPS
pgbench -s 1000 -j $n -c $n -M prepared on 4 x 18 cores Intel Xeon E7-8890 processors
mean of 3 3-minute runs with shared_buffers = 32GB, max_connections = 300
unlogged table
in_memory
Alexander Korotkov In-memory OLTP storage with persistence and transac on support 26 / 34
Read-write benchmark with persistence
(async commit)
0 50 100 150 200 250
# Clients
0
50000
100000
150000
200000
TPS
pgbench -s 1000 -j $n -c $n -M prepared on 4 x 18 cores Intel Xeon E7-8890 processors
mean of 3 3-minute runs with shared_buffers = 32GB, max_connections = 300
builtin
in-memory
Alexander Korotkov In-memory OLTP storage with persistence and transac on support 27 / 34
Read-write benchmark:
do transac on in a single statement
CREATE OR REPLACE FUNCTION tcpb_trx(_aid int, _bid int, _tid int, _delta int)
RETURNS void AS $$
BEGIN
UPDATE pgbench_accounts SET abalance = abalance + _delta WHERE aid = _aid;
PERFORM abalance FROM pgbench_accounts WHERE aid = _aid;
UPDATE pgbench_tellers SET tbalance = tbalance + _delta WHERE tid = _tid;
UPDATE pgbench_branches SET bbalance = bbalance + _delta WHERE bid = _bid;
INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES
(_tid, _bid, _aid, _delta, CURRENT_TIMESTAMP);
END;
$$ LANGUAGE plpgsql;
set aid random(1, 100000 * :scale)
set bid random(1, 1 * :scale)
set tid random(1, 10 * :scale)
set delta random(-5000, 5000)
SELECT tcpb_trx(:aid, :bid, :tid, :delta);
Alexander Korotkov In-memory OLTP storage with persistence and transac on support 28 / 34
Read-write benchmark with persistence
(async commit, func on vs. interac ve)
0 50 100 150 200 250
# Clients
0
100000
200000
300000
400000
500000
600000
TPS
pgbench -s 1000 -j $n -c $n -M prepared on 4 x 18 cores Intel Xeon E7-8890 processors
mean of 3 3-minute runs with shared_buffers = 32GB, max_connections = 300
builtin
builtin-func
in_memory
in_memory-func
Alexander Korotkov In-memory OLTP storage with persistence and transac on support 29 / 34
Hacks used in implementa on
▶ Minimalis c CSN implementa on
CSNs are assigned but neither used, neither wri en to SLRUs.
in-memory engine doesn’t need SLRU.
▶ Checkpoint hook
in-memory engine writes full data snapshot on checkpoint.
▶ Generic logical message hook
Used to implement custom recovery/replica on. This is an awful
hack.
▶ TRUNCATE using u lity command hook
TRUNCATE isn’t supported by FDW directly.
▶ DROP support using event trigger
Used to free the resources occupied by in-memory table.
Alexander Korotkov In-memory OLTP storage with persistence and transac on support 30 / 34
Recovery problem
▶ Row-level WAL is compact, but it requires meta-informa on to
apply. That is we need to be able to read system catalog while
applying WAL including recovery.
▶ We can’t access system catalog during recovery, because the
whole database isn’t accessible since it’s not recovered yet.
Alexander Korotkov In-memory OLTP storage with persistence and transac on support 31 / 34
Recovery problem solu on:
2-phase recovery
At the second phase we have consistent system catalog.
Alexander Korotkov In-memory OLTP storage with persistence and transac on support 32 / 34
Future roadmap
Integrate in-memory as pluggable storage:
▶ In-memory B-tree as index access method.
▶ Implement storage for in-memory tables using one of following
ways:
▶ Write some kind of «in-memory heap» OR/AND
▶ Write a storage wrapper implemen ng index-organized
table.
Alexander Korotkov In-memory OLTP storage with persistence and transac on support 33 / 34
Thank you for a en on!
Alexander Korotkov In-memory OLTP storage with persistence and transac on support 34 / 34

More Related Content

PDF
Solving PostgreSQL wicked problems
PDF
PostgreSQL WAL for DBAs
PDF
Linux tuning to improve PostgreSQL performance
PDF
Patroni - HA PostgreSQL made easy
PDF
The InnoDB Storage Engine for MySQL
PDF
Ceph Object Storage Performance Secrets and Ceph Data Lake Solution
PPTX
RocksDB compaction
PDF
PostgreSQLのバグとの付き合い方 ~バグの調査からコミュニティへの報告、修正パッチ投稿まで~(PostgreSQL Conference Japa...
Solving PostgreSQL wicked problems
PostgreSQL WAL for DBAs
Linux tuning to improve PostgreSQL performance
Patroni - HA PostgreSQL made easy
The InnoDB Storage Engine for MySQL
Ceph Object Storage Performance Secrets and Ceph Data Lake Solution
RocksDB compaction
PostgreSQLのバグとの付き合い方 ~バグの調査からコミュニティへの報告、修正パッチ投稿まで~(PostgreSQL Conference Japa...

What's hot (20)

PDF
5 Steps to PostgreSQL Performance
PDF
InnoDB Flushing and Checkpoints
PDF
What is new in MariaDB 10.6?
PDF
InfluxDB IOx Tech Talks: The Impossible Dream: Easy-to-Use, Super Fast Softw...
PDF
Wars of MySQL Cluster ( InnoDB Cluster VS Galera )
PDF
あなたの知らないPostgreSQL監視の世界
PDF
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...
PPTX
Getting started with postgresql
ODP
PostgreSQL Administration for System Administrators
PDF
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
PPTX
PostgreSQLの統計情報について(第26回PostgreSQLアンカンファレンス@オンライン 発表資料)
PPTX
Hive: Loading Data
PDF
PostgreSQL and RAM usage
PPTX
PostGreSQL Performance Tuning
PDF
Bulk Loading Data into Cassandra
PDF
Fluentd vs. Logstash for OpenStack Log Management
PDF
Get to know PostgreSQL!
PDF
Vacuum in PostgreSQL
PDF
Linux Performance Analysis: New Tools and Old Secrets
PDF
Linux Profiling at Netflix
5 Steps to PostgreSQL Performance
InnoDB Flushing and Checkpoints
What is new in MariaDB 10.6?
InfluxDB IOx Tech Talks: The Impossible Dream: Easy-to-Use, Super Fast Softw...
Wars of MySQL Cluster ( InnoDB Cluster VS Galera )
あなたの知らないPostgreSQL監視の世界
OSA Con 2022 - Apache Iceberg_ An Architectural Look Under the Covers - Alex ...
Getting started with postgresql
PostgreSQL Administration for System Administrators
Tech Talk: RocksDB Slides by Dhruba Borthakur & Haobo Xu of Facebook
PostgreSQLの統計情報について(第26回PostgreSQLアンカンファレンス@オンライン 発表資料)
Hive: Loading Data
PostgreSQL and RAM usage
PostGreSQL Performance Tuning
Bulk Loading Data into Cassandra
Fluentd vs. Logstash for OpenStack Log Management
Get to know PostgreSQL!
Vacuum in PostgreSQL
Linux Performance Analysis: New Tools and Old Secrets
Linux Profiling at Netflix
Ad

Similar to In-memory OLTP storage with persistence and transaction support (20)

PPTX
An introduction to SQL Server in-memory OLTP Engine
PDF
In Memory Database In Action by Tanel Poder and Kerry Osborne
PDF
Oracle Database In-Memory Option in Action
PPTX
Sql Server 2014 In Memory
PPTX
SQL Server 2014 Memory Optimised Tables - Advanced
PPTX
Inside SQL Server In-Memory OLTP
PDF
Troubleshooting PostgreSQL with pgCenter
PDF
Materialized views in PostgreSQL
PDF
In memory big data management and processing a survey
PDF
Oracle in-Memory Column Store for BI
PDF
SQL Server 2014 In-Memory Tables (XTP, Hekaton)
PDF
SQL Server Internals In Memory OLTP Inside the SQL Server 2016 Hekaton Engine...
PDF
MemSQL DB Class, Ankur Goyal
PPTX
Geek Sync I Need for Speed: In-Memory Databases in Oracle and SQL Server
PDF
C++ Programming and the Persistent Memory Developers Kit
PPTX
IMC Summit 2016 Breakout - Andy Pavlo - What Non-Volatile Memory Means for th...
PPTX
SQL Server In-Memory OLTP Migration Overview
PDF
Microsoft SQL Server 2014 in memory oltp tdm white paper
PDF
Sloupcové uložení dat a použití in-memory technologií u řešení Exadata
PPTX
Timesten Architecture
An introduction to SQL Server in-memory OLTP Engine
In Memory Database In Action by Tanel Poder and Kerry Osborne
Oracle Database In-Memory Option in Action
Sql Server 2014 In Memory
SQL Server 2014 Memory Optimised Tables - Advanced
Inside SQL Server In-Memory OLTP
Troubleshooting PostgreSQL with pgCenter
Materialized views in PostgreSQL
In memory big data management and processing a survey
Oracle in-Memory Column Store for BI
SQL Server 2014 In-Memory Tables (XTP, Hekaton)
SQL Server Internals In Memory OLTP Inside the SQL Server 2016 Hekaton Engine...
MemSQL DB Class, Ankur Goyal
Geek Sync I Need for Speed: In-Memory Databases in Oracle and SQL Server
C++ Programming and the Persistent Memory Developers Kit
IMC Summit 2016 Breakout - Andy Pavlo - What Non-Volatile Memory Means for th...
SQL Server In-Memory OLTP Migration Overview
Microsoft SQL Server 2014 in memory oltp tdm white paper
Sloupcové uložení dat a použití in-memory technologií u řešení Exadata
Timesten Architecture
Ad

More from Alexander Korotkov (6)

PDF
Oh, that ubiquitous JSON !
PDF
Jsquery - the jsonb query language with GIN indexing support
PDF
Our answer to Uber
PDF
The future is CSN
PDF
Open Source SQL databases enter millions queries per second era
PDF
Использование специальных типов данных PostgreSQL в ORM Doctrine
Oh, that ubiquitous JSON !
Jsquery - the jsonb query language with GIN indexing support
Our answer to Uber
The future is CSN
Open Source SQL databases enter millions queries per second era
Использование специальных типов данных PostgreSQL в ORM Doctrine

Recently uploaded (20)

PPTX
Essential Infomation Tech presentation.pptx
PDF
Understanding Forklifts - TECH EHS Solution
PDF
How to Migrate SBCGlobal Email to Yahoo Easily
PDF
Odoo Companies in India – Driving Business Transformation.pdf
PDF
How Creative Agencies Leverage Project Management Software.pdf
PDF
AI in Product Development-omnex systems
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 41
PDF
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
PDF
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
PPTX
Reimagine Home Health with the Power of Agentic AI​
PPTX
Introduction to Artificial Intelligence
PDF
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
PPTX
CHAPTER 2 - PM Management and IT Context
PDF
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
PDF
top salesforce developer skills in 2025.pdf
PDF
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
PDF
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
PDF
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
PPTX
L1 - Introduction to python Backend.pptx
PDF
PTS Company Brochure 2025 (1).pdf.......
Essential Infomation Tech presentation.pptx
Understanding Forklifts - TECH EHS Solution
How to Migrate SBCGlobal Email to Yahoo Easily
Odoo Companies in India – Driving Business Transformation.pdf
How Creative Agencies Leverage Project Management Software.pdf
AI in Product Development-omnex systems
Internet Downloader Manager (IDM) Crack 6.42 Build 41
T3DD25 TYPO3 Content Blocks - Deep Dive by André Kraus
Audit Checklist Design Aligning with ISO, IATF, and Industry Standards — Omne...
Reimagine Home Health with the Power of Agentic AI​
Introduction to Artificial Intelligence
Raksha Bandhan Grocery Pricing Trends in India 2025.pdf
CHAPTER 2 - PM Management and IT Context
Adobe Premiere Pro 2025 (v24.5.0.057) Crack free
top salesforce developer skills in 2025.pdf
Claude Code: Everyone is a 10x Developer - A Comprehensive AI-Powered CLI Tool
Internet Downloader Manager (IDM) Crack 6.42 Build 42 Updates Latest 2025
Addressing The Cult of Project Management Tools-Why Disconnected Work is Hold...
L1 - Introduction to python Backend.pptx
PTS Company Brochure 2025 (1).pdf.......

In-memory OLTP storage with persistence and transaction support

  • 1. In-memory OLTP storage with persistence and transac on support Alexander Korotkov Postgres Professional October 25, 2017 Alexander Korotkov In-memory OLTP storage with persistence and transac on support 1 / 34
  • 2. Disclaimer ▶ This talk is not about something produc on ready. Don’t hold your breath while wai ng for use some of considered func onality in produc on. When this func onality will be available and produc on-ready, it might become something drama cally different. ▶ This talk is about some intermediate results achieved during development. These results are presented for discussion and brainstorming in order to make further development be er. Alexander Korotkov In-memory OLTP storage with persistence and transac on support 2 / 34
  • 3. What this talk is about? ▶ We (Postgres Pro) have a prototype of in-memory OLTP storage implemented using FDW. ▶ It’s proof of concept of opportuni es for in-memory OLTP in PostgreSQL (it was debatable that there are any). ▶ It’s yet another example of alterna ve storage implemented using FDW interface before we’ve na ve pluggable storages. So, it’s waypoint to verify where we are on pluggable storages. Alexander Korotkov In-memory OLTP storage with persistence and transac on support 3 / 34
  • 4. Why pluggable storages? ▶ Lack of pluggable storages support is understood as limita on. PostgreSQL was always posi oned as highly extendable DBMS while lack of pluggable storages support is large gap in this area. ▶ Rising interest in PostgreSQL from enterprises. Postgres-centric companies have enough of resources to support mul ple storage engines. Enterprises are also interested in using PostgreSQL for non-OLTP tasks. Alterna ve storages might improve OLTP too (UNDO log for be er update performance). Alexander Korotkov In-memory OLTP storage with persistence and transac on support 4 / 34
  • 5. Use cases for pluggable storages ▶ Different MVCC implementa on: mostly varia ons of UNDO log, but not only. ▶ Data compression: either row-level, page-level etc... ▶ Non disk-oriented storage: in-memory, SSD-op mized, NVRAM-op mized. ▶ Non heap-like rows layout: index-organized table (IOT) including LSM. ▶ Non row-oriented data layout: either column or parquet layouts. Alexander Korotkov In-memory OLTP storage with persistence and transac on support 5 / 34
  • 6. Current state of pluggable storage https://www.postgresql.org/message-id/flat/ 20160812231527.GA690404%40alvherre.pgsql ▶ Started as quite mechanical separa on of heap_* methods into storage AM interface. ▶ Boundary of storage layer was significantly shi ed during discussions. ▶ S ll a lot of work to do. Alexander Korotkov In-memory OLTP storage with persistence and transac on support 6 / 34
  • 7. Current view on pluggable storage proper es ▶ All the storages should share same transac on model (or have no transac ons at all). ▶ All the storages should write same WAL stream. ▶ Tuples have to be iden fied by TIDs (further improvement is possible). ▶ Storages should share some of index access methods. ▶ Index access method interface should be expanded with new func ons (at least retail tuple delete). ▶ Storages may have completely different MVCC implementa on. Alexander Korotkov In-memory OLTP storage with persistence and transac on support 7 / 34
  • 8. Why FDW for prototyping Pro: ▶ FDW is completely free in the way it scans and modifies the foreign table. ▶ This approach is already used in cstore_fdw1 , vops2 . Cons: ▶ Lack of control on associated resources, ▶ Lack of DDL support. 1 https://github.com/citusdata/cstore_fdw 2 https://github.com/postgrespro/vops Alexander Korotkov In-memory OLTP storage with persistence and transac on support 8 / 34
  • 9. Why in-memory? ▶ No extra mapping layer (buffer manager) to traverse from one page to another. ▶ Row-level WAL takes less space (no page-level informa on, no explicit index logging), but slower to apply. ▶ Be er IO u liza on (write both snapshots and WAL are wri en sequen ally). Alexander Korotkov In-memory OLTP storage with persistence and transac on support 9 / 34
  • 10. What this par cular in-memory engine is? ▶ Index organized table where index is in-memory B-tree. ▶ This B-tree supports transac ons and MVCC using UNDO log which is circular buffer in memory containing both row-level and page-level records. ▶ It writes full data snapshots on checkpoints and row-level WAL. Alexander Korotkov In-memory OLTP storage with persistence and transac on support 10 / 34
  • 11. Why our in-memory engine is a good example of pluggable storage Because it does the things in a quite different way. ▶ It stores data in main memory with quite different model of persistence: full data snapshots plus row-level WAL. ▶ It doesn’t have heap-like layout. ▶ It uses very different MVCC implementa on: combina on of row-level and page-level undo logs. Alexander Korotkov In-memory OLTP storage with persistence and transac on support 11 / 34
  • 12. Why our in-memory engine is a bad example of pluggable storage ▶ It uses CSN snapshot model which is far from ge ng commi ed. ▶ Tuples aren’t iden fied by TIDs. ▶ Persistence is implemented using set of hacks. Alexander Korotkov In-memory OLTP storage with persistence and transac on support 12 / 34
  • 13. Configura on parameters ▶ in_memory_engine.shared_pool_size – size of separate pool of 1k pages for in-memory tables. ▶ in_memory_engine.undo_size – size of circular buffer for undo records to support transac ons and MVCC. Alexander Korotkov In-memory OLTP storage with persistence and transac on support 13 / 34
  • 14. Usage: defining in-memory table and inser ng data CREATE EXTENSION in_memory; CREATE FOREIGN TABLE im_test ( id int8 NOT NULL, val text NOT NULL ) SERVER in_memory OPTIONS (indices ’unique (id)’, persistent ’true’); INSERT INTO im_test (SELECT id, ’val’ || id FROM generate_series(1, 1000000) id); Alexander Korotkov In-memory OLTP storage with persistence and transac on support 14 / 34
  • 15. Usage: querying a single key # EXPLAIN ANALYZE SELECT * FROM im_test WHERE id = 50000; QUERY PLAN ---------------------------------------------------------------- Foreign Scan on im_test (cost=0.06..4.52 rows=357 width=40) (actual time=0.190..0.191 rows=1 loops=1) Pk conds: (id = 50000) Planning time: 0.635 ms Execution time: 0.260 ms (4 rows) Alexander Korotkov In-memory OLTP storage with persistence and transac on support 15 / 34
  • 16. Usage: querying a key range # EXPLAIN ANALYZE SELECT * FROM im_test WHERE id >= 10000 AND id <= 20000; QUERY PLAN ---------------------------------------------------------------- Foreign Scan on im_test (cost=0.06..5.41 rows=357 width=40) (actual time=0.045..4.194 rows=10001 loops=1) Pk conds: (id >= 10000 AND id <= 20000) Planning time: 0.075 ms Execution time: 4.915 ms (4 rows) Alexander Korotkov In-memory OLTP storage with persistence and transac on support 16 / 34
  • 17. Usage: querying for non-key condi on # EXPLAIN ANALYZE SELECT * FROM im_test WHERE val LIKE ’%1111%’; QUERY PLAN ---------------------------------------------------------------- Foreign Scan on im_test (cost=0.06..891.83 rows=571 width=40) (actual time=0.345..187.325 rows=280 loops=1) Filter: (val ~~ ’%1111%’::text) Rows Removed by Filter: 999720 Planning time: 0.046 ms Execution time: 187.375 ms (5 rows) Alexander Korotkov In-memory OLTP storage with persistence and transac on support 17 / 34
  • 18. Usage: non-persistent tables are writable on standby *** Master *** # CREATE FOREIGN TABLE im_test (id int8 NOT NULL, val text NOT NULL) SERVER in_memory OPTIONS (indices ’unique (id)’, persistent ’false’); # INSERT INTO im_test (SELECT id, ’val’ || id FROM generate_series(1, 1000000) id); INSERT 0 1000000 *** Standby *** # SELECT * FROM im_test; id | val ----+----- (0 rows) # INSERT INTO im_test VALUES (1, ’foo’), (2, ’bar’); INSERT 0 2 # SELECT * FROM im_test; id | val ----+----- 1 | foo 2 | bar (2 rows) Alexander Korotkov In-memory OLTP storage with persistence and transac on support 18 / 34
  • 19. Limita ons ▶ Only B-tree with limited func onality is supported. ▶ No secondary indexes are supported yet. ▶ No out-of-line storage are supported for tuples yet. ▶ Undo log shouldn’t wraparound during single transac on (that transac on is automa cally aborted). ▶ If required undo record is already overflowed then “snapshot’s too old” error is emi ed. ▶ Serializable isola on level isn’t supported. ▶ Replica on isn’t supported yet. Alexander Korotkov In-memory OLTP storage with persistence and transac on support 19 / 34
  • 20. Read-only benchmark 0 50 100 150 200 250 # Clients 0 200000 400000 600000 800000 1000000 1200000 1400000 1600000 QPS pgbench -s 1000 -j $n -c $n -M prepared -S on 4 x 18 cores Intel Xeon E7-8890 processors mean of 3 3-minute runs with shared_buffers = 32GB, max_connections = 300 builtin in-memory Alexander Korotkov In-memory OLTP storage with persistence and transac on support 20 / 34
  • 21. Why there is no win? Storage is only one layer par cipa ng in query execu on. There are also: ▶ Network layer, ▶ Executor, ▶ Parser (analyze & rewrite if not prepared), ▶ Transac on management (including snapshot acquirement), ▶ ... Alexander Korotkov In-memory OLTP storage with persistence and transac on support 21 / 34
  • 22. Measuring overheads 0 50 100 150 200 250 # Clients 0 500000 1000000 1500000 2000000 2500000 3000000 3500000 QPS pgbench -s 1000 -j $n -c $n -M prepared -S on 4 x 18 cores Intel Xeon E7-8890 processors mean of 3 3-minute runs with shared_buffers = 32GB, max_connections = 300 read-only SELECT 1; ; Alexander Korotkov In-memory OLTP storage with persistence and transac on support 22 / 34
  • 23. Read-only benchmark: fetch 9 values per single query set aid1 random(1, 100000 * :scale) set aid2 random(1, 100000 * :scale) set aid3 random(1, 100000 * :scale) set aid4 random(1, 100000 * :scale) set aid5 random(1, 100000 * :scale) set aid6 random(1, 100000 * :scale) set aid7 random(1, 100000 * :scale) set aid8 random(1, 100000 * :scale) set aid9 random(1, 100000 * :scale) SELECT abalance FROM pgbench_accounts WHERE aid IN (:aid1, :aid2, :aid3, :aid4, :aid5, :aid6, :aid7, :aid8, :aid9); Alexander Korotkov In-memory OLTP storage with persistence and transac on support 23 / 34
  • 24. Read-only benchmark: fetch 9 values per single query 0 50 100 150 200 250 # Clients 0 200000 400000 600000 800000 1000000 1200000 QPS pgbench -s 1000 -j $n -c $n -M prepared -f ro9.sql on 4 x 18 cores Intel Xeon E7-8890 processors mean of 3 3-minute runs with shared_buffers = 32GB, max_connections = 300 builtin in-memory Alexander Korotkov In-memory OLTP storage with persistence and transac on support 24 / 34
  • 25. Read-only benchmark: compare values-per-second 0 50 100 150 200 250 # Clients 0 2000000 4000000 6000000 8000000 10000000 VPS pgbench -s 1000 -j $n -c $n -M prepared -S on 4 x 18 cores Intel Xeon E7-8890 processors mean of 3 3-minute runs with shared_buffers = 32GB, max_connections = 300 builtin-1 builtin-9 in_memory-1 in_memory-9 Alexander Korotkov In-memory OLTP storage with persistence and transac on support 25 / 34
  • 26. Read-write benchmark without persistence (async commit) 0 50 100 150 200 250 # Clients 0 50000 100000 150000 200000 250000 TPS pgbench -s 1000 -j $n -c $n -M prepared on 4 x 18 cores Intel Xeon E7-8890 processors mean of 3 3-minute runs with shared_buffers = 32GB, max_connections = 300 unlogged table in_memory Alexander Korotkov In-memory OLTP storage with persistence and transac on support 26 / 34
  • 27. Read-write benchmark with persistence (async commit) 0 50 100 150 200 250 # Clients 0 50000 100000 150000 200000 TPS pgbench -s 1000 -j $n -c $n -M prepared on 4 x 18 cores Intel Xeon E7-8890 processors mean of 3 3-minute runs with shared_buffers = 32GB, max_connections = 300 builtin in-memory Alexander Korotkov In-memory OLTP storage with persistence and transac on support 27 / 34
  • 28. Read-write benchmark: do transac on in a single statement CREATE OR REPLACE FUNCTION tcpb_trx(_aid int, _bid int, _tid int, _delta int) RETURNS void AS $$ BEGIN UPDATE pgbench_accounts SET abalance = abalance + _delta WHERE aid = _aid; PERFORM abalance FROM pgbench_accounts WHERE aid = _aid; UPDATE pgbench_tellers SET tbalance = tbalance + _delta WHERE tid = _tid; UPDATE pgbench_branches SET bbalance = bbalance + _delta WHERE bid = _bid; INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (_tid, _bid, _aid, _delta, CURRENT_TIMESTAMP); END; $$ LANGUAGE plpgsql; set aid random(1, 100000 * :scale) set bid random(1, 1 * :scale) set tid random(1, 10 * :scale) set delta random(-5000, 5000) SELECT tcpb_trx(:aid, :bid, :tid, :delta); Alexander Korotkov In-memory OLTP storage with persistence and transac on support 28 / 34
  • 29. Read-write benchmark with persistence (async commit, func on vs. interac ve) 0 50 100 150 200 250 # Clients 0 100000 200000 300000 400000 500000 600000 TPS pgbench -s 1000 -j $n -c $n -M prepared on 4 x 18 cores Intel Xeon E7-8890 processors mean of 3 3-minute runs with shared_buffers = 32GB, max_connections = 300 builtin builtin-func in_memory in_memory-func Alexander Korotkov In-memory OLTP storage with persistence and transac on support 29 / 34
  • 30. Hacks used in implementa on ▶ Minimalis c CSN implementa on CSNs are assigned but neither used, neither wri en to SLRUs. in-memory engine doesn’t need SLRU. ▶ Checkpoint hook in-memory engine writes full data snapshot on checkpoint. ▶ Generic logical message hook Used to implement custom recovery/replica on. This is an awful hack. ▶ TRUNCATE using u lity command hook TRUNCATE isn’t supported by FDW directly. ▶ DROP support using event trigger Used to free the resources occupied by in-memory table. Alexander Korotkov In-memory OLTP storage with persistence and transac on support 30 / 34
  • 31. Recovery problem ▶ Row-level WAL is compact, but it requires meta-informa on to apply. That is we need to be able to read system catalog while applying WAL including recovery. ▶ We can’t access system catalog during recovery, because the whole database isn’t accessible since it’s not recovered yet. Alexander Korotkov In-memory OLTP storage with persistence and transac on support 31 / 34
  • 32. Recovery problem solu on: 2-phase recovery At the second phase we have consistent system catalog. Alexander Korotkov In-memory OLTP storage with persistence and transac on support 32 / 34
  • 33. Future roadmap Integrate in-memory as pluggable storage: ▶ In-memory B-tree as index access method. ▶ Implement storage for in-memory tables using one of following ways: ▶ Write some kind of «in-memory heap» OR/AND ▶ Write a storage wrapper implemen ng index-organized table. Alexander Korotkov In-memory OLTP storage with persistence and transac on support 33 / 34
  • 34. Thank you for a en on! Alexander Korotkov In-memory OLTP storage with persistence and transac on support 34 / 34