Learn how zheap works

HOW ZHEAP WORKS
REINVENTING POSTGRESQL STORAGE
BY HANS-JÜRGEN SCHÖNIG

ABOUT
ME AND MY
COMPANY
■ Who is the guy?
■ Who is CYBERTEC?

HANS-JÜRGEN
SCHÖNIG
CEO & SENIOR DATABASE CONSULTANT
■ PostgreSQL since 1999
■ author of various database books
M A I L hs@cybertec.at
P H O N E +43 2622 930 22-2
W E B www.cybertec-postgresql.com

DATABASE SERVICES
DATA Science
▪ Artiﬁcial Intelligence
▪ Machine Learning
▪ Big Data
▪ Business Intelligence
▪ Data Mining
▪ etc.
POSTGRESQL Services
▪ 24/7 Support
▪ Training
▪ Consulting
▪ Performance Tuning
▪ Clustering
▪ etc.

▪ ICT
▪ University
▪ Government
▪ Automotive
▪ Industry
▪ Trade
▪ Finance
▪ etc.
CLIENT
SECTORS

AGENDA
■ traditional tables
■ table bloat and VACUUM
■ Why a new storage system?
■ zheap: the goal
■ zheap: basic architecture
■ zheap: transaction slots, etc.
■ performance impacts
■ roadmap

HEAP: STANDARD TABLES
■ Data structure looks as follows:

■ Data structure looks as follows:
HEAP: STANDARD TABLES

HEAP AND TRANSACTIONS
UPDATES AND VISIBILITY

MAIN ISSUE: TABLE BLOAT
test=# CREATE TABLE a (aid int) WITH (autovacuum_enabled = off);
CREATE TABLE
test=# INSERT INTO a SELECT * FROM generate_series(1, 1000000);
INSERT 0 1000000
test=# SELECT pg_size_pretty(pg_relation_size('a'));
pg_size_pretty
----------------
35 MB
(1 row)

test=# UPDATE a SET aid = aid + 1;
UPDATE 1000000
pg_size_pretty
----------------
69 MB
(1 row)

test=# VACUUM VERBOSE a;
INFO: vacuuming "public.a"
INFO: "a": removed 1000000 row versions in 4425 pages
INFO: "a": found 1000000 removable, 1000000 nonremovable row versions in 8850
out of 8850 pages
DETAIL: 0 dead row versions cannot be removed yet, oldest xmin: 539
...
VACUUM
pg_size_pretty
----------------
69 MB
(1 row)

ONE WORD ABOUT VACUUM
■ VACUUM is not always allowed to
reallocate dead rows
■ A row must be REALLY dead for VACUUM
to do its job
■ Long transactions can be an enemy
→ Once you are in pain it tends not to go away

WAYS OUT
■ VACUUM FULL: Needs a table lock
■ pg_squeeze:
■ Shrinking tables with less locking
■ Move between tablespaces
■ Index organize tables
HINT: Try to avoid bloat in the ﬁrst place!

ZHEAP: DESIGN GOALS
■ Perform UPDATE in place
■ Have smaller tables
■ smaller tuple headers
■ improved alignment
■ Reduce writes as much as possible
■ avoid dirtying pages unless data is modiﬁed
■ normal heaps dirty pages in some cases during reads
■ Reuse space more quickly
■ Get rid of VACUUM

ZHEAP: TUPLE HEADERS
■ Heap: 20+ bytes per row
■ Zheap: 5 bytes per row
How can this be achieved?
■ The tuple header controls “visibility”
■ “Normalize tuple header”
■ Move visibility info to the page level

ZHEAP: TRANSACTION SLOTS
Transaction slots hold transactional visibility

ZHEAP: TRANSACTION SLOTS
Transaction slots:
■ 16 bytes of storage
■ contains the following information
■ transaction id
■ epoch
■ latest undo record pointer of that transaction
What if we need more slots?

ZHEAP: TPD PAGES
■ TPD: Store additional transaction slots if “4” is not enough
■ TPD pages are interleaved with normal pages
■

OPERATION: INSERT
■ Allocate a transaction slot
■ Emit an undo entry to ﬁx things on error
■ Space can be reclaimed instantly after a ROLLBACK
→ Most simplistic operation

OPERATION: UPDATE
■ More complicated:
■ The new row ﬁts into the old space
■ The new row does not ﬁt into the old space

OPERATION: UPDATE FITS
■ If the row is shorter:
■ We can overwrite it
■ Emit undo record
In short: We hold the new row in zheap and a copy of the old row in undo so
that we can copy it back to the old structure in case it is needed.

OPERATION: UPDATE DOESN’T FIT
■ Will be worse
■ DELETE old row
■ INSERT new row in a diﬀerent place
■ Less eﬃcient
Space can instantly be reclaimed in the following cases:
■ When updating a row to a shorter version
■ When non-inplace UPDATEs are performed

OPERATION: DELETE
■ How it works
■ Emit undo record
■ DELETE row from zheap
Old row can be moved back into zheap during ROLLBACK.

ROLLBACK
■ In case a ROLLBACK happens:
■ undo has to make sure that the old state of the table is restored.
■ Old rows have to be copied back
■ ROLLBACK takes longer !
Undo itself can be removed in three cases:
■ as soon as there are no transactions anymore that can see the data.
■ as soon as all undo action has been completed
■ For committed transactions till the time they are all-visible

UNDO WORKERS
■ Discarding the undo logs is performed by discard worker
■ Undo launcher checks the rollback_hash_table periodically
■ Spawn new undo workers to perform the rollback
■ Each spawned undo worker processes the rollback requests for a
particular database.

PREPARING DATA
■ Creating some random data
test=# SET temp_buffers TO '1 GB';
SET
test=# CREATE TEMP TABLE raw AS
SELECT id,
hashtext(id::text) as name,
random() * 10000 AS n, true AS b
FROM generate_series(1, 10000000) AS id;
SELECT 10000000

LOADING A HEAP
■ Populating a normal table
test=# timing
Timing is on.
test=# CREATE TABLE h1 (LIKE raw) USING heap;
CREATE TABLE
Time: 7.836 ms
test=# INSERT INTO h1 SELECT * FROM raw;
INSERT 0 10000000
Time: 7495.798 ms (00:07.496)

LOADING A ZHEAP
■ Mind the runtime
test=# CREATE TABLE z1 (LIKE raw) USING zheap;
CREATE TABLE
Time: 8.045 ms
test=# INSERT INTO z1 SELECT * FROM raw;
INSERT 0 10000000
Time: 27947.516 ms (00:27.948)

ZHEAP IN ACTION
test=# BEGIN;
BEGIN
test=*# SELECT pg_size_pretty(pg_relation_size('z1'));
pg_size_pretty
----------------
251 MB
(1 row)
test=*# UPDATE z1 SET id = id + 1;
UPDATE 10000000
test=*# SELECT pg_size_pretty(pg_relation_size('z1'));
pg_size_pretty
----------------
251 MB
(1 row)

UNDO IN ACTION
[hs@hs-MS-7817 undo]$ pwd
/home/hs/db13/base/undo
[hs@hs-MS-7817 undo]$ ls -l | tail -n 10
-rw-------. 1 hs hs 1048576 Oct 8 12:08 000001.003EC00000
-rw-------. 1 hs hs 1048576 Oct 8 12:08 000001.003ED00000
-rw-------. 1 hs hs 1048576 Oct 8 12:08 000001.003EE00000
-rw-------. 1 hs hs 1048576 Oct 8 12:08 000001.003EF00000
-rw-------. 1 hs hs 1048576 Oct 8 12:08 000001.003F000000
-rw-------. 1 hs hs 1048576 Oct 8 12:08 000001.003F100000
-rw-------. 1 hs hs 1048576 Oct 8 12:08 000001.003F200000
-rw-------. 1 hs hs 1048576 Oct 8 12:08 000001.003F300000
-rw-------. 1 hs hs 1048576 Oct 8 12:08 000001.003F400000
-rw-------. 1 hs hs 1048576 Oct 8 12:08 000001.003F500000

WHAT WE ARE WORKING ON
■ agree on ﬁnal design issues
■ ﬁx bugs in current code
■ large code base
■ not easy to handle
■ preparing a patch to move “undo” to core
■ “undo” is core infrastructure
We hope to bring this into core some day.

QUESTIONS?
Feel free to contact me!
M A I L hs@cybertec.at
P H O N E +43 2622 930 22-2
T W I T T E R @postgresql_007

Learn how zheap works

More Related Content

What's hot (19)

Similar to Learn how zheap works (20)

More from EDB (20)

Recently uploaded (20)

Learn how zheap works