SlideShare a Scribd company logo
Chicago PostgreSQL User Group - October 20, 2021 Jonathan S. Katz
Let's Build a Complex, Real-
Time Data Management
Application
• VP, Platform Engineering @ Crunchy Data
• Previously: Engineering Leadership @ Startups
• Longtime PostgreSQL community contributor
• Core Team Member
• Various Governance Committees
• Conference Organizer / Speaker
• @jkatz05
About Me
• Leading Team in Postgres – 10 contributors
• Certified Open Source PostgreSQL Distribution
• Leader in Postgres Technology for Kubernetes
• Crunchy Bridge: Fully managed cloud service
Crunchy Data
Your partner in deploying
open source PostgreSQL
throughout your enterprise.
CPSM Provider Plugin
This talk introduces many different tools and techniques available
in PostgreSQL for building applications.
It introduces different features and where to find out more
information.
We have a lot of material to cover in a short time - the slides and
demonstrations will be made available
How to Approach This Talk
CPSM Provider Plugin
Imagine we are managing virtual rooms for an event platform.
We have a set of operating hours in which the rooms can be
booked.
Only one booking can occur in a virtual room at a given time.
The Problem
CPSM Provider Plugin
For Example
CPSM Provider Plugin
We need to know...
- All the rooms that are available to book
- When the rooms are available to be booked (operating hours)
- When the rooms have been booked
And...
The system needs to be able to CRUD fast
(Create, Read, Update, Delete. Fast).
Specifications
🤔
Interlude:
Finding Availability
CPSM Provider Plugin
Availability can be thought about in three ways:
Closed
Available
Unavailable (or "booked")
Our ultimate "calendar tuple" is (room, status, range)
Managing Availability
CPSM Provider Plugin
PostgreSQL 9.2 introduced "range types" that included the ability to store and
efficiently search over ranges of data.
Built-in:
Date, Timestamps
Integer, Numeric
Lookups (e.g. overlaps) can be sped up using GiST indexes
Postgres Range Types
SELECT tstzrange('2021-10-28 09:30'::timestamptz, '2021-10-28 10:30'::timestamptz);
Availability
Availability
SELECT *
FROM (
VALUES
('closed', tstzrange('2021-10-28 0:00', '2021-10-28 8:00')),
('available', tstzrange('2021-10-28 08:00', '2021-10-28 09:30')),
('unavailable', tstzrange('2021-10-28 09:30', '2021-10-28 10:30')),
('available', tstzrange('2021-10-28 10:30', '2021-10-28 16:30')),
('unavailable', tstzrange('2021-10-28 16:30', '2021-10-28 18:30')),
('available', tstzrange('2021-10-28 18:30', '2021-10-28 20:00')),
('closed', tstzrange('2021-10-28 20:00', '2021-10-29 0:00'))
) x(status, calendar_range)
ORDER BY lower(x.calendar_range);
Easy, Right?
CPSM Provider Plugin
Insert new ranges and dividing them up
PostgreSQL did not work well with noncontiguous ranges…until PostgreSQL 14
Availability
Just for one day - what about other days?
What happens with data in the past?
What happens with data in the future?
Unavailability
Ensure no double-bookings
Overlapping Events?
Handling multiple spaces
But…
Managing Availability
availability_rule
id <serial> PRIMARY KEY
room_id <int> REFERENCES (room)
days_of_week <int[]>
start_time <time>
end_time <time>
generate_weeks_into_future <int>
DEFAULT 52
room
id <serial>
PRIMARY KEY
name <text>
availability
id <serial> PRIMARY KEY
room_id <int> REFERENCES
(room)
availability_rule_id <int>
REFERENCES (availabilityrule)
available_date <date>
available_range <tstzrange>
unavailability
id <serial> PRIMARY KEY
room_id <int> REFERENCES
(room)
unavailable_date <date>
unavailable_range <tstzrange>
calendar
id <serial> PRIMARY KEY
room_id <int> REFERENCES
(room)
status <text> DOMAIN:
{available, unavailable, closed}
calendar_date <date>
calendar_range <tstzrange>
CPSM Provider Plugin
We can now store data, but what about:
Generating initial calendar?
Generating availability based on rules?
Generating unavailability?
Sounds like we need to build an application
Managing Availability
CPSM Provider Plugin
To build our application, there are a few topics we will need to explore first:
generate_series
Recursive queries
Ranges and Multiranges
SQL Functions
Set returning functions
PL/pgsql
Triggers
Managing Availability
CPSM Provider Plugin
Generate series is a "set returning" function, i.e. a function that can return
multiple rows of data.
Generate series can return:
A set of numbers (int, bigint, numeric) either incremented by 1 or some
other integer interval
A set of timestamps incremented by a time interval(!!)
generate_series:
More Than Just For Test Data
SELECT x::date
FROM generate_series(
'2021-01-01'::date, '2021-12-31'::date, '1 day'::interval
) x;
CPSM Provider Plugin
PostgreSQL 8.4 introduced the "WITH" syntax and with it also introduced the
ability to perform recursive queries
WITH RECURSIVE ... AS ()
Base case vs. recursive case
UNION vs. UNION ALL
CAN HIT INFINITE LOOPS
Recursion in SQL?
CPSM Provider Plugin
Recursion in SQL?
WITH RECURSIVE fac AS (
SELECT
1::numeric AS n,
1::numeric AS i
UNION
SELECT
fac.n * (fac.i + 1),
fac.i + 1 AS i
FROM fac
)
SELECT fac.n, fac.i
FROM fac;
Infinite Recursion
CPSM Provider Plugin
Recursion in SQL?
WITH RECURSIVE fac AS (
SELECT
1::numeric AS n,
1::numeric AS i
UNION
SELECT
fac.n * (fac.i + 1),
fac.i + 1 AS i
FROM fac
)
SELECT fac.n, fac.i
FROM fac
LIMIT 100;
Postgres 14 introduces multirange types
Ordered list of ranges
Can be noncontiguous
Adds range aggregates: range_agg and unnest
Multirange Types
SELECT
datemultirange(
daterange(CURRENT_DATE, CURRENT_DATE + 1),
daterange(CURRENT_DATE + 5, CURRENT_DATE + 8),
daterange(CURRENT_DATE + 15, CURRENT_DATE + 22)
);
CPSM Provider Plugin
PostgreSQL provides the ability to write functions to help encapsulate
repeated behavior
PostgreSQL 11 introduces stored procedures which enables you to
embed transactions! PostgreSQL 14 adds the ability to get output from stored
procedures!
SQL functions have many properties, including:
Input / output
Volatility (IMMUTABLE, STABLE, VOLATILE) (default VOLATILE)
Parallel safety (default PARALLEL UNSAFE)
LEAKPROOF; SECURITY DEFINER
Execution Cost
Language type (more on this later)
Functions
CPSM Provider Plugin
Functions
CREATE OR REPLACE FUNCTION chipug_fac(n int)
RETURNS numeric
AS $$
WITH RECURSIVE fac AS (
SELECT
1::numeric AS n,
1::numeric AS i
UNION
SELECT
fac.n * (fac.i + 1),
fac.i + 1 AS i
FROM fac
WHERE i + 1 <= $1
)
SELECT max(fac.n)
FROM fac;
$$ LANGUAGE SQL IMMUTABLE PARALLEL SAFE;
CPSM Provider Plugin
Functions
CREATE OR REPLACE FUNCTION chipug_fac_set(n int)
RETURNS SETOF numeric
AS $$
WITH RECURSIVE fac AS (
SELECT
1::numeric AS n,
1::numeric AS i
UNION
SELECT
fac.n * (fac.i + 1),
fac.i + 1 AS i
FROM fac
WHERE i + 1 <= $1
)
SELECT fac.n
FROM fac
ORDER BY fac.n;
$$ LANGUAGE SQL IMMUTABLE PARALLEL SAFE;
CPSM Provider Plugin
Functions
CREATE OR REPLACE FUNCTION chipug_fac_table(n int)
RETURNS TABLE(n numeric)
AS $$
WITH RECURSIVE fac AS (
SELECT
1::numeric AS n,
1::numeric AS i
UNION
SELECT
fac.n * (fac.i + 1),
fac.i + 1 AS i
FROM fac
WHERE i + 1 <= $1
)
SELECT fac.n
FROM fac
ORDER BY fac.n;
$$ LANGUAGE SQL IMMUTABLE PARALLEL SAFE;
CPSM Provider Plugin
PostgreSQL has the ability to load in procedural languages ("PL") and execute
code in them beyond SQL.
Built-in: pgSQL, Python, Perl, Tcl
Others: Javascript, R, Java, C, JVM, Container, LOLCODE, Ruby, PHP, Lua,
pgPSM, Scheme
Procedural Languages
CPSM Provider Plugin
PL/pgSQL
CREATE EXTENSION IF NOT EXISTS plpgsql;
CREATE OR REPLACE FUNCTION chipug_fac_plpgsql(n int)
RETURNS numeric
AS $$
DECLARE
fac numeric;
i int;
BEGIN
fac := 1;
FOR i IN 1..n LOOP
fac := fac * i;
END LOOP;
RETURN fac;
END;
$$ LANGUAGE plpgsql IMMUTABLE PARALLEL SAFE;
CPSM Provider Plugin
Triggers are functions that can be called before/after/instead of an operation or event
Data changes (INSERT/UPDATE/DELETE)
Events (DDL, DCL, etc. changes)
Atomic
Must return "trigger" or "event_trigger"
(Return "NULL" in a trigger if you want to skip operation)
(Gotcha: RETURN OLD [INSERT] / RETURN NEW [DELETE])
Execute once per modified row or once per SQL statement
Multiple triggers on same event will execute in alphabetical order
Writeable in any PL language that defined trigger interface
Triggers
Building a
Synchronized System
We'll Scan the Code
It's Available for Download 😉
The Test
CPSM Provider Plugin
[Test your live demos before running them, and you will have much
success!]
availability_rule inserts took some time, > 350ms
availability: INSERT 52
calendar: INSERT 52 from nontrivial function
Updates on individual availability / unavailability are not too painful
Lookups are faaaaaaaast
Lessons of the Test
How About At (Web) Scale?
CPSM Provider Plugin
Recursive CTE 😢
Even with only 100 more rooms with a few set of rules, rule
generation time increased significantly
Multirange Types
These are still pretty fast and are handling scaling up well.
May still be slow for a web transaction.
Lookups are still lightning fast!
Web Scale
CPSM Provider Plugin
Added in PostgreSQL 9.4
Replays all logical changes made to the database
Create a logical replication slot in your database
Only one receiver can consume changes from one slot at a time
Slot keeps track of last change that was read by a receiver
If receiver disconnects, slot will ensure database holds changes until
receiver reconnects
Only changes from tables with primary keys are relayed
As of PostgreSQL 10, you can set a "REPLICA IDENTITY" on a
UNIQUE, NOT NULL, non-deferrable, non-partial column(s)
Basis for Logical Replication
Logical Decoding
CPSM Provider Plugin
A logical replication slot has a name and an output plugin
PostgreSQL comes with the "test" output plugin
Have to write a custom parser to read changes from test output plugin
Several output plugins and libraries available
wal2json: https://github.com/eulerto/wal2json
jsoncdc: https://github.com/instructure/jsoncdc
Debezium: http://debezium.io/
(Test: https://www.postgresql.org/docs/current/static/test-decoding.html)
Logical Replication (pgoutput)
Every data change in the database is streamed
Need to be aware of the logical decoding format
Logical Decoding Out of the Box
CPSM Provider Plugin
C: libpq
pg_recvlogical
PostgreSQL functions
Python: psycopg2 - version 2.7
JDBC: version 42
Go: pgx
JavaScript: node-postgres (pg-logical-replication)
Driver Support
CPSM Provider Plugin
Using Logical Decoding
CPSM Provider Plugin
We know it takes time to regenerate calendar
Want to ensure changes always propagate but want to ensure all users
(managers, calendar searchers) have good experience
Thoughts🤔
CPSM Provider Plugin
Will use the same data model as before as well as the same helper
functions, but without the triggers
We will have a Python script that reads from a logical replication
slot and if it detects a relevant change, take an action
Similar to what we did with triggers, but this moves the work to
OUTSIDE the transaction
BUT...we can confirm whether or not the work is completed, thus if
the program fails, we can restart from last acknowledged
transaction ID
Replacing Triggers
Reviewing the Code
CPSM Provider Plugin
A consumer of the logical stream can only read one change at a time
If our processing of a change takes a lot of time, it will create a backlog
of changes
Backlog means the PostgreSQL server needs to retain more WAL logs
Retaining too many WAL logs can lead to running out of disk space
Running out of disk space can lead to...rough times.
The Consumer Bottleneck
🌤
🌥
☁
🌩
Eliminating the Bottleneck
CPSM Provider Plugin
Can utilize a durable message queueing system to store any WAL changes
that are necessary to perform post-processing on
Ensure the changes are worked on in order
"Divide-and-conquer" workload - have multiple workers acting on
different "topics"
Remove WAL bloat
Shifting the Workload
CPSM Provider Plugin
Durable message processing and distribution system
Streams
Supports parallelization of consumers
Multiple consumers, partitions
Highly-available, distributed architecture
Acknowledgement of receiving, processing messages; can replay (sounds like
WAL?)
Can also accomplish this with Debezium, which interfaces with Kafka +
Postgres
Apache Kafka
CPSM Provider Plugin
Architecture
CPSM Provider Plugin
WAL Consumer
import json, sys
from kafka import KafkaProducer
from kafka.errors import KafkaError
import psycopg2
import psycopg2.extras
TABLES = set([
'availability', 'availability_rule', 'room', 'unavailability',
])
reader = WALConsumer()
cursor = reader.connection.cursor()
cursor.start_replication(slot_name='schedule', decode=True)
try:
cursor.consume_stream(reader)
except KeyboardInterrupt:
print("Stopping reader...")
finally:
cursor.close()
reader.connection.close()
print("Exiting reader")
CPSM Provider Plugin
class WALConsumer(object):
def __init__(self):
self.connection = psycopg2.connect("dbname=realtime",
connection_factory=psycopg2.extras.LogicalReplicationConnection,
)
self.producer = producer = KafkaProducer(
bootstrap_servers=['localhost:9092'],
value_serializer=lambda m: json.dumps(m).encode('ascii'),
)
def __call__(self, msg):
payload = json.loads(msg.payload, strict=False)
print(payload)
# determine if the payload should be passed on to a consumer
listening
# to the Kafka que
for data in payload['change']:
if data.get('table') in TABLES:
self.producer.send(data.get('table'), data)
# ensure everything is sent; call flush at this point
self.producer.flush()
# acknowledge that the change has been read - tells PostgreSQL to
stop
# holding onto this log file
msg.cursor.send_feedback(flush_lsn=msg.data_start)
CPSM Provider Plugin
Kafka Consumer
import json
from kafka import KafkaConsumer
from kafka.structs import OffsetAndMetadata, TopicPartition
import psycopg2
class Worker(object):
"""Base class to work perform any post processing on changes"""
OPERATIONS = set([]) # override with "insert", "update", "delete"
def __init__(self, topic):
# connect to the PostgreSQL database
self.connection = psycopg2.connect("dbname=realtime")
# connect to Kafka
self.consumer = KafkaConsumer(
bootstrap_servers=['localhost:9092'],
value_deserializer=lambda m: json.loads(m.decode('utf8')),
auto_offset_reset="earliest",
group_id='1')
# subscribe to the topic(s)
self.consumer.subscribe(topic if isinstance(topic, list) else [topic])
CPSM Provider Plugin
Kafka Consumer
def run(self):
"""Function that runs ad-infinitum"""
# loop through the payloads from the consumer
# determine if there are any follow-up actions based on the kind of
# operation, and if so, act upon it
# always commit when done.
for msg in self.consumer:
print(msg)
# load the data from the message
data = msg.value
# determine if there are any follow-up operations to perform
if data['kind'] in self.OPERATIONS:
# open up a cursor for interacting with PostgreSQL
cursor = self.connection.cursor()
# put the parameters in an easy to digest format
params = dict(zip(data['columnnames'], data['columnvalues']))
# all the function
getattr(self, data['kind'])(cursor, params)
# commit any work that has been done, and close the cursor
self.connection.commit()
cursor.close()
# acknowledge the message has been handled
tp = TopicPartition(msg.topic, msg.partition)
offsets = {tp: OffsetAndMetadata(msg.offset, None)}
self.consumer.commit(offsets=offsets)
CPSM Provider Plugin
Kafka Consumer
# override with the appropriate post-processing code
def insert(self, cursor, params):
"""Override with any post-processing to be done on an ``INSERT``"""
raise NotImplementedError()
def update(self, cursor, params):
"""Override with any post-processing to be done on an ``UPDATE``"""
raise NotImplementedError()
def delete(self, cursor, params):
"""Override with any post-processing to be done on an ``DELETE``"""
raise NotImplementedError()
Testing the Application
CPSM Provider Plugin
Logical decoding allows the bulk inserts to occur significantly faster from a
transactional view
Potential bottleneck for long running execution, but bottlenecks are isolated to
specific queues
Newer versions of PostgreSQL has features that make it easier to build
applications and scale
Lessons
CPSM Provider Plugin
PostgreSQL is robust.
Triggers will keep your data in sync but can have significant
performance overhead
Utilizing a logical replication slot can eliminate trigger overhead
and transfer the computational load elsewhere
Not a panacea: still need to use good architectural patterns!
Conclusion
Thank You
jonathan.katz@crunchydata.com
@jkatz05
https://github.com/CrunchyData/postgres-realtime-demo

More Related Content

PDF
VictoriaLogs: Open Source Log Management System - Preview
PDF
ClickHouse Monitoring 101: What to monitor and how
PDF
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
PDF
ClickHouse tips and tricks. Webinar slides. By Robert Hodges, Altinity CEO
PDF
Using ClickHouse for Experimentation
PDF
Altinity Quickstart for ClickHouse-2202-09-15.pdf
PDF
MyRocks Deep Dive
PDF
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO
VictoriaLogs: Open Source Log Management System - Preview
ClickHouse Monitoring 101: What to monitor and how
Deep Dive on ClickHouse Sharding and Replication-2202-09-22.pdf
ClickHouse tips and tricks. Webinar slides. By Robert Hodges, Altinity CEO
Using ClickHouse for Experimentation
Altinity Quickstart for ClickHouse-2202-09-15.pdf
MyRocks Deep Dive
ClickHouse Query Performance Tips and Tricks, by Robert Hodges, Altinity CEO

What's hot (20)

PDF
cLoki: Like Loki but for ClickHouse
PDF
How to tune Kafka® for production
PDF
All About JSON and ClickHouse - Tips, Tricks and New Features-2022-07-26-FINA...
PPTX
Autoscaling Flink with Reactive Mode
PDF
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
PDF
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
PPTX
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
PDF
ClickHouse Keeper
PPTX
NoSQL Data Modeling 101
PDF
Building a Complex, Real-Time Data Management Application
PDF
ClickHouse Unleashed 2020: Our Favorite New Features for Your Analytical Appl...
PDF
MLOps - Build pipelines with Tensor Flow Extended & Kubeflow
PPTX
PostgreSQL Database Slides
PPTX
Kubernetes Basics
PDF
Ceph issue 해결 사례
PDF
MongoDB vs. Postgres Benchmarks
 
PPTX
Apache Kudu: Technical Deep Dive


PDF
From Postgres to ScyllaDB: Migration Strategies and Performance Gains
PDF
Looking ahead at PostgreSQL 15
PDF
What Is ELK Stack | ELK Tutorial For Beginners | Elasticsearch Kibana | ELK S...
cLoki: Like Loki but for ClickHouse
How to tune Kafka® for production
All About JSON and ClickHouse - Tips, Tricks and New Features-2022-07-26-FINA...
Autoscaling Flink with Reactive Mode
ClickHouse Data Warehouse 101: The First Billion Rows, by Alexander Zaitsev a...
ClickHouse in Real Life. Case Studies and Best Practices, by Alexander Zaitsev
Migration to ClickHouse. Practical guide, by Alexander Zaitsev
ClickHouse Keeper
NoSQL Data Modeling 101
Building a Complex, Real-Time Data Management Application
ClickHouse Unleashed 2020: Our Favorite New Features for Your Analytical Appl...
MLOps - Build pipelines with Tensor Flow Extended & Kubeflow
PostgreSQL Database Slides
Kubernetes Basics
Ceph issue 해결 사례
MongoDB vs. Postgres Benchmarks
 
Apache Kudu: Technical Deep Dive


From Postgres to ScyllaDB: Migration Strategies and Performance Gains
Looking ahead at PostgreSQL 15
What Is ELK Stack | ELK Tutorial For Beginners | Elasticsearch Kibana | ELK S...
Ad

Similar to Build a Complex, Realtime Data Management App with Postgres 14! (20)

PDF
PostgreSQL 9.6 새 기능 소개
PDF
What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1
PDF
What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1
PDF
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
PPTX
Apache Cassandra 2.0
PPTX
At the core you will have KUSTO
PPTX
HBaseCon 2013: How (and Why) Phoenix Puts the SQL Back into NoSQL
PDF
Spock Framework
PDF
Spock Framework - Slidecast
PDF
Полнотекстовый поиск в PostgreSQL за миллисекунды (Олег Бартунов, Александр К...
PDF
QuestDB: The building blocks of a fast open-source time-series database
PDF
John Melesky - Federating Queries Using Postgres FDW @ Postgres Open
PDF
The Art of Database Experiments – PostgresConf Silicon Valley 2018 / San Jose
PPT
Jdbc oracle
PPTX
Linuxfest Northwest 2022 - MySQL 8.0 Nre Features
PDF
Postgres Vienna DB Meetup 2014
PDF
Presto anatomy
PPT
JDBC Connecticity.ppt
PDF
Your Timestamps Deserve Better than a Generic Database
ODP
Scaling PostgreSQL With GridSQL
PostgreSQL 9.6 새 기능 소개
What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1
What's New in MariaDB Server 10.2 and MariaDB MaxScale 2.1
Scylla Summit 2016: Analytics Show Time - Spark and Presto Powered by Scylla
Apache Cassandra 2.0
At the core you will have KUSTO
HBaseCon 2013: How (and Why) Phoenix Puts the SQL Back into NoSQL
Spock Framework
Spock Framework - Slidecast
Полнотекстовый поиск в PostgreSQL за миллисекунды (Олег Бартунов, Александр К...
QuestDB: The building blocks of a fast open-source time-series database
John Melesky - Federating Queries Using Postgres FDW @ Postgres Open
The Art of Database Experiments – PostgresConf Silicon Valley 2018 / San Jose
Jdbc oracle
Linuxfest Northwest 2022 - MySQL 8.0 Nre Features
Postgres Vienna DB Meetup 2014
Presto anatomy
JDBC Connecticity.ppt
Your Timestamps Deserve Better than a Generic Database
Scaling PostgreSQL With GridSQL
Ad

More from Jonathan Katz (13)

PDF
Vectors are the new JSON in PostgreSQL (SCaLE 21x)
PDF
Vectors are the new JSON in PostgreSQL
PDF
High Availability PostgreSQL on OpenShift...and more!
PPTX
Get Your Insecure PostgreSQL Passwords to SCRAM
PDF
Safely Protect PostgreSQL Passwords - Tell Others to SCRAM
PDF
Operating PostgreSQL at Scale with Kubernetes
PDF
Using PostgreSQL With Docker & Kubernetes - July 2018
PDF
An Introduction to Using PostgreSQL with Docker & Kubernetes
PDF
Developing and Deploying Apps with the Postgres FDW
PDF
On Beyond (PostgreSQL) Data Types
PDF
Accelerating Local Search with PostgreSQL (KNN-Search)
PDF
Webscale PostgreSQL - JSONB and Horizontal Scaling Strategies
PDF
Indexing Complex PostgreSQL Data Types
Vectors are the new JSON in PostgreSQL (SCaLE 21x)
Vectors are the new JSON in PostgreSQL
High Availability PostgreSQL on OpenShift...and more!
Get Your Insecure PostgreSQL Passwords to SCRAM
Safely Protect PostgreSQL Passwords - Tell Others to SCRAM
Operating PostgreSQL at Scale with Kubernetes
Using PostgreSQL With Docker & Kubernetes - July 2018
An Introduction to Using PostgreSQL with Docker & Kubernetes
Developing and Deploying Apps with the Postgres FDW
On Beyond (PostgreSQL) Data Types
Accelerating Local Search with PostgreSQL (KNN-Search)
Webscale PostgreSQL - JSONB and Horizontal Scaling Strategies
Indexing Complex PostgreSQL Data Types

Recently uploaded (20)

PDF
NewMind AI Monthly Chronicles - July 2025
PPT
Teaching material agriculture food technology
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
PDF
Encapsulation theory and applications.pdf
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Unlocking AI with Model Context Protocol (MCP)
PDF
Encapsulation_ Review paper, used for researhc scholars
PDF
The Rise and Fall of 3GPP – Time for a Sabbatical?
PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PPTX
Big Data Technologies - Introduction.pptx
PDF
Network Security Unit 5.pdf for BCA BBA.
PPTX
MYSQL Presentation for SQL database connectivity
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
PPTX
Cloud computing and distributed systems.
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Spectral efficient network and resource selection model in 5G networks
PDF
Electronic commerce courselecture one. Pdf
NewMind AI Monthly Chronicles - July 2025
Teaching material agriculture food technology
CIFDAQ's Market Insight: SEC Turns Pro Crypto
Encapsulation theory and applications.pdf
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Unlocking AI with Model Context Protocol (MCP)
Encapsulation_ Review paper, used for researhc scholars
The Rise and Fall of 3GPP – Time for a Sabbatical?
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
Big Data Technologies - Introduction.pptx
Network Security Unit 5.pdf for BCA BBA.
MYSQL Presentation for SQL database connectivity
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
The AUB Centre for AI in Media Proposal.docx
Blue Purple Modern Animated Computer Science Presentation.pdf.pdf
Cloud computing and distributed systems.
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Spectral efficient network and resource selection model in 5G networks
Electronic commerce courselecture one. Pdf

Build a Complex, Realtime Data Management App with Postgres 14!

  • 1. Chicago PostgreSQL User Group - October 20, 2021 Jonathan S. Katz Let's Build a Complex, Real- Time Data Management Application
  • 2. • VP, Platform Engineering @ Crunchy Data • Previously: Engineering Leadership @ Startups • Longtime PostgreSQL community contributor • Core Team Member • Various Governance Committees • Conference Organizer / Speaker • @jkatz05 About Me
  • 3. • Leading Team in Postgres – 10 contributors • Certified Open Source PostgreSQL Distribution • Leader in Postgres Technology for Kubernetes • Crunchy Bridge: Fully managed cloud service Crunchy Data Your partner in deploying open source PostgreSQL throughout your enterprise.
  • 4. CPSM Provider Plugin This talk introduces many different tools and techniques available in PostgreSQL for building applications. It introduces different features and where to find out more information. We have a lot of material to cover in a short time - the slides and demonstrations will be made available How to Approach This Talk
  • 5. CPSM Provider Plugin Imagine we are managing virtual rooms for an event platform. We have a set of operating hours in which the rooms can be booked. Only one booking can occur in a virtual room at a given time. The Problem
  • 7. CPSM Provider Plugin We need to know... - All the rooms that are available to book - When the rooms are available to be booked (operating hours) - When the rooms have been booked And... The system needs to be able to CRUD fast (Create, Read, Update, Delete. Fast). Specifications
  • 10. CPSM Provider Plugin Availability can be thought about in three ways: Closed Available Unavailable (or "booked") Our ultimate "calendar tuple" is (room, status, range) Managing Availability
  • 11. CPSM Provider Plugin PostgreSQL 9.2 introduced "range types" that included the ability to store and efficiently search over ranges of data. Built-in: Date, Timestamps Integer, Numeric Lookups (e.g. overlaps) can be sped up using GiST indexes Postgres Range Types SELECT tstzrange('2021-10-28 09:30'::timestamptz, '2021-10-28 10:30'::timestamptz);
  • 13. Availability SELECT * FROM ( VALUES ('closed', tstzrange('2021-10-28 0:00', '2021-10-28 8:00')), ('available', tstzrange('2021-10-28 08:00', '2021-10-28 09:30')), ('unavailable', tstzrange('2021-10-28 09:30', '2021-10-28 10:30')), ('available', tstzrange('2021-10-28 10:30', '2021-10-28 16:30')), ('unavailable', tstzrange('2021-10-28 16:30', '2021-10-28 18:30')), ('available', tstzrange('2021-10-28 18:30', '2021-10-28 20:00')), ('closed', tstzrange('2021-10-28 20:00', '2021-10-29 0:00')) ) x(status, calendar_range) ORDER BY lower(x.calendar_range);
  • 15. CPSM Provider Plugin Insert new ranges and dividing them up PostgreSQL did not work well with noncontiguous ranges…until PostgreSQL 14 Availability Just for one day - what about other days? What happens with data in the past? What happens with data in the future? Unavailability Ensure no double-bookings Overlapping Events? Handling multiple spaces But…
  • 16. Managing Availability availability_rule id <serial> PRIMARY KEY room_id <int> REFERENCES (room) days_of_week <int[]> start_time <time> end_time <time> generate_weeks_into_future <int> DEFAULT 52 room id <serial> PRIMARY KEY name <text> availability id <serial> PRIMARY KEY room_id <int> REFERENCES (room) availability_rule_id <int> REFERENCES (availabilityrule) available_date <date> available_range <tstzrange> unavailability id <serial> PRIMARY KEY room_id <int> REFERENCES (room) unavailable_date <date> unavailable_range <tstzrange> calendar id <serial> PRIMARY KEY room_id <int> REFERENCES (room) status <text> DOMAIN: {available, unavailable, closed} calendar_date <date> calendar_range <tstzrange>
  • 17. CPSM Provider Plugin We can now store data, but what about: Generating initial calendar? Generating availability based on rules? Generating unavailability? Sounds like we need to build an application Managing Availability
  • 18. CPSM Provider Plugin To build our application, there are a few topics we will need to explore first: generate_series Recursive queries Ranges and Multiranges SQL Functions Set returning functions PL/pgsql Triggers Managing Availability
  • 19. CPSM Provider Plugin Generate series is a "set returning" function, i.e. a function that can return multiple rows of data. Generate series can return: A set of numbers (int, bigint, numeric) either incremented by 1 or some other integer interval A set of timestamps incremented by a time interval(!!) generate_series: More Than Just For Test Data SELECT x::date FROM generate_series( '2021-01-01'::date, '2021-12-31'::date, '1 day'::interval ) x;
  • 20. CPSM Provider Plugin PostgreSQL 8.4 introduced the "WITH" syntax and with it also introduced the ability to perform recursive queries WITH RECURSIVE ... AS () Base case vs. recursive case UNION vs. UNION ALL CAN HIT INFINITE LOOPS Recursion in SQL?
  • 21. CPSM Provider Plugin Recursion in SQL? WITH RECURSIVE fac AS ( SELECT 1::numeric AS n, 1::numeric AS i UNION SELECT fac.n * (fac.i + 1), fac.i + 1 AS i FROM fac ) SELECT fac.n, fac.i FROM fac; Infinite Recursion
  • 22. CPSM Provider Plugin Recursion in SQL? WITH RECURSIVE fac AS ( SELECT 1::numeric AS n, 1::numeric AS i UNION SELECT fac.n * (fac.i + 1), fac.i + 1 AS i FROM fac ) SELECT fac.n, fac.i FROM fac LIMIT 100;
  • 23. Postgres 14 introduces multirange types Ordered list of ranges Can be noncontiguous Adds range aggregates: range_agg and unnest Multirange Types SELECT datemultirange( daterange(CURRENT_DATE, CURRENT_DATE + 1), daterange(CURRENT_DATE + 5, CURRENT_DATE + 8), daterange(CURRENT_DATE + 15, CURRENT_DATE + 22) );
  • 24. CPSM Provider Plugin PostgreSQL provides the ability to write functions to help encapsulate repeated behavior PostgreSQL 11 introduces stored procedures which enables you to embed transactions! PostgreSQL 14 adds the ability to get output from stored procedures! SQL functions have many properties, including: Input / output Volatility (IMMUTABLE, STABLE, VOLATILE) (default VOLATILE) Parallel safety (default PARALLEL UNSAFE) LEAKPROOF; SECURITY DEFINER Execution Cost Language type (more on this later) Functions
  • 25. CPSM Provider Plugin Functions CREATE OR REPLACE FUNCTION chipug_fac(n int) RETURNS numeric AS $$ WITH RECURSIVE fac AS ( SELECT 1::numeric AS n, 1::numeric AS i UNION SELECT fac.n * (fac.i + 1), fac.i + 1 AS i FROM fac WHERE i + 1 <= $1 ) SELECT max(fac.n) FROM fac; $$ LANGUAGE SQL IMMUTABLE PARALLEL SAFE;
  • 26. CPSM Provider Plugin Functions CREATE OR REPLACE FUNCTION chipug_fac_set(n int) RETURNS SETOF numeric AS $$ WITH RECURSIVE fac AS ( SELECT 1::numeric AS n, 1::numeric AS i UNION SELECT fac.n * (fac.i + 1), fac.i + 1 AS i FROM fac WHERE i + 1 <= $1 ) SELECT fac.n FROM fac ORDER BY fac.n; $$ LANGUAGE SQL IMMUTABLE PARALLEL SAFE;
  • 27. CPSM Provider Plugin Functions CREATE OR REPLACE FUNCTION chipug_fac_table(n int) RETURNS TABLE(n numeric) AS $$ WITH RECURSIVE fac AS ( SELECT 1::numeric AS n, 1::numeric AS i UNION SELECT fac.n * (fac.i + 1), fac.i + 1 AS i FROM fac WHERE i + 1 <= $1 ) SELECT fac.n FROM fac ORDER BY fac.n; $$ LANGUAGE SQL IMMUTABLE PARALLEL SAFE;
  • 28. CPSM Provider Plugin PostgreSQL has the ability to load in procedural languages ("PL") and execute code in them beyond SQL. Built-in: pgSQL, Python, Perl, Tcl Others: Javascript, R, Java, C, JVM, Container, LOLCODE, Ruby, PHP, Lua, pgPSM, Scheme Procedural Languages
  • 29. CPSM Provider Plugin PL/pgSQL CREATE EXTENSION IF NOT EXISTS plpgsql; CREATE OR REPLACE FUNCTION chipug_fac_plpgsql(n int) RETURNS numeric AS $$ DECLARE fac numeric; i int; BEGIN fac := 1; FOR i IN 1..n LOOP fac := fac * i; END LOOP; RETURN fac; END; $$ LANGUAGE plpgsql IMMUTABLE PARALLEL SAFE;
  • 30. CPSM Provider Plugin Triggers are functions that can be called before/after/instead of an operation or event Data changes (INSERT/UPDATE/DELETE) Events (DDL, DCL, etc. changes) Atomic Must return "trigger" or "event_trigger" (Return "NULL" in a trigger if you want to skip operation) (Gotcha: RETURN OLD [INSERT] / RETURN NEW [DELETE]) Execute once per modified row or once per SQL statement Multiple triggers on same event will execute in alphabetical order Writeable in any PL language that defined trigger interface Triggers
  • 32. We'll Scan the Code It's Available for Download 😉
  • 34. CPSM Provider Plugin [Test your live demos before running them, and you will have much success!] availability_rule inserts took some time, > 350ms availability: INSERT 52 calendar: INSERT 52 from nontrivial function Updates on individual availability / unavailability are not too painful Lookups are faaaaaaaast Lessons of the Test
  • 35. How About At (Web) Scale?
  • 36. CPSM Provider Plugin Recursive CTE 😢 Even with only 100 more rooms with a few set of rules, rule generation time increased significantly Multirange Types These are still pretty fast and are handling scaling up well. May still be slow for a web transaction. Lookups are still lightning fast! Web Scale
  • 37. CPSM Provider Plugin Added in PostgreSQL 9.4 Replays all logical changes made to the database Create a logical replication slot in your database Only one receiver can consume changes from one slot at a time Slot keeps track of last change that was read by a receiver If receiver disconnects, slot will ensure database holds changes until receiver reconnects Only changes from tables with primary keys are relayed As of PostgreSQL 10, you can set a "REPLICA IDENTITY" on a UNIQUE, NOT NULL, non-deferrable, non-partial column(s) Basis for Logical Replication Logical Decoding
  • 38. CPSM Provider Plugin A logical replication slot has a name and an output plugin PostgreSQL comes with the "test" output plugin Have to write a custom parser to read changes from test output plugin Several output plugins and libraries available wal2json: https://github.com/eulerto/wal2json jsoncdc: https://github.com/instructure/jsoncdc Debezium: http://debezium.io/ (Test: https://www.postgresql.org/docs/current/static/test-decoding.html) Logical Replication (pgoutput) Every data change in the database is streamed Need to be aware of the logical decoding format Logical Decoding Out of the Box
  • 39. CPSM Provider Plugin C: libpq pg_recvlogical PostgreSQL functions Python: psycopg2 - version 2.7 JDBC: version 42 Go: pgx JavaScript: node-postgres (pg-logical-replication) Driver Support
  • 40. CPSM Provider Plugin Using Logical Decoding
  • 41. CPSM Provider Plugin We know it takes time to regenerate calendar Want to ensure changes always propagate but want to ensure all users (managers, calendar searchers) have good experience Thoughts🤔
  • 42. CPSM Provider Plugin Will use the same data model as before as well as the same helper functions, but without the triggers We will have a Python script that reads from a logical replication slot and if it detects a relevant change, take an action Similar to what we did with triggers, but this moves the work to OUTSIDE the transaction BUT...we can confirm whether or not the work is completed, thus if the program fails, we can restart from last acknowledged transaction ID Replacing Triggers
  • 44. CPSM Provider Plugin A consumer of the logical stream can only read one change at a time If our processing of a change takes a lot of time, it will create a backlog of changes Backlog means the PostgreSQL server needs to retain more WAL logs Retaining too many WAL logs can lead to running out of disk space Running out of disk space can lead to...rough times. The Consumer Bottleneck 🌤 🌥 ☁ 🌩
  • 46. CPSM Provider Plugin Can utilize a durable message queueing system to store any WAL changes that are necessary to perform post-processing on Ensure the changes are worked on in order "Divide-and-conquer" workload - have multiple workers acting on different "topics" Remove WAL bloat Shifting the Workload
  • 47. CPSM Provider Plugin Durable message processing and distribution system Streams Supports parallelization of consumers Multiple consumers, partitions Highly-available, distributed architecture Acknowledgement of receiving, processing messages; can replay (sounds like WAL?) Can also accomplish this with Debezium, which interfaces with Kafka + Postgres Apache Kafka
  • 49. CPSM Provider Plugin WAL Consumer import json, sys from kafka import KafkaProducer from kafka.errors import KafkaError import psycopg2 import psycopg2.extras TABLES = set([ 'availability', 'availability_rule', 'room', 'unavailability', ]) reader = WALConsumer() cursor = reader.connection.cursor() cursor.start_replication(slot_name='schedule', decode=True) try: cursor.consume_stream(reader) except KeyboardInterrupt: print("Stopping reader...") finally: cursor.close() reader.connection.close() print("Exiting reader")
  • 50. CPSM Provider Plugin class WALConsumer(object): def __init__(self): self.connection = psycopg2.connect("dbname=realtime", connection_factory=psycopg2.extras.LogicalReplicationConnection, ) self.producer = producer = KafkaProducer( bootstrap_servers=['localhost:9092'], value_serializer=lambda m: json.dumps(m).encode('ascii'), ) def __call__(self, msg): payload = json.loads(msg.payload, strict=False) print(payload) # determine if the payload should be passed on to a consumer listening # to the Kafka que for data in payload['change']: if data.get('table') in TABLES: self.producer.send(data.get('table'), data) # ensure everything is sent; call flush at this point self.producer.flush() # acknowledge that the change has been read - tells PostgreSQL to stop # holding onto this log file msg.cursor.send_feedback(flush_lsn=msg.data_start)
  • 51. CPSM Provider Plugin Kafka Consumer import json from kafka import KafkaConsumer from kafka.structs import OffsetAndMetadata, TopicPartition import psycopg2 class Worker(object): """Base class to work perform any post processing on changes""" OPERATIONS = set([]) # override with "insert", "update", "delete" def __init__(self, topic): # connect to the PostgreSQL database self.connection = psycopg2.connect("dbname=realtime") # connect to Kafka self.consumer = KafkaConsumer( bootstrap_servers=['localhost:9092'], value_deserializer=lambda m: json.loads(m.decode('utf8')), auto_offset_reset="earliest", group_id='1') # subscribe to the topic(s) self.consumer.subscribe(topic if isinstance(topic, list) else [topic])
  • 52. CPSM Provider Plugin Kafka Consumer def run(self): """Function that runs ad-infinitum""" # loop through the payloads from the consumer # determine if there are any follow-up actions based on the kind of # operation, and if so, act upon it # always commit when done. for msg in self.consumer: print(msg) # load the data from the message data = msg.value # determine if there are any follow-up operations to perform if data['kind'] in self.OPERATIONS: # open up a cursor for interacting with PostgreSQL cursor = self.connection.cursor() # put the parameters in an easy to digest format params = dict(zip(data['columnnames'], data['columnvalues'])) # all the function getattr(self, data['kind'])(cursor, params) # commit any work that has been done, and close the cursor self.connection.commit() cursor.close() # acknowledge the message has been handled tp = TopicPartition(msg.topic, msg.partition) offsets = {tp: OffsetAndMetadata(msg.offset, None)} self.consumer.commit(offsets=offsets)
  • 53. CPSM Provider Plugin Kafka Consumer # override with the appropriate post-processing code def insert(self, cursor, params): """Override with any post-processing to be done on an ``INSERT``""" raise NotImplementedError() def update(self, cursor, params): """Override with any post-processing to be done on an ``UPDATE``""" raise NotImplementedError() def delete(self, cursor, params): """Override with any post-processing to be done on an ``DELETE``""" raise NotImplementedError()
  • 55. CPSM Provider Plugin Logical decoding allows the bulk inserts to occur significantly faster from a transactional view Potential bottleneck for long running execution, but bottlenecks are isolated to specific queues Newer versions of PostgreSQL has features that make it easier to build applications and scale Lessons
  • 56. CPSM Provider Plugin PostgreSQL is robust. Triggers will keep your data in sync but can have significant performance overhead Utilizing a logical replication slot can eliminate trigger overhead and transfer the computational load elsewhere Not a panacea: still need to use good architectural patterns! Conclusion