Real time analytics at any scale | PostgreSQL User Group NL | Marco Slot

Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November 2018Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November 2018
Real-time Analytics on
PostgreSQL at any Scale
Marco Slot <marco@citusdata.com>

Marco Slot | Citus Data | PostgreSQL Meetup Amsterdam: November 2018
You offer a product or service (e.g. SaaS, IoT platform, network telemetry, …)
that generates large volumes of time series data.
How to build an analytical dashboard for your customers that:
• Supports a large number of concurrent users
• Reflects new data within minutes
• Has subsecond response times
• Supports advanced analytics
What is real-time analytics?
2
(Heap Analytics)

Pipeline of Collect - Aggregate - Query:
Real-time analytics architecture
3
Event source
Event source
Event source
Event source
Storage
(Database) Aggregate
Rollups
(Database)
Dashboard
(App)
Collect
Queries

Pipeline of Collect - Aggregate - Query:
Real-time analytics architecture
4
Event source
Event source
Event source
Event source
Storage
(Database) Aggregate
Rollups
(Database)
Dashboard
(App)
Collect
Queries
Postgres/Citus

Define a table for storing raw events:
CREATE TABLE events (
event_id bigserial,
event_time timestamptz default now(),
customer_id bigint,
event_type text,
…
event_details jsonb
);
Raw data table
5

COPY is by far the fastest way of loading data.
COPY events (customer_id, event_time, … ) FROM STDIN;
A few parallel COPY streams can load hundreds of thousands of events per
second!
Load data
6

To achieve fast data loading:
• Use COPY
• Don’t use indexes
To achieve fast reading of new events for aggregation:
• Use an index
Fast data loading
7

To achieve fast data loading:
• Use COPY
• Don’t use large indexes
To achieve fast reading of new events for aggregation:
• Use an index
Block-range index is suitable for ordered columns:
CREATE INDEX event_time_idx ON events USING BRIN (event_time);
Fast data loading
8

Pre-computed aggregates for a period and set of (group by) dimensions.
Can be further filtered and aggregated to generate charts.
What is a rollup?
9
Period Customer Country Site Hit Count
SELECT…

Append new data to a raw events table (avoid indexes!):
COPY events FROM ...
Periodically aggregate events into rollup table (index away!):
INSERT INTO rollup SELECT … FROM events … GROUP BY …
Application queries the rollup table:
SELECT … FROM rollup WHERE customer_id = 1238 …
Postgres recipe for real-time analytics
10

Keep your data sorted into buckets
Partitioning
11

Partitioning keeps indexes small by dividing tables into partitions:
Benefits:
• Avoid fragmentation
• Smaller indexes
• Partition pruning for queries that filter by partition column
• Drop old data quickly, without bloat/fragmentation
Partitioning
12
COPY COPY

Defining a partitioned table:
CREATE TABLE events (...) PARTITION BY (event_time);
Setting up hourly partitioning with pg_partman:
SELECT partman.create_parent('public.events', 'event_time',
'native', 'hourly');
https://www.citusdata.com/blog/2018/01/24/citus-and-pg-partman-creating-a-sca
lable-time-series-database-on-PostgreSQL/
CREATE EXTENSION pg_partman
13

If you’re using partitioning, pg_partman can drop old partitions:
UPDATE partman.part_config
SET retention_keep_table = false, retention = '1 month'
WHERE parent_table = 'public.events';
Periodically run maintenance:
SELECT partman.run_maintenance();
Expiring old data in a partitioned table
14

Run pg_partman maintenance every hour using pg_cron:
SELECT cron.schedule('3 * * * *', $$
SELECT partman.run_maintenance()
$$);
https://github.com/citusdata/pg_cron
Periodic partitioning maintenance
15

High vs. Low Cardinality
Designing Rollup Tables
16

Define rollup tables containing aggregates:
CREATE TABLE rollup_by_period_and_dimensions (
<period>
<dimensions>
<aggregates>
primary key (<dimensions>,<period>)
);
Primary key index covers many queries, can also add additional indices:
CREATE INDEX usc_idx ON rollup (customer_id, site_id);
Rollup table
17

Two rollups are smaller than one: A*B + A*C < A*B*C
But… up to 2x more aggregation work.
Choosing granularity and dimensions
18
Time Customer Country Aggregates
Time Customer Site Aggregates
Time Customer Country Site Aggregates
~100 rows per period/customer
~20 rows per period/customer
~20*100=2000 rows per period/customer

Find balance between query performance and table management.
1. Identify dimensions, metrics (aggregates)
2. Try rollup with all dimensions:
3. Test compression/performance (goal is >5x smaller)
4. If too slow / too big, split rollup table based on query patterns
5. Go to 3
Usually ends up with 5-10 rollup tables
Guidelines for designing rollups
19

Append-only vs. Incremental
Running Aggregations
20

Use INSERT INTO rollup SELECT … FROM events … to populate rollup table.
Append-only aggregation (insert):
Supports all aggregates, including exact distinct, percentiles
Harder to handle late data
Incremental aggregation (upsert):
Supports late data
Cannot handle all aggregates (though can approximate using HLL, TopN)
Append-only vs. Incremental
21

Aggregate events for a particular time period and append them to the rollup
table, once all the data for the period is available.
INSERT INTO rollup
SELECT period, dimensions, aggregates
FROM events
WHERE event_time::date = '2018-09-04'
GROUP BY period, dimensions;
Should keep track of which periods have been aggregated.
Append-only Aggregation
22

Aggregate new events and upsert into rollup table.
INSERT INTO rollup
SELECT period, dimensions, aggregates
FROM events
WHERE event_id BETWEEN s AND e
GROUP BY period, dimensions
ON CONFLICT (dimensions, period) DO UPDATE
SET aggregates = aggregates + EXCLUDED.aggregates;
Need to be able to incrementally build aggregates.
Need to keep track of which events have been aggregated.
Incremental Aggregation
23

Incremental aggregation
Technique for incremental aggregation using a sequence number shown on
the Citus Data blog.
Incrementally approximate distinct count:
HyperLogLog extension
Incrementally approximate top N:
TopN extension
24

CREATE EXTENSION Citus
Scaling out your analytics pipeline
25

Citus is an open source extension to Postgres (9.6, 10, 11) for transparently
distributing tables across many Postgres servers.
CREATE EXTENSION citus
26
Coordinator
create_distributed_table('events', 'customer_id');events

Multi-tenancy
Tenant ID provides a natural sharding dimension for many applications.
Citus automatically co-locates event and rollup data for the same
SELECT create_distributed_table('events', 'tenant_id');
SELECT create_distributed_table('rollup', 'tenant_id');
Aggregations can be done locally, without network traffic:
INSERT INTO rollup SELECT tenant_id, … FROM events …
Dashboard queries are always for a particular tenant:
SELECT … FROM rollup WHERE tenant_id = 1238 …
27

COPY asynchronously scatters rows to different shards
Data loading in Citus
28
Coordinator
COPYevents

INSERT … SELECT can be parallelised across shards.
Aggregation in Citus
29
Coordinator
events
create_distributed_table('rollup', 'customer_id');
INSERT INTO rollup
SELECT … FROM events
GROUP BY customer_id, …rollup
INSERT INTO rollup_102182
SELECT … FROM events_102010
GROUP BY …
INSERT INTO rollup_102180
GROUP BY …

SELECT on rollup for a particular customer (from the dashboard) can be
routed to the appropriate shard.
Querying rollups in Citus
30
Coordinator
events SELECT … FROM rollup
WHERE customer_id = 12834 …
…rollup
WHERE customer_id = 1283 …
…

You should use:
• COPY to load raw data into a table
• BRIN index to find new events during aggregation
• Partitioning with pg_partman to expire old data
• Rollup tables built from raw event data
• Append-only aggregation if you need exact percentile/distinct count
• Incremental aggregation if you can have late data
• HLL to incrementally approximate distinct count
• TopN to incrementally approximate heavy hitters
• Citus to scale out
Summary
31

marco@citusdata.com
Q&A
32

Real time analytics at any scale | PostgreSQL User Group NL | Marco Slot

More Related Content

Similar to Real time analytics at any scale | PostgreSQL User Group NL | Marco Slot (20)

More from Citus Data (20)

Recently uploaded (20)

Real time analytics at any scale | PostgreSQL User Group NL | Marco Slot