10 Reasons to Start Your Analytics Project with PostgreSQL

10 Reasons To Start
Your Analytics Project
with PostgreSQL
Satoshi Nagayasu
@snaga
HKOSCon 2016

Agenda
• Collecting Data / Database Federation
• Building Data Warehouse and Data Mart
• Writing Queries / SQL Features
• Performance
• In-Database Analytics

Collecting Data / Database
Federation
Foreign Data Wrapper
Unlogged Table

Foreign Data Wrapper
• Connects external data sources (RDBMS, NoSQL, files,
etc) to the PostgreSQL executor.
• Allows SELECT/INSERT/UPDATE/DELETE operations
for external tables.
PostgreSQL
Oracle
MySQL
HDFS
https://wiki.postgresql.org/wiki/Foreign_data_wrappers

Unlogged Table
• Does not record XLOG.
• Has better performance compared to regular
table.
• Will be truncated after crash
recovery.
http://pgsnaga.blogspot.jp/2011/10/data-loading-into-unlogged-tables-and.html

Building Data Warehouse
and Data Mart
Materialized Views
Transactional DDLs

Materialized View
• Defines a view with caching records.
• Allows to avoid running complicated queries and
aggregations every time.
• Requires updating cache by the users.
Table
View
Table Table
Materialized
View
Table
Query Query
Cache

Transactional DDLs
• Most of DDLs can be performed in transaction in
PostgreSQL.
• Schema can be modified with keeping atomicity
even online. (commit or rollback)
• Transactional DDLs would help DBAs manage
their schema easier.

Writing Queries / SQL
Features
Rich SQL features
Compatibility with SQL standard

Writing Queries / SQL Features
• Rich SQL features
– Subqueries
– WITH clauses (Common Table Expressions, CTEs)
– Many aggregation functions
– Window functions
• JSON support
• Compatibility with the SQL standard

WITH clause
• Defines a temporary table for a query.
• May make a better performance compared to
using the same subquery more than once.
WITH foo AS (
SELECT ... FROM ... GROUP BY ...
)
SELECT ... FROM foo WHERE ...
UNION ALL
SELECT ... FROM foo WHERE ...;
https://www.postgresql.org/docs/9.5/static/queries-with.html

Many Aggregations
• New in 9.4
– percentile_cont()
– percentile_disc()
– mode()
– rank()
– dense_rank()
– percent_rank()
– cume_dist()
• New in 9.5
– ROLLUP()
– CUBE()
– GROUPING SETS()
https://www.postgresql.org/docs/9.5/static/functions-aggregate.html

ROLLUP
• Calculates total/subtotal values

CUBE
• Calculates for all combinations of the
specified columns

GROUPING SETS
• Runs multiple GROUP BY queries at once
Two GROUP BYs
at once.

JSON data type
testdb=# create table t1 ( j jsonb );
CREATE TABLE
testdb=# insert into t1 values ('{ "key1": "value1", "key2":
"value2" }');
INSERT 0 1
testdb=# select * from t1;
j
--------------------------------------
{"key1": "value1", "key2": "value2"}
(1 row)
testdb=# select j->>'key2' key2 from t1;
key2
--------
value2
(1 row)

JSON data type
testdb=# select n_nationkey,n_name from nation where
n_nationkey = 12;
n_nationkey | n_name
-------------+---------------------------
12 | JAPAN
(1 row)
testdb=# select jsonb_build_object('n_nationkey', n_nationkey,
'n_name', n_name) from nation where n_nationkey = 12;
jsonb_build_object
------------------------------------------------------------
{"n_name": "JAPAN ", "n_nationkey": 12}
(1 row)

JSON data type
Operator Description
9.4
-> Get an element by key as a JSON object
->> Get an element by key as a text object
#> Get an element by path as a JSON object
#>> Get an element by path as a text object
<@, @> Evaluate whether a JSON object contains a key/value pair
? Evaluate whether a JSON object contains a key or a value
?| Evaluate whether a JSON object contains ANY of keys or values
?& Evaluate whether a JSON object contains ALL of keys or values
9.5
|| Insert or Update an element to a JSON object
- Delete an element by key from a JSON object
#- Delete an element by path from a JSON object
http://www.postgresql.org/docs/9.5/static/functions-json.html

JSON data type
• Allows to collect data without defining schema.
• “Schema-less”, “Schema on Read” or “Schema-
later”.
• Still accessible with SQL.
JSON
Data Type
Fluentd
pg-Json plugin
View
(Schema)
App
App
Fluentd

Performance
3 types of Join
Full text search (n-gram)
Table Partition
BRIN Index
Table Sample
Parallel Queries

3 types of Join
• Nested Loop (NL) Join
– Works good when joining small number of records
between tables with indexes.
• Merge Join
• Hash Join
– Works better than NL when joining large number of
records between large tables.

Full-text search (n-gram)
• Splits a text into N-char tokens and build an index.
– Pg_trgm: Tri-gram (3-char)
– Pg_bigm: Bi-gram (2-char)
• CJK has lots of 2-char words, so Bi-gram may be
useful rather than Tri-gram.
– CJK: Chinese, Japanese and Korean.
Pg_trgm: https://www.postgresql.org/docs/9.5/static/pgtrgm.html
Pg_bigm: http://pgbigm.osdn.jp/index_en.html

Pg_bigm performance
• Wikipedia title data (2,789,266 records)
– https://dumps.wikimedia.org/zhwiki/20160601/
– zhwiki-20160601-pages-articles-multistream-index.txt.bz2
zhwikidb=> select * from zhwiki_index where title like '%香港%';
id1 | id2 | title
----------+-------+----------------------------------------
5693863 | 2087 | 香港特別行政區基本法第二十三條
11393231 | 4323 | 香港特别行政区
12830042 | 5085 | 香港大学列表
14349335 | 6088 | 香港行政区划
14349335 | 6090 | 香港行政區劃
14349335 | 6091 | 香港十八区
14349335 | 6092 | 香港十八區
16084672 | 7168 | 香港兒童文學作家
18110426 | 8206 | 北區 (香港)
18110426 | 8236 | 東區 (香港)
19537078 | 9528 | 香港專業教育學院
19537078 | 9567 | 香港中文大學

Pg_bigm performance
Aggregate (actual time=481.512..481.541 rows=1 loops=1)
-> Seq Scan on zhwiki_index (actual time=1.458..478.326 rows=317 loops=1)
Filter: (title ~~ '%香港電影%'::text)
Rows Removed by Filter: 2788949
Planning time: 0.125 ms
Execution time: 481.654 ms
(6 rows)
select count(*) from zhwiki_index
where title like '%香港電影%';

Pg_bigm performance
Aggregate (actual time=1.790..1.792 rows=1 loops=1)
-> Bitmap Heap Scan on zhwiki_index (actual time=0.299..1.225 rows=317
loops=1)
Recheck Cond: (title ~~ '%香港電影%'::text)
Rows Removed by Index Recheck: 1
Heap Blocks: exact=191
-> Bitmap Index Scan on zhwiki_index_title_idx (actual
time=0.258..0.258 rows=318 loops=1)
Index Cond: (title ~~ '%香港電影%'::text)
Planning time: 0.103 ms
Execution time: 1.833 ms
(9 rows)
select count(*) from zhwiki_index
where title like '%香港電影%';
481.6ms → 1.8ms.
200x faster than a regular LIKE.

Table Partition
• Table Partitioning by Range or List
– Called “Constraint Exclusion”
• Does not scan unnecessary partitions
– Determined by the “constraints”.
• Is able to eliminate “full table scan” for large tables
entirely.
https://www.postgresql.org/docs/9.5/static/ddl-partitioning.html

BRIN Index
• Block Range INdex (New in 9.5)
– Holds "summary“ data, instead of raw data.
– Reduces index size tremendously.
– Also reduces creation/maintenance cost.
– Needs extra tuple fetch to get the exact record.
0
50,000
100,000
150,000
200,000
250,000
300,000
Btree BRIN
Elapsedtime(ms)
Index Creation
0
50,000
100,000
150,000
200,000
250,000
300,000
Btree BRIN
NumberofBlocks
Index Size
0
2
4
6
8
10
12
14
16
18
Btree BRIN
Elapsedtime(ms)
Select 1 record
https://gist.github.com/snaga/82173bd49749ccf0fa6c

BRIN Index
• Structure of BRIN Index
Table File
Block Range 1 (128 Blocks)
Block Range 2
Block Range 3
Block
Range
Min. Value Max. Value
1 1992-01-02 1992-01-28
2 1992-01-27 1992-02-08
3 1992-02-08 1992-02-16
… … …
Holds only min/max values
for “Block Ranges”,
128 blocks each.
(in case a date
column)

TABLESAMPLE
• Allows to get approximate results for aggregations by
sampling.
• BERNOULLI
– Accurate
– Sample by Tuple
• SYSTEM
– Performance
– Sample by Block
http://blog.2ndquadrant.com/tablesample-in-postgresql-9-5-2/

TABLESAMPLE
• Calculating the average of total price.
– The actual value and the approximate ones

TABLESAMPLE
Without TABLESAMPLE
1787ms
SYSTEM Sampl.
22ms
BERNOULLI Sampl.
405ms

Parallel Queries
• The leader process cooperates with those worker
processes for:
– Sequential scan
– Joins (Nested Loop & Hash)
– Aggregations
• Will be shipped with 9.6
– 9.6 is beta2 as of today
Leader
Worker Worker
Client
Data
Read &
Examine
Query
Result
Launch & Gather

Parallel Aggregation
Performance & Scalability
• count(*) on 30M rows
– Shows a good parallel scalability

In-Database Analytics
User Defined Functions
Apache MADlib

• In-Database Analytics?
– Performs analytics workload in the database
without pulling the data out of the server.
• Advantages of In-Database Analytics
– No need to move “BigData” between server and
client for analytics.
– Higher performance hardware resources (CPU,
memory, storage) compared to client PCs.

• User defined functions
– PL/Python, PL/R, PL/v8, ... or C lang.
– Allow you to run (almost) any logics within the
database.
• Apache MADlib
– Machine Learning Library for PostgreSQL

UDF by Python
CREATE OR REPLACE FUNCTION dumpenv(OUT text, OUT text)
RETURNS SETOF record
AS $$
import os
for e in os.environ:
plpy.notice(str(e) + ": " + os.environ[e])
yield(e, os.environ[e])
$$ LANGUAGE plpythonu;

UDF by Python
CREATE OR REPLACE FUNCTION dumpenv(OUT text, OUT text)
RETURNS SETOF record
AS $$
import os
for e in os.environ:
plpy.notice(str(e) + ": " + os.environ[e])
yield(e, os.environ[e])
$$ LANGUAGE plpythonu;
testdb=# select * from dumpenv() order by 1 limit 10;
column1 | column2
--------------------+-----------------------
G_BROKEN_FILENAMES | 1
HISTCONTROL | ignoredups
HISTSIZE | 1000
HOME | /home/snaga
HOSTNAME | localhost.localdomain
LANG | ja_JP.UTF-8
LC_COLLATE | C
LC_CTYPE | C
LC_MESSAGES | C
LC_MONETARY | C
(10 rows)

Apache MADlib
• An Open Source Machine Learning Library
– Can run in PostgreSQL, Greenplum Database and
Apache HAWQ.
– Supports many ML algorithms.
http://madlib.incubator.apache.org/

Others
Strict type checking and
constraints.
Industry Standard Interface (for
BI tools)

Others
• Strict type checking and constraints.
– Avoid “Garbage in, garbage out.”
• Industry Standard Interface (for BI tools)
– ODBC, JDBC

Summary
• PostgreSQL has already had lots of features that
help your analytics project
– In terms of productivity and performance.
• And more “BigData” features are coming in the
future release.
– Parallel query must be a big-shot.
• Let’s start your analytic project with PostgreSQL
and join our community. 
– PostgreSQL 9.6 beta2 is available now!

Resources
• http://www.postgresql.org
• http://wiki.postgresql.org
• http://planet.postgresql.org
• http://pgcon.org

pgDay Asia 2016
• pgDay Asia 2016 / FOSSASIA 2016
– March 17-19 in Singapore
• Speakers:
– 19+ speakers from 9 countries
• Sessions:
– 19 Regular Sessions.
– Plus, lightning talks
• Attendees:
– Around 100 attendees

10 Reasons to Start Your Analytics Project with PostgreSQL

pgDay Asia 2017
• FOSSASIA 2017 (March, 2017)
– Probably, the same format, in the same season, in the
same region.
• Do not miss the next one!
– Will be better and bigger. 
• Join us at:
– http://pgday.asia
– https://www.facebook.com/pgdayasia

10 Reasons to Start Your Analytics Project with PostgreSQL

More Related Content

What's hot (20)

Viewers also liked (6)

Similar to 10 Reasons to Start Your Analytics Project with PostgreSQL (20)

More from Satoshi Nagayasu (20)

Recently uploaded (20)

10 Reasons to Start Your Analytics Project with PostgreSQL