SlideShare a Scribd company logo
Creating PostgreSQL-as-a-Service at Scale
Care and Feeding of Elephants in a Zoo
2015-09-18
Background on Cats
• Groupon has sold more than 850 million units to date,
including 53 million in Q2 alone1,4.
• Nearly 1 million merchants are connected by our suite
of tools to more than 110 million people that have
downloaded our mobile apps.
• 90% of merchants agree that Groupon brought in new
customers2.
• Groupon helps small businesses — 91% of the
businesses Groupon works with have 20 employees or
fewer2.
• 81% of customers have referred someone to the
business — Groupon customers are “influencers” who
spread the word in their peer groups3.
1) Units reflect vouchers and products sold before cancellations and refunds.
2) AbsolutData, Q2 2015 Merchant Business Partnership Survey, June 2015 (conducted by Groupon).
3) ForeSee Groupon Customer Satisfaction Study, June 2015 (commissioned by Groupon)
4) Information on this slide is current as of Q2 2015
SOA Vogue and Acquisitions
• Four acquisitions in 2015 and six acquisitions in 2014
• Internally many services and teams
SOA Consequences
SOA is a fancy way of saying lots of apps
talk to lots of database instances.
Building Database Systems
• Ogres are like onions:
they have layers.
Building Database Systems
• Ogres are like onions:
they have layers.
• Databases are like
onions: they have
layers, too.
Building Database Systems
* No pun intended, I promise.
• Ogres are like onions:
they have layers.
• Databases are like
onions: they have
layers, too.
• Databases do not
operate in a vacuum*.
Database Functionality
Typical web stack:
• Browser
• CDN
• Load Balancer
• App Tier
• API Tier
• Database
Where are databases in most web stacks?
Typical stack:
• Browser
• CDN
• Load Balancer
• App Tier
• API Tier
• Database
Where are databases in most web stacks?
Typical stack:
• Browser
• CDN
• Load Balancer
• App Tier
• API Tier
• Database
• Wouldn't it be nice if something was here?
Macro Components of a Database
Typical stack:
• Browser
• CDN
• Load Balancer
• App Tier
• API Tier
• Database
- Query && Query Plans
- CPU
- Locking
- Shared Buffers
- Filesystem Buffers
- Disk IO
- Disk Capacity
- Slave Replication
http://www.brendangregg.com/Perf/freebsd_observability_tools.png
Risk Management
It's Friday afternoon (a.k.a. let's have some fun):
# postgresql.conf
#fsync = on
synchronous_commit = off
Risky?
Risk Management
It's Friday afternoon, let's have some fun:
# postgresql.conf
#fsync = on
synchronous_commit = off
zfs set sync=disabled tank/foo
vfs.zfs.txg.timeout: 5
What cost are you willing to accept for 5sec of data?
Discuss.
Mandatory Disclaimer: we don't do this everywhere, but we do
by default.
• Query Engine
• Serialization Layer
• Caching
• Storage
• Proxy
Real Talk: What are the components of a Database?
• Query Engine - SQL
• Serialization Layer - MVCC
• Caching - shared_buffers
• Storage - pages (checksums to detect block corruption)
• Proxy - FDW
Real Talk: What are the components of a Database?
Database Service Layers
Database Service Layers
PostgreSQL
Database Service Layers
PostgreSQL PostgreSQL
Database Service Layers
L2VIP, LB, DNSVIP
PostgreSQL PostgreSQL
Database Service Layers
L2VIP, LB, DNSVIP
PostgreSQL PostgreSQL
PITR PITR
Database Service Layers
L2VIP, LB, DNSVIP
PostgreSQL
pgbouncer
PostgreSQL
pgbouncer
PITR PITR
Database Service Layers
L2VIP, LB, DNSVIP
PostgreSQL
pgbouncer
PostgreSQL
pgbouncer
PITR PITR
• WAN Replication
• Backups
Database Service Layers
L2VIP, LB, DNSVIP
PostgreSQL PostgreSQL
PITR PITR
pgbouncer pgbouncer
• WAN Replication
• Backups
Provisioning
•No fewer than 5x components just to get a
basic database service provisioned.
•Times how many combinations?
Plug: giving a talk on automation and provisioning
at HashiConf in 2wks
Provisioning Checklist
VIPs (DNS, LB, L2, etc)
PostgreSQL instance
Slaves (LAN, OLAP, & WAN)
Backups
pgbouncer
PITR
Stats Collection and Reporting
Graphing
Alerting
Provisioning Checklist
VIPs (DNS, LB, L2, etc)
PostgreSQL instance
Slaves (LAN, OLAP, & WAN)
Backups
pgbouncer
PITR
Stats Collection and Reporting
Graphing
Alerting
*= # VIPs
*= initdb + config
*= number of slaves
*= # backup targets
*= # pgbouncers
*= # PG instances
*= # DBs && Tables
*= # relevant graphs
*= # Thresholds
Provisioning Checklist
VIPs (DNS, LB, L2, etc)
PostgreSQL instance
Slaves (LAN, OLAP, & WAN)
Backups
pgbouncer
PITR
Stats Collection and Reporting
Graphing
Alerting
Known per-user limits
Inheriting existing
applications
Different workloads
Different compliance
and regulatory
requirements
Provisioning
•Automate
•Find a solution that provides a coherent view of the
world (e.g. ansible)
•Idempotent Execution (regardless of how quickly or
slowly)
•Immutable Provisioning
•Changes requiring a restart are forbidden by
automation: provision new things and fail over.
•Get a DBA to do restart-like activity
Efficacy vs Efficiency
•Cost justify automation and efficiency.
•Happens only once every 12mo?
• Do it by hand.
• Document it.
• Don't spend 3x man months automating some
process for the sake of efficiency.
•100% automation is a good goal, but don't forget
about the ROI.
Connection Management: pgbouncer
•Databases support unlimited
connections, am i rite?
•More connections == faster
Connection Management: pgbouncer
Clients
pgbouncer
PostgreSQL
~1.5K connections
~10 connections
Connection Management: pgbouncer
Clients
pgbouncer
PostgreSQL
~1.5K frontend connections
~10 backend connections
Rule of Thumb:
M connections == N cores * some K value
(K = approx. ratio of CPU vs off CPU, e.g. disk IO)
pgbouncer: JDBC edition
pgbouncer <1.6: ?prepareThreshold=0

pgbouncer >=1.6: ???
pgbouncer: Starting Advice
•Limit connections per user to backend by number
of active cores per user.
•M backend connections = N cores * K
•K = approx. ratio of CPU vs queued disk IO
Backups
•Slaves aren't backups
•Replication is not a backup
•Replication + Snapshots? Debatable, depends on
retention, and failure domain.
Backups
•Slaves aren't backups
•Replication is not a backup
•Replication + Snapshots? Debatable, depends on
retention, and failure domain.
-- Dev or DBA "Oops" Moment
DROP DATABASE bar;

DROP TABLE foo;
TRUNCATE foo;
Remote User Controls
•DROP DATABASE or DROP TABLE happen
•Automated schema migrations gone wrong
•Accidentally pointed dev host at prod database
•Create and own DBs using the superuser account
•Give teams ownership over a schema with a
"DBA account"
•Give teams one or more "App Accounts"
(??!!??!?! @#%@#!)
Remote User Controls: pg_hba.conf
• DBA account:
•# TYPE DATABASE USER ADDRESS METHOD

host foo_prod foo_prod_dba 100.64.1.25/32 md5

host foo_prod foo_prod_dba 100.66.42.89/32 md5

•ALTER ROLE foo_prod_dba CONNECTION LIMIT 2;
• App Account:
•# TYPE DATABASE USER ADDRESS METHOD

host foo_prod foo_prod_app1 10.23.45.67/32 md5

•ALTER ROLE foo_prod_app1 CONNECTION LIMIT 10;
Incident Response
• Develop playbooks
• Develop checklists
• DTrace scripts
Locking
-- Find the blocking PID:
SELECT bl.pid AS Blocked_PID,
a.usename as Blocked_User,
kl.pid as Blocking_PID,
ka.usename as Blocking_User,
to_char(age(now(), a.query_start),'HH24h:MIm:SSs') AS Age
FROM
(pg_catalog.pg_locks bl JOIN pg_catalog.pg_stat_activity a ON bl.pid =
a.pid)
JOIN (pg_catalog.pg_locks kl JOIN pg_catalog.pg_stat_activity ka ON
kl.pid = ka.pid)
ON bl.locktype = kl.locktype
AND bl.database is not distinct from kl.database
AND bl.relation is not distinct from kl.relation
AND bl.page is not distinct from kl.page
AND bl.tuple is not distinct from kl.tuple
AND bl.virtualxid is not distinct from kl.virtualxid
AND bl.transactionid is not distinct from kl.transactionid
AND bl.classid is not distinct from kl.classid
AND bl.objid is not distinct from kl.objid
AND bl.objsubid is not distinct from kl.objsubid
AND bl.pid != kl.pid
WHERE
kl.granted AND NOT bl.granted
ORDER BY age DESC;
Index BloatWITH btree_index_atts AS (
SELECT nspname, relname, reltuples, relpages, indrelid, relam,
regexp_split_to_table(indkey::text, ' ')::smallint AS attnum,
indexrelid as index_oid
FROM pg_index
JOIN pg_class ON pg_class.oid=pg_index.indexrelid
JOIN pg_namespace ON pg_namespace.oid = pg_class.relnamespace
JOIN pg_am ON pg_class.relam = pg_am.oid
WHERE pg_am.amname = 'btree'
),
index_item_sizes AS (
SELECT
i.nspname, i.relname, i.reltuples, i.relpages, i.relam,
s.starelid, a.attrelid AS table_oid, index_oid,
current_setting('block_size')::numeric AS bs,
/* MAXALIGN: 4 on 32bits, 8 on 64bits (and mingw32 ?) */
CASE
WHEN version() ~ 'mingw32' OR version() ~ '64-bit' THEN 8
ELSE 4
END AS maxalign,
24 AS pagehdr,
/* per tuple header: add index_attribute_bm if some cols are null-able */
CASE WHEN max(coalesce(s.stanullfrac,0)) = 0
THEN 2
ELSE 6
END AS index_tuple_hdr,
/* data len: we remove null values save space using it fractionnal part from stats */
sum( (1-coalesce(s.stanullfrac, 0)) * coalesce(s.stawidth, 2048) ) AS nulldatawidth
FROM pg_attribute AS a
JOIN pg_statistic AS s ON s.starelid=a.attrelid AND s.staattnum = a.attnum
JOIN btree_index_atts AS i ON i.indrelid = a.attrelid AND a.attnum = i.attnum
WHERE a.attnum > 0
GROUP BY 1, 2, 3, 4, 5, 6, 7, 8, 9
),
index_aligned AS (
SELECT maxalign, bs, nspname, relname AS index_name, reltuples,
relpages, relam, table_oid, index_oid,
( 2 +
maxalign - CASE /* Add padding to the index tuple header to align on MAXALIGN */
WHEN index_tuple_hdr%maxalign = 0 THEN maxalign
ELSE index_tuple_hdr%maxalign
END
+ nulldatawidth + maxalign - CASE /* Add padding to the data to align on MAXALIGN */
WHEN nulldatawidth::integer%maxalign = 0 THEN maxalign
ELSE nulldatawidth::integer%maxalign
END
)::numeric AS nulldatahdrwidth, pagehdr
FROM index_item_sizes AS s1
),
otta_calc AS (
SELECT bs, nspname, table_oid, index_oid, index_name, relpages, coalesce(
ceil((reltuples*(4+nulldatahdrwidth))/(bs-pagehdr::float)) +
CASE WHEN am.amname IN ('hash','btree') THEN 1 ELSE 0 END , 0 -- btree and hash have a metadata reserved block
) AS otta
FROM index_aligned AS s2
LEFT JOIN pg_am am ON s2.relam = am.oid
),
raw_bloat AS (
SELECT current_database() as dbname, nspname, c.relname AS table_name, index_name,
bs*(sub.relpages)::bigint AS totalbytes,
CASE
WHEN sub.relpages <= otta THEN 0
ELSE bs*(sub.relpages-otta)::bigint END
AS wastedbytes,
CASE
WHEN sub.relpages <= otta
THEN 0 ELSE bs*(sub.relpages-otta)::bigint * 100 / (bs*(sub.relpages)::bigint) END
AS realbloat,
pg_relation_size(sub.table_oid) as table_bytes,
stat.idx_scan as index_scans
FROM otta_calc AS sub
JOIN pg_class AS c ON c.oid=sub.table_oid
JOIN pg_stat_user_indexes AS stat ON sub.index_oid = stat.indexrelid
)
SELECT dbname as database_name, nspname as schema_name, table_name, index_name,
round(realbloat, 1) as bloat_pct,
wastedbytes as bloat_bytes, pg_size_pretty(wastedbytes::bigint) as bloat_size,
totalbytes as index_bytes, pg_size_pretty(totalbytes::bigint) as index_size,
table_bytes, pg_size_pretty(table_bytes) as table_size,
index_scans
FROM raw_bloat
WHERE ( realbloat > 50 and wastedbytes > 50000000 )
ORDER BY wastedbytes DESC;
Go here instead:
https://gist.github.com/jberkus/992394
Duplicate Indexes
-- Detect duplicate indexes
SELECT ss.tbl::REGCLASS AS table_name,
pg_size_pretty(SUM(pg_relation_size(idx))::bigint) AS size,
(array_agg(idx))[1] AS idx1,
(array_agg(idx))[2] AS idx2,
(array_agg(idx))[3] AS idx3,
(array_agg(idx))[4] AS idx4
FROM
( SELECT indrelid AS tbl, indexrelid::regclass AS idx,
(indrelid::text ||E'n'|| indclass::text ||E'n'|| indkey::text
||E'n'|| coalesce(indexprs::text,'')||E'n' ||
coalesce(indpred::text,'')) AS KEY
FROM pg_index
) AS ss
GROUP BY ss.tbl, KEY HAVING count(*) > 1
ORDER BY SUM(pg_relation_size(idx)) DESC;
Frequently Used Queries
•Top Queries:
• Sorted by average ms per call
• CPU hog
• number of callers
•Locks blocking queries
•Table Bloat
•Unused Indexes
•Sequences close to max values
•Find tables with sequences
Thank you! Questions?

We're hiring DBAs and DBEs.
Sean Chittenden

seanc@groupon.com

More Related Content

PDF
Incrementalism: An Industrial Strategy For Adopting Modern Automation
PDF
Production Readiness Strategies in an Automated World
PDF
Modern tooling to assist with developing applications on FreeBSD
PDF
FreeBSD: Dev to Prod
PDF
Unity Makes Strength
PDF
Australian OpenStack User Group August 2012: Chef for OpenStack
PDF
Regex Considered Harmful: Use Rosie Pattern Language Instead
PDF
Dynamic Database Credentials: Security Contingency Planning
Incrementalism: An Industrial Strategy For Adopting Modern Automation
Production Readiness Strategies in an Automated World
Modern tooling to assist with developing applications on FreeBSD
FreeBSD: Dev to Prod
Unity Makes Strength
Australian OpenStack User Group August 2012: Chef for OpenStack
Regex Considered Harmful: Use Rosie Pattern Language Instead
Dynamic Database Credentials: Security Contingency Planning

What's hot (20)

PDF
Supercharging Content Delivery with Varnish
PDF
Integrated Cache on Netscaler
PDF
Altitude SF 2017: Advanced VCL: Shielding and Clustering
PDF
From pets to cattle - powered by CoreOS, docker, Mesos & nginx
PPT
Running High Performance and Fault Tolerant Elasticsearch Clusters on Docker
PDF
SCALE12X Build a Cloud Day: Chef: The Swiss Army Knife of Cloud Infrastructure
PDF
24HOP Introduction to Linux for SQL Server DBAs
PPTX
Automate DBA Tasks With Ansible
PDF
Ansible at work
PDF
ruxc0n 2012
PDF
TXLF: Chef- Software Defined Infrastructure Today & Tomorrow
PDF
Open Source Logging and Monitoring Tools
PDF
Stop Worrying & Love the SQL - A Case Study
PDF
BloodHound: Attack Graphs Practically Applied to Active Directory
ODP
Nagios Conference 2012 - Mike Weber - Failover
PDF
#WeSpeakLinux Session
PDF
Docker 對傳統 DevOps 工具鏈的衝擊 (Docker's Impact on traditional DevOps toolchain)
PDF
How To Set Up SQL Load Balancing with HAProxy - Slides
PDF
Caching reboot: javax.cache & Ehcache 3
PPTX
Infrastructure as code: running microservices on AWS using Docker, Terraform,...
Supercharging Content Delivery with Varnish
Integrated Cache on Netscaler
Altitude SF 2017: Advanced VCL: Shielding and Clustering
From pets to cattle - powered by CoreOS, docker, Mesos & nginx
Running High Performance and Fault Tolerant Elasticsearch Clusters on Docker
SCALE12X Build a Cloud Day: Chef: The Swiss Army Knife of Cloud Infrastructure
24HOP Introduction to Linux for SQL Server DBAs
Automate DBA Tasks With Ansible
Ansible at work
ruxc0n 2012
TXLF: Chef- Software Defined Infrastructure Today & Tomorrow
Open Source Logging and Monitoring Tools
Stop Worrying & Love the SQL - A Case Study
BloodHound: Attack Graphs Practically Applied to Active Directory
Nagios Conference 2012 - Mike Weber - Failover
#WeSpeakLinux Session
Docker 對傳統 DevOps 工具鏈的衝擊 (Docker's Impact on traditional DevOps toolchain)
How To Set Up SQL Load Balancing with HAProxy - Slides
Caching reboot: javax.cache & Ehcache 3
Infrastructure as code: running microservices on AWS using Docker, Terraform,...
Ad

Viewers also liked (11)

PDF
PostgreSQL High-Availability and Geographic Locality using consul
PDF
Operating Consul as an Early Adopter
PDF
KubeCon EU 2016: Full Automatic Database: PostgreSQL HA with Kubernetes
PDF
[NYC Meetup] Docker at Nuxeo
PPTX
IsaacWyatt_NR_Martech_DH_031615_v4
PDF
etcd based PostgreSQL HA Cluster
PDF
High Availability PostgreSQL with Zalando Patroni
PDF
Consul: Microservice Enabling Microservices and Reactive Programming
PDF
Ingesting Drone Data into Big Data Platforms
PDF
PostgreSQL + ZFS best practices
PDF
5 Steps to PostgreSQL Performance
PostgreSQL High-Availability and Geographic Locality using consul
Operating Consul as an Early Adopter
KubeCon EU 2016: Full Automatic Database: PostgreSQL HA with Kubernetes
[NYC Meetup] Docker at Nuxeo
IsaacWyatt_NR_Martech_DH_031615_v4
etcd based PostgreSQL HA Cluster
High Availability PostgreSQL with Zalando Patroni
Consul: Microservice Enabling Microservices and Reactive Programming
Ingesting Drone Data into Big Data Platforms
PostgreSQL + ZFS best practices
5 Steps to PostgreSQL Performance
Ad

Similar to Creating PostgreSQL-as-a-Service at Scale (20)

PPTX
Raising ux bar with offline first design
PDF
Headaches and Breakthroughs in Building Continuous Applications
PPTX
Hannes end-of-the-router-tnc17
PPTX
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
PDF
Realtime Analytics on AWS
PDF
Tame the Mesh An intro to cross-platform tracing and troubleshooting.pdf
PPTX
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
PDF
6 tips for improving ruby performance
PDF
Operating PostgreSQL at Scale with Kubernetes
PDF
Real-time Big Data Analytics Engine using Impala
PDF
Applied Machine learning using H2O, python and R Workshop
PPTX
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
PPTX
Intro to big data analytics using microsoft machine learning server with spark
PDF
Choisir entre une API RPC, SOAP, REST, GraphQL? 
Et si le problème était ai...
PPTX
Basic Application Performance Optimization Techniques (Backend)
PPTX
Emerging technologies /frameworks in Big Data
PDF
PostgreSQL - масштабирование в моде, Valentine Gogichashvili (Zalando SE)
KEY
Software architectures for the cloud
PDF
Buildingsocialanalyticstoolwithmongodb
Raising ux bar with offline first design
Headaches and Breakthroughs in Building Continuous Applications
Hannes end-of-the-router-tnc17
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Realtime Analytics on AWS
Tame the Mesh An intro to cross-platform tracing and troubleshooting.pdf
Intro to Apache Apex - Next Gen Platform for Ingest and Transform
6 tips for improving ruby performance
Operating PostgreSQL at Scale with Kubernetes
Real-time Big Data Analytics Engine using Impala
Applied Machine learning using H2O, python and R Workshop
First Hive Meetup London 2012-07-10 - Tomas Cervenka - VisualDNA
Intro to big data analytics using microsoft machine learning server with spark
Choisir entre une API RPC, SOAP, REST, GraphQL? 
Et si le problème était ai...
Basic Application Performance Optimization Techniques (Backend)
Emerging technologies /frameworks in Big Data
PostgreSQL - масштабирование в моде, Valentine Gogichashvili (Zalando SE)
Software architectures for the cloud
Buildingsocialanalyticstoolwithmongodb

More from Sean Chittenden (7)

PDF
BSDCan '19 Core Update
PDF
pg_prefaulter: Scaling WAL Performance
PDF
FreeBSD VPC Introduction
PDF
Universal Userland
PDF
Life Cycle of Metrics, Alerting, and Performance Monitoring in Microservices
PDF
Codified PostgreSQL Schema
PDF
PostgreSQL on ZFS Lightning Talk
BSDCan '19 Core Update
pg_prefaulter: Scaling WAL Performance
FreeBSD VPC Introduction
Universal Userland
Life Cycle of Metrics, Alerting, and Performance Monitoring in Microservices
Codified PostgreSQL Schema
PostgreSQL on ZFS Lightning Talk

Recently uploaded (20)

PPTX
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
PDF
cuic standard and advanced reporting.pdf
PDF
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
PDF
Modernizing your data center with Dell and AMD
PDF
Per capita expenditure prediction using model stacking based on satellite ima...
PDF
Diabetes mellitus diagnosis method based random forest with bat algorithm
PDF
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
PDF
Dropbox Q2 2025 Financial Results & Investor Presentation
PPTX
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
PDF
Chapter 3 Spatial Domain Image Processing.pdf
PPTX
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
PDF
Reach Out and Touch Someone: Haptics and Empathic Computing
DOCX
The AUB Centre for AI in Media Proposal.docx
PDF
Network Security Unit 5.pdf for BCA BBA.
PDF
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
PDF
Spectral efficient network and resource selection model in 5G networks
PPTX
Understanding_Digital_Forensics_Presentation.pptx
PDF
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
PPT
“AI and Expert System Decision Support & Business Intelligence Systems”
PDF
CIFDAQ's Market Insight: SEC Turns Pro Crypto
KOM of Painting work and Equipment Insulation REV00 update 25-dec.pptx
cuic standard and advanced reporting.pdf
7 ChatGPT Prompts to Help You Define Your Ideal Customer Profile.pdf
Modernizing your data center with Dell and AMD
Per capita expenditure prediction using model stacking based on satellite ima...
Diabetes mellitus diagnosis method based random forest with bat algorithm
Build a system with the filesystem maintained by OSTree @ COSCUP 2025
Dropbox Q2 2025 Financial Results & Investor Presentation
VMware vSphere Foundation How to Sell Presentation-Ver1.4-2-14-2024.pptx
Chapter 3 Spatial Domain Image Processing.pdf
PA Analog/Digital System: The Backbone of Modern Surveillance and Communication
Reach Out and Touch Someone: Haptics and Empathic Computing
The AUB Centre for AI in Media Proposal.docx
Network Security Unit 5.pdf for BCA BBA.
Shreyas Phanse Resume: Experienced Backend Engineer | Java • Spring Boot • Ka...
Spectral efficient network and resource selection model in 5G networks
Understanding_Digital_Forensics_Presentation.pptx
Bridging biosciences and deep learning for revolutionary discoveries: a compr...
“AI and Expert System Decision Support & Business Intelligence Systems”
CIFDAQ's Market Insight: SEC Turns Pro Crypto

Creating PostgreSQL-as-a-Service at Scale

  • 1. Creating PostgreSQL-as-a-Service at Scale Care and Feeding of Elephants in a Zoo 2015-09-18
  • 2. Background on Cats • Groupon has sold more than 850 million units to date, including 53 million in Q2 alone1,4. • Nearly 1 million merchants are connected by our suite of tools to more than 110 million people that have downloaded our mobile apps. • 90% of merchants agree that Groupon brought in new customers2. • Groupon helps small businesses — 91% of the businesses Groupon works with have 20 employees or fewer2. • 81% of customers have referred someone to the business — Groupon customers are “influencers” who spread the word in their peer groups3. 1) Units reflect vouchers and products sold before cancellations and refunds. 2) AbsolutData, Q2 2015 Merchant Business Partnership Survey, June 2015 (conducted by Groupon). 3) ForeSee Groupon Customer Satisfaction Study, June 2015 (commissioned by Groupon) 4) Information on this slide is current as of Q2 2015
  • 3. SOA Vogue and Acquisitions • Four acquisitions in 2015 and six acquisitions in 2014 • Internally many services and teams
  • 4. SOA Consequences SOA is a fancy way of saying lots of apps talk to lots of database instances.
  • 5. Building Database Systems • Ogres are like onions: they have layers.
  • 6. Building Database Systems • Ogres are like onions: they have layers. • Databases are like onions: they have layers, too.
  • 7. Building Database Systems * No pun intended, I promise. • Ogres are like onions: they have layers. • Databases are like onions: they have layers, too. • Databases do not operate in a vacuum*.
  • 8. Database Functionality Typical web stack: • Browser • CDN • Load Balancer • App Tier • API Tier • Database
  • 9. Where are databases in most web stacks? Typical stack: • Browser • CDN • Load Balancer • App Tier • API Tier • Database
  • 10. Where are databases in most web stacks? Typical stack: • Browser • CDN • Load Balancer • App Tier • API Tier • Database • Wouldn't it be nice if something was here?
  • 11. Macro Components of a Database Typical stack: • Browser • CDN • Load Balancer • App Tier • API Tier • Database - Query && Query Plans - CPU - Locking - Shared Buffers - Filesystem Buffers - Disk IO - Disk Capacity - Slave Replication
  • 13. Risk Management It's Friday afternoon (a.k.a. let's have some fun): # postgresql.conf #fsync = on synchronous_commit = off Risky?
  • 14. Risk Management It's Friday afternoon, let's have some fun: # postgresql.conf #fsync = on synchronous_commit = off zfs set sync=disabled tank/foo vfs.zfs.txg.timeout: 5 What cost are you willing to accept for 5sec of data? Discuss. Mandatory Disclaimer: we don't do this everywhere, but we do by default.
  • 15. • Query Engine • Serialization Layer • Caching • Storage • Proxy Real Talk: What are the components of a Database?
  • 16. • Query Engine - SQL • Serialization Layer - MVCC • Caching - shared_buffers • Storage - pages (checksums to detect block corruption) • Proxy - FDW Real Talk: What are the components of a Database?
  • 20. Database Service Layers L2VIP, LB, DNSVIP PostgreSQL PostgreSQL
  • 21. Database Service Layers L2VIP, LB, DNSVIP PostgreSQL PostgreSQL PITR PITR
  • 22. Database Service Layers L2VIP, LB, DNSVIP PostgreSQL pgbouncer PostgreSQL pgbouncer PITR PITR
  • 23. Database Service Layers L2VIP, LB, DNSVIP PostgreSQL pgbouncer PostgreSQL pgbouncer PITR PITR • WAN Replication • Backups
  • 24. Database Service Layers L2VIP, LB, DNSVIP PostgreSQL PostgreSQL PITR PITR pgbouncer pgbouncer • WAN Replication • Backups
  • 25. Provisioning •No fewer than 5x components just to get a basic database service provisioned. •Times how many combinations? Plug: giving a talk on automation and provisioning at HashiConf in 2wks
  • 26. Provisioning Checklist VIPs (DNS, LB, L2, etc) PostgreSQL instance Slaves (LAN, OLAP, & WAN) Backups pgbouncer PITR Stats Collection and Reporting Graphing Alerting
  • 27. Provisioning Checklist VIPs (DNS, LB, L2, etc) PostgreSQL instance Slaves (LAN, OLAP, & WAN) Backups pgbouncer PITR Stats Collection and Reporting Graphing Alerting *= # VIPs *= initdb + config *= number of slaves *= # backup targets *= # pgbouncers *= # PG instances *= # DBs && Tables *= # relevant graphs *= # Thresholds
  • 28. Provisioning Checklist VIPs (DNS, LB, L2, etc) PostgreSQL instance Slaves (LAN, OLAP, & WAN) Backups pgbouncer PITR Stats Collection and Reporting Graphing Alerting Known per-user limits Inheriting existing applications Different workloads Different compliance and regulatory requirements
  • 29. Provisioning •Automate •Find a solution that provides a coherent view of the world (e.g. ansible) •Idempotent Execution (regardless of how quickly or slowly) •Immutable Provisioning •Changes requiring a restart are forbidden by automation: provision new things and fail over. •Get a DBA to do restart-like activity
  • 30. Efficacy vs Efficiency •Cost justify automation and efficiency. •Happens only once every 12mo? • Do it by hand. • Document it. • Don't spend 3x man months automating some process for the sake of efficiency. •100% automation is a good goal, but don't forget about the ROI.
  • 31. Connection Management: pgbouncer •Databases support unlimited connections, am i rite? •More connections == faster
  • 33. Connection Management: pgbouncer Clients pgbouncer PostgreSQL ~1.5K frontend connections ~10 backend connections Rule of Thumb: M connections == N cores * some K value (K = approx. ratio of CPU vs off CPU, e.g. disk IO)
  • 34. pgbouncer: JDBC edition pgbouncer <1.6: ?prepareThreshold=0
 pgbouncer >=1.6: ???
  • 35. pgbouncer: Starting Advice •Limit connections per user to backend by number of active cores per user. •M backend connections = N cores * K •K = approx. ratio of CPU vs queued disk IO
  • 36. Backups •Slaves aren't backups •Replication is not a backup •Replication + Snapshots? Debatable, depends on retention, and failure domain.
  • 37. Backups •Slaves aren't backups •Replication is not a backup •Replication + Snapshots? Debatable, depends on retention, and failure domain. -- Dev or DBA "Oops" Moment DROP DATABASE bar;
 DROP TABLE foo; TRUNCATE foo;
  • 38. Remote User Controls •DROP DATABASE or DROP TABLE happen •Automated schema migrations gone wrong •Accidentally pointed dev host at prod database •Create and own DBs using the superuser account •Give teams ownership over a schema with a "DBA account" •Give teams one or more "App Accounts" (??!!??!?! @#%@#!)
  • 39. Remote User Controls: pg_hba.conf • DBA account: •# TYPE DATABASE USER ADDRESS METHOD
 host foo_prod foo_prod_dba 100.64.1.25/32 md5
 host foo_prod foo_prod_dba 100.66.42.89/32 md5
 •ALTER ROLE foo_prod_dba CONNECTION LIMIT 2; • App Account: •# TYPE DATABASE USER ADDRESS METHOD
 host foo_prod foo_prod_app1 10.23.45.67/32 md5
 •ALTER ROLE foo_prod_app1 CONNECTION LIMIT 10;
  • 40. Incident Response • Develop playbooks • Develop checklists • DTrace scripts
  • 41. Locking -- Find the blocking PID: SELECT bl.pid AS Blocked_PID, a.usename as Blocked_User, kl.pid as Blocking_PID, ka.usename as Blocking_User, to_char(age(now(), a.query_start),'HH24h:MIm:SSs') AS Age FROM (pg_catalog.pg_locks bl JOIN pg_catalog.pg_stat_activity a ON bl.pid = a.pid) JOIN (pg_catalog.pg_locks kl JOIN pg_catalog.pg_stat_activity ka ON kl.pid = ka.pid) ON bl.locktype = kl.locktype AND bl.database is not distinct from kl.database AND bl.relation is not distinct from kl.relation AND bl.page is not distinct from kl.page AND bl.tuple is not distinct from kl.tuple AND bl.virtualxid is not distinct from kl.virtualxid AND bl.transactionid is not distinct from kl.transactionid AND bl.classid is not distinct from kl.classid AND bl.objid is not distinct from kl.objid AND bl.objsubid is not distinct from kl.objsubid AND bl.pid != kl.pid WHERE kl.granted AND NOT bl.granted ORDER BY age DESC;
  • 42. Index BloatWITH btree_index_atts AS ( SELECT nspname, relname, reltuples, relpages, indrelid, relam, regexp_split_to_table(indkey::text, ' ')::smallint AS attnum, indexrelid as index_oid FROM pg_index JOIN pg_class ON pg_class.oid=pg_index.indexrelid JOIN pg_namespace ON pg_namespace.oid = pg_class.relnamespace JOIN pg_am ON pg_class.relam = pg_am.oid WHERE pg_am.amname = 'btree' ), index_item_sizes AS ( SELECT i.nspname, i.relname, i.reltuples, i.relpages, i.relam, s.starelid, a.attrelid AS table_oid, index_oid, current_setting('block_size')::numeric AS bs, /* MAXALIGN: 4 on 32bits, 8 on 64bits (and mingw32 ?) */ CASE WHEN version() ~ 'mingw32' OR version() ~ '64-bit' THEN 8 ELSE 4 END AS maxalign, 24 AS pagehdr, /* per tuple header: add index_attribute_bm if some cols are null-able */ CASE WHEN max(coalesce(s.stanullfrac,0)) = 0 THEN 2 ELSE 6 END AS index_tuple_hdr, /* data len: we remove null values save space using it fractionnal part from stats */ sum( (1-coalesce(s.stanullfrac, 0)) * coalesce(s.stawidth, 2048) ) AS nulldatawidth FROM pg_attribute AS a JOIN pg_statistic AS s ON s.starelid=a.attrelid AND s.staattnum = a.attnum JOIN btree_index_atts AS i ON i.indrelid = a.attrelid AND a.attnum = i.attnum WHERE a.attnum > 0 GROUP BY 1, 2, 3, 4, 5, 6, 7, 8, 9 ), index_aligned AS ( SELECT maxalign, bs, nspname, relname AS index_name, reltuples, relpages, relam, table_oid, index_oid, ( 2 + maxalign - CASE /* Add padding to the index tuple header to align on MAXALIGN */ WHEN index_tuple_hdr%maxalign = 0 THEN maxalign ELSE index_tuple_hdr%maxalign END + nulldatawidth + maxalign - CASE /* Add padding to the data to align on MAXALIGN */ WHEN nulldatawidth::integer%maxalign = 0 THEN maxalign ELSE nulldatawidth::integer%maxalign END )::numeric AS nulldatahdrwidth, pagehdr FROM index_item_sizes AS s1 ), otta_calc AS ( SELECT bs, nspname, table_oid, index_oid, index_name, relpages, coalesce( ceil((reltuples*(4+nulldatahdrwidth))/(bs-pagehdr::float)) + CASE WHEN am.amname IN ('hash','btree') THEN 1 ELSE 0 END , 0 -- btree and hash have a metadata reserved block ) AS otta FROM index_aligned AS s2 LEFT JOIN pg_am am ON s2.relam = am.oid ), raw_bloat AS ( SELECT current_database() as dbname, nspname, c.relname AS table_name, index_name, bs*(sub.relpages)::bigint AS totalbytes, CASE WHEN sub.relpages <= otta THEN 0 ELSE bs*(sub.relpages-otta)::bigint END AS wastedbytes, CASE WHEN sub.relpages <= otta THEN 0 ELSE bs*(sub.relpages-otta)::bigint * 100 / (bs*(sub.relpages)::bigint) END AS realbloat, pg_relation_size(sub.table_oid) as table_bytes, stat.idx_scan as index_scans FROM otta_calc AS sub JOIN pg_class AS c ON c.oid=sub.table_oid JOIN pg_stat_user_indexes AS stat ON sub.index_oid = stat.indexrelid ) SELECT dbname as database_name, nspname as schema_name, table_name, index_name, round(realbloat, 1) as bloat_pct, wastedbytes as bloat_bytes, pg_size_pretty(wastedbytes::bigint) as bloat_size, totalbytes as index_bytes, pg_size_pretty(totalbytes::bigint) as index_size, table_bytes, pg_size_pretty(table_bytes) as table_size, index_scans FROM raw_bloat WHERE ( realbloat > 50 and wastedbytes > 50000000 ) ORDER BY wastedbytes DESC; Go here instead: https://gist.github.com/jberkus/992394
  • 43. Duplicate Indexes -- Detect duplicate indexes SELECT ss.tbl::REGCLASS AS table_name, pg_size_pretty(SUM(pg_relation_size(idx))::bigint) AS size, (array_agg(idx))[1] AS idx1, (array_agg(idx))[2] AS idx2, (array_agg(idx))[3] AS idx3, (array_agg(idx))[4] AS idx4 FROM ( SELECT indrelid AS tbl, indexrelid::regclass AS idx, (indrelid::text ||E'n'|| indclass::text ||E'n'|| indkey::text ||E'n'|| coalesce(indexprs::text,'')||E'n' || coalesce(indpred::text,'')) AS KEY FROM pg_index ) AS ss GROUP BY ss.tbl, KEY HAVING count(*) > 1 ORDER BY SUM(pg_relation_size(idx)) DESC;
  • 44. Frequently Used Queries •Top Queries: • Sorted by average ms per call • CPU hog • number of callers •Locks blocking queries •Table Bloat •Unused Indexes •Sequences close to max values •Find tables with sequences
  • 45. Thank you! Questions?
 We're hiring DBAs and DBEs. Sean Chittenden
 seanc@groupon.com