Nancy CLI. Automated Database Experiments

NANCY CLI:
a unified way to manage Database Experiments in clouds
Postgres.AI
Nikolay Samokhvalov
twitter: @postgresmen
email: ru@postgresql.org

About me
Postgres experience: 12+ years (database systems: 17+)
Founder and CTO of 3 startups (total 30M+ users), all based on Postgres
Founder of #RuPostgres (1700+ members on Meetup.com, 2nd largest globally)
Re-launched consulting practice in the SF Bay Area http://PostgreSQL.support
Founder of Postgres.AI – the Postgres platform to automate what is not yet automated
Twitter: @postgresmen
Email: ru@postgresql.org

Pre-story. Finding the largest tables in a database
How many times did
you google things
like this?

Finding the largest tables, a semi-automated way
postgres_dba – The missing set of useful tools for Postgres https://github.com/NikolayS/postgres_dba

Finding the largest tables, a semi-automated way
Report #2: sizes of the tables in the current database

Installation of postgres_dba
Installation is trivial:
Important: psql version 10 is needed.
(install postgresql-client-10 package, see README)
Server version may be older
(Use ssh tunnel to connect to remote servers, see README)
7

Part 1. Why do we need
automated DB experiments?

- Let’s do default_statistics_target = 1000!
- Let’s do random_page_cost = 1!
- I’ve heard that setting shared_buffers to ¼ of RAM doesn’t rock anymore,
¾ is much better!
- Let’s add this index here!
- Let’s use partitioning for this table!
- Let’s don’t allow having >100,000 dead tuples in a table!
- Etc, etc, etc...
How do we do performance improvements nowadays?

The Whole Truth - Let’s do default_statistics_target = 1000!
- I’ve heard that setting shared_buffers to ¼ of RAM
doesn’t rock anymore, ¾ is much better!
- Let’s don’t allow having >100,000 dead tuples in a
table!
- Etc, etc, etc...

The Whole Truth
Does it give better (or at
least not worse)
performance for all queries?
Is this value the best for our
database & workload?
Does it give a real gain for
our database & workload?
- Let’s do default_statistics_target = 1000!
- I’ve heard that setting shared_buffers to ¼ of RAM
doesn’t rock anymore, ¾ is much better!
- Let’s don’t allow having >100,000 dead tuples in a
table!
- Etc, etc, etc...

...and more
- postgres_dba shows that the bloat level is 35% for this
1B-rows table. Is it good or bad? How bad? (Or how good?)
- When do I need to add more RAM to my database server to
keep good performance characteristics?
- What will happen when more users come to use my app?
- Will i3.xlarge handle 3,000 UPDATEs per second?
- Etc, etc, etc

How do we make changes now?
Option #1
Oh, we see (from monitoring, pg_stat_statements, pgBadger, etc) that this
query is slow on production. Let’s fix that!
⇒ the problem is already there...

How do we make changes now?
Option #1
Oh, we see (from monitoring, pg_stat_statements, pgBadger, etc) that this
query is slow on production. Let’s fix that!
⇒ the problem is already there...
Option #2
DBA: well, this will be slow. Add this index! And we can do even better, let’s use
a partial index here!
⇒ good if the DBA is really experienced and/or verified ideas on a DB
clone. But how often is it so? And how many queries were checked?

Towards the better future
Option 1: we’re going to change something. Let’s verify *all* query groups form
pg_stat_statement and see how performance is changed – using some
“what-if” API

“what-if” API
Option 2: a human or artificial DBAs has an idea for improvement.
Let’s verify it! Again with the same “what-if” API

“what-if” API
⇒ continuous database administration

“what-if” API
⇒ continuous database administration
Bonus: let’s use the same “what-if” API to get more knowledge for AI we are
building (to train ML models)

So, why do we need to automate DB experiments?
If we see that our change is good for one or several queries it doesn’t mean
that it is so for all queries.
Without performing the deep SQL query analysis (analyzing “all” query groups
on pg_stat_statement’s Top-N) we are blind.
And database administration is a black magic.

Part 2. Existing works and tools

Existing works and solutions
Andy Pavlo, CMU:
● “What is a Self-Driving Database Management System?” great overview
of history and existing works, an entry point to learn what’s done (good
research papers, Microsoft works, etc)
● PelotonDB, ottertune – great research projects

Existing works and solutions
Andy Pavlo, CMU:
● “What is a Self-Driving Database Management System?” great overview
of history and existing works, an entry point to learn what’s done (good
research papers, Microsoft works, etc)
● PelotonDB, ottertune – great research projects
Oracle’s RAT (Real Application Testing)
– Database Replay + SQL Performance Analyzer:
● Real Application Testing, Oracle 18c
● Sample report
● “Oracle Real Application Testing Delivers 224% ROI”

How to automate database optimization using ecosystem tools and AWS?
Analyze:
● pg_stat_statements
● auto_explan
● pgBadger to parse logs, use JSON output
● pg_query to group queries better
● pg_stat_kcache to analyze FS-level ops
Configuration:
● annotated.conf, pgtune, pgconfigurator, postgresqlco.nf
● ottertune
Suggested indexes (internal “what-if” API w/o actual execution)
● (useful: pgHero, POWA, HypoPG, dexter, plantuner)
Conduct experiments:
● pgreplay to replay logs (different log_line_prefix, you need to handle it)
● EC2 spot instances
Machine learning
● MADlib
DIY automated pipeline for DB optimization

Analyze:
● auto_explan
Configuration:
● ottertune
Machine learning
● MADlib
The basis for Nancy

Analyze:
● auto_explan
Configuration:
● ottertune
Machine learning
● MADlib
pgBadger:
● Grouping queries can be implemented better (see pg_query)
● Makes all queries lower cased (hurts "camelCased" names)
● Doesn’t really support plans (auto_explain)

Analyze:
● auto_explan
Configuration:
● ottertune
Machine learning
● MADlib
pgBadger:
● Grouping queries can be implemented better (see pg_query)
● Makes all queries lower cased (hurts "camelCased" names)
● Doesn’t really support plans (auto_explain)
pgreplay and pgBadger are not friends,
require different log formats

Already automated:
● Setup/tune hardware, OS, FS
● Provision Postgres instances
● Create replicas
● High Availability:
detect failures and switch to replicas
● Create backups
● Basic monitoring
28

Already automated:
● Postgres parameters tuning
● Query analysis and optimization
● Index set optimization
● Detailed monitoring
● Verify optimization ideas
● Benchmarks
● Regression&performance CI-like testing
● Create replicas
● Create backups
Little to zero level of automation:
29

Already automated:
● Postgres parameters tuning
● Query analysis and optimization
● Index set optimization
● Detailed monitoring
● Verify optimization ideas
● Benchmarks
● Regression&performance CI-like testing
● Create replicas
● Create backups
Little to zero level of automation:
30
Can be done with
Database Experiments

What is the Database Experiment?
The Database Experiment – a set of actions...

The Database Experiment – a set of actions
to perform the deep SQL query analysis

for specified database

against specified workload

in specified environment

with an optional change of the database and environment (called “delta”).

An experiment may consist of one or more experimental runs.

An experiment may consist of one or more experimental runs.
To analyze the impact of some delta, we need at least two runs, one of them
being “clean run” (w/o delta).

The input of a experimental run:
● Environment
○ Location (on-premise or GCP or AWS), hardware (CPU, RAM, disks)
○ System (OS, file system)
○ Postgres version
○ Postgres configuration
● Database snapshot. Can be:
○ A dump (regular or in directory format)
○ A physical archive (pg_basebackup or pgBackRest/WAL-E/WAL-G/…)
○ A replica promoted for experiments
○ Some synthetic one, a generated database (“create table as …”, pgbench -i, etc)
● Workload. Can be:
○ Synthetic (custom SQL), single-threaded
○ Synthetic (custom SQL), multi-threaded (with pgbench)
○ “Real workload” (based on logs)
● [Optional] Delta:
○ Configuration change(s) (e.g.: shared_buffers = 16GB)
○ Some DDL (e.g.: `create index …`). “Undo” DDL is required in this case `drop index …`) to enable
serialization of experiments

The output of a experimental run:
● the contents of basic pg_stat_*** (e.g. pg_stat_user_tables)
● the contents of pg_stat_statements
● the contents of pg_stat_kcache
● the PostgreSQL detailed log (with auto_explain turned on)
● the pgBadger’s extended report in JSON format
● the Postgres config at time of applying the workload

AI-based cloud-friendly platform to automate database administration
41
Steve
AI-based expert in capacity planning and
database tuning
Joe
AI-based expert in query optimization and
Postgres indexes
Nancy
AI-based expert in database experiments.
Conducts experiments and presents
results to human and artificial DBAs
Sign up for early access:
http://Postgres.ai

AI-based cloud-friendly platform to automate database administration
42
Steve
AI-based expert in capacity planning and
database tuning
Joe
AI-based expert in query optimization and
Postgres indexes
Nancy
AI-based expert in database experiments.
Conducts experiments and presents
results to human and artificial DBAs
Sign up for early access:
http://Postgres.ai

Demo 1
Postgres.AI GUI live demonstration

Metastorage +
GUI
Postgres.AI architecture
Databases being
observed
AWS S3
dump/
backup
nancyprepare-workload
Nancy
Steve
Joe
AWS EC2 docker machines
Nancy CLI
A human engineer can use:
● GUI
● CLI
● Chat

Demo 1
Nancy CLI live demonstration (local + AWS)

Meet Nancy CLI (open source)
Nancy CLI https://github.com/postgres-ai/nancy
● custom docker image (Postgres with extensions & tools)
● nancy prepare-workload to convert Postgres logs (now only .csv)
to workload binary file
● nancy run to run experiments
● able to run locally (any machine) on in EC2 spot instance (low price!),
including i3.*** instances (with NVMe)
● fully automated management of EC2 spots

What’s inside the docker container?
Source: https://github.com/postgres-ai/nancy/tree/master/docker
Image: https://hub.docker.com/r/postgresmen/postgres-with-stuff/
Inside:
● Ubuntu 16.04
● Postgres (now 9.6 or 10)
● postgres_dba (for manual debugging)
● pg_stat_statements enabled
● auto_explain enabled (all queries, with timing)
● pgreplay
● pgBadger
● pg_stat_kcache (soon)
● additional utilities

Part 5. The future of Nancy CLI

Various ways to create an experimental database
● plain text pg_dump
○ restoration is very slow (1 vcpu utilized)
○ “logical” – physical structure is lost (cannot experiment with bloat, etc)
○ small (if compressed)
○ “snapshot” only
● pg_dump with either -Fd (“directory”) or -Fc (“custom”):
○ restoration is faster (multiple vCPUs, -j option)
○ “logical” (again: bloat, physical layout is “lost”)
○ small (because compressed)
○ “snapshot” only
● pg_basebackup + WALs, point-in-time recovery (PITR), possibly with help from WAL-E, WAL-G, pgBackRest
○ less reliable, sometimes there issues (especially if 3rd party tools involved - e.g. WAL-E & WAL-G don’t
support tablespaces, there are bugs sometimes, etc)
○ “physical”: bloat and physical structure is preserved
○ not small – ~ size of the DB
○ can “walk in time” (PITR)
○ requires warm-up procedure (data is not in the memory!)
● AWS RDS: create a replica + promote it
○ no Spots :-/
○ Lazy Load is tricky (it looks like the DB is there but it’s very slow – warm-up is needed)

How can we speed up experimental runs?
● Prepare the EC2 instance(s) in advance and keep it
● Prepare EBS volume(s) only (perhaps, using an instance of the different
type) and keep it ready. When attached to the new instance, do warm-up
● Resource re-usage:
○ reuse docker container
○ reuse EC2 instance
○ serialize experimental runs serialization (DDL Do/Undo; VACUUM FULL; cleanup)
● Partial database snapshots (dump/restore only needed tables)

The future development of Nancy CLI
● Speedup DB creation
● Support GCP
● More artifacts delivered: pg_stat_kcache, etc
● nancy see-report to print the summary + top-30 queries
● nancy compare-reports to print the “diff” for 2+ reports (the summary + numbers for
top-30 queries, ordered by by total time based on the 1st report)
● Postgres 11
● pgbench -i for database initialization
● pgbench to generate multithreaded synthetic workload
● Workload analysis: automatically detect “N+1 SELECT” when running workload
● Better support for the serialization of experimental runs
● Better support for multiple runs:
○ interval with step
○ gradient descent
● Provide costs estimation (time + money)
● Rewrite in Python or Go

The future development of Nancy CLI
● Speedup DB creation
● Support GCP
● More artifacts delivered: pg_stat_kcache, etc
● nancy see-report to print the summary + top-30 queries
● nancy compare-reports to print the “diff” for 2+ reports (the summary + numbers for
top-30 queries, ordered by by total time based on the 1st report)
● Postgres 11
● pgbench -i for database initialization
● pgbench to generate multithreaded synthetic workload
● Workload analysis: automatically detect “N+1 SELECT” when running workload
● Better support for the serialization of experimental runs
● Better support for multiple runs:
○ interval with step
○ gradient descent
● Provide costs estimation (time + money)
● Rewrite in Python or Go
Contributions welcome!

Thank you!
Nikolay Samokhvalov
ru@postgresql.org
twitter: @postgresmen
Postgres.ai
https://github.com/postgres-ai/nancy
54

Nancy CLI. Automated Database Experiments

More Related Content

What's hot (20)

Similar to Nancy CLI. Automated Database Experiments (20)

More from Nikolay Samokhvalov (20)

Recently uploaded (20)

Nancy CLI. Automated Database Experiments