SlideShare a Scribd company logo
/
Think Your Postgres Backups
And Disaster Recovery Are
Safe? Let's talk.
Payal Singh
Database Administrator
OmniTI
1
Who am I ?
DBA@OmniTI
Github: payals
Blog: http://penningpence.blogspot.com
Twitter: @pallureshu
Email: payal@omniti.com
2
Agenda
Types of backups
Validation
Backups management
Automation
Filesystem
Third party tools
In-house solution
3
Types of Backups
4
Business Continuity Objectives
Recovery Point Objective (RPO)
Recovery Time Objective (RTO)
5
Types of Backups
● Logical
○ pg_dump
○ pg_dumpall
● Physical (File-system Level)
○ Online
 Most commonly used
 No downtime
○ Offline
 rarely used
 shutdown database for backup
6
Logical Backups
Advantages:
Granularity
Fine-tuned restores
Multiple in-built compression types
Ease of use, no extra setup required
Disadvantages:
Relatively slower
Frozen snapshots in time, i.e. no PITR
Locks
Data spread in time
7
pg_dumpall
Pros:
• Globals
Cons:
● Requires superuser privileges
● No granularity while restoring
● Plaintext only
● Cannot take advantage of faster/parallel restore with pg_restore
8
Physical Backups
Advantages:
Faster
Incremental Backups
Point In Time Recovery
By default compression on certain file-systems
Disadvantages:
Lacks granularity
9
Why do you need both?
The more the merrier:
Database wide restores - Filesystem restore is faster!
Someone dropped a table? - Dump backups are faster!
10
pg_basebackup
Advantage:
• In core postgres
• No explicit backup mode required
• Multiple instances can run concurrently
• Backups can be made on master as well as standby
Disadvantage:
• Slower when compared to snapshot file-system technologies
• Backups are always of the entire cluster
11
pg_basebackup Requirements
• A superuser or a user with REPLICATION permissions
can be used
• archive_command parameter should be set to true
• max_wal_senders should be at least 1
12
Delayed Replicas as Backups
• In situations where accidental damage was done that was realized in a
short period of time.
• May still result in some data loss post recovery, but it depends on the
importance of table, other objects dependent on it, etc.
13
Version Controlled DDL
• Committing daily changes to schema design, functions, triggers etc. into
source control
• Can go back to any state
• pg_extractor (https://github.com/omniti-labs/pg_extractor) facilitates
this by creating files for each database object, followed by integration
with git, svn, etc.
14
Validation
15
Continuous Restores
Validation is important
Estimates / Expectations
Procedure / External factors
Does it even work?
Development Database
Routine refresh + restore testing
Reporting Databases:
Overnight restores for reporting databases refreshed daily. Great
candidate for daily validation.
16
RTO
Sample Validation Log
2015-03-01 10:00:03 : Testing backup for 2015-02-28 from
/data/backup/hot_backup/+/var/log/test/bin/test_backups.sql
2015-03-01 10:00:03 : Starting decompress of
/data/backup/hot_backup/test.local-data-2015-02-28.tar.gz
2015-03-01 10:00:03 : Starting decompress of
/data/backup/hot_backup/test.local-xlog-2015-02-28.tar.gz
2015-03-01 10:00:03 : Waiting for both decompressing processes to finish
2015-03-01 14:36:06 : Decompressing worked, generated directory test with
size: 963G
2015-03-01 14:36:06 : Starting PostgreSQL
server starting
2015-03-01 14:36:07.372 EST @ 3282 LOG: loaded library
"pg_scoreboard.so"
2015-03-01 15:52:36 : PostgreSQL started
2015-03-01 15:52:36 : Validating Database
starting Vacuum
2015-03-02 08:17:56 : Test result: OK
2015-03-02 08:18:11 : All OK.
17
Where did Gitlab go wrong?
18
No PITR, no alerts for missing backups, no documentation for backup
location, no restore testing
“…backups…taken once per 24 hours…not yet been able to figure
out where they are stored…don’t appear to be working...”
No retention policy
“Fog gem may have cleaned out older backups”
No monitoring
“Our backups to S3 apparently don’t work either: the bucket is
empty”
No RPO
“…Unless we can pull these from the past 24 hours they will be
lost”
Sobering Thought of the Day
19
A chain is only as strong as the weakest link
Backup Management
20
Retention Period
Remove all older than x days
VS
Remove all but latest x backups
Off-server (short term)
VS
Off-site retention period (long term)
21
Security
Transfer
One way passwordless SSH access
HTTPS for cloud uploads (e.g. s3cmd)
Storage
Encryption
Access control
PCI Compliance
Private keys
Logical backups preferred
Multi-Tenancy Environment
Backup Server  Client
Client  Backup server
22
Monitoring and Alerts
Alert at runtime
• To detect errors at runtime within script, or change in system/user
environments
• Immediate
• Cron emails
Alert on storage/retention:
• Error in retention logic/implementation
• Secondary/Delayed
• Cron’d script, external monitoring
Alert on backup validation:
• On routine refreshes/restores
23
S3 Backups
Ideal for long term backups:
• Cheap
• Ample space
• Transfer in parts, can pick up where you left off in case of errors
• Speed depends on bandwidth
Secure:
• Buckets , Users, Policies
Communication:
• AWS cli / s3cmd
• Access keys
• PGP encryption
24
S3 Backups Contd.
Sample Bucket Policy:
{
"Version": "2012-10-17",
"Id": "S3-Account-Permissions",
"Statement": [
{
"Sid": "1",
"Effect": "Allow",
"Principal": {
"AWS":
"arn:aws:iam::<account_number>:user/omniti_testing"
},
"Action": "s3:*",
"Resource": [
"arn:aws:s3:::omniti_testing",
"arn:aws:s3:::omniti_testing/*"
]
}
]
}
25
S3 Backups Contd.
Sample User Policy:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": "s3:*",
"Resource": [
"arn:aws:s3:::omniti_testing",
"arn:aws:s3:::omniti_testing/*"
]
}
]
}
26
Automation
27
Why Automate?
Reduced Chance of Human Error
pg_dump backups
filesystem level backups
retention scripts
backup monitoring and alerts
off-site and off-server transfer as well as storage
access channels
security keys
cronjobs…
Less work, faster
Move scripts, swap crontabs, ensure all accesses exist
Uniformity, Reliability
28
What to Automate?
Scripts
Crontabs
Accesses
Validation/restores
29
Automation Example In Chef
payal@payal-ThinkPad-T520$ ls -l s3_backup/
drwxr-xr-x 2 payal payal 4096 Dec 2 11:03 attributes
drwxr-xr-x 3 payal payal 4096 Dec 2 11:03 files
-rw-r--r-- 1 payal payal 1351 Dec 2 11:03 README.md
drwxr-xr-x 2 payal payal 4096 Dec 2 11:03 recipes
drwxr-xr-x 3 payal payal 4096 Dec 2 11:03 templates
payal@payal-ThinkPad-T520 $ ls -l s3_backup/templates/default/
-rwxr-xr-x 1 payal payal 3950 Dec 2 11:03 backup.sh.erb
-rw-r--r-- 1 payal payal 1001 Jan 12 12:40 check_offsite_backups.sh.erb
-rw-r--r-- 1 payal payal 37 Dec 2 11:03 pg_dump_backup.conf.erb
-rw-r--r-- 1 payal payal 19 Dec 2 11:03 pg_dump.conf.erb
-rw-r--r-- 1 payal payal 6319 Dec 2 11:03 pg_dump_upload.py.erb
-rw-r--r-- 1 payal payal 1442 Dec 2 11:03 pg_start_backup.sh.erb
-rw-r--r-- 1 payal payal 1631 Dec 2 11:03 s3cfg.erb
30
Automation Example in Ansible
31
- name: "Use pg_dump for logical backup"
shell: "pg_dump -U postgres -Fc -f {{ db_name }}_{{ today
}}.bak {{ db_name }} > backup.log 2>&1"
args:
chdir: "{{ backup_dir }}"
when: not only_upload_backup
- name: "upload backup file"
s3: region="ap-southeast-1" bucket=“sample"
object="/DB_bucket/{{ db_name }}_{{ today }}.bak" src="{{
backup_dir }}/{{ db_name }}_{{ today }}.bak" mode=put
https://github.com/payals/postgresql_automation/tree/master/backups
Filesystem
32
Can Filesystems enhance DR?
33
ZFS
• Available on Solaris, Illumos. ZFS on Linux is new but improving!
• Built in compression
• Protections against data corruption
• Snapshots
• Copy-on-Write
How can ZFS help you with DR?
34
Scenario 1: You’re using pg_upgrade for a
major upgrade with hard link option (-k). It fails
in between. Since you used hardlinks, you
cannot turn on the older cluster. What do you
do?
ZFS Rollback – instantaneously rollback a
dataset to a previous state from a snapshot
sudo zfs rollback <snapshot_name>
How can ZFS help you with DR?
35
Scenario 2: You’ve accidentally deleted a large
table in a very large database where taking full
backups are infeasible. Is there a faster
alternative to a filesystem restore? ZFS rollback
will overwrite pgdata so you cannot use that.
ZFS Clone – Copy-on-Write. Instantaneously
clone a dataset from snapshot without any
additional space consumptions as long as you
don’t write to it.
sudo zfs clone <snapshot_name> <mountpoint>
External Tools
36
OmniPITR
37
PITR backups and archives
Backups off of master or standby
Built in backup integrity checks
Quick setup, Minimal requirements
Comprehensive logging
Built in retention
Ideal for infrastructures of all sizes
Barman
38
PITR backups and restores
Automatic retention, continuous recovery
Requires its own server
More suitable for larger infrastructures to manage multiple clusters and
multiple locations
Simplicity
Minimal knowledge of PITR inner workings required
Wrappers for recovery process
Supports backups from both primary and standby
wal-e
39
PITR backups and restores
Automatic retention, continuous recovery
Minimal setup
Cloud integration:
AWS, Azure, Google, Swift
In House Solution
40
Logical
def take_dump():
try:
with open(config_file, 'r') as f:
for db in f:
if db.strip():
db = db.replace("n", "")
dump_command = pg_dump_path + " -U postgres -v -Fc -f " +
dump_file_path + db.split()[-1] + "_" + start_time + ".sql" + " " + db + " 2>>
" + dump_file_path + db.split()[-1] + "_" + start_time + ".log"
os.system(dump_command)
print('backup of ' + db.split()[-1] + ' completed successfully')
except:
print('ERROR: bash command did not execute properly')
41
Physical
pg_basebackup:
Pg_basebackup –D pgdata –F format --xlogdir –Xs –c fast -p …
ZFS snapshots:
ZFS restore from snapshot
ZFS rollback to snapshot after failed upgrade or maintenance task
…
read_params "$@“
if [[ -z ${OFFLINE} ]]
then
postgres_start_backup
zfs_snap
postgres_stop_backup
else
zfs_snap
fi
backup
zfs_clear_snaps
42
S3 Backups
Basic steps:
check_lock()
take_dump()
gpg_encrypt()
s3_upload()
move_files()
cleanup()
43
Redundancy is Good
“That Sunday, Thomas deleted remotely stored backups and turned off the automated
backup system. He made some changes to VPN authentication that basically locked
everybody out, and turned off the automatic restart. He deleted internal IT wiki pages,
removed users from a mailing list, deactivated the company's pager notification system,
and a number of other things that basically created a huge mess that the company spent
the whole of Monday sorting out (it turned out there were local copies of the deleted
backups).”
https://www.theregister.co.uk/2017/02/23/michael_thomas_appeals_conviction/
44
S3 Backups Monitoring
Sample:
# Get timestamp of latest file in S3 bucket
latest_backup=$(s3cmd ls s3://omniti_client/ | awk {'print $1'} | sort -n | tail -1)
today=$(date --date=today "+%Y-%m-%d")
yesterday=$(date --date=yesterday "+%Y-%m-%d")
# If latest backup is older than yesterday, email dba
if [ $yesterday -ne $latest_backup -a $today -ne $latest_backup ]
then
echo "Offsite Backups Missing for client" | mailx -s "Offsite Backup Missing
for Client " dba@omniti.com
45
Backup Documentation
• Detailed backup information - types, locations, retention periods
• Procedures to setup new machines
• Analysis and estimation time for recovery
• Ways to recover
46
References
http://www.postgresql.org/docs/9.4/static
https://github.com/omniti-labs/omnipitr
http://docs.pgbarman.org
http://docs.chef.io/
https://github.com/omniti-labs/pgtreats
https://github.com/omniti-labs/pg_extractor
http://www.druva.com/blog/understanding-rpo-and-rto
http://evol-monkey.blogspot.co.uk/2013/08/postgresql-backup-
and-recovery.html
47
Questions?
48
Payal Singh
payal@omniti.com
OmniTI Computer Consulting Inc.

More Related Content

PDF
Think_your_Postgres_backups_and_recovery_are_safe_lets_talk.pptx
PDF
PostgreSQL + ZFS best practices
PDF
pg_prefaulter: Scaling WAL Performance
PDF
Out of the box replication in postgres 9.4
PDF
PGConf.ASIA 2019 - High Availability, 10 Seconds Failover - Lucky Haryadi
PDF
MySQL on ZFS
PDF
PGConf.ASIA 2019 Bali - Mission Critical Production High Availability Postgre...
PDF
Linux Systems Performance 2016
Think_your_Postgres_backups_and_recovery_are_safe_lets_talk.pptx
PostgreSQL + ZFS best practices
pg_prefaulter: Scaling WAL Performance
Out of the box replication in postgres 9.4
PGConf.ASIA 2019 - High Availability, 10 Seconds Failover - Lucky Haryadi
MySQL on ZFS
PGConf.ASIA 2019 Bali - Mission Critical Production High Availability Postgre...
Linux Systems Performance 2016

What's hot (20)

PDF
ZFSperftools2012
PDF
Comparison of-foss-distributed-storage
PDF
The New Systems Performance
PDF
Kernel Recipes 2017: Performance Analysis with BPF
PDF
Make Your Containers Faster: Linux Container Performance Tools
PDF
LizardFS-WhitePaper-Eng-v3.9.2-web
PDF
Linux Performance Profiling and Monitoring
PDF
Kernel Recipes 2017: Using Linux perf at Netflix
PDF
PGConf.ASIA 2019 Bali - Building PostgreSQL as a Service with Kubernetes - Ta...
PDF
Out of the box replication in postgres 9.4(pg confus)
PDF
Percona xtrabackup - MySQL Meetup @ Mumbai
ODP
High Availability in 37 Easy Steps
PDF
EuroBSDcon 2017 System Performance Analysis Methodologies
PDF
PostgreSQL High Availability in a Containerized World
PDF
pgDay Asia 2016 - Swapping Pacemaker-Corosync for repmgr (1)
PDF
PostgreSQL Extensions: A deeper look
PDF
Linux Performance Analysis: New Tools and Old Secrets
PDF
Streaming replication in practice
PDF
SuperServer in Firebird 3
PDF
Advanced Postgres Monitoring
ZFSperftools2012
Comparison of-foss-distributed-storage
The New Systems Performance
Kernel Recipes 2017: Performance Analysis with BPF
Make Your Containers Faster: Linux Container Performance Tools
LizardFS-WhitePaper-Eng-v3.9.2-web
Linux Performance Profiling and Monitoring
Kernel Recipes 2017: Using Linux perf at Netflix
PGConf.ASIA 2019 Bali - Building PostgreSQL as a Service with Kubernetes - Ta...
Out of the box replication in postgres 9.4(pg confus)
Percona xtrabackup - MySQL Meetup @ Mumbai
High Availability in 37 Easy Steps
EuroBSDcon 2017 System Performance Analysis Methodologies
PostgreSQL High Availability in a Containerized World
pgDay Asia 2016 - Swapping Pacemaker-Corosync for repmgr (1)
PostgreSQL Extensions: A deeper look
Linux Performance Analysis: New Tools and Old Secrets
Streaming replication in practice
SuperServer in Firebird 3
Advanced Postgres Monitoring
Ad

Viewers also liked (20)

PPTX
Automating Disaster Recovery PostgreSQL
DOCX
Ensayo Higiene Maribel Arias
PDF
Dossier inscription district lycee 2017
PPTX
Developing Your Business Through Internal Controls
PPTX
Corpo humano
PPT
Practica innovadora rosario cañar y nancy garcia
PDF
Tracxn Research - Finance & Accounting Landscape, February 2017
PDF
Mercadão Social · Equipe Brasil
PDF
The Importance of Transparency in the Trust Ecomomy
PPTX
Sociología de la Tecnología
PDF
La telefonía móvil y su salud
PPT
Practica innovadora Rosa Ligia chapal
PDF
Συνάντηση με τον Θεό της Βίβλου
PDF
MongoDB NoSQL database a deep dive -MyWhitePaper
PPTX
Salesforce Marketing Cloud Training | Salesforce Training For Beginners - Mar...
PDF
Tracxn Research - Mobile Advertising Landscape, February 2017
PPTX
3Com 70-0399-000
PPTX
Introducing NoSQL and MongoDB to complement Relational Databases (AMIS SIG 14...
PDF
Tracxn Research - Insurance Tech Landscape, February 2017
PDF
2015 Internet Trends Report
Automating Disaster Recovery PostgreSQL
Ensayo Higiene Maribel Arias
Dossier inscription district lycee 2017
Developing Your Business Through Internal Controls
Corpo humano
Practica innovadora rosario cañar y nancy garcia
Tracxn Research - Finance & Accounting Landscape, February 2017
Mercadão Social · Equipe Brasil
The Importance of Transparency in the Trust Ecomomy
Sociología de la Tecnología
La telefonía móvil y su salud
Practica innovadora Rosa Ligia chapal
Συνάντηση με τον Θεό της Βίβλου
MongoDB NoSQL database a deep dive -MyWhitePaper
Salesforce Marketing Cloud Training | Salesforce Training For Beginners - Mar...
Tracxn Research - Mobile Advertising Landscape, February 2017
3Com 70-0399-000
Introducing NoSQL and MongoDB to complement Relational Databases (AMIS SIG 14...
Tracxn Research - Insurance Tech Landscape, February 2017
2015 Internet Trends Report
Ad

Similar to Backups (20)

PPTX
Database Dumps and Backups
 
PDF
Postgres Point-in-Time Recovery
 
PPTX
Deep dive into the Rds PostgreSQL Universe Austin 2017
PDF
Pgbr 2013 postgres on aws
PDF
PostgreSQL continuous backup and PITR with Barman
 
PDF
Elephants in the Cloud
PPTX
DAT402 - Deep Dive on Amazon Aurora PostgreSQL
PDF
Hosted PostgreSQL
PDF
Backing up Wikipedia Databases
PDF
How to Upgrade Major Version of Your Production PostgreSQL
PPTX
Postgres backup-and-recovery2.pptx
PDF
configuring a warm standby, the easy way
PDF
Webinar slides: Backup Management for MySQL, MariaDB, PostgreSQL & MongoDB wi...
PDF
Backup-Recovery in PostgreSQL
PPTX
An overview of reference architectures for Postgres
 
PDF
PostgreSQL Disaster Recovery with Barman
PDF
Advanced backup methods (Postgres@CERN)
PPTX
OVHcloud – Enterprise Cloud Databases
PDF
Best Practices & Lessons Learned from Deployment of PostgreSQL
 
PPTX
Scalabe MySQL Infrastructure
Database Dumps and Backups
 
Postgres Point-in-Time Recovery
 
Deep dive into the Rds PostgreSQL Universe Austin 2017
Pgbr 2013 postgres on aws
PostgreSQL continuous backup and PITR with Barman
 
Elephants in the Cloud
DAT402 - Deep Dive on Amazon Aurora PostgreSQL
Hosted PostgreSQL
Backing up Wikipedia Databases
How to Upgrade Major Version of Your Production PostgreSQL
Postgres backup-and-recovery2.pptx
configuring a warm standby, the easy way
Webinar slides: Backup Management for MySQL, MariaDB, PostgreSQL & MongoDB wi...
Backup-Recovery in PostgreSQL
An overview of reference architectures for Postgres
 
PostgreSQL Disaster Recovery with Barman
Advanced backup methods (Postgres@CERN)
OVHcloud – Enterprise Cloud Databases
Best Practices & Lessons Learned from Deployment of PostgreSQL
 
Scalabe MySQL Infrastructure

Recently uploaded (20)

PPTX
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
PDF
Transcultural that can help you someday.
PPT
ISS -ESG Data flows What is ESG and HowHow
PDF
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
PPTX
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
PPTX
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
PPTX
Acceptance and paychological effects of mandatory extra coach I classes.pptx
PDF
Optimise Shopper Experiences with a Strong Data Estate.pdf
PPT
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
PDF
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
PDF
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
PPTX
A Complete Guide to Streamlining Business Processes
PDF
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
PDF
Global Data and Analytics Market Outlook Report
PPTX
IMPACT OF LANDSLIDE.....................
PPTX
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
PPTX
Topic 5 Presentation 5 Lesson 5 Corporate Fin
PDF
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
PDF
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
PPTX
STERILIZATION AND DISINFECTION-1.ppthhhbx
QUANTUM_COMPUTING_AND_ITS_POTENTIAL_APPLICATIONS[2].pptx
Transcultural that can help you someday.
ISS -ESG Data flows What is ESG and HowHow
Data Engineering Interview Questions & Answers Cloud Data Stacks (AWS, Azure,...
(Ali Hamza) Roll No: (F24-BSCS-1103).pptx
sac 451hinhgsgshssjsjsjheegdggeegegdggddgeg.pptx
Acceptance and paychological effects of mandatory extra coach I classes.pptx
Optimise Shopper Experiences with a Strong Data Estate.pdf
lectureusjsjdhdsjjshdshshddhdhddhhd1.ppt
Votre score augmente si vous choisissez une catégorie et que vous rédigez une...
Jean-Georges Perrin - Spark in Action, Second Edition (2020, Manning Publicat...
A Complete Guide to Streamlining Business Processes
Capcut Pro Crack For PC Latest Version {Fully Unlocked 2025}
Global Data and Analytics Market Outlook Report
IMPACT OF LANDSLIDE.....................
AI Strategy room jwfjksfksfjsjsjsjsjfsjfsj
Topic 5 Presentation 5 Lesson 5 Corporate Fin
REAL ILLUMINATI AGENT IN KAMPALA UGANDA CALL ON+256765750853/0705037305
OneRead_20250728_1808.pdfhdhddhshahwhwwjjaaja
STERILIZATION AND DISINFECTION-1.ppthhhbx

Backups

  • 1. / Think Your Postgres Backups And Disaster Recovery Are Safe? Let's talk. Payal Singh Database Administrator OmniTI 1
  • 2. Who am I ? DBA@OmniTI Github: payals Blog: http://penningpence.blogspot.com Twitter: @pallureshu Email: payal@omniti.com 2
  • 3. Agenda Types of backups Validation Backups management Automation Filesystem Third party tools In-house solution 3
  • 5. Business Continuity Objectives Recovery Point Objective (RPO) Recovery Time Objective (RTO) 5
  • 6. Types of Backups ● Logical ○ pg_dump ○ pg_dumpall ● Physical (File-system Level) ○ Online  Most commonly used  No downtime ○ Offline  rarely used  shutdown database for backup 6
  • 7. Logical Backups Advantages: Granularity Fine-tuned restores Multiple in-built compression types Ease of use, no extra setup required Disadvantages: Relatively slower Frozen snapshots in time, i.e. no PITR Locks Data spread in time 7
  • 8. pg_dumpall Pros: • Globals Cons: ● Requires superuser privileges ● No granularity while restoring ● Plaintext only ● Cannot take advantage of faster/parallel restore with pg_restore 8
  • 9. Physical Backups Advantages: Faster Incremental Backups Point In Time Recovery By default compression on certain file-systems Disadvantages: Lacks granularity 9
  • 10. Why do you need both? The more the merrier: Database wide restores - Filesystem restore is faster! Someone dropped a table? - Dump backups are faster! 10
  • 11. pg_basebackup Advantage: • In core postgres • No explicit backup mode required • Multiple instances can run concurrently • Backups can be made on master as well as standby Disadvantage: • Slower when compared to snapshot file-system technologies • Backups are always of the entire cluster 11
  • 12. pg_basebackup Requirements • A superuser or a user with REPLICATION permissions can be used • archive_command parameter should be set to true • max_wal_senders should be at least 1 12
  • 13. Delayed Replicas as Backups • In situations where accidental damage was done that was realized in a short period of time. • May still result in some data loss post recovery, but it depends on the importance of table, other objects dependent on it, etc. 13
  • 14. Version Controlled DDL • Committing daily changes to schema design, functions, triggers etc. into source control • Can go back to any state • pg_extractor (https://github.com/omniti-labs/pg_extractor) facilitates this by creating files for each database object, followed by integration with git, svn, etc. 14
  • 16. Continuous Restores Validation is important Estimates / Expectations Procedure / External factors Does it even work? Development Database Routine refresh + restore testing Reporting Databases: Overnight restores for reporting databases refreshed daily. Great candidate for daily validation. 16 RTO
  • 17. Sample Validation Log 2015-03-01 10:00:03 : Testing backup for 2015-02-28 from /data/backup/hot_backup/+/var/log/test/bin/test_backups.sql 2015-03-01 10:00:03 : Starting decompress of /data/backup/hot_backup/test.local-data-2015-02-28.tar.gz 2015-03-01 10:00:03 : Starting decompress of /data/backup/hot_backup/test.local-xlog-2015-02-28.tar.gz 2015-03-01 10:00:03 : Waiting for both decompressing processes to finish 2015-03-01 14:36:06 : Decompressing worked, generated directory test with size: 963G 2015-03-01 14:36:06 : Starting PostgreSQL server starting 2015-03-01 14:36:07.372 EST @ 3282 LOG: loaded library "pg_scoreboard.so" 2015-03-01 15:52:36 : PostgreSQL started 2015-03-01 15:52:36 : Validating Database starting Vacuum 2015-03-02 08:17:56 : Test result: OK 2015-03-02 08:18:11 : All OK. 17
  • 18. Where did Gitlab go wrong? 18 No PITR, no alerts for missing backups, no documentation for backup location, no restore testing “…backups…taken once per 24 hours…not yet been able to figure out where they are stored…don’t appear to be working...” No retention policy “Fog gem may have cleaned out older backups” No monitoring “Our backups to S3 apparently don’t work either: the bucket is empty” No RPO “…Unless we can pull these from the past 24 hours they will be lost”
  • 19. Sobering Thought of the Day 19 A chain is only as strong as the weakest link
  • 21. Retention Period Remove all older than x days VS Remove all but latest x backups Off-server (short term) VS Off-site retention period (long term) 21
  • 22. Security Transfer One way passwordless SSH access HTTPS for cloud uploads (e.g. s3cmd) Storage Encryption Access control PCI Compliance Private keys Logical backups preferred Multi-Tenancy Environment Backup Server  Client Client  Backup server 22
  • 23. Monitoring and Alerts Alert at runtime • To detect errors at runtime within script, or change in system/user environments • Immediate • Cron emails Alert on storage/retention: • Error in retention logic/implementation • Secondary/Delayed • Cron’d script, external monitoring Alert on backup validation: • On routine refreshes/restores 23
  • 24. S3 Backups Ideal for long term backups: • Cheap • Ample space • Transfer in parts, can pick up where you left off in case of errors • Speed depends on bandwidth Secure: • Buckets , Users, Policies Communication: • AWS cli / s3cmd • Access keys • PGP encryption 24
  • 25. S3 Backups Contd. Sample Bucket Policy: { "Version": "2012-10-17", "Id": "S3-Account-Permissions", "Statement": [ { "Sid": "1", "Effect": "Allow", "Principal": { "AWS": "arn:aws:iam::<account_number>:user/omniti_testing" }, "Action": "s3:*", "Resource": [ "arn:aws:s3:::omniti_testing", "arn:aws:s3:::omniti_testing/*" ] } ] } 25
  • 26. S3 Backups Contd. Sample User Policy: { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": "s3:*", "Resource": [ "arn:aws:s3:::omniti_testing", "arn:aws:s3:::omniti_testing/*" ] } ] } 26
  • 28. Why Automate? Reduced Chance of Human Error pg_dump backups filesystem level backups retention scripts backup monitoring and alerts off-site and off-server transfer as well as storage access channels security keys cronjobs… Less work, faster Move scripts, swap crontabs, ensure all accesses exist Uniformity, Reliability 28
  • 30. Automation Example In Chef payal@payal-ThinkPad-T520$ ls -l s3_backup/ drwxr-xr-x 2 payal payal 4096 Dec 2 11:03 attributes drwxr-xr-x 3 payal payal 4096 Dec 2 11:03 files -rw-r--r-- 1 payal payal 1351 Dec 2 11:03 README.md drwxr-xr-x 2 payal payal 4096 Dec 2 11:03 recipes drwxr-xr-x 3 payal payal 4096 Dec 2 11:03 templates payal@payal-ThinkPad-T520 $ ls -l s3_backup/templates/default/ -rwxr-xr-x 1 payal payal 3950 Dec 2 11:03 backup.sh.erb -rw-r--r-- 1 payal payal 1001 Jan 12 12:40 check_offsite_backups.sh.erb -rw-r--r-- 1 payal payal 37 Dec 2 11:03 pg_dump_backup.conf.erb -rw-r--r-- 1 payal payal 19 Dec 2 11:03 pg_dump.conf.erb -rw-r--r-- 1 payal payal 6319 Dec 2 11:03 pg_dump_upload.py.erb -rw-r--r-- 1 payal payal 1442 Dec 2 11:03 pg_start_backup.sh.erb -rw-r--r-- 1 payal payal 1631 Dec 2 11:03 s3cfg.erb 30
  • 31. Automation Example in Ansible 31 - name: "Use pg_dump for logical backup" shell: "pg_dump -U postgres -Fc -f {{ db_name }}_{{ today }}.bak {{ db_name }} > backup.log 2>&1" args: chdir: "{{ backup_dir }}" when: not only_upload_backup - name: "upload backup file" s3: region="ap-southeast-1" bucket=“sample" object="/DB_bucket/{{ db_name }}_{{ today }}.bak" src="{{ backup_dir }}/{{ db_name }}_{{ today }}.bak" mode=put https://github.com/payals/postgresql_automation/tree/master/backups
  • 33. Can Filesystems enhance DR? 33 ZFS • Available on Solaris, Illumos. ZFS on Linux is new but improving! • Built in compression • Protections against data corruption • Snapshots • Copy-on-Write
  • 34. How can ZFS help you with DR? 34 Scenario 1: You’re using pg_upgrade for a major upgrade with hard link option (-k). It fails in between. Since you used hardlinks, you cannot turn on the older cluster. What do you do? ZFS Rollback – instantaneously rollback a dataset to a previous state from a snapshot sudo zfs rollback <snapshot_name>
  • 35. How can ZFS help you with DR? 35 Scenario 2: You’ve accidentally deleted a large table in a very large database where taking full backups are infeasible. Is there a faster alternative to a filesystem restore? ZFS rollback will overwrite pgdata so you cannot use that. ZFS Clone – Copy-on-Write. Instantaneously clone a dataset from snapshot without any additional space consumptions as long as you don’t write to it. sudo zfs clone <snapshot_name> <mountpoint>
  • 37. OmniPITR 37 PITR backups and archives Backups off of master or standby Built in backup integrity checks Quick setup, Minimal requirements Comprehensive logging Built in retention Ideal for infrastructures of all sizes
  • 38. Barman 38 PITR backups and restores Automatic retention, continuous recovery Requires its own server More suitable for larger infrastructures to manage multiple clusters and multiple locations Simplicity Minimal knowledge of PITR inner workings required Wrappers for recovery process Supports backups from both primary and standby
  • 39. wal-e 39 PITR backups and restores Automatic retention, continuous recovery Minimal setup Cloud integration: AWS, Azure, Google, Swift
  • 41. Logical def take_dump(): try: with open(config_file, 'r') as f: for db in f: if db.strip(): db = db.replace("n", "") dump_command = pg_dump_path + " -U postgres -v -Fc -f " + dump_file_path + db.split()[-1] + "_" + start_time + ".sql" + " " + db + " 2>> " + dump_file_path + db.split()[-1] + "_" + start_time + ".log" os.system(dump_command) print('backup of ' + db.split()[-1] + ' completed successfully') except: print('ERROR: bash command did not execute properly') 41
  • 42. Physical pg_basebackup: Pg_basebackup –D pgdata –F format --xlogdir –Xs –c fast -p … ZFS snapshots: ZFS restore from snapshot ZFS rollback to snapshot after failed upgrade or maintenance task … read_params "$@“ if [[ -z ${OFFLINE} ]] then postgres_start_backup zfs_snap postgres_stop_backup else zfs_snap fi backup zfs_clear_snaps 42
  • 44. Redundancy is Good “That Sunday, Thomas deleted remotely stored backups and turned off the automated backup system. He made some changes to VPN authentication that basically locked everybody out, and turned off the automatic restart. He deleted internal IT wiki pages, removed users from a mailing list, deactivated the company's pager notification system, and a number of other things that basically created a huge mess that the company spent the whole of Monday sorting out (it turned out there were local copies of the deleted backups).” https://www.theregister.co.uk/2017/02/23/michael_thomas_appeals_conviction/ 44
  • 45. S3 Backups Monitoring Sample: # Get timestamp of latest file in S3 bucket latest_backup=$(s3cmd ls s3://omniti_client/ | awk {'print $1'} | sort -n | tail -1) today=$(date --date=today "+%Y-%m-%d") yesterday=$(date --date=yesterday "+%Y-%m-%d") # If latest backup is older than yesterday, email dba if [ $yesterday -ne $latest_backup -a $today -ne $latest_backup ] then echo "Offsite Backups Missing for client" | mailx -s "Offsite Backup Missing for Client " dba@omniti.com 45
  • 46. Backup Documentation • Detailed backup information - types, locations, retention periods • Procedures to setup new machines • Analysis and estimation time for recovery • Ways to recover 46