SlideShare a Scribd company logo
GitLab PostgresMortem:
Lessons Learned
Alexey Lesovsky
alexey.lesovsky@dataegret.com
dataegret.com
31 January Events
Failure's key points
Preventative measures
https://goo.gl/GO5rYJ
02
03
01
31 January
events
01
17:20 - an LVM snapshot of the production db was taken.
19:00 - database load increased due to spam.
23:00 - secondary's replication process started to lag behind.
23:30 - PostgreSQL database directory was wiped.
31 January events01
dataegret.com
Failure's
key points
02
1.LVM snapshots and staging provisioning.
2.When a replica start to lag.
3.Do pg_basebackup properly – part 1.
4.max_wal_senders was exceeded, but how?
5.max_connections = 8000.
6.pg_basebackup «stuck» – do pg_basebackup properly – part 2.
7.strace: good thing in wrong place.
8.rm or not rm?
9.A bit about backup.
10.Different PG versions on the production.
11.Broken mail.
31 January events02
dataegret.com
Snapshot impact on underlying storage.
Provisioning from backup.
Staging based on LVM snapshots02
dataegret.com
Re-initialize the standby.
Monitoring with pg_stat_replication.
Use wal_keep_segments while troubleshooting.
Use WAL archive.
When a replica started to lag02
dataegret.com
Do pg_basebackup into clean directory.
Remove «unnecessary» directory.
Use mv instead of rm.
Do pg_basebackup properly. Part 102
dataegret.com
There was only one standby (which was failed).
Increase max_wal_senders.
Check who has stolen connections.
The limit was exceeded by concurrent pg_basebackups.
max_wal_senders was exceeded.02
dataegret.com
More than 500 is bad idea.
Use pgbouncer to reduce the number of server connections.
max_connections = 800002
dataegret.com
Don't run more than one pg_basebackups.
It didn't stuck, it waited for the checkpoint.
Use «-c» option to make fast checkpoint.
Do pg_basebackup properly. Part 202
dataegret.com
Strace isn't a good tool in that case.
Use strace for system errors tracing.
Check stack trace from /proc/<pid>/stack or GDB.
Good things in wrong place.02
dataegret.com
Data directory was cleaned with rm.
Use mv instead of rm.
rm or not rm02
dataegret.com
Daily pg_dump.
Daily LVM snapshot.
Daily Azure snapshot.
PostgreSQL streaming replication.
Basebackup with WAL archive.
A bit about backup02
dataegret.com
Clean out old packages after major upgrade.
Different versions on a production02
dataegret.com
Setup cron, but forgot notifications.
Use reliable notification systems.
Different versions on a production02
dataegret.com
Preventative
measures
03
1. Update PS1 across all hosts to more clearly differentiate between hosts and environments.
2. Prometheus monitoring for backups.
3. Set PostgreSQL's max_connections to a sane value.
4. Investigate Point in time recovery & continuous archiving for PostgreSQL.
5. Hourly LVM snapshots of the production databases.
6. Azure disk snapshots of production databases.
7. Move staging to the ARM environment.
8. Recover production replica(s).
9. Automated testing of recovering PostgreSQL database backups.
10.Improve PostgreSQL replication documentation/runbooks.
11.Investigate pgbarman for creating PostgreSQL backups.
12.Investigate using WAL-E as a means of Database Backup and Realtime Replication.
13.Build Streaming Database Restore.
14.Assign an owner for data durability.
Different versions on a production03
dataegret.com
1. Update PS1 across all hosts.
Looks OK.
2. Prometheus monitoring for backups.
Size, number, age and recovery status.
3. Set PostgreSQL's max_connections to a sane value.
Better use pgbouncer.
4. Investigate PITR & continuous archiving for PostgreSQL.
Yes, as the part of the backup.
Preventative measures03
dataegret.com
5. Hourly LVM snapshots of the production databases.
Looks unnecessary.
6. Azure disk snapshots of production databases.
Looks unnecessary.
7. Move staging to the ARM environment.
Very and very suspicious.
8. Recover production replica(s).
Do that asap.
Preventative measures03
dataegret.com
9. Automated testing of recovering database backups.
YES!
10. Improve documentation/runbooks.
You need a bureaucrat.
11. Investigate pgbarman.
Looks OK, Barman is stable and reliable.
12. Investigate using WAL-E.
Looks OK, WAL-E is the «setup and forget».
Preventative measures03
dataegret.com
13. Build Streaming Database Restore.
Corresponds with p.9.
14. Assign an owner for data durability.
Hire a DBA.
Preventative measures03
dataegret.com
Check and monitor backups.
Create an emergency instructions.
Learn to use tools properly.
Lessons learned03
dataegret.com
Postmortem of database outage of January 31
https://about.gitlab.com/2017/02/10/postmortem-of-database-outage-of-january-31/
PostgreSQL Statistics Collector: pg_stat_replication view
https://www.postgresql.org/docs/current/static/monitoring-stats.html#PG-STAT-REPLICATION-VIEW
pg_basebackup utility
https://www.postgresql.org/docs/current/static/app-pgbasebackup.html
PostgreSQL Replication
https://www.postgresql.org/docs/9.6/static/runtime-config-replication.html
PgBouncer
https://pgbouncer.github.io/
https://wiki.postgresql.org/wiki/PgBouncer
Barman
http://www.pgbarman.org/
Links03
dataegret.com
Thanks for watching!
dataegret.com alexey.lesovsky@dataegret.com

More Related Content

What's hot (20)

Как PostgreSQL работает с диском
Как PostgreSQL работает с дискомКак PostgreSQL работает с диском
Как PostgreSQL работает с диском
PostgreSQL-Consulting
 
Managing PostgreSQL with PgCenter
Managing PostgreSQL with PgCenterManaging PostgreSQL with PgCenter
Managing PostgreSQL with PgCenter
Alexey Lesovsky
 
Postgresql database administration volume 1
Postgresql database administration volume 1Postgresql database administration volume 1
Postgresql database administration volume 1
Federico Campoli
 
Advanced Postgres Monitoring
Advanced Postgres MonitoringAdvanced Postgres Monitoring
Advanced Postgres Monitoring
Denish Patel
 
Troubleshooting PostgreSQL Streaming Replication
Troubleshooting PostgreSQL Streaming ReplicationTroubleshooting PostgreSQL Streaming Replication
Troubleshooting PostgreSQL Streaming Replication
Alexey Lesovsky
 
Deep dive into PostgreSQL statistics.
Deep dive into PostgreSQL statistics.Deep dive into PostgreSQL statistics.
Deep dive into PostgreSQL statistics.
Alexey Lesovsky
 
Mastering PostgreSQL Administration
Mastering PostgreSQL AdministrationMastering PostgreSQL Administration
Mastering PostgreSQL Administration
EDB
 
Autovacuum, explained for engineers, new improved version PGConf.eu 2015 Vienna
Autovacuum, explained for engineers, new improved version PGConf.eu 2015 ViennaAutovacuum, explained for engineers, new improved version PGConf.eu 2015 Vienna
Autovacuum, explained for engineers, new improved version PGConf.eu 2015 Vienna
PostgreSQL-Consulting
 
Pgcenter overview
Pgcenter overviewPgcenter overview
Pgcenter overview
Alexey Lesovsky
 
Peeking into the Black Hole Called PL/PGSQL - the New PL Profiler / Jan Wieck...
Peeking into the Black Hole Called PL/PGSQL - the New PL Profiler / Jan Wieck...Peeking into the Black Hole Called PL/PGSQL - the New PL Profiler / Jan Wieck...
Peeking into the Black Hole Called PL/PGSQL - the New PL Profiler / Jan Wieck...
Ontico
 
PostgreSQL and RAM usage
PostgreSQL and RAM usagePostgreSQL and RAM usage
PostgreSQL and RAM usage
Alexey Bashtanov
 
Deep dive into PostgreSQL statistics.
Deep dive into PostgreSQL statistics.Deep dive into PostgreSQL statistics.
Deep dive into PostgreSQL statistics.
Alexey Lesovsky
 
GUC Tutorial Package (9.0)
GUC Tutorial Package (9.0)GUC Tutorial Package (9.0)
GUC Tutorial Package (9.0)
PostgreSQL Experts, Inc.
 
Deep dive into PostgreSQL statistics.
Deep dive into PostgreSQL statistics.Deep dive into PostgreSQL statistics.
Deep dive into PostgreSQL statistics.
Alexey Lesovsky
 
Logical replication with pglogical
Logical replication with pglogicalLogical replication with pglogical
Logical replication with pglogical
Umair Shahid
 
Streaming Replication Made Easy in v9.3
Streaming Replication Made Easy in v9.3Streaming Replication Made Easy in v9.3
Streaming Replication Made Easy in v9.3
Sameer Kumar
 
pgDay Asia 2016 - Swapping Pacemaker-Corosync for repmgr (1)
pgDay Asia 2016 - Swapping Pacemaker-Corosync for repmgr (1)pgDay Asia 2016 - Swapping Pacemaker-Corosync for repmgr (1)
pgDay Asia 2016 - Swapping Pacemaker-Corosync for repmgr (1)
Wei Shan Ang
 
Advanced backup methods (Postgres@CERN)
Advanced backup methods (Postgres@CERN)Advanced backup methods (Postgres@CERN)
Advanced backup methods (Postgres@CERN)
Anastasia Lubennikova
 
pg / shardman: шардинг в PostgreSQL на основе postgres / fdw, pg / pathman и ...
pg / shardman: шардинг в PostgreSQL на основе postgres / fdw, pg / pathman и ...pg / shardman: шардинг в PostgreSQL на основе postgres / fdw, pg / pathman и ...
pg / shardman: шардинг в PostgreSQL на основе postgres / fdw, pg / pathman и ...
Ontico
 
Out of the Box Replication in Postgres 9.4(pgconfsf)
Out of the Box Replication in Postgres 9.4(pgconfsf)Out of the Box Replication in Postgres 9.4(pgconfsf)
Out of the Box Replication in Postgres 9.4(pgconfsf)
Denish Patel
 
Как PostgreSQL работает с диском
Как PostgreSQL работает с дискомКак PostgreSQL работает с диском
Как PostgreSQL работает с диском
PostgreSQL-Consulting
 
Managing PostgreSQL with PgCenter
Managing PostgreSQL with PgCenterManaging PostgreSQL with PgCenter
Managing PostgreSQL with PgCenter
Alexey Lesovsky
 
Postgresql database administration volume 1
Postgresql database administration volume 1Postgresql database administration volume 1
Postgresql database administration volume 1
Federico Campoli
 
Advanced Postgres Monitoring
Advanced Postgres MonitoringAdvanced Postgres Monitoring
Advanced Postgres Monitoring
Denish Patel
 
Troubleshooting PostgreSQL Streaming Replication
Troubleshooting PostgreSQL Streaming ReplicationTroubleshooting PostgreSQL Streaming Replication
Troubleshooting PostgreSQL Streaming Replication
Alexey Lesovsky
 
Deep dive into PostgreSQL statistics.
Deep dive into PostgreSQL statistics.Deep dive into PostgreSQL statistics.
Deep dive into PostgreSQL statistics.
Alexey Lesovsky
 
Mastering PostgreSQL Administration
Mastering PostgreSQL AdministrationMastering PostgreSQL Administration
Mastering PostgreSQL Administration
EDB
 
Autovacuum, explained for engineers, new improved version PGConf.eu 2015 Vienna
Autovacuum, explained for engineers, new improved version PGConf.eu 2015 ViennaAutovacuum, explained for engineers, new improved version PGConf.eu 2015 Vienna
Autovacuum, explained for engineers, new improved version PGConf.eu 2015 Vienna
PostgreSQL-Consulting
 
Peeking into the Black Hole Called PL/PGSQL - the New PL Profiler / Jan Wieck...
Peeking into the Black Hole Called PL/PGSQL - the New PL Profiler / Jan Wieck...Peeking into the Black Hole Called PL/PGSQL - the New PL Profiler / Jan Wieck...
Peeking into the Black Hole Called PL/PGSQL - the New PL Profiler / Jan Wieck...
Ontico
 
Deep dive into PostgreSQL statistics.
Deep dive into PostgreSQL statistics.Deep dive into PostgreSQL statistics.
Deep dive into PostgreSQL statistics.
Alexey Lesovsky
 
Deep dive into PostgreSQL statistics.
Deep dive into PostgreSQL statistics.Deep dive into PostgreSQL statistics.
Deep dive into PostgreSQL statistics.
Alexey Lesovsky
 
Logical replication with pglogical
Logical replication with pglogicalLogical replication with pglogical
Logical replication with pglogical
Umair Shahid
 
Streaming Replication Made Easy in v9.3
Streaming Replication Made Easy in v9.3Streaming Replication Made Easy in v9.3
Streaming Replication Made Easy in v9.3
Sameer Kumar
 
pgDay Asia 2016 - Swapping Pacemaker-Corosync for repmgr (1)
pgDay Asia 2016 - Swapping Pacemaker-Corosync for repmgr (1)pgDay Asia 2016 - Swapping Pacemaker-Corosync for repmgr (1)
pgDay Asia 2016 - Swapping Pacemaker-Corosync for repmgr (1)
Wei Shan Ang
 
Advanced backup methods (Postgres@CERN)
Advanced backup methods (Postgres@CERN)Advanced backup methods (Postgres@CERN)
Advanced backup methods (Postgres@CERN)
Anastasia Lubennikova
 
pg / shardman: шардинг в PostgreSQL на основе postgres / fdw, pg / pathman и ...
pg / shardman: шардинг в PostgreSQL на основе postgres / fdw, pg / pathman и ...pg / shardman: шардинг в PostgreSQL на основе postgres / fdw, pg / pathman и ...
pg / shardman: шардинг в PostgreSQL на основе postgres / fdw, pg / pathman и ...
Ontico
 
Out of the Box Replication in Postgres 9.4(pgconfsf)
Out of the Box Replication in Postgres 9.4(pgconfsf)Out of the Box Replication in Postgres 9.4(pgconfsf)
Out of the Box Replication in Postgres 9.4(pgconfsf)
Denish Patel
 

Viewers also liked (19)

PostgreSQL Vacuum: Nine Circles of Hell
PostgreSQL Vacuum: Nine Circles of HellPostgreSQL Vacuum: Nine Circles of Hell
PostgreSQL Vacuum: Nine Circles of Hell
Alexey Lesovsky
 
Linux tuning to improve PostgreSQL performance
Linux tuning to improve PostgreSQL performanceLinux tuning to improve PostgreSQL performance
Linux tuning to improve PostgreSQL performance
PostgreSQL-Consulting
 
Linux tuning for PostgreSQL at Secon 2015
Linux tuning for PostgreSQL at Secon 2015Linux tuning for PostgreSQL at Secon 2015
Linux tuning for PostgreSQL at Secon 2015
Alexey Lesovsky
 
Tuning Linux for your database FLOSSUK 2016
Tuning Linux for your database FLOSSUK 2016Tuning Linux for your database FLOSSUK 2016
Tuning Linux for your database FLOSSUK 2016
Colin Charles
 
Best Practices for Becoming an Exceptional Postgres DBA
Best Practices for Becoming an Exceptional Postgres DBA Best Practices for Becoming an Exceptional Postgres DBA
Best Practices for Becoming an Exceptional Postgres DBA
EDB
 
Tuning Linux for Databases.
Tuning Linux for Databases.Tuning Linux for Databases.
Tuning Linux for Databases.
Alexey Lesovsky
 
Streaming replication in practice
Streaming replication in practiceStreaming replication in practice
Streaming replication in practice
Alexey Lesovsky
 
PostgreSQL Administration for System Administrators
PostgreSQL Administration for System AdministratorsPostgreSQL Administration for System Administrators
PostgreSQL Administration for System Administrators
Command Prompt., Inc
 
Use Case: PostGIS and Agribotics
Use Case: PostGIS and AgriboticsUse Case: PostGIS and Agribotics
Use Case: PostGIS and Agribotics
PGConf APAC
 
Best Practices of HA and Replication of PostgreSQL in Virtualized Environments
Best Practices of HA and Replication of PostgreSQL in Virtualized EnvironmentsBest Practices of HA and Replication of PostgreSQL in Virtualized Environments
Best Practices of HA and Replication of PostgreSQL in Virtualized Environments
Jignesh Shah
 
5 Steps to PostgreSQL Performance
5 Steps to PostgreSQL Performance5 Steps to PostgreSQL Performance
5 Steps to PostgreSQL Performance
Command Prompt., Inc
 
Opslag van long tail producten in e-warehouse
Opslag van long tail producten in e-warehouseOpslag van long tail producten in e-warehouse
Opslag van long tail producten in e-warehouse
Storeganizer
 
Расширяемость PostgreSQL для хакеров и архитекторов / Олег Бартунов, Александ...
Расширяемость PostgreSQL для хакеров и архитекторов / Олег Бартунов, Александ...Расширяемость PostgreSQL для хакеров и архитекторов / Олег Бартунов, Александ...
Расширяемость PostgreSQL для хакеров и архитекторов / Олег Бартунов, Александ...
Ontico
 
5 мифов о производительности баз данных и Python
5 мифов о производительности баз данных и Python5 мифов о производительности баз данных и Python
5 мифов о производительности баз данных и Python
Max Klymyshyn
 
Python и высокая нагрузка
Python и высокая нагрузкаPython и высокая нагрузка
Python и высокая нагрузка
Alexander Shigin
 
Big Data aggregation techniques
Big Data aggregation techniquesBig Data aggregation techniques
Big Data aggregation techniques
Valentin Logvinskiy
 
Gtd Dev Labs2010 Part Ii
Gtd Dev Labs2010 Part IiGtd Dev Labs2010 Part Ii
Gtd Dev Labs2010 Part Ii
Maxim Dorofeev
 
Как HeadHunter удалось безопасно нарушить RFC 793 (TCP) и обойти сетевые лову...
Как HeadHunter удалось безопасно нарушить RFC 793 (TCP) и обойти сетевые лову...Как HeadHunter удалось безопасно нарушить RFC 793 (TCP) и обойти сетевые лову...
Как HeadHunter удалось безопасно нарушить RFC 793 (TCP) и обойти сетевые лову...
Андрей Шорин
 
Gtd Dev Labs2010 Part I
Gtd Dev Labs2010 Part IGtd Dev Labs2010 Part I
Gtd Dev Labs2010 Part I
Maxim Dorofeev
 
PostgreSQL Vacuum: Nine Circles of Hell
PostgreSQL Vacuum: Nine Circles of HellPostgreSQL Vacuum: Nine Circles of Hell
PostgreSQL Vacuum: Nine Circles of Hell
Alexey Lesovsky
 
Linux tuning to improve PostgreSQL performance
Linux tuning to improve PostgreSQL performanceLinux tuning to improve PostgreSQL performance
Linux tuning to improve PostgreSQL performance
PostgreSQL-Consulting
 
Linux tuning for PostgreSQL at Secon 2015
Linux tuning for PostgreSQL at Secon 2015Linux tuning for PostgreSQL at Secon 2015
Linux tuning for PostgreSQL at Secon 2015
Alexey Lesovsky
 
Tuning Linux for your database FLOSSUK 2016
Tuning Linux for your database FLOSSUK 2016Tuning Linux for your database FLOSSUK 2016
Tuning Linux for your database FLOSSUK 2016
Colin Charles
 
Best Practices for Becoming an Exceptional Postgres DBA
Best Practices for Becoming an Exceptional Postgres DBA Best Practices for Becoming an Exceptional Postgres DBA
Best Practices for Becoming an Exceptional Postgres DBA
EDB
 
Tuning Linux for Databases.
Tuning Linux for Databases.Tuning Linux for Databases.
Tuning Linux for Databases.
Alexey Lesovsky
 
Streaming replication in practice
Streaming replication in practiceStreaming replication in practice
Streaming replication in practice
Alexey Lesovsky
 
PostgreSQL Administration for System Administrators
PostgreSQL Administration for System AdministratorsPostgreSQL Administration for System Administrators
PostgreSQL Administration for System Administrators
Command Prompt., Inc
 
Use Case: PostGIS and Agribotics
Use Case: PostGIS and AgriboticsUse Case: PostGIS and Agribotics
Use Case: PostGIS and Agribotics
PGConf APAC
 
Best Practices of HA and Replication of PostgreSQL in Virtualized Environments
Best Practices of HA and Replication of PostgreSQL in Virtualized EnvironmentsBest Practices of HA and Replication of PostgreSQL in Virtualized Environments
Best Practices of HA and Replication of PostgreSQL in Virtualized Environments
Jignesh Shah
 
Opslag van long tail producten in e-warehouse
Opslag van long tail producten in e-warehouseOpslag van long tail producten in e-warehouse
Opslag van long tail producten in e-warehouse
Storeganizer
 
Расширяемость PostgreSQL для хакеров и архитекторов / Олег Бартунов, Александ...
Расширяемость PostgreSQL для хакеров и архитекторов / Олег Бартунов, Александ...Расширяемость PostgreSQL для хакеров и архитекторов / Олег Бартунов, Александ...
Расширяемость PostgreSQL для хакеров и архитекторов / Олег Бартунов, Александ...
Ontico
 
5 мифов о производительности баз данных и Python
5 мифов о производительности баз данных и Python5 мифов о производительности баз данных и Python
5 мифов о производительности баз данных и Python
Max Klymyshyn
 
Python и высокая нагрузка
Python и высокая нагрузкаPython и высокая нагрузка
Python и высокая нагрузка
Alexander Shigin
 
Gtd Dev Labs2010 Part Ii
Gtd Dev Labs2010 Part IiGtd Dev Labs2010 Part Ii
Gtd Dev Labs2010 Part Ii
Maxim Dorofeev
 
Как HeadHunter удалось безопасно нарушить RFC 793 (TCP) и обойти сетевые лову...
Как HeadHunter удалось безопасно нарушить RFC 793 (TCP) и обойти сетевые лову...Как HeadHunter удалось безопасно нарушить RFC 793 (TCP) и обойти сетевые лову...
Как HeadHunter удалось безопасно нарушить RFC 793 (TCP) и обойти сетевые лову...
Андрей Шорин
 
Gtd Dev Labs2010 Part I
Gtd Dev Labs2010 Part IGtd Dev Labs2010 Part I
Gtd Dev Labs2010 Part I
Maxim Dorofeev
 
Ad

Similar to GitLab PostgresMortem: Lessons Learned (20)

On The Building Of A PostgreSQL Cluster
On The Building Of A PostgreSQL ClusterOn The Building Of A PostgreSQL Cluster
On The Building Of A PostgreSQL Cluster
Srihari Sriraman
 
PostgreSQL continuous backup and PITR with Barman
 PostgreSQL continuous backup and PITR with Barman PostgreSQL continuous backup and PITR with Barman
PostgreSQL continuous backup and PITR with Barman
EDB
 
Backups
BackupsBackups
Backups
Payal Singh
 
Fail over fail_back
Fail over fail_backFail over fail_back
Fail over fail_back
PostgreSQL Experts, Inc.
 
PyConline AU 2021 - Things might go wrong in a data-intensive application
PyConline AU 2021 - Things might go wrong in a data-intensive applicationPyConline AU 2021 - Things might go wrong in a data-intensive application
PyConline AU 2021 - Things might go wrong in a data-intensive application
Hua Chu
 
PGDay.Amsterdam 2018 - Stefan Fercot - Save your data with pgBackRest
PGDay.Amsterdam 2018 - Stefan Fercot - Save your data with pgBackRestPGDay.Amsterdam 2018 - Stefan Fercot - Save your data with pgBackRest
PGDay.Amsterdam 2018 - Stefan Fercot - Save your data with pgBackRest
PGDay.Amsterdam
 
From 0 to ~100: Business Continuity with PostgreSQL
From 0 to ~100: Business Continuity with PostgreSQLFrom 0 to ~100: Business Continuity with PostgreSQL
From 0 to ~100: Business Continuity with PostgreSQL
Gabriele Bartolini
 
PGConf.ASIA 2019 Bali - Your Business Continuity Matrix and PostgreSQL's Disa...
PGConf.ASIA 2019 Bali - Your Business Continuity Matrix and PostgreSQL's Disa...PGConf.ASIA 2019 Bali - Your Business Continuity Matrix and PostgreSQL's Disa...
PGConf.ASIA 2019 Bali - Your Business Continuity Matrix and PostgreSQL's Disa...
Equnix Business Solutions
 
PGConf APAC 2018 - Tale from Trenches
PGConf APAC 2018 - Tale from TrenchesPGConf APAC 2018 - Tale from Trenches
PGConf APAC 2018 - Tale from Trenches
PGConf APAC
 
Benchmarking for postgresql workloads in kubernetes
Benchmarking for postgresql workloads in kubernetesBenchmarking for postgresql workloads in kubernetes
Benchmarking for postgresql workloads in kubernetes
DoKC
 
How would you recover? Lesson's from 2016's most interesting disasters
How would you recover? Lesson's from 2016's most interesting disastersHow would you recover? Lesson's from 2016's most interesting disasters
How would you recover? Lesson's from 2016's most interesting disasters
Databarracks
 
Database Dumps and Backups
Database Dumps and BackupsDatabase Dumps and Backups
Database Dumps and Backups
EDB
 
Pro PostgreSQL, OSCon 2008
Pro PostgreSQL, OSCon 2008Pro PostgreSQL, OSCon 2008
Pro PostgreSQL, OSCon 2008
Robert Treat
 
Data corruption
Data corruptionData corruption
Data corruption
Tomas Vondra
 
Think_your_Postgres_backups_and_recovery_are_safe_lets_talk.pptx
Think_your_Postgres_backups_and_recovery_are_safe_lets_talk.pptxThink_your_Postgres_backups_and_recovery_are_safe_lets_talk.pptx
Think_your_Postgres_backups_and_recovery_are_safe_lets_talk.pptx
Payal Singh
 
The Challenges of Distributing Postgres: A Citus Story
The Challenges of Distributing Postgres: A Citus StoryThe Challenges of Distributing Postgres: A Citus Story
The Challenges of Distributing Postgres: A Citus Story
Hanna Kelman
 
The Challenges of Distributing Postgres: A Citus Story | DataEngConf NYC 2017...
The Challenges of Distributing Postgres: A Citus Story | DataEngConf NYC 2017...The Challenges of Distributing Postgres: A Citus Story | DataEngConf NYC 2017...
The Challenges of Distributing Postgres: A Citus Story | DataEngConf NYC 2017...
Citus Data
 
Postgre sql best_practices
Postgre sql best_practicesPostgre sql best_practices
Postgre sql best_practices
Jacques Kostic
 
How to Design for Database High Availability
How to Design for Database High AvailabilityHow to Design for Database High Availability
How to Design for Database High Availability
EDB
 
7 Ways To Crash Postgres
7 Ways To Crash Postgres7 Ways To Crash Postgres
7 Ways To Crash Postgres
PostgreSQL Experts, Inc.
 
On The Building Of A PostgreSQL Cluster
On The Building Of A PostgreSQL ClusterOn The Building Of A PostgreSQL Cluster
On The Building Of A PostgreSQL Cluster
Srihari Sriraman
 
PostgreSQL continuous backup and PITR with Barman
 PostgreSQL continuous backup and PITR with Barman PostgreSQL continuous backup and PITR with Barman
PostgreSQL continuous backup and PITR with Barman
EDB
 
PyConline AU 2021 - Things might go wrong in a data-intensive application
PyConline AU 2021 - Things might go wrong in a data-intensive applicationPyConline AU 2021 - Things might go wrong in a data-intensive application
PyConline AU 2021 - Things might go wrong in a data-intensive application
Hua Chu
 
PGDay.Amsterdam 2018 - Stefan Fercot - Save your data with pgBackRest
PGDay.Amsterdam 2018 - Stefan Fercot - Save your data with pgBackRestPGDay.Amsterdam 2018 - Stefan Fercot - Save your data with pgBackRest
PGDay.Amsterdam 2018 - Stefan Fercot - Save your data with pgBackRest
PGDay.Amsterdam
 
From 0 to ~100: Business Continuity with PostgreSQL
From 0 to ~100: Business Continuity with PostgreSQLFrom 0 to ~100: Business Continuity with PostgreSQL
From 0 to ~100: Business Continuity with PostgreSQL
Gabriele Bartolini
 
PGConf.ASIA 2019 Bali - Your Business Continuity Matrix and PostgreSQL's Disa...
PGConf.ASIA 2019 Bali - Your Business Continuity Matrix and PostgreSQL's Disa...PGConf.ASIA 2019 Bali - Your Business Continuity Matrix and PostgreSQL's Disa...
PGConf.ASIA 2019 Bali - Your Business Continuity Matrix and PostgreSQL's Disa...
Equnix Business Solutions
 
PGConf APAC 2018 - Tale from Trenches
PGConf APAC 2018 - Tale from TrenchesPGConf APAC 2018 - Tale from Trenches
PGConf APAC 2018 - Tale from Trenches
PGConf APAC
 
Benchmarking for postgresql workloads in kubernetes
Benchmarking for postgresql workloads in kubernetesBenchmarking for postgresql workloads in kubernetes
Benchmarking for postgresql workloads in kubernetes
DoKC
 
How would you recover? Lesson's from 2016's most interesting disasters
How would you recover? Lesson's from 2016's most interesting disastersHow would you recover? Lesson's from 2016's most interesting disasters
How would you recover? Lesson's from 2016's most interesting disasters
Databarracks
 
Database Dumps and Backups
Database Dumps and BackupsDatabase Dumps and Backups
Database Dumps and Backups
EDB
 
Pro PostgreSQL, OSCon 2008
Pro PostgreSQL, OSCon 2008Pro PostgreSQL, OSCon 2008
Pro PostgreSQL, OSCon 2008
Robert Treat
 
Think_your_Postgres_backups_and_recovery_are_safe_lets_talk.pptx
Think_your_Postgres_backups_and_recovery_are_safe_lets_talk.pptxThink_your_Postgres_backups_and_recovery_are_safe_lets_talk.pptx
Think_your_Postgres_backups_and_recovery_are_safe_lets_talk.pptx
Payal Singh
 
The Challenges of Distributing Postgres: A Citus Story
The Challenges of Distributing Postgres: A Citus StoryThe Challenges of Distributing Postgres: A Citus Story
The Challenges of Distributing Postgres: A Citus Story
Hanna Kelman
 
The Challenges of Distributing Postgres: A Citus Story | DataEngConf NYC 2017...
The Challenges of Distributing Postgres: A Citus Story | DataEngConf NYC 2017...The Challenges of Distributing Postgres: A Citus Story | DataEngConf NYC 2017...
The Challenges of Distributing Postgres: A Citus Story | DataEngConf NYC 2017...
Citus Data
 
Postgre sql best_practices
Postgre sql best_practicesPostgre sql best_practices
Postgre sql best_practices
Jacques Kostic
 
How to Design for Database High Availability
How to Design for Database High AvailabilityHow to Design for Database High Availability
How to Design for Database High Availability
EDB
 
Ad

More from Alexey Lesovsky (8)

Отладка и устранение проблем в PostgreSQL Streaming Replication.
Отладка и устранение проблем в PostgreSQL Streaming Replication.Отладка и устранение проблем в PostgreSQL Streaming Replication.
Отладка и устранение проблем в PostgreSQL Streaming Replication.
Alexey Lesovsky
 
Call of Postgres: Advanced Operations (part 5)
Call of Postgres: Advanced Operations (part 5)Call of Postgres: Advanced Operations (part 5)
Call of Postgres: Advanced Operations (part 5)
Alexey Lesovsky
 
Call of Postgres: Advanced Operations (part 4)
Call of Postgres: Advanced Operations (part 4)Call of Postgres: Advanced Operations (part 4)
Call of Postgres: Advanced Operations (part 4)
Alexey Lesovsky
 
Call of Postgres: Advanced Operations (part 3)
Call of Postgres: Advanced Operations (part 3)Call of Postgres: Advanced Operations (part 3)
Call of Postgres: Advanced Operations (part 3)
Alexey Lesovsky
 
Call of Postgres: Advanced Operations (part 2)
Call of Postgres: Advanced Operations (part 2)Call of Postgres: Advanced Operations (part 2)
Call of Postgres: Advanced Operations (part 2)
Alexey Lesovsky
 
Call of Postgres: Advanced Operations (part 1)
Call of Postgres: Advanced Operations (part 1)Call of Postgres: Advanced Operations (part 1)
Call of Postgres: Advanced Operations (part 1)
Alexey Lesovsky
 
PostgreSQL Streaming Replication
PostgreSQL Streaming ReplicationPostgreSQL Streaming Replication
PostgreSQL Streaming Replication
Alexey Lesovsky
 
Highload 2014. PostgreSQL: ups, DevOps.
Highload 2014. PostgreSQL: ups, DevOps.Highload 2014. PostgreSQL: ups, DevOps.
Highload 2014. PostgreSQL: ups, DevOps.
Alexey Lesovsky
 
Отладка и устранение проблем в PostgreSQL Streaming Replication.
Отладка и устранение проблем в PostgreSQL Streaming Replication.Отладка и устранение проблем в PostgreSQL Streaming Replication.
Отладка и устранение проблем в PostgreSQL Streaming Replication.
Alexey Lesovsky
 
Call of Postgres: Advanced Operations (part 5)
Call of Postgres: Advanced Operations (part 5)Call of Postgres: Advanced Operations (part 5)
Call of Postgres: Advanced Operations (part 5)
Alexey Lesovsky
 
Call of Postgres: Advanced Operations (part 4)
Call of Postgres: Advanced Operations (part 4)Call of Postgres: Advanced Operations (part 4)
Call of Postgres: Advanced Operations (part 4)
Alexey Lesovsky
 
Call of Postgres: Advanced Operations (part 3)
Call of Postgres: Advanced Operations (part 3)Call of Postgres: Advanced Operations (part 3)
Call of Postgres: Advanced Operations (part 3)
Alexey Lesovsky
 
Call of Postgres: Advanced Operations (part 2)
Call of Postgres: Advanced Operations (part 2)Call of Postgres: Advanced Operations (part 2)
Call of Postgres: Advanced Operations (part 2)
Alexey Lesovsky
 
Call of Postgres: Advanced Operations (part 1)
Call of Postgres: Advanced Operations (part 1)Call of Postgres: Advanced Operations (part 1)
Call of Postgres: Advanced Operations (part 1)
Alexey Lesovsky
 
PostgreSQL Streaming Replication
PostgreSQL Streaming ReplicationPostgreSQL Streaming Replication
PostgreSQL Streaming Replication
Alexey Lesovsky
 
Highload 2014. PostgreSQL: ups, DevOps.
Highload 2014. PostgreSQL: ups, DevOps.Highload 2014. PostgreSQL: ups, DevOps.
Highload 2014. PostgreSQL: ups, DevOps.
Alexey Lesovsky
 

Recently uploaded (20)

Pavement and its types, Application of rigid and Flexible Pavements
Pavement and its types, Application of rigid and Flexible PavementsPavement and its types, Application of rigid and Flexible Pavements
Pavement and its types, Application of rigid and Flexible Pavements
Sakthivel M
 
SEW make Brake BE05 – BE30 Brake – Repair Kit
SEW make Brake BE05 – BE30 Brake – Repair KitSEW make Brake BE05 – BE30 Brake – Repair Kit
SEW make Brake BE05 – BE30 Brake – Repair Kit
projectultramechanix
 
Week 6- PC HARDWARE AND MAINTENANCE-THEORY.pptx
Week 6- PC HARDWARE AND MAINTENANCE-THEORY.pptxWeek 6- PC HARDWARE AND MAINTENANCE-THEORY.pptx
Week 6- PC HARDWARE AND MAINTENANCE-THEORY.pptx
dayananda54
 
Rearchitecturing a 9-year-old legacy Laravel application.pdf
Rearchitecturing a 9-year-old legacy Laravel application.pdfRearchitecturing a 9-year-old legacy Laravel application.pdf
Rearchitecturing a 9-year-old legacy Laravel application.pdf
Takumi Amitani
 
OCS Group SG - HPHT Well Design and Operation - SN.pdf
OCS Group SG - HPHT Well Design and Operation - SN.pdfOCS Group SG - HPHT Well Design and Operation - SN.pdf
OCS Group SG - HPHT Well Design and Operation - SN.pdf
Muanisa Waras
 
Structural Design for Residential-to-Restaurant Conversion
Structural Design for Residential-to-Restaurant ConversionStructural Design for Residential-to-Restaurant Conversion
Structural Design for Residential-to-Restaurant Conversion
DanielRoman285499
 
22PCOAM16 _ML_Unit 3 Notes & Question bank
22PCOAM16 _ML_Unit 3 Notes & Question bank22PCOAM16 _ML_Unit 3 Notes & Question bank
22PCOAM16 _ML_Unit 3 Notes & Question bank
Guru Nanak Technical Institutions
 
Impurities of Water and their Significance.pptx
Impurities of Water and their Significance.pptxImpurities of Water and their Significance.pptx
Impurities of Water and their Significance.pptx
dhanashree78
 
TEA2016AAT 160 W TV application design example
TEA2016AAT 160 W TV application design exampleTEA2016AAT 160 W TV application design example
TEA2016AAT 160 W TV application design example
ssuser1be9ce
 
Decoding Kotlin - Your Guide to Solving the Mysterious in Kotlin - Devoxx PL ...
Decoding Kotlin - Your Guide to Solving the Mysterious in Kotlin - Devoxx PL ...Decoding Kotlin - Your Guide to Solving the Mysterious in Kotlin - Devoxx PL ...
Decoding Kotlin - Your Guide to Solving the Mysterious in Kotlin - Devoxx PL ...
João Esperancinha
 
362 Alec Data Center Solutions-Slysium Data Center-AUH-Adaptaflex.pdf
362 Alec Data Center Solutions-Slysium Data Center-AUH-Adaptaflex.pdf362 Alec Data Center Solutions-Slysium Data Center-AUH-Adaptaflex.pdf
362 Alec Data Center Solutions-Slysium Data Center-AUH-Adaptaflex.pdf
djiceramil
 
COMPOSITE COLUMN IN STEEL CONCRETE COMPOSITES.ppt
COMPOSITE COLUMN IN STEEL CONCRETE COMPOSITES.pptCOMPOSITE COLUMN IN STEEL CONCRETE COMPOSITES.ppt
COMPOSITE COLUMN IN STEEL CONCRETE COMPOSITES.ppt
ravicivil
 
Water demand - Types , variations and WDS
Water demand - Types , variations and WDSWater demand - Types , variations and WDS
Water demand - Types , variations and WDS
dhanashree78
 
362 Alec Data Center Solutions-Slysium Data Center-AUH-Adaptaflex.pdf
362 Alec Data Center Solutions-Slysium Data Center-AUH-Adaptaflex.pdf362 Alec Data Center Solutions-Slysium Data Center-AUH-Adaptaflex.pdf
362 Alec Data Center Solutions-Slysium Data Center-AUH-Adaptaflex.pdf
djiceramil
 
Engineering Mechanics Introduction and its Application
Engineering Mechanics Introduction and its ApplicationEngineering Mechanics Introduction and its Application
Engineering Mechanics Introduction and its Application
Sakthivel M
 
Fundamentals of Digital Design_Class_12th April.pptx
Fundamentals of Digital Design_Class_12th April.pptxFundamentals of Digital Design_Class_12th April.pptx
Fundamentals of Digital Design_Class_12th April.pptx
drdebarshi1993
 
Rigor, ethics, wellbeing and resilience in the ICT doctoral journey
Rigor, ethics, wellbeing and resilience in the ICT doctoral journeyRigor, ethics, wellbeing and resilience in the ICT doctoral journey
Rigor, ethics, wellbeing and resilience in the ICT doctoral journey
Yannis
 
Tree_Traversals.pptbbbbbbbbbbbbbbbbbbbbbbbbb
Tree_Traversals.pptbbbbbbbbbbbbbbbbbbbbbbbbbTree_Traversals.pptbbbbbbbbbbbbbbbbbbbbbbbbb
Tree_Traversals.pptbbbbbbbbbbbbbbbbbbbbbbbbb
RATNANITINPATIL
 
362 Alec Data Center Solutions-Slysium Data Center-AUH-ABB Furse.pdf
362 Alec Data Center Solutions-Slysium Data Center-AUH-ABB Furse.pdf362 Alec Data Center Solutions-Slysium Data Center-AUH-ABB Furse.pdf
362 Alec Data Center Solutions-Slysium Data Center-AUH-ABB Furse.pdf
djiceramil
 
最新版美国圣莫尼卡学院毕业证(SMC毕业证书)原版定制
最新版美国圣莫尼卡学院毕业证(SMC毕业证书)原版定制最新版美国圣莫尼卡学院毕业证(SMC毕业证书)原版定制
最新版美国圣莫尼卡学院毕业证(SMC毕业证书)原版定制
Taqyea
 
Pavement and its types, Application of rigid and Flexible Pavements
Pavement and its types, Application of rigid and Flexible PavementsPavement and its types, Application of rigid and Flexible Pavements
Pavement and its types, Application of rigid and Flexible Pavements
Sakthivel M
 
SEW make Brake BE05 – BE30 Brake – Repair Kit
SEW make Brake BE05 – BE30 Brake – Repair KitSEW make Brake BE05 – BE30 Brake – Repair Kit
SEW make Brake BE05 – BE30 Brake – Repair Kit
projectultramechanix
 
Week 6- PC HARDWARE AND MAINTENANCE-THEORY.pptx
Week 6- PC HARDWARE AND MAINTENANCE-THEORY.pptxWeek 6- PC HARDWARE AND MAINTENANCE-THEORY.pptx
Week 6- PC HARDWARE AND MAINTENANCE-THEORY.pptx
dayananda54
 
Rearchitecturing a 9-year-old legacy Laravel application.pdf
Rearchitecturing a 9-year-old legacy Laravel application.pdfRearchitecturing a 9-year-old legacy Laravel application.pdf
Rearchitecturing a 9-year-old legacy Laravel application.pdf
Takumi Amitani
 
OCS Group SG - HPHT Well Design and Operation - SN.pdf
OCS Group SG - HPHT Well Design and Operation - SN.pdfOCS Group SG - HPHT Well Design and Operation - SN.pdf
OCS Group SG - HPHT Well Design and Operation - SN.pdf
Muanisa Waras
 
Structural Design for Residential-to-Restaurant Conversion
Structural Design for Residential-to-Restaurant ConversionStructural Design for Residential-to-Restaurant Conversion
Structural Design for Residential-to-Restaurant Conversion
DanielRoman285499
 
Impurities of Water and their Significance.pptx
Impurities of Water and their Significance.pptxImpurities of Water and their Significance.pptx
Impurities of Water and their Significance.pptx
dhanashree78
 
TEA2016AAT 160 W TV application design example
TEA2016AAT 160 W TV application design exampleTEA2016AAT 160 W TV application design example
TEA2016AAT 160 W TV application design example
ssuser1be9ce
 
Decoding Kotlin - Your Guide to Solving the Mysterious in Kotlin - Devoxx PL ...
Decoding Kotlin - Your Guide to Solving the Mysterious in Kotlin - Devoxx PL ...Decoding Kotlin - Your Guide to Solving the Mysterious in Kotlin - Devoxx PL ...
Decoding Kotlin - Your Guide to Solving the Mysterious in Kotlin - Devoxx PL ...
João Esperancinha
 
362 Alec Data Center Solutions-Slysium Data Center-AUH-Adaptaflex.pdf
362 Alec Data Center Solutions-Slysium Data Center-AUH-Adaptaflex.pdf362 Alec Data Center Solutions-Slysium Data Center-AUH-Adaptaflex.pdf
362 Alec Data Center Solutions-Slysium Data Center-AUH-Adaptaflex.pdf
djiceramil
 
COMPOSITE COLUMN IN STEEL CONCRETE COMPOSITES.ppt
COMPOSITE COLUMN IN STEEL CONCRETE COMPOSITES.pptCOMPOSITE COLUMN IN STEEL CONCRETE COMPOSITES.ppt
COMPOSITE COLUMN IN STEEL CONCRETE COMPOSITES.ppt
ravicivil
 
Water demand - Types , variations and WDS
Water demand - Types , variations and WDSWater demand - Types , variations and WDS
Water demand - Types , variations and WDS
dhanashree78
 
362 Alec Data Center Solutions-Slysium Data Center-AUH-Adaptaflex.pdf
362 Alec Data Center Solutions-Slysium Data Center-AUH-Adaptaflex.pdf362 Alec Data Center Solutions-Slysium Data Center-AUH-Adaptaflex.pdf
362 Alec Data Center Solutions-Slysium Data Center-AUH-Adaptaflex.pdf
djiceramil
 
Engineering Mechanics Introduction and its Application
Engineering Mechanics Introduction and its ApplicationEngineering Mechanics Introduction and its Application
Engineering Mechanics Introduction and its Application
Sakthivel M
 
Fundamentals of Digital Design_Class_12th April.pptx
Fundamentals of Digital Design_Class_12th April.pptxFundamentals of Digital Design_Class_12th April.pptx
Fundamentals of Digital Design_Class_12th April.pptx
drdebarshi1993
 
Rigor, ethics, wellbeing and resilience in the ICT doctoral journey
Rigor, ethics, wellbeing and resilience in the ICT doctoral journeyRigor, ethics, wellbeing and resilience in the ICT doctoral journey
Rigor, ethics, wellbeing and resilience in the ICT doctoral journey
Yannis
 
Tree_Traversals.pptbbbbbbbbbbbbbbbbbbbbbbbbb
Tree_Traversals.pptbbbbbbbbbbbbbbbbbbbbbbbbbTree_Traversals.pptbbbbbbbbbbbbbbbbbbbbbbbbb
Tree_Traversals.pptbbbbbbbbbbbbbbbbbbbbbbbbb
RATNANITINPATIL
 
362 Alec Data Center Solutions-Slysium Data Center-AUH-ABB Furse.pdf
362 Alec Data Center Solutions-Slysium Data Center-AUH-ABB Furse.pdf362 Alec Data Center Solutions-Slysium Data Center-AUH-ABB Furse.pdf
362 Alec Data Center Solutions-Slysium Data Center-AUH-ABB Furse.pdf
djiceramil
 
最新版美国圣莫尼卡学院毕业证(SMC毕业证书)原版定制
最新版美国圣莫尼卡学院毕业证(SMC毕业证书)原版定制最新版美国圣莫尼卡学院毕业证(SMC毕业证书)原版定制
最新版美国圣莫尼卡学院毕业证(SMC毕业证书)原版定制
Taqyea
 

GitLab PostgresMortem: Lessons Learned

  • 2. dataegret.com 31 January Events Failure's key points Preventative measures https://goo.gl/GO5rYJ 02 03 01
  • 4. 17:20 - an LVM snapshot of the production db was taken. 19:00 - database load increased due to spam. 23:00 - secondary's replication process started to lag behind. 23:30 - PostgreSQL database directory was wiped. 31 January events01 dataegret.com
  • 6. 1.LVM snapshots and staging provisioning. 2.When a replica start to lag. 3.Do pg_basebackup properly – part 1. 4.max_wal_senders was exceeded, but how? 5.max_connections = 8000. 6.pg_basebackup «stuck» – do pg_basebackup properly – part 2. 7.strace: good thing in wrong place. 8.rm or not rm? 9.A bit about backup. 10.Different PG versions on the production. 11.Broken mail. 31 January events02 dataegret.com
  • 7. Snapshot impact on underlying storage. Provisioning from backup. Staging based on LVM snapshots02 dataegret.com
  • 8. Re-initialize the standby. Monitoring with pg_stat_replication. Use wal_keep_segments while troubleshooting. Use WAL archive. When a replica started to lag02 dataegret.com
  • 9. Do pg_basebackup into clean directory. Remove «unnecessary» directory. Use mv instead of rm. Do pg_basebackup properly. Part 102 dataegret.com
  • 10. There was only one standby (which was failed). Increase max_wal_senders. Check who has stolen connections. The limit was exceeded by concurrent pg_basebackups. max_wal_senders was exceeded.02 dataegret.com
  • 11. More than 500 is bad idea. Use pgbouncer to reduce the number of server connections. max_connections = 800002 dataegret.com
  • 12. Don't run more than one pg_basebackups. It didn't stuck, it waited for the checkpoint. Use «-c» option to make fast checkpoint. Do pg_basebackup properly. Part 202 dataegret.com
  • 13. Strace isn't a good tool in that case. Use strace for system errors tracing. Check stack trace from /proc/<pid>/stack or GDB. Good things in wrong place.02 dataegret.com
  • 14. Data directory was cleaned with rm. Use mv instead of rm. rm or not rm02 dataegret.com
  • 15. Daily pg_dump. Daily LVM snapshot. Daily Azure snapshot. PostgreSQL streaming replication. Basebackup with WAL archive. A bit about backup02 dataegret.com
  • 16. Clean out old packages after major upgrade. Different versions on a production02 dataegret.com
  • 17. Setup cron, but forgot notifications. Use reliable notification systems. Different versions on a production02 dataegret.com
  • 19. 1. Update PS1 across all hosts to more clearly differentiate between hosts and environments. 2. Prometheus monitoring for backups. 3. Set PostgreSQL's max_connections to a sane value. 4. Investigate Point in time recovery & continuous archiving for PostgreSQL. 5. Hourly LVM snapshots of the production databases. 6. Azure disk snapshots of production databases. 7. Move staging to the ARM environment. 8. Recover production replica(s). 9. Automated testing of recovering PostgreSQL database backups. 10.Improve PostgreSQL replication documentation/runbooks. 11.Investigate pgbarman for creating PostgreSQL backups. 12.Investigate using WAL-E as a means of Database Backup and Realtime Replication. 13.Build Streaming Database Restore. 14.Assign an owner for data durability. Different versions on a production03 dataegret.com
  • 20. 1. Update PS1 across all hosts. Looks OK. 2. Prometheus monitoring for backups. Size, number, age and recovery status. 3. Set PostgreSQL's max_connections to a sane value. Better use pgbouncer. 4. Investigate PITR & continuous archiving for PostgreSQL. Yes, as the part of the backup. Preventative measures03 dataegret.com
  • 21. 5. Hourly LVM snapshots of the production databases. Looks unnecessary. 6. Azure disk snapshots of production databases. Looks unnecessary. 7. Move staging to the ARM environment. Very and very suspicious. 8. Recover production replica(s). Do that asap. Preventative measures03 dataegret.com
  • 22. 9. Automated testing of recovering database backups. YES! 10. Improve documentation/runbooks. You need a bureaucrat. 11. Investigate pgbarman. Looks OK, Barman is stable and reliable. 12. Investigate using WAL-E. Looks OK, WAL-E is the «setup and forget». Preventative measures03 dataegret.com
  • 23. 13. Build Streaming Database Restore. Corresponds with p.9. 14. Assign an owner for data durability. Hire a DBA. Preventative measures03 dataegret.com
  • 24. Check and monitor backups. Create an emergency instructions. Learn to use tools properly. Lessons learned03 dataegret.com
  • 25. Postmortem of database outage of January 31 https://about.gitlab.com/2017/02/10/postmortem-of-database-outage-of-january-31/ PostgreSQL Statistics Collector: pg_stat_replication view https://www.postgresql.org/docs/current/static/monitoring-stats.html#PG-STAT-REPLICATION-VIEW pg_basebackup utility https://www.postgresql.org/docs/current/static/app-pgbasebackup.html PostgreSQL Replication https://www.postgresql.org/docs/9.6/static/runtime-config-replication.html PgBouncer https://pgbouncer.github.io/ https://wiki.postgresql.org/wiki/PgBouncer Barman http://www.pgbarman.org/ Links03 dataegret.com