More Aim,
Less Blame!
How do you feel after a
conference?
Inspired!
Empowered!
Blame
Feeling Down
Guilt
Chief Enterprise Architect
@ SiteGround
About Me - @dvkanchev
DevOps Engineer/SRE
Adrenaline Junkie (snowboarding, sailing, parenting)
Safety Enthusiast
Focuses on culture and not technology
Based on technical failures examples
But valid for handling all types of failures
About This Talk
What Is A
Post-Mortem?
“A postmortem is a written record of an
incident, its impact, the actions taken to
mitigate or resolve it, the root cause(s),
and the follow-up actions to prevent the
incident from recurring.”
Website
Downtime
Site Is Broken.
What Do You Do?
I just fix it
Fix it + postmortem
Someone else fixes such problems for me
1
2
3
I just fix it - 82%
Fix it + postmortem - 12%
Someone else fixes such problems for me - 6%
1
2
3
“Successful Software
Never Gets Simpler”
Blame, Sanctions And Accountability
Blame, Sanctions And Accountability
Issue (root cause): “A backup script was run against the production
database. It locked all tables and caused downtime.”





Issue (root cause): “A backup script was run against the production
database. It locked all tables and caused downtime.”



BAD!

Blame, Sanctions And Accountability
Issue (root cause): “A backup script was run against the production
database. It locked all tables and caused downtime.”



BAD!



Issue (root cause): “I (Daniel) ran a backup script against the production
database. It locked all tables and caused downtime. Why is this script not
configured to use --single-transaction for InnoDB tables?”

Blame, Sanctions And Accountability
Issue (root cause): “A backup script was run against the production
database. It locked all tables and caused downtime.”



BAD!



Issue (root cause): “I (Daniel) ran a backup script against the production
database. It locked all tables and caused downtime. Why is this script not
configured to use --single-transaction for InnoDB tables?”



BETTER!
Blame, Sanctions And Accountability
“Safety requires prevention, prevention requires
honesty, honesty requires absence of fear.”
A Pinch Of
Blameless
“Focus on the situational aspects of a failure’s
mechanism AND the decision-making process of
individuals proximate to the failure.”
Post-mortem template.
The obvious stuff.
Post-mortem template.
The obvious stuff.
Describe the incident and the impact?1
Post-mortem template.
The obvious stuff.
Describe the incident and the impact?
How was it solved?
1
2
Post-mortem template.
The obvious stuff.
Describe the incident and the impact?
How was it solved?
Complete timeline of events.
1
2
3
Table Timeline Example
03:45 AM Monitoring system detected high rate of 5xx errors
Table Timeline Example
03:45 AM Monitoring system detected high rate of 5xx errors
03:46 AM Monitoring system paged engineer on call
Table Timeline Example
03:45 AM Monitoring system detected high rate of 5xx errors
03:46 AM Monitoring system paged engineer on call
03:47 AM Incident was confirmed
Table Timeline Example
03:45 AM Monitoring system detected high rate of 5xx errors
03:46 AM Monitoring system paged engineer on call
03:47 AM Incident was confirmed
03:53 AM
Graphs were checked and 10 times increase in traffic
towards Redis was observed
Table Timeline Example
03:45 AM Monitoring system detected high rate of 5xx errors
03:46 AM Monitoring system paged engineer on call
03:47 AM Incident was confirmed
03:53 AM
Graphs were checked and 10 times increase in traffic
towards Redis was observed
04:25 AM Issue was escalated to a senior engineer
Table Timeline Example
03:45 AM Monitoring system detected high rate of 5xx errors
03:46 AM Monitoring system paged engineer on call
03:47 AM Incident was confirmed
03:53 AM
Graphs were checked and 10 times increase in traffic
towards Redis was observed
04:25 AM Issue was escalated to a senior engineer
04:52 AM WordPress plugin was downgraded to fix the issue
Post-mortem template. The obvious stuff.
Describe the incident and the impact?
How was it solved?
Complete timeline of events.
Root Cause(s) Analysis?
1
2
3
4
Post-mortem template. The obvious stuff.
Describe the incident and the impact?
How was it solved?
Complete timeline of events.
Root Cause(s) Analysis?
Lessons learned.
1
2
3
4
5
Post-mortem template. The obvious stuff.
Describe the incident and the impact?
How was it solved?
Complete timeline of events.
Root Cause(s) Analysis?
Lessons learned.
Action Item List.
1
2
3
4
5
6
Post-mortem template. The obvious stuff.
Describe the incident and the impact?
How was it solved?
Complete timeline of events.
Root Cause(s) Analysis?
Lessons learned.
Action Item List.
Post-Mortem Review and Approval.
1
2
3
4
5
6
7
Post-mortem template. The hidden gems.
Post-mortem template. The hidden gems.
Different Triggers/Contributors.1
Post-mortem template. The hidden gems.
Different Triggers/Contributors.
Mitigators.
1
2
Post-mortem template. The hidden gems.
Different Triggers/Contributors.
Mitigators.
Additions to the Timeline of Events.
1
2
3
Time
escalations dev
on-call dev DBA
customer service
network engineer
security
engineer
Time
escalations dev
on-call dev DBA
customer service
network engineer
security
engineer
Time
escalations dev
on-call dev DBA
customer service
network engineer
security
engineer
escalations
on-call dev
DBA
customer
service
network
engineer
security
engineer
Called for 

assistance
On poor
conference
wifi
Starts checking backups
and preparing for restore
Unrelated alerts for
connected systems
Working on a theory
related to load balancing
as new data is obtained
Post-mortem template. The hidden gems.
Different Triggers/Contributors.
Mitigators.
Additions to the Timeline of Events.
Islands of Knowledge.
1
2
3
4
Post-mortem template. The hidden gems.
Different Triggers/Contributors.
Mitigators.
Additions to the Timeline of Events.
Islands of Knowledge.
Open discussions.
1
2
3
4
5
Step Back
Example Time
The best way to find out if you can
trust somebody is to trust them!
Ernest Hemingway
”
“
Resources
github.com/dkanchev
Questions
Thank You!
@dvkanchev

More Aim, Less Blame: How to use postmortems to turn failures into something valuable for your team

  • 1.
  • 2.
    How do youfeel after a conference?
  • 3.
  • 4.
  • 6.
  • 7.
    Chief Enterprise Architect @SiteGround About Me - @dvkanchev DevOps Engineer/SRE Adrenaline Junkie (snowboarding, sailing, parenting) Safety Enthusiast
  • 8.
    Focuses on cultureand not technology Based on technical failures examples But valid for handling all types of failures About This Talk
  • 9.
  • 10.
    “A postmortem isa written record of an incident, its impact, the actions taken to mitigate or resolve it, the root cause(s), and the follow-up actions to prevent the incident from recurring.”
  • 11.
  • 12.
  • 13.
    I just fixit Fix it + postmortem Someone else fixes such problems for me 1 2 3
  • 14.
    I just fixit - 82% Fix it + postmortem - 12% Someone else fixes such problems for me - 6% 1 2 3
  • 15.
  • 16.
    Blame, Sanctions AndAccountability
  • 17.
    Blame, Sanctions AndAccountability Issue (root cause): “A backup script was run against the production database. It locked all tables and caused downtime.”
 
 

  • 18.
    Issue (root cause):“A backup script was run against the production database. It locked all tables and caused downtime.”
 
 BAD!
 Blame, Sanctions And Accountability
  • 19.
    Issue (root cause):“A backup script was run against the production database. It locked all tables and caused downtime.”
 
 BAD!
 
 Issue (root cause): “I (Daniel) ran a backup script against the production database. It locked all tables and caused downtime. Why is this script not configured to use --single-transaction for InnoDB tables?”
 Blame, Sanctions And Accountability
  • 20.
    Issue (root cause):“A backup script was run against the production database. It locked all tables and caused downtime.”
 
 BAD!
 
 Issue (root cause): “I (Daniel) ran a backup script against the production database. It locked all tables and caused downtime. Why is this script not configured to use --single-transaction for InnoDB tables?”
 
 BETTER! Blame, Sanctions And Accountability
  • 21.
    “Safety requires prevention,prevention requires honesty, honesty requires absence of fear.”
  • 22.
  • 23.
    “Focus on thesituational aspects of a failure’s mechanism AND the decision-making process of individuals proximate to the failure.”
  • 24.
  • 25.
    Post-mortem template. The obviousstuff. Describe the incident and the impact?1
  • 26.
    Post-mortem template. The obviousstuff. Describe the incident and the impact? How was it solved? 1 2
  • 27.
    Post-mortem template. The obviousstuff. Describe the incident and the impact? How was it solved? Complete timeline of events. 1 2 3
  • 28.
    Table Timeline Example 03:45AM Monitoring system detected high rate of 5xx errors
  • 29.
    Table Timeline Example 03:45AM Monitoring system detected high rate of 5xx errors 03:46 AM Monitoring system paged engineer on call
  • 30.
    Table Timeline Example 03:45AM Monitoring system detected high rate of 5xx errors 03:46 AM Monitoring system paged engineer on call 03:47 AM Incident was confirmed
  • 31.
    Table Timeline Example 03:45AM Monitoring system detected high rate of 5xx errors 03:46 AM Monitoring system paged engineer on call 03:47 AM Incident was confirmed 03:53 AM Graphs were checked and 10 times increase in traffic towards Redis was observed
  • 32.
    Table Timeline Example 03:45AM Monitoring system detected high rate of 5xx errors 03:46 AM Monitoring system paged engineer on call 03:47 AM Incident was confirmed 03:53 AM Graphs were checked and 10 times increase in traffic towards Redis was observed 04:25 AM Issue was escalated to a senior engineer
  • 33.
    Table Timeline Example 03:45AM Monitoring system detected high rate of 5xx errors 03:46 AM Monitoring system paged engineer on call 03:47 AM Incident was confirmed 03:53 AM Graphs were checked and 10 times increase in traffic towards Redis was observed 04:25 AM Issue was escalated to a senior engineer 04:52 AM WordPress plugin was downgraded to fix the issue
  • 34.
    Post-mortem template. Theobvious stuff. Describe the incident and the impact? How was it solved? Complete timeline of events. Root Cause(s) Analysis? 1 2 3 4
  • 35.
    Post-mortem template. Theobvious stuff. Describe the incident and the impact? How was it solved? Complete timeline of events. Root Cause(s) Analysis? Lessons learned. 1 2 3 4 5
  • 36.
    Post-mortem template. Theobvious stuff. Describe the incident and the impact? How was it solved? Complete timeline of events. Root Cause(s) Analysis? Lessons learned. Action Item List. 1 2 3 4 5 6
  • 37.
    Post-mortem template. Theobvious stuff. Describe the incident and the impact? How was it solved? Complete timeline of events. Root Cause(s) Analysis? Lessons learned. Action Item List. Post-Mortem Review and Approval. 1 2 3 4 5 6 7
  • 38.
  • 39.
    Post-mortem template. Thehidden gems. Different Triggers/Contributors.1
  • 40.
    Post-mortem template. Thehidden gems. Different Triggers/Contributors. Mitigators. 1 2
  • 41.
    Post-mortem template. Thehidden gems. Different Triggers/Contributors. Mitigators. Additions to the Timeline of Events. 1 2 3
  • 42.
    Time escalations dev on-call devDBA customer service network engineer security engineer
  • 43.
    Time escalations dev on-call devDBA customer service network engineer security engineer
  • 44.
    Time escalations dev on-call devDBA customer service network engineer security engineer
  • 45.
    escalations on-call dev DBA customer service network engineer security engineer Called for
 assistance On poor conference wifi Starts checking backups and preparing for restore Unrelated alerts for connected systems Working on a theory related to load balancing as new data is obtained
  • 46.
    Post-mortem template. Thehidden gems. Different Triggers/Contributors. Mitigators. Additions to the Timeline of Events. Islands of Knowledge. 1 2 3 4
  • 47.
    Post-mortem template. Thehidden gems. Different Triggers/Contributors. Mitigators. Additions to the Timeline of Events. Islands of Knowledge. Open discussions. 1 2 3 4 5
  • 48.
  • 49.
  • 50.
    The best wayto find out if you can trust somebody is to trust them! Ernest Hemingway ” “
  • 52.
  • 53.
  • 54.