Top 10 Practices of Highly Successful DevOps Incident Management Teams

Matthew Boeckman
Developer Advocate
@matthewboeckman
Background
● 18 years on-call Ops
● 15 years w/software teams
● Startup junkie
● DevOps enthusiast

3
What is VictorOps?
VictorOps ingests all of your alerts from your current monitoring tools and becomes the logical
layer between your alerts and the people who receive them.

4
Five Phases of Incident Management
Detection
● monitoring
● metrics
● thresholds
Response
● alerting
● on-call
● escalation
Remediation
● fixes
● tickets
● deployment
Analysis
● postmortem
● how or why
● understand
Readiness
● improve
● game days
● learning
1 2 3 4 5

6
MTTR - Our Core Metric
Detection Response Remediation Analysis
R
e
a
d
i
n
e
s
s
Time to Repair (MTTR)

8
Detection - Response - Remediation - Analysis - Readiness
Detection

9
#1 Blended approach to Detection#1 Blended Approach to Detection1

10
#1 Blended Approach to Detection
● Synthetic Testing
● Time-series data
● Application Monitoring
● Log Analytics

1
Detection - Synthetic Testing
Existing
User?
Synthetic monitoring leverages scripted web interactions to
validate critical path user interactions (or system interactions).
Landing
Page
Registration
Welcome
Stream
User
Home
User
Home
InteractionLogin

12
Detection - Time Series
Systems are not static.
Why treat measurements as if they were?

13
Detection - Rate of ChangeDetection - Rate of Change

14
Detection - Application Performance (APM)
RUM
Transactions
Page
Performance
Thread
Profiling
Application
internals
Timings and
Counters
Microservices
Transactional
path
3rd party
calls
Dependency
Management
Circuit
Breakers
APM monitors complex application behaviors
and reports or alerts on deviation from norms.
UX Runtime Tracing Internals

15
Detection - Log Analytics
Log Analysis opens new detection avenues with insight
particularly into security and compliance concerns.

16
#2 Focus on Business Outcomes#2 Focus on Business Outcomes2

These datapoints become Detectable conditions
for actionable alerting.
17
#2 Focus on Business Outcomes
Business Activity Monitoring maps key metrics
to data or flows within the IT environment.

Email
Deliverability
SEM/Referral
Volume
Channel
Activity
18
Business Objectives
Revenue
(SDLW, YoY)
Transactions
User
Registration
Conversion
NPS Scores
Social Media
Reviews
KPI
3rd
Party
Indirect

20
Response

21
#3 Alerts are Actionable#3 Alerts are Actionable3

22
Alert Fatigue is a Leading Cause of Burnout
“I get paged for issues that I
can’t resolve; most of my time is
either researching a problem
that is transient or non
reproducible, or contacting a
vendor.”
“I’m on call for everything from
infrastructure issues to, much
more often, application issues for
software I didn’t write and don’t
own.”

23
Actionability Exists on Two Dimensions
The alert must be actionable, and differentiated
from something that’s merely informational.

Alerts must route to someone who has the access,
permission, and skills to adequately perform said
action.
24
Actionability Exists on Two Dimensions

Ensure you’re measuring the right thing,
at the proper precision.
25
Alert Fidelity

26
#4 ChatOps is an area of focus#4 ChatOps is an Area of Focus4

27
Providing a Central Collaboration Channel

28
Integrating with Information and Workflow

31
Remediation

#5 Runbooks are central to Remediation efforts
32
#5 Runbooks are Central to Remediation Efforts5

33
Great Runbooks...
...provide clear
explanations of system
metrics and alert
thresholds

34
Great Runbooks...
...clearly identify
dependencies
...clearly identify an
SME

35
Great Runbooks...
...track incident history
...list known failure
conditions

36
Great Runbooks...
...are accessible to all
...are routinely updated

37
#6 Configuration and Infrastructure as Code#6 Configuration and Infrastructure as Code6

39
Infrastructure as Code
● Empower Dev and Ops to
manage systems
● Create adaptable patterns
● Enable quicker iterations
● Ensure auditable change
management
● Common deployment pipelines
● Instrument everything!

40
Analysis

41
#7 Data drives investigations#7 Data Drives Investigations7

42
Analysis of Systems
*ArchonMagnus Wikipedia

46
#8 Postmortems are blameless#8 Postmortems are Blameless8

47
Great Postmortems...
Focus on learning
Focus on data
Require objectivity
Encourage participation

48
Readiness

49
#9 Postmortems populate Backlogs#9 Postmortems Populate Backlogs9

50
#9 Postmortems Populate Backlogs
Call Postmortem
Overview of
Incident
Review timeline
Review
Remediation
Review
Response
Discuss
improvements
Add to Backlog Involve PM

51
#10 Organize team responses#1 Organize Team Responses10

52
Organization of the Swarm
Formal escalation policies including a catch-all escalation

53
Who Ya Gonna Call?
Formal communication plan

54
Defined Roles
Incident Commander (Quarterback) - directs and
dedupes team efforts. Calls the ball on action

55
Defined Roles
Incident Communicator (Scribe) - handles all async
communication, records team actions

56
MTTR - Our Core Metric
Detection Response Remediation Analysis
R
e
a
d
i
n
e
s
s
Time to Repair (MTTR)

57
Conclusion
D
e
t
e
c
t
i
o
n
R
e
s
p
o
n
s
e
Remediation Analysis
Readiness
Time to Repair
(MTTR)
Time to Learn
(TTL)

THANK YOU!
@matthewboeckman
Slides on devops.com & slideshare.com

Top 10 Practices of Highly Successful DevOps Incident Management Teams

More Related Content

What's hot

Similar to Top 10 Practices of Highly Successful DevOps Incident Management Teams

More from Matthew Boeckman

Recently uploaded

Top 10 Practices of Highly Successful DevOps Incident Management Teams