Matthew Boeckman
Developer Advocate
@matthewboeckman
Background
● 18 years on-call Ops
● 15 years w/software teams
● Startup junkie
● DevOps enthusiast
3
What is VictorOps?
VictorOps ingests all of your alerts from your current monitoring tools and becomes the logical
layer between your alerts and the people who receive them.
4
Five Phases of Incident Management
Detection
● monitoring
● metrics
● thresholds
Response
● alerting
● on-call
● escalation
Remediation
● fixes
● tickets
● deployment
Analysis
● postmortem
● how or why
● understand
Readiness
● improve
● game days
● learning
1 2 3 4 5
5
Standard Incident Lifecycle
6
MTTR - Our Core Metric
Detection Response Remediation Analysis
R
e
a
d
i
n
e
s
s
Time to Repair (MTTR)
7
#0 Iteration!# Iteration!0
8
Detection - Response - Remediation - Analysis - Readiness
Detection
9
#1 Blended approach to Detection#1 Blended Approach to Detection1
10
#1 Blended Approach to Detection
● Synthetic Testing
● Time-series data
● Application Monitoring
● Log Analytics
1
Detection - Synthetic Testing
Existing
User?
Synthetic monitoring leverages scripted web interactions to
validate critical path user interactions (or system interactions).
Landing
Page
Registration
Welcome
Stream
User
Home
User
Home
InteractionLogin
12
Detection - Time Series
Systems are not static.
Why treat measurements as if they were?
13
Detection - Rate of ChangeDetection - Rate of Change
14
Detection - Application Performance (APM)
RUM
Transactions
Page
Performance
Thread
Profiling
Application
internals
Timings and
Counters
Microservices
Transactional
path
3rd party
calls
Dependency
Management
Circuit
Breakers
APM monitors complex application behaviors
and reports or alerts on deviation from norms.
UX Runtime Tracing Internals
15
Detection - Log Analytics
Log Analysis opens new detection avenues with insight
particularly into security and compliance concerns.
16
#2 Focus on Business Outcomes#2 Focus on Business Outcomes2
These datapoints become Detectable conditions
for actionable alerting.
17
#2 Focus on Business Outcomes
Business Activity Monitoring maps key metrics
to data or flows within the IT environment.
Email
Deliverability
SEM/Referral
Volume
Channel
Activity
18
Business Objectives
Revenue
(SDLW, YoY)
Transactions
User
Registration
Conversion
NPS Scores
Social Media
Reviews
KPI
3rd
Party
Indirect
19
Poll #1
20
Detection - Response - Remediation - Analysis - Readiness
Response
21
#3 Alerts are Actionable#3 Alerts are Actionable3
22
Alert Fatigue is a Leading Cause of Burnout
“I get paged for issues that I
can’t resolve; most of my time is
either researching a problem
that is transient or non
reproducible, or contacting a
vendor.”
“I’m on call for everything from
infrastructure issues to, much
more often, application issues for
software I didn’t write and don’t
own.”
23
Actionability Exists on Two Dimensions
The alert must be actionable, and differentiated
from something that’s merely informational.
Alerts must route to someone who has the access,
permission, and skills to adequately perform said
action.
24
Actionability Exists on Two Dimensions
Ensure you’re measuring the right thing,
at the proper precision.
25
Alert Fidelity
26
#4 ChatOps is an area of focus#4 ChatOps is an Area of Focus4
27
Providing a Central Collaboration Channel
28
Integrating with Information and Workflow
29
Managing Everything
30
Poll #2
31
Detection - Response - Remediation - Analysis - Readiness
Remediation
#5 Runbooks are central to Remediation efforts
32
#5 Runbooks are Central to Remediation Efforts5
33
Great Runbooks...
...provide clear
explanations of system
metrics and alert
thresholds
34
Great Runbooks...
...clearly identify
dependencies
...clearly identify an
SME
35
Great Runbooks...
...track incident history
...list known failure
conditions
36
Great Runbooks...
...are accessible to all
...are routinely updated
37
#6 Configuration and Infrastructure as Code#6 Configuration and Infrastructure as Code6
38
Infrastructure as Code
39
Infrastructure as Code
● Empower Dev and Ops to
manage systems
● Create adaptable patterns
● Enable quicker iterations
● Ensure auditable change
management
● Common deployment pipelines
● Instrument everything!
40
Detection - Response - Remediation - Analysis - Readiness
Analysis
41
#7 Data drives investigations#7 Data Drives Investigations7
42
Analysis of Systems
*ArchonMagnus Wikipedia
43
Analysis of People
44
There is no Root Cause
45
Poll #3
46
#8 Postmortems are blameless#8 Postmortems are Blameless8
47
Great Postmortems...
Focus on learning
Focus on data
Require objectivity
Encourage participation
48
Detection - Response - Remediation - Analysis - Readiness
Readiness
49
#9 Postmortems populate Backlogs#9 Postmortems Populate Backlogs9
50
#9 Postmortems Populate Backlogs
Call Postmortem
Overview of
Incident
Review timeline
Review
Remediation
Review
Response
Discuss
improvements
Add to Backlog Involve PM
51
#10 Organize team responses#1 Organize Team Responses10
52
Organization of the Swarm
Formal escalation policies including a catch-all escalation
53
Who Ya Gonna Call?
Formal communication plan
54
Defined Roles
Incident Commander (Quarterback) - directs and
dedupes team efforts. Calls the ball on action
55
Defined Roles
Incident Communicator (Scribe) - handles all async
communication, records team actions
56
MTTR - Our Core Metric
Detection Response Remediation Analysis
R
e
a
d
i
n
e
s
s
Time to Repair (MTTR)
57
Conclusion
D
e
t
e
c
t
i
o
n
R
e
s
p
o
n
s
e
Remediation Analysis
Readiness
Time to Repair
(MTTR)
Time to Learn
(TTL)
THANK YOU!
@matthewboeckman
Slides on devops.com & slideshare.com

Top 10 Practices of Highly Successful DevOps Incident Management Teams