incident & problem
management
LIGHTWEIGHT ITIL.
Berlin, November 2015
1.why?
2.overview
3.example
Agenda
1.why?
2.overview
3.example
Agenda
Problem — too frequent incidents
in the live product
Let’s take a look into ITIL.
Wow! It has exactly what we need!
Let’s take the best out of
Problem and Incident Management!
https://www.flickr.com/photos/parkstreetparrot/9764446493
incident
— unplanned interruption or a serious
reduction in the service quality
why incident management?
why incident management?
○ restore service asap
why incident management?
○ restore service asap
○ avoid unnecessary involvement
why incident management?
○ restore service asap
○ avoid unnecessary involvement
○ avoid mistakes
how to manage incidents?
identify
really
incident?
https://www.flickr.com/photos/parkstreetparrot/9764364985
identify handle
how to manage incidents?
use defined
procedure
really
incident?
https://www.flickr.com/photos/parkstreetparrot/9764378536
handle
how to manage incidents?
identify close
add incident
record
use defined
procedure
really
incident?
https://www.flickr.com/photos/parkstreetparrot/9764364706
EXAMPLE
The goal — to minimize the amount
and severity of incidents in
live online games
We don’t treat every bug an incident.
Incident criteria was defined
identify
Incident only when:
➔ game becomes unavailable
identify
Incident only when:
➔ game becomes unavailable, or
➔ game revenue drops more
than €XXX
identify
Incident only when:
➔ game becomes unavailable, or
➔ game revenue drops more
than €XXX, or
➔ severe issues with servers
identify
Incident only when:
➔ game becomes unavailable, or
➔ game revenue drops more
than €XXX, or
➔ severe issues with servers, or
➔ it can't wait for next
planned deployment
We don’t panic when the incident
occurs. We follow the process:
➔ Elect a SWAT team
handle
➔ Elect a SWAT team
➔ Plan Communication
handle
➔ Elect a SWAT team
➔ Plan Communication
➔ Kick-off
handle
➔ Elect a SWAT team
➔ Plan Communication
➔ Kick-off
➔ Check the Knowledge Base
handle
➔ Elect a SWAT team
➔ Plan Communication
➔ Kick-off
➔ Check the Knowledge Base
➔ Create an IM chat group
handle
➔ Elect a SWAT team
➔ Plan Communication
➔ Kick-off
➔ Check the Knowledge Base
➔ Create an IM chat group
➔ Send email notifications to
stakeholders on every update
handle
➔ Elect a SWAT team
➔ Plan Communication
➔ Kick-off
➔ Check the Knowledge Base
➔ Create an IM chat group
➔ Send email notifications to
stakeholders on every update
➔ Follow defined policies and
guidelines
handle
We act smartly after
the incident is resolved:
➔ prevent recurrences
close
➔ prevent recurrences
➔ update stakeholders
close
➔ prevent recurrences
➔ update stakeholders
➔ submit the Incident record
close
➔ prevent recurrences
➔ update stakeholders
➔ submit the Incident record
➔ update Knowledge Base if
necessary
close
➔ prevent recurrences
➔ update stakeholders
➔ submit the Incident record
➔ update Knowledge Base if
necessary
➔ propose process improvements
close
★ resolved/workarounded incident
★ updated incidents database
Outcomes
problem
management
why problem management?
○ recognize problems
why problem management?
○ recognize problems
○ permanent solutions
why problem management?
○ recognize problems
○ permanent solutions
○ less emergencies
identify
problems
from incident
records
identify
problems
from incident
records
submit
problems
submit
problems
implement
permanent
solutions
identify
problems
from incident
records
implement
permanent
solutions
identify
problems
from incident
records
submit
problems
also add problems directly
we ask questions:
we ask questions:
could any of the incidents be prevented?
we ask questions:
could any of the incidents be prevented?
can we detect incident symptoms?
we ask questions:
could any of the incidents be prevented?
can we detect incident symptoms?
are there any patterns?
we ask questions:
could any of the incidents be prevented?
can we detect incident symptoms?
are there any patterns?
“the aim of incident management is to restore the service as
quickly as possible, often through a workaround, rather than
through trying to find a permanent solution which is the aim
of problem management.”
Summary
Appendices
IDENTIFICATION CLOSUREHANDLING
‣ Receive data regarding the incident and
ensure it is full and clear
‣ Qualify issue as an Incident.
DELIVERABLES
➨ Incident process is triggered
After the incident has been solved, make sure
to:
‣ Communicate the results to relevant
stakeholders by sending mail following the
'Issue on Live' closure procedure as per the
template.
‣ Take corrective actions to prevent issue from
happening again. Create JIRA tickets where
possible.
‣ Evaluate possible procedure updates that can
be made in the teams in the pipeline.
‣ Submit “Incident Login” form.
DELIVERABLES
➨ Sent report to XXX email
➨ Submitted related JIRA tickets
➨ Submitted “Incident Login” form
‣ Elect a SWAT team to fix the incident issue.
‣ Decide on a War Room for the Huddle.
‣ Huddle and lay down an Action Plan.
‣ Send out email notification to all
stakeholders. No one is allowed to disturb the
SWAT team from work, while they actively
investigate/resolve the Incident
‣ Huddle regularly to update the action plan.
‣ Send out updates to all stakeholders.
‣ If devOps is necessary, follow the “Emergency
IT Support Policy”.
‣ Follow “Live Actions Guidelines”
DELIVERABLES
➨ Resolved incident (possibly workarounded)
➨ Sent Incident report(s) to XXX email
incident management process
Resolved ?
no
Add / Update Incident record
(via Incident Login form)
Open >3 days ?
yes
Create JIRAs for fixing root cause or other
related issues if possible
Add / Update Incident record
(via Incident Login form)
yes
HANDLING
CLOSURE
IDENTIFICATION
Action plan
(five minutes huddle of the SWAT team in a war room)
Send email
Create / Update JIRAs
(contact OPS if necessary)
Fix
(first QA, then Live)
Incident
detected
Send email
(keep one thread)
no
problem management process
PROBLEM
DETECTION
ROOT CAUSE
IDENTIFICATION
SOLUTION
DEFINITION
PRIORITISATION
PROBLEM
LOGGING
IMPLEMENTATION,
CLOSURE
ACTIVITIES
‣ Define the problem
‣ Receive data regarding the
problem from incident
management
‣ Ensure the collected data is full
and clear
‣ Define which teams or
departments are affected
‣ Gather other data at the day of
incident
‣ Analyze symptoms
‣ Analyze the data collected from
various sources relating to the
major incident
‣ Analyze historical data to see if
there was such problem before
DELIVERABLES
➨ Analyzed problem
➨ Updated incident record
ACTIVITIES
Problem investigation and
diagnosis (requires tech experts)
‣ To conduct root cause
analyses using various
techniques if necessary:
• Make a sketch
• Draw Ishikawa (fishbone)
diagram
• Kepner-Tregoe
• Flow diagrams
• etc.
‣ Determine workarounds
‣ Think of potential solutions
‣ Assess the problem and
recommended actions to
resolve the problem
DELIVERABLES
➨ Updated problem record
➨ Root cause detected
➨ Workaround(s) identified
ACTIVITIES
‣ Identify the team for solution
development
‣ Determine possible resolutions
‣ Choose the best approach
‣ Make sure the solution can
effectively prevent
reoccurrence
DELIVERABLES
➨ Updated problem record
➨ Other tasks in JIRA
➨ Updated incident records
➨ Defined resources that are
necessary for implementation
ACTIVITIES
‣ Identify the urgency and
impact of this task
‣ Define a priority in the
Problem management queue
‣ Identify responsible for
the implementation
‣ Decide how this problem
should be prioritized among
other tasks of the team
DELIVERABLES
➨ The task(s) has a priority
➨ The team leads are aware of
the task and can plan it in
their sprints
ACTIVITIES
‣ Create a new JIRA record or
update the old one:
• Unique ID, timestamp
• Name of submitter
• Link associate problem
records
(with hierarchy if applicable)
• Link associate incident
records
• Problem description
• Problem category
• Status
• Severity and Impact
• Responsible person, team
• Affected game
• Associate JIRA records
• History of all taken actions
• Workaround
• Permanent solution (if known
already)
DELIVERABLES
➨ Created/updated problem
record
➨ Analyzed and updated
incident data
ACTIVITIES
‣ Conduct activities to implement
the fix to the problem
‣ Verify if the solution is appropriate
and close problem record
‣ Submit a record to the Error
Knowledge Base if applicable
‣ Share Lessons learned via email
if reasonable
‣ Ensure that all the associated
incidents are closed with a proper
fix or resolution
DELIVERABLES
➨ Updated incident record
➨ Updated problem record
➨ Updated Known Errors
Knowledge Base spreadsheet
➨ Lessons learned shared
➨ Report is sent
IMPLEMENTATION
&
CLOSURE
ROOT CAUSE
IDENTIFICATION
&
SOLUTION
DEFINITION
Close
(update knowledge base, submit lessons learned, send email)
Implement
(by defined implementation team)
DETECTION
&
LOGGING
Choose the problem area
Analyze related incident data
(symptoms, relations, historical data)
Request missing data
(symptoms, relations, historical data)
Create new Problem Record /update existing
(JIRA)
Identify root cause, workarounds
Determine work for identified solutions
(and choose implementation team)
Prioritize
Incident
record(s)
update
Problem
record
update
Problem
record
update
Problem
record
update
Problem
record
update
Incident
record(s)
update
Incident
record(s)
update
Known
Errors
update
Run MeetingPrepare Meeting
Who: Problem Manager
Process summary:
- to ensure the quality of the
incident spreadsheet
- to select follow-up’s
- to prefill the problem
management spreadsheet
Efforts: 3-5 mh
When: no later than 3 days
before the meeting
Who: particular person is
responsible for every problem as
defined in the meeting
Process summary:
- implement
- verify
- update all records
Outcomes: Updated incident and
problem records
Chairperson: Problem Manager
Participants: PMs, APs, OPS
representative
Frequency: monthly
Activities:
- identify problems
- prioritize & agree on
actions
- define responsible teams
When: at the end of the month. In
case of holidays or emergency
moved to the next working day.
Outcomes: Assigned tasks
Take Actions
problem management process simplified
Valentyn Barmak
thank you!
http://www.linkedin.com/in/valentineb
https://www.xing.com/profile/Valentyn_Barmak
www.barmak.de
ask for more:

Incident and Problem management simplified

  • 1.
    incident & problem management LIGHTWEIGHTITIL. Berlin, November 2015
  • 2.
  • 3.
  • 4.
    Problem — toofrequent incidents in the live product
  • 5.
    Let’s take alook into ITIL. Wow! It has exactly what we need!
  • 6.
    Let’s take thebest out of Problem and Incident Management!
  • 7.
  • 8.
    incident — unplanned interruptionor a serious reduction in the service quality
  • 9.
  • 10.
    why incident management? ○restore service asap
  • 11.
    why incident management? ○restore service asap ○ avoid unnecessary involvement
  • 12.
    why incident management? ○restore service asap ○ avoid unnecessary involvement ○ avoid mistakes
  • 13.
    how to manageincidents? identify really incident?
  • 14.
  • 15.
    identify handle how tomanage incidents? use defined procedure really incident?
  • 16.
  • 17.
    handle how to manageincidents? identify close add incident record use defined procedure really incident?
  • 18.
  • 19.
  • 20.
    The goal —to minimize the amount and severity of incidents in live online games
  • 21.
    We don’t treatevery bug an incident. Incident criteria was defined
  • 22.
    identify Incident only when: ➔game becomes unavailable
  • 23.
    identify Incident only when: ➔game becomes unavailable, or ➔ game revenue drops more than €XXX
  • 24.
    identify Incident only when: ➔game becomes unavailable, or ➔ game revenue drops more than €XXX, or ➔ severe issues with servers
  • 25.
    identify Incident only when: ➔game becomes unavailable, or ➔ game revenue drops more than €XXX, or ➔ severe issues with servers, or ➔ it can't wait for next planned deployment
  • 26.
    We don’t panicwhen the incident occurs. We follow the process:
  • 27.
    ➔ Elect aSWAT team handle
  • 28.
    ➔ Elect aSWAT team ➔ Plan Communication handle
  • 29.
    ➔ Elect aSWAT team ➔ Plan Communication ➔ Kick-off handle
  • 30.
    ➔ Elect aSWAT team ➔ Plan Communication ➔ Kick-off ➔ Check the Knowledge Base handle
  • 31.
    ➔ Elect aSWAT team ➔ Plan Communication ➔ Kick-off ➔ Check the Knowledge Base ➔ Create an IM chat group handle
  • 32.
    ➔ Elect aSWAT team ➔ Plan Communication ➔ Kick-off ➔ Check the Knowledge Base ➔ Create an IM chat group ➔ Send email notifications to stakeholders on every update handle
  • 33.
    ➔ Elect aSWAT team ➔ Plan Communication ➔ Kick-off ➔ Check the Knowledge Base ➔ Create an IM chat group ➔ Send email notifications to stakeholders on every update ➔ Follow defined policies and guidelines handle
  • 34.
    We act smartlyafter the incident is resolved:
  • 35.
  • 36.
    ➔ prevent recurrences ➔update stakeholders close
  • 37.
    ➔ prevent recurrences ➔update stakeholders ➔ submit the Incident record close
  • 38.
    ➔ prevent recurrences ➔update stakeholders ➔ submit the Incident record ➔ update Knowledge Base if necessary close
  • 39.
    ➔ prevent recurrences ➔update stakeholders ➔ submit the Incident record ➔ update Knowledge Base if necessary ➔ propose process improvements close
  • 40.
    ★ resolved/workarounded incident ★updated incidents database Outcomes
  • 41.
  • 42.
    why problem management? ○recognize problems
  • 43.
    why problem management? ○recognize problems ○ permanent solutions
  • 44.
    why problem management? ○recognize problems ○ permanent solutions ○ less emergencies
  • 45.
  • 46.
  • 47.
  • 48.
  • 49.
  • 50.
    we ask questions: couldany of the incidents be prevented?
  • 51.
    we ask questions: couldany of the incidents be prevented? can we detect incident symptoms?
  • 52.
    we ask questions: couldany of the incidents be prevented? can we detect incident symptoms? are there any patterns?
  • 53.
    we ask questions: couldany of the incidents be prevented? can we detect incident symptoms? are there any patterns?
  • 54.
    “the aim ofincident management is to restore the service as quickly as possible, often through a workaround, rather than through trying to find a permanent solution which is the aim of problem management.” Summary
  • 55.
  • 56.
    IDENTIFICATION CLOSUREHANDLING ‣ Receivedata regarding the incident and ensure it is full and clear ‣ Qualify issue as an Incident. DELIVERABLES ➨ Incident process is triggered After the incident has been solved, make sure to: ‣ Communicate the results to relevant stakeholders by sending mail following the 'Issue on Live' closure procedure as per the template. ‣ Take corrective actions to prevent issue from happening again. Create JIRA tickets where possible. ‣ Evaluate possible procedure updates that can be made in the teams in the pipeline. ‣ Submit “Incident Login” form. DELIVERABLES ➨ Sent report to XXX email ➨ Submitted related JIRA tickets ➨ Submitted “Incident Login” form ‣ Elect a SWAT team to fix the incident issue. ‣ Decide on a War Room for the Huddle. ‣ Huddle and lay down an Action Plan. ‣ Send out email notification to all stakeholders. No one is allowed to disturb the SWAT team from work, while they actively investigate/resolve the Incident ‣ Huddle regularly to update the action plan. ‣ Send out updates to all stakeholders. ‣ If devOps is necessary, follow the “Emergency IT Support Policy”. ‣ Follow “Live Actions Guidelines” DELIVERABLES ➨ Resolved incident (possibly workarounded) ➨ Sent Incident report(s) to XXX email incident management process
  • 57.
    Resolved ? no Add /Update Incident record (via Incident Login form) Open >3 days ? yes Create JIRAs for fixing root cause or other related issues if possible Add / Update Incident record (via Incident Login form) yes HANDLING CLOSURE IDENTIFICATION Action plan (five minutes huddle of the SWAT team in a war room) Send email Create / Update JIRAs (contact OPS if necessary) Fix (first QA, then Live) Incident detected Send email (keep one thread) no
  • 58.
    problem management process PROBLEM DETECTION ROOTCAUSE IDENTIFICATION SOLUTION DEFINITION PRIORITISATION PROBLEM LOGGING IMPLEMENTATION, CLOSURE ACTIVITIES ‣ Define the problem ‣ Receive data regarding the problem from incident management ‣ Ensure the collected data is full and clear ‣ Define which teams or departments are affected ‣ Gather other data at the day of incident ‣ Analyze symptoms ‣ Analyze the data collected from various sources relating to the major incident ‣ Analyze historical data to see if there was such problem before DELIVERABLES ➨ Analyzed problem ➨ Updated incident record ACTIVITIES Problem investigation and diagnosis (requires tech experts) ‣ To conduct root cause analyses using various techniques if necessary: • Make a sketch • Draw Ishikawa (fishbone) diagram • Kepner-Tregoe • Flow diagrams • etc. ‣ Determine workarounds ‣ Think of potential solutions ‣ Assess the problem and recommended actions to resolve the problem DELIVERABLES ➨ Updated problem record ➨ Root cause detected ➨ Workaround(s) identified ACTIVITIES ‣ Identify the team for solution development ‣ Determine possible resolutions ‣ Choose the best approach ‣ Make sure the solution can effectively prevent reoccurrence DELIVERABLES ➨ Updated problem record ➨ Other tasks in JIRA ➨ Updated incident records ➨ Defined resources that are necessary for implementation ACTIVITIES ‣ Identify the urgency and impact of this task ‣ Define a priority in the Problem management queue ‣ Identify responsible for the implementation ‣ Decide how this problem should be prioritized among other tasks of the team DELIVERABLES ➨ The task(s) has a priority ➨ The team leads are aware of the task and can plan it in their sprints ACTIVITIES ‣ Create a new JIRA record or update the old one: • Unique ID, timestamp • Name of submitter • Link associate problem records (with hierarchy if applicable) • Link associate incident records • Problem description • Problem category • Status • Severity and Impact • Responsible person, team • Affected game • Associate JIRA records • History of all taken actions • Workaround • Permanent solution (if known already) DELIVERABLES ➨ Created/updated problem record ➨ Analyzed and updated incident data ACTIVITIES ‣ Conduct activities to implement the fix to the problem ‣ Verify if the solution is appropriate and close problem record ‣ Submit a record to the Error Knowledge Base if applicable ‣ Share Lessons learned via email if reasonable ‣ Ensure that all the associated incidents are closed with a proper fix or resolution DELIVERABLES ➨ Updated incident record ➨ Updated problem record ➨ Updated Known Errors Knowledge Base spreadsheet ➨ Lessons learned shared ➨ Report is sent
  • 59.
    IMPLEMENTATION & CLOSURE ROOT CAUSE IDENTIFICATION & SOLUTION DEFINITION Close (update knowledgebase, submit lessons learned, send email) Implement (by defined implementation team) DETECTION & LOGGING Choose the problem area Analyze related incident data (symptoms, relations, historical data) Request missing data (symptoms, relations, historical data) Create new Problem Record /update existing (JIRA) Identify root cause, workarounds Determine work for identified solutions (and choose implementation team) Prioritize Incident record(s) update Problem record update Problem record update Problem record update Problem record update Incident record(s) update Incident record(s) update Known Errors update
  • 60.
    Run MeetingPrepare Meeting Who:Problem Manager Process summary: - to ensure the quality of the incident spreadsheet - to select follow-up’s - to prefill the problem management spreadsheet Efforts: 3-5 mh When: no later than 3 days before the meeting Who: particular person is responsible for every problem as defined in the meeting Process summary: - implement - verify - update all records Outcomes: Updated incident and problem records Chairperson: Problem Manager Participants: PMs, APs, OPS representative Frequency: monthly Activities: - identify problems - prioritize & agree on actions - define responsible teams When: at the end of the month. In case of holidays or emergency moved to the next working day. Outcomes: Assigned tasks Take Actions problem management process simplified
  • 61.