The document provides an overview of incident and problem management based on ITIL practices, focusing on minimizing the frequency and severity of incidents in live online games. It details the processes of identifying, handling, and resolving incidents, alongside a structured approach to problem management aimed at preventing future incidents. The content also emphasizes communication with stakeholders and the continuous improvement of operational procedures.
Overview of incident and problem management concepts, introducing the lightweight ITIL framework.
Overview of incidents; importance of incident management to restore services quickly and avoid unnecessary involvement.
Steps to identify, handle, and close incidents using defined procedures to minimize disruption. Example demonstrating incident identification criteria and actions taken to reduce incidents in online games.
Definition of problem management and its objective to recognize problems to implement permanent solutions.
Process of identifying issues from incident records, submitting problems for permanent solutions.
Critical questions to explore regarding incident prevention and detection of symptoms and patterns.
Comparison of incident management focusing on quick service restoration versus problem management.
Detailed workflow of incident management process including identification, handling, and closure.
Closing remarks and acknowledgment from the presenter with contact information for further inquiries.
identify
Incident only when:
➔game becomes unavailable, or
➔ game revenue drops more
than €XXX, or
➔ severe issues with servers
25.
identify
Incident only when:
➔game becomes unavailable, or
➔ game revenue drops more
than €XXX, or
➔ severe issues with servers, or
➔ it can't wait for next
planned deployment
26.
We don’t panicwhen the incident
occurs. We follow the process:
➔ Elect aSWAT team
➔ Plan Communication
➔ Kick-off
handle
30.
➔ Elect aSWAT team
➔ Plan Communication
➔ Kick-off
➔ Check the Knowledge Base
handle
31.
➔ Elect aSWAT team
➔ Plan Communication
➔ Kick-off
➔ Check the Knowledge Base
➔ Create an IM chat group
handle
32.
➔ Elect aSWAT team
➔ Plan Communication
➔ Kick-off
➔ Check the Knowledge Base
➔ Create an IM chat group
➔ Send email notifications to
stakeholders on every update
handle
33.
➔ Elect aSWAT team
➔ Plan Communication
➔ Kick-off
➔ Check the Knowledge Base
➔ Create an IM chat group
➔ Send email notifications to
stakeholders on every update
➔ Follow defined policies and
guidelines
handle
we ask questions:
couldany of the incidents be prevented?
can we detect incident symptoms?
are there any patterns?
53.
we ask questions:
couldany of the incidents be prevented?
can we detect incident symptoms?
are there any patterns?
54.
“the aim ofincident management is to restore the service as
quickly as possible, often through a workaround, rather than
through trying to find a permanent solution which is the aim
of problem management.”
Summary
IDENTIFICATION CLOSUREHANDLING
‣ Receivedata regarding the incident and
ensure it is full and clear
‣ Qualify issue as an Incident.
DELIVERABLES
➨ Incident process is triggered
After the incident has been solved, make sure
to:
‣ Communicate the results to relevant
stakeholders by sending mail following the
'Issue on Live' closure procedure as per the
template.
‣ Take corrective actions to prevent issue from
happening again. Create JIRA tickets where
possible.
‣ Evaluate possible procedure updates that can
be made in the teams in the pipeline.
‣ Submit “Incident Login” form.
DELIVERABLES
➨ Sent report to XXX email
➨ Submitted related JIRA tickets
➨ Submitted “Incident Login” form
‣ Elect a SWAT team to fix the incident issue.
‣ Decide on a War Room for the Huddle.
‣ Huddle and lay down an Action Plan.
‣ Send out email notification to all
stakeholders. No one is allowed to disturb the
SWAT team from work, while they actively
investigate/resolve the Incident
‣ Huddle regularly to update the action plan.
‣ Send out updates to all stakeholders.
‣ If devOps is necessary, follow the “Emergency
IT Support Policy”.
‣ Follow “Live Actions Guidelines”
DELIVERABLES
➨ Resolved incident (possibly workarounded)
➨ Sent Incident report(s) to XXX email
incident management process
57.
Resolved ?
no
Add /Update Incident record
(via Incident Login form)
Open >3 days ?
yes
Create JIRAs for fixing root cause or other
related issues if possible
Add / Update Incident record
(via Incident Login form)
yes
HANDLING
CLOSURE
IDENTIFICATION
Action plan
(five minutes huddle of the SWAT team in a war room)
Send email
Create / Update JIRAs
(contact OPS if necessary)
Fix
(first QA, then Live)
Incident
detected
Send email
(keep one thread)
no
58.
problem management process
PROBLEM
DETECTION
ROOTCAUSE
IDENTIFICATION
SOLUTION
DEFINITION
PRIORITISATION
PROBLEM
LOGGING
IMPLEMENTATION,
CLOSURE
ACTIVITIES
‣ Define the problem
‣ Receive data regarding the
problem from incident
management
‣ Ensure the collected data is full
and clear
‣ Define which teams or
departments are affected
‣ Gather other data at the day of
incident
‣ Analyze symptoms
‣ Analyze the data collected from
various sources relating to the
major incident
‣ Analyze historical data to see if
there was such problem before
DELIVERABLES
➨ Analyzed problem
➨ Updated incident record
ACTIVITIES
Problem investigation and
diagnosis (requires tech experts)
‣ To conduct root cause
analyses using various
techniques if necessary:
• Make a sketch
• Draw Ishikawa (fishbone)
diagram
• Kepner-Tregoe
• Flow diagrams
• etc.
‣ Determine workarounds
‣ Think of potential solutions
‣ Assess the problem and
recommended actions to
resolve the problem
DELIVERABLES
➨ Updated problem record
➨ Root cause detected
➨ Workaround(s) identified
ACTIVITIES
‣ Identify the team for solution
development
‣ Determine possible resolutions
‣ Choose the best approach
‣ Make sure the solution can
effectively prevent
reoccurrence
DELIVERABLES
➨ Updated problem record
➨ Other tasks in JIRA
➨ Updated incident records
➨ Defined resources that are
necessary for implementation
ACTIVITIES
‣ Identify the urgency and
impact of this task
‣ Define a priority in the
Problem management queue
‣ Identify responsible for
the implementation
‣ Decide how this problem
should be prioritized among
other tasks of the team
DELIVERABLES
➨ The task(s) has a priority
➨ The team leads are aware of
the task and can plan it in
their sprints
ACTIVITIES
‣ Create a new JIRA record or
update the old one:
• Unique ID, timestamp
• Name of submitter
• Link associate problem
records
(with hierarchy if applicable)
• Link associate incident
records
• Problem description
• Problem category
• Status
• Severity and Impact
• Responsible person, team
• Affected game
• Associate JIRA records
• History of all taken actions
• Workaround
• Permanent solution (if known
already)
DELIVERABLES
➨ Created/updated problem
record
➨ Analyzed and updated
incident data
ACTIVITIES
‣ Conduct activities to implement
the fix to the problem
‣ Verify if the solution is appropriate
and close problem record
‣ Submit a record to the Error
Knowledge Base if applicable
‣ Share Lessons learned via email
if reasonable
‣ Ensure that all the associated
incidents are closed with a proper
fix or resolution
DELIVERABLES
➨ Updated incident record
➨ Updated problem record
➨ Updated Known Errors
Knowledge Base spreadsheet
➨ Lessons learned shared
➨ Report is sent
59.
IMPLEMENTATION
&
CLOSURE
ROOT CAUSE
IDENTIFICATION
&
SOLUTION
DEFINITION
Close
(update knowledgebase, submit lessons learned, send email)
Implement
(by defined implementation team)
DETECTION
&
LOGGING
Choose the problem area
Analyze related incident data
(symptoms, relations, historical data)
Request missing data
(symptoms, relations, historical data)
Create new Problem Record /update existing
(JIRA)
Identify root cause, workarounds
Determine work for identified solutions
(and choose implementation team)
Prioritize
Incident
record(s)
update
Problem
record
update
Problem
record
update
Problem
record
update
Problem
record
update
Incident
record(s)
update
Incident
record(s)
update
Known
Errors
update
60.
Run MeetingPrepare Meeting
Who:Problem Manager
Process summary:
- to ensure the quality of the
incident spreadsheet
- to select follow-up’s
- to prefill the problem
management spreadsheet
Efforts: 3-5 mh
When: no later than 3 days
before the meeting
Who: particular person is
responsible for every problem as
defined in the meeting
Process summary:
- implement
- verify
- update all records
Outcomes: Updated incident and
problem records
Chairperson: Problem Manager
Participants: PMs, APs, OPS
representative
Frequency: monthly
Activities:
- identify problems
- prioritize & agree on
actions
- define responsible teams
When: at the end of the month. In
case of holidays or emergency
moved to the next working day.
Outcomes: Assigned tasks
Take Actions
problem management process simplified