Site Reliability Engineering
Organizational & Operational Framework
(a.k.a. “Work Mode”)
May 2019
Olaf Reitmaier Veracierta
Agenda
Introduction
Weakness / Strengths
• Culture of Service
• Bi-Modality Awareness
• Ways of Working
• Organization
• Innovation
• Communication
• Planning
Changes
Trust & Visibility
Round Robin (Q&A)
Introduction
Purpose
• Clarify expectations from company leaders
• Align all the people then address technology issues
• Allow collaborators to share concerns (at the end, take notes)
• Propose a set of criteria used to follow-up teamwork
• Opportunity to be honest, positive and share ideas
• Put everyone in the same (new) page, no excuses
• Avoid repeating this information several times
Site Reliability Engineering
Vision
• Be recognized as an expert and reliable group of people which
offers the best in class knowledge and support over the core IT
infrastructure.
Mission:
• Ensure the continuous and secure operation and support of the
network, compute, storage, backup, messaging, logging and
monitoring platforms of the core IT infrastructure enabling the
critical business application and services.
Site Reliability Engineering
https://sre.google/books/
“Reliability”
Weakness
& Strengths
Strengths
KNOWLEDGE PASSIONATE
COMPROMISE
++ Assertiveness
++ Resolution
Weakness
ACCOUNTABILITY
(TRACEABILITY)
COMMUNICATION
(WORKFLOWS)
VERSATILITY
(STRATEGY)
What
When
Whom
Where
Why
Who
-- Scalability
Changes
The Company / Business Culture
Vicious Cycle: I really like a different future, but
why change if everything is ok the way it is. The
(other) people has to change.
Micro
Macro
Safety
Circle
Blindness?
Start with You
Every:
• Business
• Customer
• Organization
• Employee
Has:
1. Opportunities
2. Values
3. Risks
4. Rules Standards
• Everybody has a movie
• Criteria (dis-)alignment
Change
CULTURE
(OF SERVICE)
BI-MODAL
(VERSATILITY)
WAYS OF WORKING
(ME / YOU / ALL)
LEGACY
(HR / IT)
• Incremental
• Iterative
• Agreed
Change
|
Culture
of Service
User Experience => Every Interaction Counts
Customers ⬄ Team / Group / Member / Partner (Internal & External)
Service Delivery Orientation
Efficient
Kind
Inefficient
Unkind
Inefficient
Kind
Efficient
Unkind
1.Developers
2.DevOps
3.SysOps
Emotional
Intelligence
KEY FOR PEOPLE
RELATIONS
KEY FOR WORK
RELATIONS
TRI-ONE BRAIN SYSTEM 1/2
Conflict <=> Negotiation
Tension <=> Nothing
Conflict != Fight
Triune Brain
(Thinking/(Re-)Acting)
With time habits (good/bad) becomes reptilian
bias, without noticing you loose reasoning skills
and ability to change at all.
System 1 and 2
(Thinking/(Re-)Acting)
Stimulation -> Time -> Reaction
“People gather together with people
that make the work easier”.
Empathy
“Put yourself in the other shoes”
“Ask for reciprocity about it”
Resilience
Self – Esteem / Empathy / Positiveness
Goals / Challenges / Opportunities
Capacity to recover or thrive
from any kind of difficulties
in work and personal life
Difficulties
Change
|
Bi-Modality
Awareness
Code
Infrastructure
Sys/Ops Dev
Bussiness
IT
Maintenance,
Support
IaaCode, Cloud, DevOps
Innovation
Mr. No Mr. Yes
Infrastructure vs. Code
Infrastructure Code
Hardware / Software Software
Static / Inmutable / Un-versioned Dynamic / Mutable / Versioned
Moore’s Law (Faster) Wirth’s Law (Slower)
Extrinsic Documentation Intrinsic Documentation
Administrator Driven (Indirect) User Driven (Direct)
Software Defined Anything
Virtualization Cloud Native Solutions Serverless
Still slow, really slow,
transition…
Changes
|
Way of Work
Changes
|
Way of Work
|
Organization
Technician vs.
Engineer vs.
Manager
Solves different problems
From technical to the business language
Doing, thinking or organizing people
(Hardly done in parallel by the same person)
From nothing to a book of knowledge
Managers
(aka Coachs)
• Helps team members
• Drive the team to improve
• Support team good/bad times
• Organization, planification and
auditing of the work progress
Deliver:
• Proactive progress reporting
• Active participation in solutions
• Answers for follow-up
question
• Ensure the job is done in time
Expects:
Performance
Classification
Autonomy
Seniority
Delegation
Operation
Infrastructure
Code
Legacy
Documentation
Training
(Re-)usability
Auditability
Autonomy
(Freedom to innovate)
You work within a team
(not alone) for good & bad
You implement and
support high-available
solutions (whenever is
possible)
You foster automation
instead of manual task
(wherever is possible)
You look for strategic approval
(specially for new projects /
scope changes)
Seniority
Years doing same stuff → Changing on a daily basis
Just being different → Making the difference
Smart person → Smart team
A lot of Power → Responsibilities
Learning → Teaching (New senior engineers will need it!)
Delegation
Knowing what are your
responsibilities is important
We are here because
somebody let us make
mistakes
Having time for innovation
needs delegation and teaching
to someone to be able to scale
Senior
Complex Tasks
Semi-Senior
Medium Tasks
Entry
Basic Tasks
2nd in Command (Backup)
RESPONSIBILITIES MATRIX COUPLE OF YOUR SELECTION
(HIRE A NEW MEMBER?)
PROMOTES SHARING OF
CRITICAL DOCUMENTATION
AND KNOWLEDGE
Update Responsibility Matrix
Infrastructure:
• Network
• Compute
• Storage
• Management Systems
• Security/Identity Devices/Systems
• Backup System and Disaster Recovery
• Monitoring / Alerting (NPM, ITM, APM)
• Supporting Business Applications
Documentation
(The Critical One)
Alive (always change)
Challenging for everyone (a lot of reasons)
Boring because it is not for you but for others
Needed for delegation to others (i.e. newcomers,
leftovers when vacation, sickness, renounce).
Important when companies
growth up in time
Reduce tribal knowledge
Raise quality levels
Allow auditing
Documentation: Layered / Trimmed / Useful
Abstraction
Update -> Critical
Archive -> Obsolete
Focused on Training
Discoverable (Searchable)
KPI
Documentation
(Workshops)
Resume: purpose, technology/vendor websites,
external articles/references.
Architecture: high level (at least) visual representation
of the platform or system (i.e. draw.io, dot/graphviz).
Assets: resources inventory labeling (not naming), links
(URL) and credentials (e.g. Vault, SSO).
How-To’s: about (re-)install and configure with focus
on critical and tricky in-house customizations.
Basic administration and troubleshooting: standard
procedures, know errors with brief solution explained,
references to article/tickets and similars.
Changes
|
Way of Work
|
Innovation
Legacy
• There was, there is and there will be
• It’s a matter of time, but it is important
• It is not neither bad nor good is just legacy
• Innovation is needed, but maintenance also
99% of the things are legacy since its go live
Innovation
(& Investigation)
Must be focused
Value
(Customer)
Should be planned
(End/Start)
Trackable
(Timeframe)
Must be a process
(Success/Failure)
Measured
(Deliverable)
Changes
|
Way of Work
|
Communication
Communication
• Notify:
• Live changes with enough anticipation to customers
• Absence with anticipation (book and or block calendar)
• Delay/Leave to/from office to the team members
• If you are working and applying changes on weekend
• Avoid:
• Doing unplanned changes not related to live issues on “Fridays”
• Implementing new features near days you are on vacations (freeze)
• Mixing informal/formal communication (e.g. me/x-team, misunderstanding, rumors)
• Overlapping vacations with your 2nd in command (a.k.a. my backup)
Estimated Time Ahead (ETA)
BROADCAST
INFORMATION
START, PROGRESS,
END, EVIDENCE
REPORT
CONTINUOUSLY
DON’T EXPECT TO
BE FOLLOW-UP
RESPONSE
QUESTIONS
TO EASY CLARIFY
FOLLOW-UP
APPLY TO
EVERYTHING:
INCIDENT,
PROBLEMS,
REQUIREMENTS.
Communication
(Too many
channels?)
• Many formals communications and calendaring
• Request to chats will be converted to tickets (as needed)
@Email
• Live Issues and News
• Daily/Weekly Slack-Up Trial (instead of Daily Stand-Up)
#Slack
• Kanban Board: Task/Requirements (Internal/External)
• Wiki: For publishing critical infrastructure information
JIRA (Tickets/Wiki)
• Only for emergencies
Phone Calls / WhatsApp
Mute Generation / Work Comm. Channel
Favorite Comm.
Channel (IM)
Daily Comm.
Channel (IM)
Family Comm.
Channel (IM)
So let’s Slack-Up!
• (Trial) Broadcast to slack channel instead of verbal stand-up.
• Daily/Weekly basis at the start of the day before 10 A.M.
• Reminder for achievements/contributions to OKR’s:
• Maintenance
• Innovation & Optimization
• Support (Only if not already a JIRA Ticket)
• Low, Lowest, Medium, High, Highest.
• Undesired/Unplanned requests (e.g. C-Level Urgencies, Last-Minute-Request)
• Help to track easily your progress and risks (red flags).
• Promote team awareness of issues and progress.
How it looks? https://slack.com/intl/en-de/slack-tips/run-daily-standups-or-check-ins
Changes
|
Way of Work
|
Planning
Planning
Objectives and
Key Results (OKRs)
Planning
(OKR)
Maintenance
Projects (30%)
Maintenance plans
Preventive monitoring
Optimization
& Innovation
Projects (20%)
Cost-efficiency driven plans
Security oriented driven plans
New technologies / features
try-out / deployment plans
Support
Projects (50%)
New requirements
Reactive incident response
• Known errors handling
• Problems troubleshooting
Initiatives: Pick up selection to improve operation?
URGENT NOT URGENT
IMPORTANT
DO
• Live Issue
• Slack until fixed
• Do the post-mortem
documentation
• OKR’s
• Slack-Up w/Manager
DECIDE
• Non-Live Issue
• Ask-for/Open ticket
• Slack-Up w/Manager
• Clarify expectations third
parties (e.g. ETA, Termin)
• Move to Initiatives
NOT IMPORTANT
DELEGATE
• Teach fishing
• Assign a less senior
• 2nd in Cmd. (Backup)?
• Manager
DELETE
• Don’t confuse it with
Innovation at OKR’s
• Stop thinking on it
Individual Planning / Report-Up
Estimated Time Ahead (ETA)
Never Ending Story
Unexpected
Additions/Drops
For critical
infrastructure is
important
Challenge for most
of the tech teams
Different for
innovation,
maintenance
and support
IT Incident Response Plan (Overview)
Alerts: Person, Call, E-Mail,
#Slack, SMS, without
standard classification
(severity).
Follow-up on slack
Operational (OPS) channel
(chat).
Direct Responsible Individual
(DRI) take care of him/her
alerts.
Meeting everyone involved in
Situational Room.
SLA defined in terms of
maximum downtime in hours
per system / application.
Spiral Escalation to Eng. Mgr.,
Head/Director, CTO, CEO.
<45 minutes
IT Incident Response Insights
24/7 Emergency Handling - Contacts
Technology Team Cloud Support - Contacts
On-Premise Datacenter Emergency - Contacts
Formal Mailing Lists for the Business / Technology Teams
Report +
Escalate
Pro vs.
Re-Act
Account
+ Adjust
Incident Response Plan - Responsibilities
DevOps & Site Reliability Teams – Responsibilities
Physical / IT Security & Legal Teams - Responsibilities
(Local) IT Support Teams - Responsibilities
Calls, Chats, Alerts, Tickets
Incident / Problems / Changes (ITIL’s way)
Stability & Post-Mortem Meetings (Agile way)
SLI
(Observe)
SLO
(Oversee)
SLA
(Own)
Progress (Follow-Up/Report-Up)
Tickets New/Completed (JIRA -> slack channel)
Daily/Weekly Slack-Up (You -> slack channel)
Individual Daily Pin Pointing (On Your Desk)
Bi-Weekly One-To-One (Room / Walk / Lunch)
Monthly Retrospective (Last Friday of the Month Afternoon)
Trust &
Visibility
Trust &
Visibility
Trust ring mitigates business hi-jacking
Critical credential and access levels should be
shared (i.e. OneLogin) with key team
members and C-Level (Breaking the glass)
Access should cover all infrastructure assets,
platforms and systems
Monitoring tools and central consoles alerts
for infrastructure must be broadcasted in
communication channels (i.e. slack, e-mail)

SRE Organizational Framework

  • 1.
    Site Reliability Engineering Organizational& Operational Framework (a.k.a. “Work Mode”) May 2019 Olaf Reitmaier Veracierta
  • 2.
    Agenda Introduction Weakness / Strengths •Culture of Service • Bi-Modality Awareness • Ways of Working • Organization • Innovation • Communication • Planning Changes Trust & Visibility Round Robin (Q&A)
  • 3.
  • 4.
    Purpose • Clarify expectationsfrom company leaders • Align all the people then address technology issues • Allow collaborators to share concerns (at the end, take notes) • Propose a set of criteria used to follow-up teamwork • Opportunity to be honest, positive and share ideas • Put everyone in the same (new) page, no excuses • Avoid repeating this information several times
  • 5.
    Site Reliability Engineering Vision •Be recognized as an expert and reliable group of people which offers the best in class knowledge and support over the core IT infrastructure. Mission: • Ensure the continuous and secure operation and support of the network, compute, storage, backup, messaging, logging and monitoring platforms of the core IT infrastructure enabling the critical business application and services.
  • 6.
  • 7.
  • 8.
  • 9.
  • 10.
  • 11.
    The Company /Business Culture Vicious Cycle: I really like a different future, but why change if everything is ok the way it is. The (other) people has to change. Micro Macro Safety Circle Blindness? Start with You
  • 12.
    Every: • Business • Customer •Organization • Employee Has: 1. Opportunities 2. Values 3. Risks 4. Rules Standards • Everybody has a movie • Criteria (dis-)alignment
  • 13.
    Change CULTURE (OF SERVICE) BI-MODAL (VERSATILITY) WAYS OFWORKING (ME / YOU / ALL) LEGACY (HR / IT) • Incremental • Iterative • Agreed
  • 14.
  • 15.
    User Experience =>Every Interaction Counts Customers ⬄ Team / Group / Member / Partner (Internal & External)
  • 16.
  • 17.
    Emotional Intelligence KEY FOR PEOPLE RELATIONS KEYFOR WORK RELATIONS TRI-ONE BRAIN SYSTEM 1/2 Conflict <=> Negotiation Tension <=> Nothing Conflict != Fight
  • 18.
    Triune Brain (Thinking/(Re-)Acting) With timehabits (good/bad) becomes reptilian bias, without noticing you loose reasoning skills and ability to change at all.
  • 19.
    System 1 and2 (Thinking/(Re-)Acting) Stimulation -> Time -> Reaction “People gather together with people that make the work easier”.
  • 20.
    Empathy “Put yourself inthe other shoes” “Ask for reciprocity about it”
  • 21.
    Resilience Self – Esteem/ Empathy / Positiveness Goals / Challenges / Opportunities Capacity to recover or thrive from any kind of difficulties in work and personal life Difficulties
  • 22.
  • 23.
  • 24.
    Infrastructure vs. Code InfrastructureCode Hardware / Software Software Static / Inmutable / Un-versioned Dynamic / Mutable / Versioned Moore’s Law (Faster) Wirth’s Law (Slower) Extrinsic Documentation Intrinsic Documentation Administrator Driven (Indirect) User Driven (Direct) Software Defined Anything Virtualization Cloud Native Solutions Serverless Still slow, really slow, transition…
  • 25.
  • 26.
  • 27.
    Technician vs. Engineer vs. Manager Solvesdifferent problems From technical to the business language Doing, thinking or organizing people (Hardly done in parallel by the same person) From nothing to a book of knowledge
  • 28.
    Managers (aka Coachs) • Helpsteam members • Drive the team to improve • Support team good/bad times • Organization, planification and auditing of the work progress Deliver: • Proactive progress reporting • Active participation in solutions • Answers for follow-up question • Ensure the job is done in time Expects:
  • 29.
  • 30.
    Autonomy (Freedom to innovate) Youwork within a team (not alone) for good & bad You implement and support high-available solutions (whenever is possible) You foster automation instead of manual task (wherever is possible) You look for strategic approval (specially for new projects / scope changes)
  • 31.
    Seniority Years doing samestuff → Changing on a daily basis Just being different → Making the difference Smart person → Smart team A lot of Power → Responsibilities Learning → Teaching (New senior engineers will need it!)
  • 32.
    Delegation Knowing what areyour responsibilities is important We are here because somebody let us make mistakes Having time for innovation needs delegation and teaching to someone to be able to scale Senior Complex Tasks Semi-Senior Medium Tasks Entry Basic Tasks
  • 33.
    2nd in Command(Backup) RESPONSIBILITIES MATRIX COUPLE OF YOUR SELECTION (HIRE A NEW MEMBER?) PROMOTES SHARING OF CRITICAL DOCUMENTATION AND KNOWLEDGE
  • 34.
    Update Responsibility Matrix Infrastructure: •Network • Compute • Storage • Management Systems • Security/Identity Devices/Systems • Backup System and Disaster Recovery • Monitoring / Alerting (NPM, ITM, APM) • Supporting Business Applications
  • 35.
    Documentation (The Critical One) Alive(always change) Challenging for everyone (a lot of reasons) Boring because it is not for you but for others Needed for delegation to others (i.e. newcomers, leftovers when vacation, sickness, renounce). Important when companies growth up in time Reduce tribal knowledge Raise quality levels Allow auditing
  • 36.
    Documentation: Layered /Trimmed / Useful Abstraction Update -> Critical Archive -> Obsolete Focused on Training Discoverable (Searchable) KPI
  • 37.
    Documentation (Workshops) Resume: purpose, technology/vendorwebsites, external articles/references. Architecture: high level (at least) visual representation of the platform or system (i.e. draw.io, dot/graphviz). Assets: resources inventory labeling (not naming), links (URL) and credentials (e.g. Vault, SSO). How-To’s: about (re-)install and configure with focus on critical and tricky in-house customizations. Basic administration and troubleshooting: standard procedures, know errors with brief solution explained, references to article/tickets and similars.
  • 38.
  • 39.
    Legacy • There was,there is and there will be • It’s a matter of time, but it is important • It is not neither bad nor good is just legacy • Innovation is needed, but maintenance also 99% of the things are legacy since its go live
  • 40.
    Innovation (& Investigation) Must befocused Value (Customer) Should be planned (End/Start) Trackable (Timeframe) Must be a process (Success/Failure) Measured (Deliverable)
  • 41.
  • 42.
    Communication • Notify: • Livechanges with enough anticipation to customers • Absence with anticipation (book and or block calendar) • Delay/Leave to/from office to the team members • If you are working and applying changes on weekend • Avoid: • Doing unplanned changes not related to live issues on “Fridays” • Implementing new features near days you are on vacations (freeze) • Mixing informal/formal communication (e.g. me/x-team, misunderstanding, rumors) • Overlapping vacations with your 2nd in command (a.k.a. my backup)
  • 43.
    Estimated Time Ahead(ETA) BROADCAST INFORMATION START, PROGRESS, END, EVIDENCE REPORT CONTINUOUSLY DON’T EXPECT TO BE FOLLOW-UP RESPONSE QUESTIONS TO EASY CLARIFY FOLLOW-UP APPLY TO EVERYTHING: INCIDENT, PROBLEMS, REQUIREMENTS.
  • 44.
    Communication (Too many channels?) • Manyformals communications and calendaring • Request to chats will be converted to tickets (as needed) @Email • Live Issues and News • Daily/Weekly Slack-Up Trial (instead of Daily Stand-Up) #Slack • Kanban Board: Task/Requirements (Internal/External) • Wiki: For publishing critical infrastructure information JIRA (Tickets/Wiki) • Only for emergencies Phone Calls / WhatsApp
  • 45.
    Mute Generation /Work Comm. Channel Favorite Comm. Channel (IM) Daily Comm. Channel (IM) Family Comm. Channel (IM)
  • 46.
    So let’s Slack-Up! •(Trial) Broadcast to slack channel instead of verbal stand-up. • Daily/Weekly basis at the start of the day before 10 A.M. • Reminder for achievements/contributions to OKR’s: • Maintenance • Innovation & Optimization • Support (Only if not already a JIRA Ticket) • Low, Lowest, Medium, High, Highest. • Undesired/Unplanned requests (e.g. C-Level Urgencies, Last-Minute-Request) • Help to track easily your progress and risks (red flags). • Promote team awareness of issues and progress. How it looks? https://slack.com/intl/en-de/slack-tips/run-daily-standups-or-check-ins
  • 47.
  • 48.
  • 49.
    Planning (OKR) Maintenance Projects (30%) Maintenance plans Preventivemonitoring Optimization & Innovation Projects (20%) Cost-efficiency driven plans Security oriented driven plans New technologies / features try-out / deployment plans Support Projects (50%) New requirements Reactive incident response • Known errors handling • Problems troubleshooting
  • 50.
    Initiatives: Pick upselection to improve operation?
  • 51.
    URGENT NOT URGENT IMPORTANT DO •Live Issue • Slack until fixed • Do the post-mortem documentation • OKR’s • Slack-Up w/Manager DECIDE • Non-Live Issue • Ask-for/Open ticket • Slack-Up w/Manager • Clarify expectations third parties (e.g. ETA, Termin) • Move to Initiatives NOT IMPORTANT DELEGATE • Teach fishing • Assign a less senior • 2nd in Cmd. (Backup)? • Manager DELETE • Don’t confuse it with Innovation at OKR’s • Stop thinking on it Individual Planning / Report-Up
  • 52.
    Estimated Time Ahead(ETA) Never Ending Story Unexpected Additions/Drops For critical infrastructure is important Challenge for most of the tech teams Different for innovation, maintenance and support
  • 53.
    IT Incident ResponsePlan (Overview) Alerts: Person, Call, E-Mail, #Slack, SMS, without standard classification (severity). Follow-up on slack Operational (OPS) channel (chat). Direct Responsible Individual (DRI) take care of him/her alerts. Meeting everyone involved in Situational Room. SLA defined in terms of maximum downtime in hours per system / application. Spiral Escalation to Eng. Mgr., Head/Director, CTO, CEO. <45 minutes
  • 54.
    IT Incident ResponseInsights 24/7 Emergency Handling - Contacts Technology Team Cloud Support - Contacts On-Premise Datacenter Emergency - Contacts Formal Mailing Lists for the Business / Technology Teams Report + Escalate Pro vs. Re-Act Account + Adjust Incident Response Plan - Responsibilities DevOps & Site Reliability Teams – Responsibilities Physical / IT Security & Legal Teams - Responsibilities (Local) IT Support Teams - Responsibilities Calls, Chats, Alerts, Tickets Incident / Problems / Changes (ITIL’s way) Stability & Post-Mortem Meetings (Agile way) SLI (Observe) SLO (Oversee) SLA (Own)
  • 55.
    Progress (Follow-Up/Report-Up) Tickets New/Completed(JIRA -> slack channel) Daily/Weekly Slack-Up (You -> slack channel) Individual Daily Pin Pointing (On Your Desk) Bi-Weekly One-To-One (Room / Walk / Lunch) Monthly Retrospective (Last Friday of the Month Afternoon)
  • 56.
  • 57.
    Trust & Visibility Trust ringmitigates business hi-jacking Critical credential and access levels should be shared (i.e. OneLogin) with key team members and C-Level (Breaking the glass) Access should cover all infrastructure assets, platforms and systems Monitoring tools and central consoles alerts for infrastructure must be broadcasted in communication channels (i.e. slack, e-mail)