SRE Organizational Framework

Site Reliability Engineering
Organizational & Operational Framework
(a.k.a. “Work Mode”)
May 2019
Olaf Reitmaier Veracierta

Agenda
Introduction
Weakness / Strengths
• Culture of Service
• Bi-Modality Awareness
• Ways of Working
• Organization
• Innovation
• Communication
• Planning
Changes
Trust & Visibility
Round Robin (Q&A)

Purpose
• Clarify expectations from company leaders
• Align all the people then address technology issues
• Allow collaborators to share concerns (at the end, take notes)
• Propose a set of criteria used to follow-up teamwork
• Opportunity to be honest, positive and share ideas
• Put everyone in the same (new) page, no excuses
• Avoid repeating this information several times

Vision
• Be recognized as an expert and reliable group of people which
offers the best in class knowledge and support over the core IT
infrastructure.
Mission:
• Ensure the continuous and secure operation and support of the
network, compute, storage, backup, messaging, logging and
monitoring platforms of the core IT infrastructure enabling the
critical business application and services.

https://sre.google/books/
“Reliability”

Strengths
KNOWLEDGE PASSIONATE
COMPROMISE
++ Assertiveness
++ Resolution

Weakness
ACCOUNTABILITY
(TRACEABILITY)
COMMUNICATION
(WORKFLOWS)
VERSATILITY
(STRATEGY)
What
When
Whom
Where
Why
Who
-- Scalability

The Company / Business Culture
Vicious Cycle: I really like a different future, but
why change if everything is ok the way it is. The
(other) people has to change.
Micro
Macro
Safety
Circle
Blindness?
Start with You

Every:
• Business
• Customer
• Organization
• Employee
Has:
1. Opportunities
2. Values
3. Risks
4. Rules Standards
• Everybody has a movie
• Criteria (dis-)alignment

Change
CULTURE
(OF SERVICE)
BI-MODAL
(VERSATILITY)
WAYS OF WORKING
(ME / YOU / ALL)
LEGACY
(HR / IT)
• Incremental
• Iterative
• Agreed

User Experience => Every Interaction Counts
Customers ⬄ Team / Group / Member / Partner (Internal & External)

Service Delivery Orientation
Efficient
Kind
Inefficient
Unkind
Inefficient
Kind
Efficient
Unkind
1.Developers
2.DevOps
3.SysOps

Emotional
Intelligence
KEY FOR PEOPLE
RELATIONS
KEY FOR WORK
RELATIONS
TRI-ONE BRAIN SYSTEM 1/2
Conflict <=> Negotiation
Tension <=> Nothing
Conflict != Fight

Triune Brain
(Thinking/(Re-)Acting)
With time habits (good/bad) becomes reptilian
bias, without noticing you loose reasoning skills
and ability to change at all.

System 1 and 2
(Thinking/(Re-)Acting)
Stimulation -> Time -> Reaction
“People gather together with people
that make the work easier”.

Empathy
“Put yourself in the other shoes”
“Ask for reciprocity about it”

Resilience
Self – Esteem / Empathy / Positiveness
Goals / Challenges / Opportunities
Capacity to recover or thrive
from any kind of difficulties
in work and personal life
Difficulties

Change
|
Bi-Modality
Awareness

Code
Infrastructure
Sys/Ops Dev
Bussiness
IT
Maintenance,
Support
IaaCode, Cloud, DevOps
Innovation
Mr. No Mr. Yes

Infrastructure vs. Code
Infrastructure Code
Hardware / Software Software
Static / Inmutable / Un-versioned Dynamic / Mutable / Versioned
Moore’s Law (Faster) Wirth’s Law (Slower)
Extrinsic Documentation Intrinsic Documentation
Administrator Driven (Indirect) User Driven (Direct)
Software Defined Anything
Virtualization Cloud Native Solutions Serverless
Still slow, really slow,
transition…

Changes
|
Way of Work
|
Organization

Technician vs.
Engineer vs.
Manager
Solves different problems
From technical to the business language
Doing, thinking or organizing people
(Hardly done in parallel by the same person)
From nothing to a book of knowledge

Managers
(aka Coachs)
• Helps team members
• Drive the team to improve
• Support team good/bad times
• Organization, planification and
auditing of the work progress
Deliver:
• Proactive progress reporting
• Active participation in solutions
• Answers for follow-up
question
• Ensure the job is done in time
Expects:

Performance
Classification
Autonomy
Seniority
Delegation
Operation
Infrastructure
Code
Legacy
Documentation
Training
(Re-)usability
Auditability

Autonomy
(Freedom to innovate)
You work within a team
(not alone) for good & bad
You implement and
support high-available
solutions (whenever is
possible)
You foster automation
instead of manual task
(wherever is possible)
You look for strategic approval
(specially for new projects /
scope changes)

Seniority
Years doing same stuff → Changing on a daily basis
Just being different → Making the difference
Smart person → Smart team
A lot of Power → Responsibilities
Learning → Teaching (New senior engineers will need it!)

Delegation
Knowing what are your
responsibilities is important
We are here because
somebody let us make
mistakes
Having time for innovation
needs delegation and teaching
to someone to be able to scale
Senior
Complex Tasks
Semi-Senior
Medium Tasks
Entry
Basic Tasks

2nd in Command (Backup)
RESPONSIBILITIES MATRIX COUPLE OF YOUR SELECTION
(HIRE A NEW MEMBER?)
PROMOTES SHARING OF
CRITICAL DOCUMENTATION
AND KNOWLEDGE

Update Responsibility Matrix
Infrastructure:
• Network
• Compute
• Storage
• Management Systems
• Security/Identity Devices/Systems
• Backup System and Disaster Recovery
• Monitoring / Alerting (NPM, ITM, APM)
• Supporting Business Applications

Documentation
(The Critical One)
Alive (always change)
Challenging for everyone (a lot of reasons)
Boring because it is not for you but for others
Needed for delegation to others (i.e. newcomers,
leftovers when vacation, sickness, renounce).
Important when companies
growth up in time
Reduce tribal knowledge
Raise quality levels
Allow auditing

Documentation: Layered / Trimmed / Useful
Abstraction
Update -> Critical
Archive -> Obsolete
Focused on Training
Discoverable (Searchable)
KPI

Documentation
(Workshops)
Resume: purpose, technology/vendor websites,
external articles/references.
Architecture: high level (at least) visual representation
of the platform or system (i.e. draw.io, dot/graphviz).
Assets: resources inventory labeling (not naming), links
(URL) and credentials (e.g. Vault, SSO).
How-To’s: about (re-)install and configure with focus
on critical and tricky in-house customizations.
Basic administration and troubleshooting: standard
procedures, know errors with brief solution explained,
references to article/tickets and similars.

Changes
|
Way of Work
|
Innovation

Legacy
• There was, there is and there will be
• It’s a matter of time, but it is important
• It is not neither bad nor good is just legacy
• Innovation is needed, but maintenance also
99% of the things are legacy since its go live

Innovation
(& Investigation)
Must be focused
Value
(Customer)
Should be planned
(End/Start)
Trackable
(Timeframe)
Must be a process
(Success/Failure)
Measured
(Deliverable)

Changes
|
Way of Work
|
Communication

Communication
• Notify:
• Live changes with enough anticipation to customers
• Absence with anticipation (book and or block calendar)
• Delay/Leave to/from office to the team members
• If you are working and applying changes on weekend
• Avoid:
• Doing unplanned changes not related to live issues on “Fridays”
• Implementing new features near days you are on vacations (freeze)
• Mixing informal/formal communication (e.g. me/x-team, misunderstanding, rumors)
• Overlapping vacations with your 2nd in command (a.k.a. my backup)

Estimated Time Ahead (ETA)
BROADCAST
INFORMATION
START, PROGRESS,
END, EVIDENCE
REPORT
CONTINUOUSLY
DON’T EXPECT TO
BE FOLLOW-UP
RESPONSE
QUESTIONS
TO EASY CLARIFY
FOLLOW-UP
APPLY TO
EVERYTHING:
INCIDENT,
PROBLEMS,
REQUIREMENTS.

Communication
(Too many
channels?)
• Many formals communications and calendaring
• Request to chats will be converted to tickets (as needed)
@Email
• Live Issues and News
• Daily/Weekly Slack-Up Trial (instead of Daily Stand-Up)
#Slack
• Kanban Board: Task/Requirements (Internal/External)
• Wiki: For publishing critical infrastructure information
JIRA (Tickets/Wiki)
• Only for emergencies
Phone Calls / WhatsApp

Mute Generation / Work Comm. Channel
Favorite Comm.
Channel (IM)
Daily Comm.
Channel (IM)
Family Comm.
Channel (IM)

So let’s Slack-Up!
• (Trial) Broadcast to slack channel instead of verbal stand-up.
• Daily/Weekly basis at the start of the day before 10 A.M.
• Reminder for achievements/contributions to OKR’s:
• Maintenance
• Innovation & Optimization
• Support (Only if not already a JIRA Ticket)
• Low, Lowest, Medium, High, Highest.
• Undesired/Unplanned requests (e.g. C-Level Urgencies, Last-Minute-Request)
• Help to track easily your progress and risks (red flags).
• Promote team awareness of issues and progress.
How it looks? https://slack.com/intl/en-de/slack-tips/run-daily-standups-or-check-ins

Changes
|
Way of Work
|
Planning

Planning
Objectives and
Key Results (OKRs)

Planning
(OKR)
Maintenance
Projects (30%)
Maintenance plans
Preventive monitoring
Optimization
& Innovation
Projects (20%)
Cost-efficiency driven plans
Security oriented driven plans
New technologies / features
try-out / deployment plans
Support
Projects (50%)
New requirements
Reactive incident response
• Known errors handling
• Problems troubleshooting

Initiatives: Pick up selection to improve operation?

URGENT NOT URGENT
IMPORTANT
DO
• Live Issue
• Slack until fixed
• Do the post-mortem
documentation
• OKR’s
• Slack-Up w/Manager
DECIDE
• Non-Live Issue
• Ask-for/Open ticket
• Slack-Up w/Manager
• Clarify expectations third
parties (e.g. ETA, Termin)
• Move to Initiatives
NOT IMPORTANT
DELEGATE
• Teach fishing
• Assign a less senior
• 2nd in Cmd. (Backup)?
• Manager
DELETE
• Don’t confuse it with
Innovation at OKR’s
• Stop thinking on it
Individual Planning / Report-Up

Estimated Time Ahead (ETA)
Never Ending Story
Unexpected
Additions/Drops
For critical
infrastructure is
important
Challenge for most
of the tech teams
Different for
innovation,
maintenance
and support

IT Incident Response Plan (Overview)
Alerts: Person, Call, E-Mail,
#Slack, SMS, without
standard classification
(severity).
Follow-up on slack
Operational (OPS) channel
(chat).
Direct Responsible Individual
(DRI) take care of him/her
alerts.
Meeting everyone involved in
Situational Room.
SLA defined in terms of
maximum downtime in hours
per system / application.
Spiral Escalation to Eng. Mgr.,
Head/Director, CTO, CEO.
<45 minutes

IT Incident Response Insights
24/7 Emergency Handling - Contacts
Technology Team Cloud Support - Contacts
On-Premise Datacenter Emergency - Contacts
Formal Mailing Lists for the Business / Technology Teams
Report +
Escalate
Pro vs.
Re-Act
Account
+ Adjust
Incident Response Plan - Responsibilities
DevOps & Site Reliability Teams – Responsibilities
Physical / IT Security & Legal Teams - Responsibilities
(Local) IT Support Teams - Responsibilities
Calls, Chats, Alerts, Tickets
Incident / Problems / Changes (ITIL’s way)
Stability & Post-Mortem Meetings (Agile way)
SLI
(Observe)
SLO
(Oversee)
SLA
(Own)

Progress (Follow-Up/Report-Up)
Tickets New/Completed (JIRA -> slack channel)
Daily/Weekly Slack-Up (You -> slack channel)
Individual Daily Pin Pointing (On Your Desk)
Bi-Weekly One-To-One (Room / Walk / Lunch)
Monthly Retrospective (Last Friday of the Month Afternoon)

Trust &
Visibility
Trust ring mitigates business hi-jacking
Critical credential and access levels should be
shared (i.e. OneLogin) with key team
members and C-Level (Breaking the glass)
Access should cover all infrastructure assets,
platforms and systems
Monitoring tools and central consoles alerts
for infrastructure must be broadcasted in
communication channels (i.e. slack, e-mail)

SRE Organizational Framework

More Related Content

Similar to SRE Organizational Framework

More from Olaf Reitmaier Veracierta

Recently uploaded

SRE Organizational Framework