SRE (service reliability engineer) on big DevOps platform running on the cloud by Pavlo Serdiuk at Cloud focused 76th DevClub.lv

DEVCLUB.LV 20/06/2019
SRE
(service reliability engineer)
on big DevOps
platform
running on the
cloud
Copyright © 2019 Accenture. All rights reserved.

2
“RELIABILITY IS THE MOST
IMPORTANT FEATURE OF ANY
APPLICATION
BUT IS OFTEN THE LEAST
WELL DEFINED.
WHY ARE WE HERE?
WE NEED TO USE SRE TO HELP
OUR CLIENTS TO CHANGE
THAT AND KEEP THEIR
FUTURES BRIGHT!”

3
https://cloudplatform.googleblog.com/2018/05/SRE-vs-DevOps-competing-standards-or-close-friends.html
SITE RELIABILITY ENGINEERING (SRE)
• Proclaimed by Google as how they
do IT Operations
• Invented by them in 2003
• First book published in 2016 (30
essays)
• Read for free online:
https://landing.google.com/sre/boo
k/index.html
• Workbook much more applied
Copyright © 2019 Accenture.All rights reserved.

As per wikipedia, SRE can be defined as:
“a discipline that incorporates aspects
of software engineering and applies
that to operations whose goals are to
create ultra-scalable and highly
reliable software systems.”
DEFINITION FROM GOOGLE
SO IT’S A DISCIPLINE
4

SRE: WHEN
OPERATIONS IS
DESIGNED
BY SOFTWARE
ENGINEERS
Modern Product Development requires more
functionality introduced more frequently,
creating more complexity and more support
activities.
Site reliability engineering (SRE) is part of the
solution: a discipline that incorporates aspects of
software engineering and applies that to
operations whose goals are to create ultra-
scalable and highly reliable software systems.
Applying principles of computer
science and engineering to the design
and development of highly available
systems
Proactively finding ways to make
systems more scalable, reliable, and
efficient until systems reach “desired
reliability targets”
Spanning a broad portfolio of
software (applications, databases,
cloud services) and hardware
(network, data-center) assets
SREs are engineers ... … focused on reliability ... … while operating services

Copyright © 2019 Accenture. All rights reserved. 6
“THERE IS NO SUCHTHING AS A NEW IDEA. IT
IS IMPOSSIBLE. WE SIMPLYTAKE A LOT OF
OLD IDEAS AND PUTTHEM INTO A SORT OF
MENTAL KALEIDOSCOPE.WE GIVETHEM A
TURN ANDTHEY MAKE NEW AND CURIOUS
COMBINATIONS.
WE KEEP ONTURNING AND MAKING NEW
COMBINATIONS INDEFINITELY; BUTTHEY
ARETHE SAME OLD PIECES OF COLORED
GLASSTHAT HAVE BEEN IN USETHROUGH
ALLTHE AGES.
INSIGHT FROM
MARKTWAIN

MEASURING UP
against other movements
DevOps
shouldn’t be like
DevOps

SRE versus DEVOPS
Dev | Ops(wall of confusion)
DevOps
In practice typically leads to:
• CI/CD pipelines
• Infra-code at least for test environments
• DevOps Team
• Better quality engineering (hopefully)
SRE
In practice will hopefully lead to:
• Higher reliability after code deploy
• Better operability
• Better life for Ops team
• The right balance of speed vs safety
But… does devOps always implement all of DevOps in reality?

9
class SRE implements DevOps
SRE are focused on prescriptive way of measuring and achieving
reliability through engineering and operations work

10
DevOps SRE
Reduce organization silos Share ownership with developers by using the same
and techniques across the stack
Accept failure as normal Have a formula for balancing accidents and failures
new releases
Implement gradual change Encourage moving quickly by reducing costs of failure
Leverage tooling & Encourages "automating this year's job away" and
minimizing manual systems work to focus on efforts
bring long-term value to the system
Measure everything Believes that operations is a software problem, and
prescriptive ways for measuring availability, uptime,
outages, toil, etc.

Remediation Plugins
FBAR API
Facebook Operations API
Monitoring
API
Hardware Power
Control API
Service
Configuration API
Site Operations
Repair
API
• Sifts through 3.37 billion notifications from
network devices each month
• Filtering out noise down to roughly 750,000
alarms that need action
• Of those, FBAR resolves 99.6 percent of the
alarms without human intervention
• Developed and maintained by two full time
engineers
• Doing the work of ~200 full time systems
administrators
SRE IMPACT EXAMPLE
FACEBOOK AUTO-REMEDIATION SYSTEM (FBAR)

Defining TOIL
12
“Toil is the kind of work tied to running a
production service that tends to be manual,
repetitive, automatable, tactical, devoid of
enduring value, and that scales linearly as a
service grows. …one or more of the
following…”:
• Manual
• Repetitive
• Automatable
• Tactical
• No enduring value
• O(n) with service growth
https://landing.google.com/sre/book/chapters/eliminating-toil.html
All types of SRE work:
• Software engineering
• Systems engineering
• Toil
• Overhead

INNOVATING OPERATING
SRE Principles
50% 50%Features
Scaling
Automation
Software background
Issues
On-Call
Manual Intervention
Systems background

• Service Level Agreement
(SLA)
• Defines the service availability for
a customer and the penalties for
breaking that availability
• Question: what happens if the
SLOs aren’t met?
SRE usually not involved
• Service Level Indicator (SLI)
• Metrics over time which inform about
the health of a service
• Examples:
Request latency
Error rate
System throughput
Availability
• Service Level Objective (SLO)
• Agreed upon bounds for how
often SLI’s must be met
• Examples:
LI ≤ target
lower bound ≤ SLI ≤ upper bound
SRE MEASURMENTS
• SLIs and SLOs are the prescriptive way in which SRE practices the
DevOps principle of "measure everything". Implementing SLOs
also forces collaboration between product owners and systems
operators, adhering to the DevOps principle of "break down
organizational barriers".

MEASURE, MANAGE RISK
15
DEVELOPMENT OPERATIONS
• Request Latency
• Batch Throughput
• Failures per Request
SLI
• 99th percentile Latency of requests received in the
last 5 mins < 300 ms
• Ratio of Errors/Total Request received in the last 5
mins <1%
• Binding Targets for a
collection of SLI’s
SLO
• Total amount of downtown over a
year more or less than the
‘Objective’ of the Service
• Agreement b/w a customer & service
provider – typically based on SLO’sSLA
SLIs drive SLOs which inform SLAs

SRE Effectiveness?
• SLA Compliance
• System Compliance Profile
• MTTR
• Problem or Bug Age
• Incident to Unique Root-
Cause Ratio
• Toil to Overall Effort Ratio
• Service Performance (e.g.
Page LoadTimes, Network
Latency, etc.)
• Infrastructure & Cloud
Efficiency
• Service ProvisioningCycle
Time
• Service Automation Ratio
Given their breadth of scope, it becomes important to define performance and success metrics upon which the SRE is
evaluated
EXAMPLES

SRE Getting Started
Think big… …start small… …scale fast.
Talent, Organization and Culture
 Align on the portfolio of Product
Development services available and identify
health indicators
 Web app, MW services, SAP HR, etc.
 Identify product owner & end-customer
 What reliability expectations do they have?
(availability, latency, etc.)
 What indicators and mechanisms do you use
to measure health today?
 Identify value potential & path forward
 Dependency on other services
 Are reliability expectations realistic?
 Does the team have the right telemetry in
place to measure E2E health
 Is team empowered and skilled to make
changes to improve reliability?
Set the Strategy Begin Implementation Transform the Organization
 Select Product Development services that
have the greatest need for reliability to the
business – availability, stability, and
performance
 Consider agility and viability constraints
 Service criticality (Maintenance state/EOL,
critical to core, strategic)
 Telemetry state, productivity tools
 Organizational considerations –
Consolidating multi-tiered groups into a
single multi-modal group
 Human capital strategy
 Dependency on other services
 Assemble pilot SRE team for identified
Product Development service, define
operating model, productivity measures,
start running
 Reflect on pilot team achievements across health and
reliability metrics – MTTR, Availability, Performance,
Incident ratios
 Trend data over 30, 60, 90 days
 Analyze SRE backlog – big rock projects
 Talk about toil (work humans don’t wish to do)
 Fine-tune, continue to improve
 Initiate broader assessment and selection of Product
Development services based on business need and
viability (functional and technical)
 Strategize SRE specialties based on nature of Product
Development service portfolio to scale while
remaining lean
 SRE for custom web applications
 SRE for storage infrastructure
 SRE for all packaged back-office solutions
CONSIDER IMPLEMENTATIONOF SREAS A CULTURALJOURNEY

You can mobilize your ADOP toolset in less than 48 hours with 3 easy steps
through our self-service portal
19
What ADOP Can Do for you
DevOps processes on the ADOP integrated tooling environment have been
known to reduce delivery costs substantially
The platform support projects of all sizes - both enterprise-scale or smaller
projects at a low cost & flexible subscription model
ADOP includes ready-to-go pipelines and infrastructure automation branded
cartridges for hundreds of technologies
ADOP Support both Agile and Waterfall projects by driving increased
productivity, quality, and lower risk
ADOP: ACCENTURE DEVOPS PLATFORM

WHAT CAN YOU DO WITH ADOP?
The platform is designed around technology extensions and re-usable components called cartridges, which further accelerate DevOps enablement.
Document and
Manage Project
Scope
Track Project
Progress
Build Code
Artifacts and
Products
Deploy your
Code to Any
Environment
Test your
System
Enforce
Security Policy
“Install”
Accenture Best
Practices

2014
2015
2016
2017- 2018
ADOP/Enterprise History
Launched Managed Jira within
ALM Factory in Hoff Data Centre
Re - Platforming to AWS cloud
ADLM merges with ADOP
ADOP CI/CD Offering
Accenture DevOps Platform
Projects using
CI/CD
215+ on
300+ Masters
560+
Clients supported by
ADOP SaaS
Confluence
21.5K in last 3 months
Jira
45K+ Total
Users
11K+ Active in
last 3 months
27M+
LOC Analysis total
17K+
Jenkins Job
weekly
Cloud
4 Clouds Account
330 EC2
1000 Containers
300TB data
500+ Security groups
Accenture Security Compliant
600+
Tickets processed monthly
PaaS capabilities
Self Service Capabilities
BY THE NUMBERS….

SRE (service reliability engineer) on big DevOps platform running on the cloud by Pavlo Serdiuk at Cloud focused 76th DevClub.lv

SRE (service reliability engineer) on big DevOps platform running on the cloud by Pavlo Serdiuk at Cloud focused 76th DevClub.lv

In this document