DEVCLUB.LV 20/06/2019
SRE
(service reliability engineer)
on big DevOps
platform
running on the
cloud
Copyright © 2019 Accenture. All rights reserved.
2
“RELIABILITY IS THE MOST
IMPORTANT FEATURE OF ANY
APPLICATION
BUT IS OFTEN THE LEAST
WELL DEFINED.
WHY ARE WE HERE?
WE NEED TO USE SRE TO HELP
OUR CLIENTS TO CHANGE
THAT AND KEEP THEIR
FUTURES BRIGHT!”
Copyright © 2019 Accenture. All rights reserved.
3
https://cloudplatform.googleblog.com/2018/05/SRE-vs-DevOps-competing-standards-or-close-friends.html
SITE RELIABILITY ENGINEERING (SRE)
• Proclaimed by Google as how they
do IT Operations
• Invented by them in 2003
• First book published in 2016 (30
essays)
• Read for free online:
https://landing.google.com/sre/boo
k/index.html
• Workbook much more applied
Copyright © 2019 Accenture.All rights reserved.
As per wikipedia, SRE can be defined as:
“a discipline that incorporates aspects
of software engineering and applies
that to operations whose goals are to
create ultra-scalable and highly
reliable software systems.”
DEFINITION FROM GOOGLE
SO IT’S A DISCIPLINE
4
Copyright © 2019 Accenture. All rights reserved.
SRE: WHEN
OPERATIONS IS
DESIGNED
BY SOFTWARE
ENGINEERS
Modern Product Development requires more
functionality introduced more frequently,
creating more complexity and more support
activities.
Site reliability engineering (SRE) is part of the
solution: a discipline that incorporates aspects of
software engineering and applies that to
operations whose goals are to create ultra-
scalable and highly reliable software systems.
Applying principles of computer
science and engineering to the design
and development of highly available
systems
Proactively finding ways to make
systems more scalable, reliable, and
efficient until systems reach “desired
reliability targets”
Spanning a broad portfolio of
software (applications, databases,
cloud services) and hardware
(network, data-center) assets
SREs are engineers ... … focused on reliability ... … while operating services
Copyright © 2019 Accenture. All rights reserved. 6
“THERE IS NO SUCHTHING AS A NEW IDEA. IT
IS IMPOSSIBLE. WE SIMPLYTAKE A LOT OF
OLD IDEAS AND PUTTHEM INTO A SORT OF
MENTAL KALEIDOSCOPE.WE GIVETHEM A
TURN ANDTHEY MAKE NEW AND CURIOUS
COMBINATIONS.
WE KEEP ONTURNING AND MAKING NEW
COMBINATIONS INDEFINITELY; BUTTHEY
ARETHE SAME OLD PIECES OF COLORED
GLASSTHAT HAVE BEEN IN USETHROUGH
ALLTHE AGES.
INSIGHT FROM
MARKTWAIN
Copyright © 2019 Accenture.All rights reserved.
MEASURING UP
against other movements
DevOps
shouldn’t be like
DevOps
Copyright © 2019 Accenture.All rights reserved.
SRE versus DEVOPS
Dev | Ops(wall of confusion)
DevOps
In practice typically leads to:
• CI/CD pipelines
• Infra-code at least for test environments
• DevOps Team
• Better quality engineering (hopefully)
SRE
In practice will hopefully lead to:
• Higher reliability after code deploy
• Better operability
• Better life for Ops team
• The right balance of speed vs safety
But… does devOps always implement all of DevOps in reality?
Copyright © 2019 Accenture. All rights reserved.
9
class SRE implements DevOps
SRE are focused on prescriptive way of measuring and achieving
reliability through engineering and operations work
Copyright © 2019 Accenture. All rights reserved.
10
DevOps SRE
Reduce organization silos Share ownership with developers by using the same
and techniques across the stack
Accept failure as normal Have a formula for balancing accidents and failures
new releases
Implement gradual change Encourage moving quickly by reducing costs of failure
Leverage tooling & Encourages "automating this year's job away" and
minimizing manual systems work to focus on efforts
bring long-term value to the system
Measure everything Believes that operations is a software problem, and
prescriptive ways for measuring availability, uptime,
outages, toil, etc.
Copyright © 2019 Accenture. All rights reserved.
Remediation Plugins
FBAR API
Facebook Operations API
Monitoring
API
Hardware Power
Control API
Service
Configuration API
Site Operations
Repair
API
• Sifts through 3.37 billion notifications from
network devices each month
• Filtering out noise down to roughly 750,000
alarms that need action
• Of those, FBAR resolves 99.6 percent of the
alarms without human intervention
• Developed and maintained by two full time
engineers
• Doing the work of ~200 full time systems
administrators
SRE IMPACT EXAMPLE
FACEBOOK AUTO-REMEDIATION SYSTEM (FBAR)
Copyright © 2019 Accenture. All rights reserved.
Defining TOIL
12
“Toil is the kind of work tied to running a
production service that tends to be manual,
repetitive, automatable, tactical, devoid of
enduring value, and that scales linearly as a
service grows. …one or more of the
following…”:
• Manual
• Repetitive
• Automatable
• Tactical
• No enduring value
• O(n) with service growth
https://landing.google.com/sre/book/chapters/eliminating-toil.html
All types of SRE work:
• Software engineering
• Systems engineering
• Toil
• Overhead
Copyright © 2019 Accenture.All rights reserved.
INNOVATING OPERATING
SRE Principles
50% 50%Features
Scaling
Automation
Software background
Issues
On-Call
Manual Intervention
Systems background
Copyright © 2019 Accenture. All rights reserved.
• Service Level Agreement
(SLA)
• Defines the service availability for
a customer and the penalties for
breaking that availability
• Question: what happens if the
SLOs aren’t met?
SRE usually not involved
• Service Level Indicator (SLI)
• Metrics over time which inform about
the health of a service
• Examples:
Request latency
Error rate
System throughput
Availability
• Service Level Objective (SLO)
• Agreed upon bounds for how
often SLI’s must be met
• Examples:
LI ≤ target
lower bound ≤ SLI ≤ upper bound
SRE MEASURMENTS
• SLIs and SLOs are the prescriptive way in which SRE practices the
DevOps principle of "measure everything". Implementing SLOs
also forces collaboration between product owners and systems
operators, adhering to the DevOps principle of "break down
organizational barriers".
MEASURE, MANAGE RISK
15
DEVELOPMENT OPERATIONS
• Request Latency
• Batch Throughput
• Failures per Request
SLI
• 99th percentile Latency of requests received in the
last 5 mins < 300 ms
• Ratio of Errors/Total Request received in the last 5
mins <1%
• Binding Targets for a
collection of SLI’s
SLO
• Total amount of downtown over a
year more or less than the
‘Objective’ of the Service
• Agreement b/w a customer & service
provider – typically based on SLO’sSLA
SLIs drive SLOs which inform SLAs
Copyright © 2019 Accenture. All rights reserved.
SRE Effectiveness?
• SLA Compliance
• System Compliance Profile
• MTTR
• Problem or Bug Age
• Incident to Unique Root-
Cause Ratio
• Toil to Overall Effort Ratio
• Service Performance (e.g.
Page LoadTimes, Network
Latency, etc.)
• Infrastructure & Cloud
Efficiency
• Service ProvisioningCycle
Time
• Service Automation Ratio
Given their breadth of scope, it becomes important to define performance and success metrics upon which the SRE is
evaluated
EXAMPLES
SRE Getting Started
Think big… …start small… …scale fast.
Talent, Organization and Culture
 Align on the portfolio of Product
Development services available and identify
health indicators
 Web app, MW services, SAP HR, etc.
 Identify product owner & end-customer
 What reliability expectations do they have?
(availability, latency, etc.)
 What indicators and mechanisms do you use
to measure health today?
 Identify value potential & path forward
 Dependency on other services
 Are reliability expectations realistic?
 Does the team have the right telemetry in
place to measure E2E health
 Is team empowered and skilled to make
changes to improve reliability?
Set the Strategy Begin Implementation Transform the Organization
 Select Product Development services that
have the greatest need for reliability to the
business – availability, stability, and
performance
 Consider agility and viability constraints
 Service criticality (Maintenance state/EOL,
critical to core, strategic)
 Telemetry state, productivity tools
 Organizational considerations –
Consolidating multi-tiered groups into a
single multi-modal group
 Human capital strategy
 Dependency on other services
 Assemble pilot SRE team for identified
Product Development service, define
operating model, productivity measures,
start running
 Reflect on pilot team achievements across health and
reliability metrics – MTTR, Availability, Performance,
Incident ratios
 Trend data over 30, 60, 90 days
 Analyze SRE backlog – big rock projects
 Talk about toil (work humans don’t wish to do)
 Fine-tune, continue to improve
 Initiate broader assessment and selection of Product
Development services based on business need and
viability (functional and technical)
 Strategize SRE specialties based on nature of Product
Development service portfolio to scale while
remaining lean
 SRE for custom web applications
 SRE for storage infrastructure
 SRE for all packaged back-office solutions
CONSIDER IMPLEMENTATIONOF SREAS A CULTURALJOURNEY
ADOP
STATE OF THE
UNION
You can mobilize your ADOP toolset in less than 48 hours with 3 easy steps
through our self-service portal
19
What ADOP Can Do for you
DevOps processes on the ADOP integrated tooling environment have been
known to reduce delivery costs substantially
The platform support projects of all sizes - both enterprise-scale or smaller
projects at a low cost & flexible subscription model
ADOP includes ready-to-go pipelines and infrastructure automation branded
cartridges for hundreds of technologies
ADOP Support both Agile and Waterfall projects by driving increased
productivity, quality, and lower risk
ADOP: ACCENTURE DEVOPS PLATFORM
WHAT CAN YOU DO WITH ADOP?
The platform is designed around technology extensions and re-usable components called cartridges, which further accelerate DevOps enablement.
Document and
Manage Project
Scope
Track Project
Progress
Build Code
Artifacts and
Products
Deploy your
Code to Any
Environment
Test your
System
Enforce
Security Policy
“Install”
Accenture Best
Practices
2014
2015
2016
2017- 2018
ADOP/Enterprise History
Launched Managed Jira within
ALM Factory in Hoff Data Centre
Re - Platforming to AWS cloud
ADLM merges with ADOP
ADOP CI/CD Offering
Accenture DevOps Platform
Projects using
CI/CD
215+ on
300+ Masters
560+
Clients supported by
ADOP SaaS
Confluence
21.5K in last 3 months
Jira
45K+ Total
Users
11K+ Active in
last 3 months
27M+
LOC Analysis total
17K+
Jenkins Job
weekly
Cloud
4 Clouds Account
330 EC2
1000 Containers
300TB data
500+ Security groups
Accenture Security Compliant
600+
Tickets processed monthly
PaaS capabilities
Self Service Capabilities
BY THE NUMBERS….
SRE (service reliability engineer) on big DevOps platform running on the cloud by Pavlo Serdiuk at Cloud focused 76th DevClub.lv

SRE (service reliability engineer) on big DevOps platform running on the cloud by Pavlo Serdiuk at Cloud focused 76th DevClub.lv

  • 1.
    DEVCLUB.LV 20/06/2019 SRE (service reliabilityengineer) on big DevOps platform running on the cloud Copyright © 2019 Accenture. All rights reserved.
  • 2.
    2 “RELIABILITY IS THEMOST IMPORTANT FEATURE OF ANY APPLICATION BUT IS OFTEN THE LEAST WELL DEFINED. WHY ARE WE HERE? WE NEED TO USE SRE TO HELP OUR CLIENTS TO CHANGE THAT AND KEEP THEIR FUTURES BRIGHT!” Copyright © 2019 Accenture. All rights reserved.
  • 3.
    3 https://cloudplatform.googleblog.com/2018/05/SRE-vs-DevOps-competing-standards-or-close-friends.html SITE RELIABILITY ENGINEERING(SRE) • Proclaimed by Google as how they do IT Operations • Invented by them in 2003 • First book published in 2016 (30 essays) • Read for free online: https://landing.google.com/sre/boo k/index.html • Workbook much more applied Copyright © 2019 Accenture.All rights reserved.
  • 4.
    As per wikipedia,SRE can be defined as: “a discipline that incorporates aspects of software engineering and applies that to operations whose goals are to create ultra-scalable and highly reliable software systems.” DEFINITION FROM GOOGLE SO IT’S A DISCIPLINE 4 Copyright © 2019 Accenture. All rights reserved.
  • 5.
    SRE: WHEN OPERATIONS IS DESIGNED BYSOFTWARE ENGINEERS Modern Product Development requires more functionality introduced more frequently, creating more complexity and more support activities. Site reliability engineering (SRE) is part of the solution: a discipline that incorporates aspects of software engineering and applies that to operations whose goals are to create ultra- scalable and highly reliable software systems. Applying principles of computer science and engineering to the design and development of highly available systems Proactively finding ways to make systems more scalable, reliable, and efficient until systems reach “desired reliability targets” Spanning a broad portfolio of software (applications, databases, cloud services) and hardware (network, data-center) assets SREs are engineers ... … focused on reliability ... … while operating services
  • 6.
    Copyright © 2019Accenture. All rights reserved. 6 “THERE IS NO SUCHTHING AS A NEW IDEA. IT IS IMPOSSIBLE. WE SIMPLYTAKE A LOT OF OLD IDEAS AND PUTTHEM INTO A SORT OF MENTAL KALEIDOSCOPE.WE GIVETHEM A TURN ANDTHEY MAKE NEW AND CURIOUS COMBINATIONS. WE KEEP ONTURNING AND MAKING NEW COMBINATIONS INDEFINITELY; BUTTHEY ARETHE SAME OLD PIECES OF COLORED GLASSTHAT HAVE BEEN IN USETHROUGH ALLTHE AGES. INSIGHT FROM MARKTWAIN Copyright © 2019 Accenture.All rights reserved.
  • 7.
    MEASURING UP against othermovements DevOps shouldn’t be like DevOps Copyright © 2019 Accenture.All rights reserved.
  • 8.
    SRE versus DEVOPS Dev| Ops(wall of confusion) DevOps In practice typically leads to: • CI/CD pipelines • Infra-code at least for test environments • DevOps Team • Better quality engineering (hopefully) SRE In practice will hopefully lead to: • Higher reliability after code deploy • Better operability • Better life for Ops team • The right balance of speed vs safety But… does devOps always implement all of DevOps in reality? Copyright © 2019 Accenture. All rights reserved.
  • 9.
    9 class SRE implementsDevOps SRE are focused on prescriptive way of measuring and achieving reliability through engineering and operations work Copyright © 2019 Accenture. All rights reserved.
  • 10.
    10 DevOps SRE Reduce organizationsilos Share ownership with developers by using the same and techniques across the stack Accept failure as normal Have a formula for balancing accidents and failures new releases Implement gradual change Encourage moving quickly by reducing costs of failure Leverage tooling & Encourages "automating this year's job away" and minimizing manual systems work to focus on efforts bring long-term value to the system Measure everything Believes that operations is a software problem, and prescriptive ways for measuring availability, uptime, outages, toil, etc. Copyright © 2019 Accenture. All rights reserved.
  • 11.
    Remediation Plugins FBAR API FacebookOperations API Monitoring API Hardware Power Control API Service Configuration API Site Operations Repair API • Sifts through 3.37 billion notifications from network devices each month • Filtering out noise down to roughly 750,000 alarms that need action • Of those, FBAR resolves 99.6 percent of the alarms without human intervention • Developed and maintained by two full time engineers • Doing the work of ~200 full time systems administrators SRE IMPACT EXAMPLE FACEBOOK AUTO-REMEDIATION SYSTEM (FBAR) Copyright © 2019 Accenture. All rights reserved.
  • 12.
    Defining TOIL 12 “Toil isthe kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as a service grows. …one or more of the following…”: • Manual • Repetitive • Automatable • Tactical • No enduring value • O(n) with service growth https://landing.google.com/sre/book/chapters/eliminating-toil.html All types of SRE work: • Software engineering • Systems engineering • Toil • Overhead Copyright © 2019 Accenture.All rights reserved.
  • 13.
    INNOVATING OPERATING SRE Principles 50%50%Features Scaling Automation Software background Issues On-Call Manual Intervention Systems background Copyright © 2019 Accenture. All rights reserved.
  • 14.
    • Service LevelAgreement (SLA) • Defines the service availability for a customer and the penalties for breaking that availability • Question: what happens if the SLOs aren’t met? SRE usually not involved • Service Level Indicator (SLI) • Metrics over time which inform about the health of a service • Examples: Request latency Error rate System throughput Availability • Service Level Objective (SLO) • Agreed upon bounds for how often SLI’s must be met • Examples: LI ≤ target lower bound ≤ SLI ≤ upper bound SRE MEASURMENTS • SLIs and SLOs are the prescriptive way in which SRE practices the DevOps principle of "measure everything". Implementing SLOs also forces collaboration between product owners and systems operators, adhering to the DevOps principle of "break down organizational barriers".
  • 15.
    MEASURE, MANAGE RISK 15 DEVELOPMENTOPERATIONS • Request Latency • Batch Throughput • Failures per Request SLI • 99th percentile Latency of requests received in the last 5 mins < 300 ms • Ratio of Errors/Total Request received in the last 5 mins <1% • Binding Targets for a collection of SLI’s SLO • Total amount of downtown over a year more or less than the ‘Objective’ of the Service • Agreement b/w a customer & service provider – typically based on SLO’sSLA SLIs drive SLOs which inform SLAs Copyright © 2019 Accenture. All rights reserved.
  • 16.
    SRE Effectiveness? • SLACompliance • System Compliance Profile • MTTR • Problem or Bug Age • Incident to Unique Root- Cause Ratio • Toil to Overall Effort Ratio • Service Performance (e.g. Page LoadTimes, Network Latency, etc.) • Infrastructure & Cloud Efficiency • Service ProvisioningCycle Time • Service Automation Ratio Given their breadth of scope, it becomes important to define performance and success metrics upon which the SRE is evaluated EXAMPLES
  • 17.
    SRE Getting Started Thinkbig… …start small… …scale fast. Talent, Organization and Culture  Align on the portfolio of Product Development services available and identify health indicators  Web app, MW services, SAP HR, etc.  Identify product owner & end-customer  What reliability expectations do they have? (availability, latency, etc.)  What indicators and mechanisms do you use to measure health today?  Identify value potential & path forward  Dependency on other services  Are reliability expectations realistic?  Does the team have the right telemetry in place to measure E2E health  Is team empowered and skilled to make changes to improve reliability? Set the Strategy Begin Implementation Transform the Organization  Select Product Development services that have the greatest need for reliability to the business – availability, stability, and performance  Consider agility and viability constraints  Service criticality (Maintenance state/EOL, critical to core, strategic)  Telemetry state, productivity tools  Organizational considerations – Consolidating multi-tiered groups into a single multi-modal group  Human capital strategy  Dependency on other services  Assemble pilot SRE team for identified Product Development service, define operating model, productivity measures, start running  Reflect on pilot team achievements across health and reliability metrics – MTTR, Availability, Performance, Incident ratios  Trend data over 30, 60, 90 days  Analyze SRE backlog – big rock projects  Talk about toil (work humans don’t wish to do)  Fine-tune, continue to improve  Initiate broader assessment and selection of Product Development services based on business need and viability (functional and technical)  Strategize SRE specialties based on nature of Product Development service portfolio to scale while remaining lean  SRE for custom web applications  SRE for storage infrastructure  SRE for all packaged back-office solutions CONSIDER IMPLEMENTATIONOF SREAS A CULTURALJOURNEY
  • 18.
  • 19.
    You can mobilizeyour ADOP toolset in less than 48 hours with 3 easy steps through our self-service portal 19 What ADOP Can Do for you DevOps processes on the ADOP integrated tooling environment have been known to reduce delivery costs substantially The platform support projects of all sizes - both enterprise-scale or smaller projects at a low cost & flexible subscription model ADOP includes ready-to-go pipelines and infrastructure automation branded cartridges for hundreds of technologies ADOP Support both Agile and Waterfall projects by driving increased productivity, quality, and lower risk ADOP: ACCENTURE DEVOPS PLATFORM
  • 20.
    WHAT CAN YOUDO WITH ADOP? The platform is designed around technology extensions and re-usable components called cartridges, which further accelerate DevOps enablement. Document and Manage Project Scope Track Project Progress Build Code Artifacts and Products Deploy your Code to Any Environment Test your System Enforce Security Policy “Install” Accenture Best Practices
  • 21.
    2014 2015 2016 2017- 2018 ADOP/Enterprise History LaunchedManaged Jira within ALM Factory in Hoff Data Centre Re - Platforming to AWS cloud ADLM merges with ADOP ADOP CI/CD Offering Accenture DevOps Platform Projects using CI/CD 215+ on 300+ Masters 560+ Clients supported by ADOP SaaS Confluence 21.5K in last 3 months Jira 45K+ Total Users 11K+ Active in last 3 months 27M+ LOC Analysis total 17K+ Jenkins Job weekly Cloud 4 Clouds Account 330 EC2 1000 Containers 300TB data 500+ Security groups Accenture Security Compliant 600+ Tickets processed monthly PaaS capabilities Self Service Capabilities BY THE NUMBERS….

Editor's Notes

  • #5 If you’ve not heard the term Site Reliability Engineering (SRE), it’s worth exploring.  The term originates from Google who actually invented it as a role name around 14(!) years ago as part of reinventing their approach to IT Operation
  • #7 There is no such thing as a new idea. It is impossible. We simply take a lot of old ideas and put them into a sort of mental kaleidoscope. We give them a turn and they make new and curious combinations. We keep on turning and making new combinations indefinitely; but they are the same old pieces of colored glass that have been in use through all the ages. - Mark Twain, a Biography 
  • #8 In truth We emphasis Dev part of lifecycle This is another crack at it Renaissance of Ops Arch Like New IT – not all of it is that new, but it’s helpful to have a reason to change to adopt it and avoid “why weren’t you doing this before?”
  • #9 DevOps a loose generic set of principles and SRE an advanced explicit implementation. Andrew CS
  • #13 Named, defined, actually measured Quarterly surveys of Google’s SREs show that the average time spent toiling is about 33%, so we do much better than our overall target of 50%.
  • #15 https://www.youtube.com/watch?v=tEylFyxbDLE&index=2&list=PLIivdWyY5sqJrKl7D2u-gmis8h9K66qoj
  • #17 https://www.youtube.com/watch?v=tEylFyxbDLE&index=2&list=PLIivdWyY5sqJrKl7D2u-gmis8h9K66qoj
  • #18 https://landing.google.com/sre/sre-book/chapters/part3/
  • #20 Best in class agile tools Mobilisation expertise 24/7 support ADOP/E consulting/soliutioning offering Solutioning service offering Most valuable product – we
  • #22 There is nothing in Accenture that comes close to ADOP Grassroots internal business – roots that are watered by the basics – good ideas- great ideas – hardwork and people who are just really damn good at what they do. Be proud – be very proud of what you are. You are You are the best-in-class, cutting edge central engine that generates millions in USD in revenue across the globe continuously and sustainable. You are the team that gets thousands of our people in hundreds of cities across the globe to do better work and let them maybe even home on time to live their lives. When we, Accenture, do better work – that means the whole world does better world. People in the medical industry searching for and delivering cures, people in media – getting better information to people in father reaches, people in the resources industry – powering the nations, people in governments, people in groceries, people all over the world doing everything. Make no mistake, it is true that DevOps and modern engineering is more than technology – it is a new paradigm. We are the pioneers showing the rest of the world the right way to get work done in the modern computing age. Our excellence is their excellence. Be proud, and be wary like it or not, everyone of us in this room must raise to this occasion and this opportunity. We must embody the progressive spirit of New IT – team work, collaboration, relentless dedication to improvement. That is what this week is about. How do we come together as a team and rise to this expectation. I want to set expectation now that all of you actively engage this week with open yet critical minds, as well as mutual respect. So to get things started, lets all introduce ourselves. I know many of you already know many of you, but lets give everyone their moment here and maybe we learn something new.