TITLE SPONSORS
TRACK SPONSORS
HEADLINE SPONSORS
PARTNER SPONSORS MEMBER SPONSORS
Guiceworks
Cooley
Bridgepoint Education
Full Contact
General Assembly
Dripjoy
Lyft
OnDeck
Connect for Health
Wazee Digital
Officescapes
Jake Jabs Center for Entrepreneurship
Denver Office of Economic Development
Alchemy Security
Ayla Networks
Edge link
Swift page
Taxnologi
Spotx
Davis Graham & Stubbs
Documoto
Right point
Name.com
The Denver Foundation
Boomtown
Six Actual
Maker Source
Slider Smith & Frampton
Netsuite
Logistical Meetings & Events
Rewriting DevOps
Matthew Boeckman
VP - Infrastructure
Craftsy
@matthewboeckman
This is not a DevOps definition
●Common Tooling
●Organizational Empathy
●Shared Responsibility
Why Rewrite?
1. Support new business initiatives
2. Scale and resilience
3. Quicker iterations
#30 on Forbes' 2015 list of Most Promising Companies
10+MM registered members, 11+MM enrolled courses
350 course enrollments/hour
DevOps 1.0
● Some Ops dev’d, and a few Devs Ops’d
● Great cross team culture, still separate teams
● Shared Oncall but heavy Ops burden
● Limited common tooling
DevOps 2.0 goals
● Integrated DevOps team and workflows
● Common tools
● Shared Oncall
Common Tooling
Common Tools
Jenkins (build, deploy, ETL, scheduled tasks)
Terraform (infrastructure configuration)
Splunk (data intelligence)
AWS (all infrastructure)
Backend
Ops
Frontend
Organizational Empathy
Site
Reliability
Engineering
*not DevOps
"Fundamentally, it's what happens when you ask a
software engineer to design an operations function."
Ben Treynor Sloss, Vice President, Google Engineering, founder of Google SRE
SRE Phase 1 (Feb-May)
● Determine tooling
○ Nagios, graphite, splunk, confluence
● SWAG at reliability metrics
○ Errors; response time
● Runbooks
● Blameless Postmortem every outage
● Iterate
The primary hurdle to DevOps
and SRE adoption is
The Skill Gap
Runbooks:
● System overview
● Escalation path
● Alert descriptions
● Common failure conditions
● Known recovery procedures
● Incident history
Postmortem - 7 W’s and an H
1. What (happened)
2. What (systems were impacted)
3. When (did it occur)
4. Who (was involved)
5. How (did we discover the issue)
6. Why (did it go explody)
7. What (will we do to remedy it)
8. When (will that remedy be actioned)
Having a “blameless” Post-Mortem process means that engineers whose actions have contributed to an
accident can give a detailed account of:
what actions they took at what time,
what effects they observed,
expectations they had,
assumptions they had made,
and their understanding of timeline of events as they occurred.
…and that they can give this detailed account without fear of punishment or retribution
*John Allspaw, CTO - Etsy
https://codeascraft.com/2012/05/22/blameless-postmortems/
Shared Responsibility
Empathy drives action
Common tools and Runbooks bridge the skills gap
Postmortems direct iterations
Incident
Post-Mortem
Tools
Runbook
Reward
SRE Phase 2 (May-> … forever)
● Build a production environment
● Tune reliability metrics
● Load tests
● Resilience tests
● Recovery tests
● Blameless Postmortem every outage
● Runbooks
● Iterate
Fastly - Content Delivery
F5 & ELB - load balancing
FE - Node.js
BE - Java
Packer - AMI’s
Consul - service discovery
Terraform - Infrastructure
Postgres/RDS - database
SQS/SNS/Lambda/S3 - everything else
SRE - Two metrics
Mean Time to Identify
Mean Time to Resolve
DevOps + SRE
T-18 days
3 hours
…
This is not a DevOps definition approach
●Common Tooling
●Organizational Empathy
●Shared Responsibility
●Land and expand
●Start with pre-prod and grow
Thank you!
Questions?
@matthewboeckman

Rewriting DevOps

  • 2.
  • 3.
    HEADLINE SPONSORS PARTNER SPONSORSMEMBER SPONSORS Guiceworks Cooley Bridgepoint Education Full Contact General Assembly Dripjoy Lyft OnDeck Connect for Health Wazee Digital Officescapes Jake Jabs Center for Entrepreneurship Denver Office of Economic Development Alchemy Security Ayla Networks Edge link Swift page Taxnologi Spotx Davis Graham & Stubbs Documoto Right point Name.com The Denver Foundation Boomtown Six Actual Maker Source Slider Smith & Frampton Netsuite Logistical Meetings & Events
  • 4.
    Rewriting DevOps Matthew Boeckman VP- Infrastructure Craftsy @matthewboeckman
  • 5.
    This is nota DevOps definition ●Common Tooling ●Organizational Empathy ●Shared Responsibility
  • 6.
    Why Rewrite? 1. Supportnew business initiatives 2. Scale and resilience 3. Quicker iterations
  • 7.
    #30 on Forbes'2015 list of Most Promising Companies 10+MM registered members, 11+MM enrolled courses 350 course enrollments/hour
  • 11.
    DevOps 1.0 ● SomeOps dev’d, and a few Devs Ops’d ● Great cross team culture, still separate teams ● Shared Oncall but heavy Ops burden ● Limited common tooling
  • 12.
    DevOps 2.0 goals ●Integrated DevOps team and workflows ● Common tools ● Shared Oncall
  • 16.
  • 18.
    Common Tools Jenkins (build,deploy, ETL, scheduled tasks) Terraform (infrastructure configuration) Splunk (data intelligence) AWS (all infrastructure)
  • 20.
  • 22.
  • 23.
  • 24.
    "Fundamentally, it's whathappens when you ask a software engineer to design an operations function." Ben Treynor Sloss, Vice President, Google Engineering, founder of Google SRE
  • 25.
    SRE Phase 1(Feb-May) ● Determine tooling ○ Nagios, graphite, splunk, confluence ● SWAG at reliability metrics ○ Errors; response time ● Runbooks ● Blameless Postmortem every outage ● Iterate
  • 26.
    The primary hurdleto DevOps and SRE adoption is The Skill Gap
  • 27.
    Runbooks: ● System overview ●Escalation path ● Alert descriptions ● Common failure conditions ● Known recovery procedures ● Incident history
  • 28.
    Postmortem - 7W’s and an H 1. What (happened) 2. What (systems were impacted) 3. When (did it occur) 4. Who (was involved) 5. How (did we discover the issue) 6. Why (did it go explody) 7. What (will we do to remedy it) 8. When (will that remedy be actioned)
  • 29.
    Having a “blameless”Post-Mortem process means that engineers whose actions have contributed to an accident can give a detailed account of: what actions they took at what time, what effects they observed, expectations they had, assumptions they had made, and their understanding of timeline of events as they occurred. …and that they can give this detailed account without fear of punishment or retribution *John Allspaw, CTO - Etsy https://codeascraft.com/2012/05/22/blameless-postmortems/
  • 32.
    Shared Responsibility Empathy drivesaction Common tools and Runbooks bridge the skills gap Postmortems direct iterations
  • 33.
  • 34.
    SRE Phase 2(May-> … forever) ● Build a production environment ● Tune reliability metrics ● Load tests ● Resilience tests ● Recovery tests ● Blameless Postmortem every outage ● Runbooks ● Iterate
  • 36.
    Fastly - ContentDelivery F5 & ELB - load balancing FE - Node.js BE - Java Packer - AMI’s Consul - service discovery Terraform - Infrastructure Postgres/RDS - database SQS/SNS/Lambda/S3 - everything else
  • 37.
    SRE - Twometrics Mean Time to Identify Mean Time to Resolve
  • 38.
    DevOps + SRE T-18days 3 hours …
  • 39.
    This is nota DevOps definition approach ●Common Tooling ●Organizational Empathy ●Shared Responsibility ●Land and expand ●Start with pre-prod and grow
  • 40.