CHAOS ENGINEERING
THE FINE ART OF BREAKING STUFF IN
PRODUCTION ON PURPOSE
GEERT VANDER CRUIJSEN
@GEERTVDC
GEERT VAN DER CRUIJSEN
@GEERTVDC
CLOUD NATIVE ARCHITECT
#DOEPICSHIT
FULL CYCLE DEVELOPER
DEVOPS COACH
CHAOS ENGINEERING ?
WHY DO WE NEED
“IN A COMPLEX LANDSCAPE
YOUR APPLICATION IS
NEVER FULLY UP”
TRADITIONAL MONITORING
TOOLS ARE DEAD!
MEASURE
USER IMPACT
MEASURE
USER IMPACT RELIABILITY
AVAILABILITY LATENCY
THROUGHPUT
CORRECTNESS
FRESHNESS
COVERAGE
QUALITY
DURABILITY
RESILIENT APPLICATIONS
INFRASTRUCTURE
NETWORK
APPLICATION
PEOPLE
GRACEFUL DEGRADATION
FAIL OPEN
GRACEFUL DEGRADATION
FAIL OPEN
BUT WE DO TESTS?
BUT WE DO TESTS?
UNIT A
INPUT OUTPUT
UNIT TESTS
BUT WE DO TESTS?
COMPONENT
/ SERVICE A
INPUT OUTPUT
COMPONENT
/SERVICE B
INTEGRATION TESTS
CHAOS ENGINEERING ?
WHAT IS
CHAOS ENGINEERING
IS NOT
RANDOMLY BREAKING
STUFF IN PRODUCTION
CHAOS ENGINEERING
“Chaos Engineering is the discipline of
experimenting on a distributed system
in order to build confidence in the
system’s capability to withstand
turbulent conditions in production.”
https//principlesofchaos.org
CHAOS ENGINEERING
“Chaos Engineering is the discipline of
experimenting on a distributed system
in order to build confidence in the
system’s capability to withstand
turbulent conditions in production.”
https//principlesofchaos.org
SERVICE
INPUT OUTPUT
SERVICE
CHAOS ENGINEERING EXPERIMENTS
HOST FAILURE
RESOURCE CAPACITY ATTACKS
APPLICATION FAILURE
NETWORK ATTACKS
BRENT ATTACK
CHAOS ENGINEERING
ONLY IN PRODUCTION?
YOUR FIRST EXPERIMENT
HOW TO START
GAME DAY
INCIDENT RESPONSE LEARNING
OUTAGENORMAL
DETECT &
ANALYSIS
FIX
LEARNIMPROVE
CHAOS GAME DAY
CHAOS
EXPERIMENT
NORMAL
DETECT &
ANALYSIS
FIX
LEARNIMPROVE
CHAOS EXPERIMENT PHASES
STEADY
STATE
DEFINE
HYPOTHESIS
DESIGN&
EXECUTE
LEARN FIX EMBED
STEADY STATE
STEADY
STATE
DEFINE
HYPOTHESIS
DESIGN&
EXECUTE
LEARN FIX EMBED
STEADY STATE
MEASURE BUSINESS METRICS
100ms extra load time drop Amazon’s sale by 1%
STEADY STATE
SERVICE
UNDER TEST
ROUTING SERVICE B
STEADY STATE
SERVICE
UNDER TEST
ROUTING SERVICE B
CONTROL
SERVICE
EXPERIMENT
SERVICE
STEADY STATE
SERVICE
UNDER TEST
ROUTING SERVICE B
CONTROL
SERVICE
EXPERIMENT
SERVICE
98%
1%
1%
ALWAYS BE ABLE TO ABORT
DEFINE HYPOTHESIS
STEADY
STATE
DEFINE
HYPOTHESIS
DESIGN&
EXECUTE
LEARN FIX EMBED
DEFINE HYPOTHESIS
BRAINSTORM WHAT CAN GO WRONG
BRING EVERYONE
DEVELOPERS
SRE /OPERATIONS
NETWORKS
BUSINESS
INFRASTRUCTURE
TESTERS
WHAT CAN GO WRONG?
WHAT IFDATABASE IS DOWN?
WHAT IFSERVICE RESPONDS SLOWER?
WHAT IFMY CACHE RESPONDS SLOW?
WHAT IFA POD DIES?
WHAT IF LOADBALANCER STOPS?
WHAT IF….?
STOP IF YOU KNOW THE
EXPERIMENT WILL BREAK
DESIGN & EXECUTE EXPERIMENT
STEADY
STATE
DEFINE
HYPOTHESIS
DESIGN&
EXECUTE
LEARN FIX EMBED
DESIGN & EXECUTE EXPERIMENT
START SMALL
NOTIFY PEOPLE INVOLVED
SLOWLY INCREASE BLAST RADIUS
TOOLS:
GREMLIN.COM
CHAOSTOOLKIT.ORG
GITHUB.COM/NETFLIX/SIMIANARMY
GITHUB.COM/ASOBTI/KUBE-MONKEY
LEARN
STEADY
STATE
DEFINE
HYPOTHESIS
DESIGN&
EXECUTE
LEARN FIX EMBED
LEARN
HOW FAST DID WE RECOVER?
HOW FAST DID WE DETECT?
DO NOT BLAME!
FIX
STEADY
STATE
DEFINE
HYPOTHESIS
DESIGN&
EXECUTE
LEARN FIX EMBED
FIX
IMPLEMENT FIX
RERUN EXPERIMENT
EMBED
STEADY
STATE
DEFINE
HYPOTHESIS
DESIGN&
EXECUTE
LEARN FIX EMBED
EMBED
ONBOARDING
CONTINUOUS CHAOS
EMBED IN CULTURE
PATTERNS
RESILIENT ARCHITECTURE
MULTI PARALELLISM
PARALLELISM AVAILABILITY DOWNTIME PER YEAR
1 99% 3 DAYS 16 HOURS
2 99,99% 53 MINUTES
3 99,9999% 32 SECONDS
HOW PARALEL IS YOUR CLOUD COMPONENT ?
REGIONSAVAILABILITY ZONES
ASYNC COMMUNICATION
SYNC REQUIRES A CONNECTION PER REQUEST
FOCUS ON MESSAGE BASED COMMUNICATION
DECOUPLING PUB SUB LISTENER
QUEUE BASED LOAD DISTRIBUTION
QUEUE BASED LOAD DISTRIBUTION
QUEUE BASED LOAD DISTRIBUTION
SERVICE BUS
IDEMPOTENT APIS
HTTP METHOD IDEMPOTENCE SAFETY
GET YES YES
HEAD YES YES
PUT YES NO
DELETE YES NO
POST NO NO
PATCH NO NO
BULKHEAD PATTERN
ISOLATE WORKLOADS LIKE THE HULL OF A SHIP
CIRCUIT BREAKER
CIRCUIT BREAKER
ADD JITTER TO RETRIES
SPLIT RESPONSIBILITIES
READ / WRITE SHARDING
CQRS
WRAP UP
BIG CULTURE CHANGE
FULL CYCLE DEVELOPERSPRODUCTION ACCESS
START EXPERIMENTING
START SMALL CHECK OUT TOOLSOBSERVABILITY
“CHAOS ENGINEERING DOESN’T CAUSE
PROBLEMS, IT JUST REVEALS THEM”
NORA JONES – CHAOS ENGINEERING LEAD SLACK
GEERT VAN DER CRUIJSEN
@GEERTVDC
THANK YOU!ALL PICTURES USED ARE FROM UNSPLASHED.COM
RESOURCES
BOOKS:
Chaosengineering-O’Reilly
Chaosengineeringobservability -O’Reilly
TOOLS:
chaostoolkit.org
gremlin.com
github.com/netflix/simianarmy
github.com/asobti/kube-monkey
RESOURCES:
principlesofchaos.org
github.com/dastergon/awesome-chaos-engineering
docs.microsoft.com/en-us/azure/architecture/patterns/category/resiliency

Chaos engineering - The art of breaking stuff in production on purpose

Editor's Notes

  • #4 Why? Easy you can put “chaos engineer” as function title on your resume Who did fail over test to other data center?
  • #6 We’ve started using the cloud and building distributed applications Deathstar of amazon
  • #8 Who did fail over test to other data center?
  • #9 How do we monitor this kind of stuff?
  • #10 Netflix SPS
  • #11 Netflix SPS It has been reported that every 100ms of latency costs Amazon 1% of profit
  • #12 INFRA: cloud is providing this for us right? NETWORK: The network is always reliable? We know it is not. How about switching over? How do we test that? It’s often one of the easiest APPLICATION: how do applications hold up when errors occur? What if the database is not accesible? PEOPLE: People intervention. Is that making it wose? Fire drills. Do we fire drill for IT?
  • #13 Partial failure mode You have to think TOGETHER with business of the impact of failure. Fail open example
  • #14 Partial failure mode
  • #15 Partial failure mode
  • #16 Partial failure mode
  • #17 Chaos engineering is like vaccination. We add small amounts of harm to make the full system more immune to the effects
  • #19 Why? Easy you can put “chaos engineer” as function title on your resume
  • #20 Why? Easy you can put “chaos engineer” as function title on your resume
  • #21 Partial failure mode
  • #25 incident-response learnin
  • #26 incident-response learning MTTR
  • #27 incident-response learnin
  • #47 MULTI REGION, MULTI AVAILABILITY ZONE
  • #52 Idempotency: can i do the same thing with same effect? Safety: does it change the end state?
  • #53 Example in kubernetes fixed CPU / memory reservations Not 1 application can kill others by using max CPU/MEMORY
  • #54 Fusebox When things go wrong stop retrying
  • #55 Polly
  • #56 Command and Query Responsibility Segregation 
  • #60 Chaos engineering is like vaccination. We add small amounts of harm to make the full system more immune to the effects
  • #61 Why? Easy you can put “chaos engineer” as function title on your resume