one chaos experiment a day.
keep the outages away.
yan cui,@theburningmonk
Chaos
Engineering?
MUST KILL SERVERS!
RAWR!!
RAWR!!
@theburningmonk theburningmonk.com
“the discipline of experimenting on a system in order to build confidence in the
system’s capability to withstand turbulent conditions in production”
principlesofchaos.org
@theburningmonk theburningmonk.com
microservices death stars circa 2015
@theburningmonk theburningmonk.com
@theburningmonk theburningmonk.com
“the capacity to recover quickly from difficulties; toughness.”
resilience
/rɪˈzɪlɪəns/
noun
@theburningmonk theburningmonk.com
“the capacity to recover quickly from difficulties; toughness.”
resilience
/rɪˈzɪlɪəns/
noun
it’s not about
preventing failures!
everything fails, all the time
@theburningmonk theburningmonk.com
“You don't choose the moment, the moment chooses you!
You only choose how prepared you are when it does.”
Fire Chief Mike Burtch
@theburningmonk theburningmonk.com
@theburningmonk theburningmonk.com
anything that can go wrong, will go wrong.
MURPHY’s LAW
@theburningmonk theburningmonk.com
identify weaknesses before they manifest in system-wide, aberrant behaviors
GOAL
@theburningmonk theburningmonk.com
learn about the system’s behavior by observing it during a controlled experiments
HOW
@theburningmonk theburningmonk.com
learn about the system’s behavior by observing it during a controlled experiments
HOW
game days
failure injection
Yan Cui
http://theburningmonk.com
@theburningmonk
AWS user for 10 years
Yan Cui
http://theburningmonk.com
@theburningmonk
http://bit.ly/yubl-serverless
Yan Cui
http://theburningmonk.com
@theburningmonk
Developer Advocate @
Yan Cui
http://theburningmonk.com
@theburningmonk
Independent Consultant
advisetraining delivery
theburningmonk.com/courses
theburningmonk.com/courses
realworldserverless.com
“using serverless reduces the blast radius”
www.buzzsprout.com/877747/4615985
@theburningmonk theburningmonk.com
serverless improves resilience
as platform takes care of infrastructure failures
@theburningmonk theburningmonk.com
@theburningmonk theburningmonk.com
by Russ Miles @russmiles
source https://medium.com/russmiles/chaos-engineering-for-the-business-17b723f26361
@theburningmonk theburningmonk.com
by Russ Miles @russmiles
source https://medium.com/russmiles/chaos-engineering-for-the-business-17b723f26361
@theburningmonk theburningmonk.com
Shared Responsibility Model
@theburningmonk theburningmonk.com
by Russ Miles @russmiles
source https://medium.com/russmiles/chaos-engineering-for-the-business-17b723f26361
@theburningmonk theburningmonk.com
chaos monkey kills an
EC2 instance
latency monkey induces
artificial delay in APIs
chaos gorilla kills an AWS
Availability Zone
chaos kong kills an entire
AWS region
@theburningmonk theburningmonk.com
@theburningmonk theburningmonk.com
@theburningmonk theburningmonk.com
there are no servers to kill!
SERVERLESS
@theburningmonk theburningmonk.com
improperly tuned timeouts
@theburningmonk theburningmonk.com
missing error handling
@theburningmonk theburningmonk.com
missing fallbacks
@theburningmonk theburningmonk.com
@theburningmonk theburningmonk.com
STEP 1.
define steady state
i.e. “what does normal look like”
@theburningmonk theburningmonk.com
STEP 2.
hypothesis that steady state continues in control and experimental group
e.g. “the system stays up if a server dies”
@theburningmonk theburningmonk.com
STEP 3.
inject realistic failures
e.g. “slow response from 3rd-party service”
@theburningmonk theburningmonk.com
STEP 4.
try to disprove hypothesis
i.e. “look for difference between control and experimental group”
@theburningmonk theburningmonk.com
latency inject latency to function invocation
@theburningmonk theburningmonk.com
“what if service X has elevated latency?”
@theburningmonk theburningmonk.com
API Gateway Lambda API Gateway Lambda
@theburningmonk theburningmonk.com
@theburningmonk theburningmonk.com
hypothesis: API would timeout and our try-catch
would handle it and return default response
@theburningmonk theburningmonk.com
@theburningmonk theburningmonk.com
@theburningmonk theburningmonk.com
result: function times out after 6s
(hypothesis is disproved)
@theburningmonk theburningmonk.com
@theburningmonk theburningmonk.com
API Gateway Lambda API Gateway Lambda
502
200
@theburningmonk theburningmonk.com
API Gateway Lambda API Gateway Lambda
3s timeout
6s timeout
@theburningmonk theburningmonk.com
API Gateway Lambda API Gateway Lambda
max 29s integration
max 15 mins timeout
@theburningmonk theburningmonk.com
and then there’s
cold starts…
@theburningmonk theburningmonk.com
TIL: most HTTP client libraries have default timeout of 60s.
API Gateway has an integration timeout of 29s.
Most Lambda functions default to timeout of 3-6s.
Don’t forget about the cold starts!
@theburningmonk theburningmonk.com
@theburningmonk theburningmonk.com
@theburningmonk theburningmonk.com
https://bit.ly/2Wvfort
@theburningmonk theburningmonk.com
@theburningmonk theburningmonk.com
@theburningmonk theburningmonk.com
@theburningmonk theburningmonk.com
@theburningmonk theburningmonk.com
@theburningmonk theburningmonk.com
@theburningmonk theburningmonk.com
@theburningmonk theburningmonk.com
outcome: a more resilient system
@theburningmonk theburningmonk.com
latency
exception
inject latency to function invocation
throws exception
@theburningmonk theburningmonk.com
latency
exception
statuscode
inject latency to function invocation
throws exception
return HTTP status code
@theburningmonk theburningmonk.com
latency
exception
statuscode
diskspace
inject latency to function invocation
throws exception
return HTTP status code
fills up /tmp directory
@theburningmonk theburningmonk.com
latency
exception
statuscode
diskspace
denylist
inject latency to function invocation
throws exception
return HTTP status code
fills up /tmp directory
looses network connectivity
@theburningmonk theburningmonk.com
“what if DynamoDB has an elevated error rate?”
@theburningmonk theburningmonk.com
API Gateway Lambda DynamoDB
@theburningmonk theburningmonk.com
@theburningmonk theburningmonk.com
hypothesis: the AWS SDK retries would handle it
@theburningmonk theburningmonk.com
result: function times out after 6s
(hypothesis is disproved)
@theburningmonk theburningmonk.com
TIL: the js DynamoDB client defaults to 10 retries
with base delay of 50ms
@theburningmonk theburningmonk.com
TIL: the js DynamoDB client defaults to 10 retries
with base delay of 50ms
delay = Math.random() * (Math.pow(2, retryCount) * base)
this is Marc Brooker’s
fav formula!
@theburningmonk theburningmonk.com
@theburningmonk theburningmonk.com
@theburningmonk theburningmonk.com
action: set max retry count + fallback
@theburningmonk theburningmonk.com
@theburningmonk theburningmonk.com
@theburningmonk theburningmonk.com
@theburningmonk theburningmonk.com
@theburningmonk theburningmonk.com
outcome: a more resilient system
@theburningmonk theburningmonk.com
latency
exception
statuscode
diskspace
denylist
inject latency to function invocation
throws exception
return HTTP status code
fills up /tmp directory
looses network connectivity
everything fails, all the time
@theburningmonk theburningmonk.com
by Russ Miles @russmiles
source https://medium.com/russmiles/chaos-engineering-for-the-business-17b723f26361
@theburningmonk theburningmonk.com
https://theburningmonk.com/hire-me
AdviseTraining Delivery
“Fundamentally, Yan has improved our team by increasing our
ability to derive value from AWS and Lambda in particular.”
Nick Blair
Tech Lead
@theburningmonk
theburningmonk.com
github.com/theburningmonk

A chaos experiment a day, keeping the outage away