Serverless in production
an experience report
Yan Cui
What’s in this talk?
! how to responsibly run a serverless architecture (aka. how to do
ops in serverless)
! testing, CI/CD
! logging, distributed tracing, monitoring
! config management, securing secrets
! coldstarts
! gotchas/limitations + workarounds/hacks
hi,I’mYanCui
hi,I’mYanCui
AWS user since 2009
apr, 2016
Before
! hidden complexities and dependencies
! low utilisation to leave headroom for large spikes
! EC2 scaling is slow, so scale earlier
! paying for lots of used resources
! up to 30 mins to deploy
! deployments required downtime
- Dan North
“lead time to someone saying
thank you is the only reputation
metric that matters.”
“what would good look like for us?”
Deployments should…
! be small
! be fast
! have zero downtime
! require no lock-step
Features should…
! be independently deployable
! be loosely-coupled
We want to…
! minimise cost of unused resources
! minimise ops effort
! reduce technical mess
! deliver visible improvements to users faster
nov, 2016
170 Lambda functions in prod
1.2 GB deployment packages in prod
95% cost saving vs EC2
15x no. of prod releases per month
time
is a good fit
1st function in prod!
time
is a good fit
?
time
is a good fit
1st function in prod!
Practices ToolsPrinciples
what is good? how to make it good? with what?
Principles outlast Tools
ALERTING
CI / CD
TESTING
LOGGING
MONITORING
170 functions
WOOF!
? ?
time
is a good fit
1st function in prod!
CONFIG MANAGEMENT
SECURITY
DISTRIBUTED TRACING
evolving the platform
building a better search experience
Legacy Monolith Amazon Kinesis Amazon Lambda
Amazon CloudSearch
Legacy Monolith Amazon Kinesis Amazon Lambda
Amazon CloudSearchAmazon API Gateway Amazon Lambda
building an analytics pipeline
Legacy Monolith Amazon Kinesis Amazon Lambda
Google BigQuery
Legacy Monolith Amazon Kinesis Amazon Lambda
Google BigQuery
1 developer, 2 days
design production
(his 1st serverless project)
Legacy Monolith Amazon Kinesis Amazon Lambda
Google BigQuery
“thank you, nothing ever got
done this fast at Skype!”
- Dan North
“lead time to someone saying
thank you is the only reputation
metric that matters.”
rebuilding the timeline feature
building better user recommendations
BigQuery
BigQuery
grapheneDB
BigQuery
grapheneDB
BigQuery
grapheneDB
BigQuery
getting PRODUCTION READY
CHOOSE A
FRAMEWORK
DEPLOYMENT
http://serverless.com
https://github.com/awslabs/serverless-application-model
http://apex.run
https://apex.github.io/up
https://github.com/claudiajs/claudia
https://github.com/Miserlou/Zappa
http://gosparta.io/
TESTING
amzn.to/29Lxuzu
Level of Testing
1.Unit
do our objects do the right thing?
are they easy to work with?
1.Unit
2.Integration
does our code work against code we
can’t change?
Level of Testing
handler
handler
test by invoking the handler
Level of Testing
1.Unit
2.Integration
3.Acceptance
does the whole system work?
Level of Testing
unit
integration
acceptance
feedback
confidence
“…We find that tests that mock external libraries
often need to be complex to get the code into the
right state for the functionality we need to exercise.
The mess in such tests is telling us that the design
isn’t right but, instead of fixing the problem by
improving the code, we have to carry the extra
complexity in both code and test…”
Don’t Mock Types You Can’t Change
“…The second risk is that we have to be sure
that the behaviour we stub or mock matches
what the external library will actually do…
Even if we get it right once, we have to make
sure that the tests remain valid when we
upgrade the libraries…”
Don’t Mock Types You Can’t Change
Services
Don’t Mock Types You Can’t Change
Paul Johnston
The serverless approach
to testing is different and
may actually be easier.
http://bit.ly/2t5viwK
LambdaAPI Gateway DynamoDB
LambdaAPI Gateway DynamoDB
Unit Tests
LambdaAPI Gateway DynamoDB
Unit Tests
Mock/Stub
is our request correct?
is the request
mapping set up
is the API resources
configured correctly?
are we assuming the
correct schema?
LambdaAPI Gateway DynamoDB
is Lambda proxy
configured correctly?
is IAM policy set up
correctly?
is the table created?
what unit tests will not tell you…
most Lambda functions are simple have
single purpose, the risk of shipping broken
software has largely shifted to how they
integrate with external services
observation
But it slows down
my feedback loop…
IT’S NOT
ABOUT YOU!
me
test your system,
not (just) your code
API Gateway
IOT
Kinesis
SNS
ElastiCache
CloudWatch
DynamoDB
IAM
S3
Auth0
GrapheneDB
SES
Twilio
Google BigQuery
MongoLab
CloudSearch
APN
GCM
Lambda
EC2
…if a service can’t provide you with
a relatively easy way to test the
interface in reality, then you should
consider using another one.
Paul Johnston
“…Wherever possible, an acceptance test
should exercise the system end-to-end without
directly calling its internal code.
An end-to-end test interacts with the system
only from the outside: through its interface…”
Testing End-to-End
Legacy Monolith Amazon Kinesis Amazon Lambda
Amazon CloudSearchAmazon API Gateway Amazon Lambda
Legacy Monolith Amazon Kinesis Amazon Lambda
Amazon CloudSearchAmazon API Gateway Amazon Lambda
Test Input
Legacy Monolith Amazon Kinesis Amazon Lambda
Amazon CloudSearchAmazon API Gateway Amazon Lambda
Test Input
Validate
integration tests exercise
system’s Integration with its
external dependencies
acceptance tests exercise
system End-to-End from
the outside
integration tests differ from
acceptance tests only in HOW the
Lambda functions are invoked
observation
CI/CD PIPELINE
“…We prefer to have the end-to-end tests
exercise both the system and the process
by which it’s built and deployed…
This sounds like a lot of effort (it is), but has
to be done anyway repeatedly during the
software’s lifetime…”
Testing End-to-End
me
Deployment scripts that only
live on the CI box is a disaster
waiting to happen.
Jenkins build config deploys and tests
unit + integration tests
deploy
acceptance tests
if [ "$1" = "deploy" ] && [ $# -eq 4 ]; then
STAGE=$2
REGION=$3
PROFILE=$4
npm install
AWS_PROFILE=$PROFILE 'node_modules/.bin/sls' deploy -s $STAGE -r $REGION
…
if [ "$1" = "deploy" ] && [ $# -eq 4 ]; then
STAGE=$2
REGION=$3
PROFILE=$4
npm install
AWS_PROFILE=$PROFILE 'node_modules/.bin/sls' deploy -s $STAGE -r $REGION
…
install serverless framework as
dev dependency
can be run locally & on the CI box
auto auto manual
LOGGING
2016-07-12T12:24:37.571Z 994f18f9-482b-11e6-8668-53e4eab441ae
GOT is off air, what do I do now?
2016-07-12T12:24:37.571Z 994f18f9-482b-11e6-8668-53e4eab441ae
GOT is off air, what do I do now?
UTC Timestamp API Gateway Request Id
your log message
function name
date
function version
me
Logs are not easily searchable
in CloudWatch Logs.
LOG OVERLOAD
CENTRALISE LOGS
CENTRALISE LOGS
MAKE THEM EASILY
SEARCHABLE
+ +
the elk stack
CloudWatch Logs
CloudWatch Logs AWS Lambda ELK stack
CloudWatch Events
CloudWatch Logs
http://bit.ly/2f3zxQG
DISTRIBUTED TRACING
“my followers didn’t
receive my new post!”
- a user
where could the
problem be?
correlation IDs*
* eg. request-id, user-id, yubl-id, etc.
ROLL YOUR OWN
CLIENTS
kinesis client
http client
sns client
http://bit.ly/2k93hAj
kinesis
global.CONTEXT
log.info(…)
api-b
global.CONTEXT
global.CONTEXT
global.CONTEXT
x-correlation-id = …
x-correlation-xxx = …
API Gateway Kinesis
SNS
API Gateway
API Gatewayapi-a api-c
sns
headers[“User-Agent”]
headers[“Debug-Log-Enabled”]
MessageAttributes: [
“x-correlation-id”: …
“User-Agent”: …
“Debug-Log-Enabled”: …
]
global.CONTEXT
headers[“User-Agent”]
headers[“Debug-Log-Enabled”]
headers[“x-correlation-id”]
headers[“User-Agent”]
headers[“Debug-Log-Enabled”]
headers[“x-correlation-id”]
data.__context
capture
forward
function
event
ROLL YOUR OWN
CLIENTS
X-RAY
Amazon X-Ray
Amazon X-Ray
traces do not span over
API Gateway
MONITORING + ALERTING
“where do I install
monitoring agents?”
you can’t
• invocation Count
• error Count
• latency
• throttling
• granular to the minute
• support custom metrics
• invocation Count
• error Count
• latency
• throttling
• granular to the minute
• support custom metrics
Why not IOPipe?
! pervasive access to your entire application
! adds latency for tracking
me
The only “background”
processing you get are the
capabilities the platform
provides out of the box.
“how do I batch up and
send logs/metrics in the
background?”
you can’t
(kinda)
console.log(“hydrating yubls from db…”);
console.log(“fetching user info from user-api”);
console.log(“MONITORING|1489795335|27.4|latency|user-api-latency”);
console.log(“MONITORING|1489795335|8|count|yubls-served”);
timestamp metric value
metric type
metric namemetrics
logs
CloudWatch Logs AWS Lambda
ELK stack
logs
metrics
CloudWatch
CloudWatch Logs
CloudWatch Logs AWS Lambda
ELK stack
logs
metrics
CloudWatch
memory used
memory size
billed duration
http://bit.ly/2gGredx
http://bit.ly/2goFZ8F
DASHBOARDS
DASHBOARDS
SET ALARMS
DASHBOARDS
SET ALARMS
TRACK APP-LEVEL
METRICS
Not Only CloudWatch
don’t put all your eggs in one basket
aka. you don’t want your monitoring system to
fail at the same time as the systems it monitors
CONFIG MANAGEMENT
Lambda
me
Environment variables make it
hard to share configurations
across functions.
me
Environment variables make it
hard to implement fine-grained
access to sensitive info.
http://bit.ly/2uQKABA
couples ability to deploy with access to sensitive
data, which often don’t overlap in a large
engineering team or in a regulated environment
CENTRALISED
CONFIG SERVICE
config service
goes here
Why not consul or etcd?
! multiple EC2 instances in multi-AZ for HA
! have to manage servers, patch OS, patch software, etc.
! learning curve for configuring the service
! learning curve for using the CLI tools
SSM
Parameter
Store
SSM Parameter Store
HTTPS
role-based access
encrypted in-flight
SSM Parameter Store
encrypt
role-based access
SSM Parameter Store
encrypted at-rest
HTTPS
role-based access
SSM Parameter Store
encrypted in-flight
SSM Parameter Store
decrypt
role-based access
CENTRALISED
CONFIG SERVICE
CLIENT LIBRARY
Requirements for client library
! standardise and encapsulate how you manage configs
! supports client-side caching (fetch & cache at coldstart)
! invalidate cache at interval
! invalidate cache explicitly when staleness is detected
http://bit.ly/2yLUjwd
PRO TIPS
max 75 GB total deployment package size*
* limit is per AWS region
Janitor Monkey
Janitor Lambda
http://bit.ly/2xzVu4a
disable versionFunctions in
install Serverless framework as dev
dependency at project level
dev dependencies are excluded since 1.16.0
http://bit.ly/2vzBqhC
http://amzn.to/2vtUkDU
UNDERSTAND
COLDSTARTS
Amazon X-Ray
1st invocation
2nd invocation
cold start
source: http://bit.ly/2oBEbw2
http://bit.ly/2rtCCBz
http://bit.ly/2rtCCBz
C#
http://bit.ly/2rtCCBz
Java
http://bit.ly/2rtCCBz
NodeJs, Python
me
C# and Java experiences ~100
times the cold start time of
Python and also suffer from
much higher standard deviation
me
memory size improves cold
start time linearly
AVOID
COLDSTARTS
CloudWatch Event AWS Lambda
CloudWatch Event AWS Lambda
ping
ping
ping
ping
CloudWatch Event AWS Lambda
ping
ping
ping
ping
CloudWatch Event AWS Lambda
ping
ping
ping
ping
HEALTH CHECKS?
AWS Lambda
docs
Take advantage of container re-use to improve the
performance of your function. Make sure any
externalized configuration or dependencies that your
code retrieves are stored and referenced locally after initial
execution. Limit the re-initialization of variables/objects on
every invocation. Instead use static initialization/
constructor, global/static variables and singletons. Keep
alive and reuse connections (HTTP, database, etc.) that
were established during a previous invocation.
http://amzn.to/2jzLmkb
max 5 mins execution time
http://bit.ly/2w6ItdI
CONSIDER
PARTIAL
FAILURES
AWS Lambda
docs
AWS Lambda polls your stream and
invokes your Lambda function.
Therefore, if a Lambda function fails,
AWS Lambda attempts to process the
erring batch of records until the time
the data expires.
http://amzn.to/2vs2lIg
vs
processing halts until failed
events are retried successfully/
expired from stream
prioritize realtime-ness,
retry failed events with best effort,
then skip
SNS
Kinesis
SQS
after 3 attempts
share processing logic
events are processed in
chronological order
failed events are retried out
of sequence
PROCESS SQS
WITH RECURSIVE
FUNCTIONS
http://bit.ly/2npomX6
AVOID HOT
KINESS
STREAMS
AWS Lambda
docs
Each shard can support up to
5 transactions per second for
reads, up to a maximum total data
read rate of 2 MB per second.
http://amzn.to/2ubyaot
AWS Lambda
docs
If your stream has 100 active
shards, there will be 100 Lambda
functions running concurrently.
Then, each Lambda function
processes events on a shard in
the order that they arrive.
http://amzn.to/2ubyaot
when no. of processors goes up…
ReadProvisionedThroughputExceeded
can have too many Kinesis read operations…
ReadRecords.IteratorAge
unpredictable spikes in read ‘latency’…
can kinda workaround…
http://bit.ly/2uv5LsH
clever, but costly
new tool, new problems
but they’re easier to deal with
@theburningmonk
theburningmonk.com
github.com/theburningmonk
http://bit.ly/2yQZj1H

Serverless in production (O'Reilly Software Architecture)