Serverless in production, an experience report (London js community)

in production
an experience reportan experience report
what you should know before you go to production
ServerlessServerless

Yan Cui
http://theburningmonk.com
@theburningmonk
AWS user since 2009

Yan Cui
Server Architect
Principal Engineer
Lead Developer
Senior Developer
http://theburningmonk.com
@theburningmonk
Senior Developer

Diana Ionita
Senior Developer
Senior Developer
Senior Developer

hidden complexities and dependencies
low utilisation to leave room for traffic spikes
EC2 scaling is slow, so scale earlier
lots of cost for unused resources
up to 30 mins for deployment
deployment required downtime

- Dan North
“lead time to someone saying
thank you is the only reputation
metric that matters.”

“what would good
look like for us?”

be small
be fast
have zero downtime
have no lock-step
DEPLOYMENTS SHOULD...

FEATURES SHOULD...
be deployable independently
be loosely-coupled

WE WANT TO...
minimise cost for unused resources
minimise ops effort
reduce tech mess
deliver visible improvements faster

170 Lambda functions in prod
1.2 GB deployment packages in prod
95% cost saving vs EC2
15x no. of prod releases per month

1st function in prod!
time
is a good fit

?
time
is a good fit

ALERTING
CI / CD
TESTING
LOGGING
MONITORING

Practices ToolsPrinciples
what is good? how to make it good? with what?

170 functions
WOOF!
? ?
time
is a good fit

SECURITY
DISTRIBUTED
TRACING
CONFIG
MANAGEMENT

Legacy Monolith Amazon Kinesis Amazon Lambda
Amazon CloudSearch

Amazon CloudSearchAmazon API Gateway Amazon Lambda

Google BigQuery

Google BigQuery
1 developer, 2 days
design production
(his 1st serverless project)

Google BigQuery
“nothing ever got done
this fast at Skype!”
- Chris Twamley

http://serverless.com

https://github.com/awslabs/serverless-application-model

http://apex.run

https://apex.github.io/up

https://github.com/claudiajs/claudia

https://github.com/Miserlou/Zappa

http://gosparta.io/

Level of Testing
1.Unit
do our objects do the right thing?
are they easy to work with?

Level of Testing
1.Unit
2.Integration
does our code work against code we
can’t change?

handler
test by invoking
the handler

Level of Testing
1.Unit
2.Integration
3.Acceptance
does the whole system work?

Level of Testing
unit
integration
acceptance
feedback
confidence

“…We find that tests that mock external
libraries often need to be complex to
get the code into the right state for the
functionality we need to exercise.
The mess in such tests is telling us that
the design isn’t right but, instead of
fixing the problem by improving the
code, we have to carry the extra
complexity in both code and test…”
Don’t Mock Types You Can’t Change

“…The second risk is that we have to be
sure that the behaviour we stub or mock
matches what the external library will
actually do…
Even if we get it right once, we have to
make sure that the tests remain valid
when we upgrade the libraries…”

Services

Paul Johnston
The serverless approach to
testing is different and may
actually be easier.
http://bit.ly/2t5viwK

LambdaAPI Gateway DynamoDB
Unit Tests

Unit Tests
Mock/Stub

is our request correct?
is the request mapping
set up correctly?is the API resources
conﬁgured correctly?
are we assuming the
correct schema?
is Lambda proxy
conﬁgured correctly?
is IAM policy set
up correctly?
is the table created?
what unit tests will not tell you…

most Lambda functions are simple
have single purpose, the risk of
shipping broken software has largely
shifted to how they integrate with
external services
observation

But it slows down
my feedback loop…
IT’S NOT
ABOUT YOU!

…if a service can’t provide
you with a relatively easy
way to test the interface in
reality, then you should
consider using another one.
Paul Johnston

“…Wherever possible, an acceptance
test should exercise the system end-to-
end without directly calling its internal
code.
An end-to-end test interacts with the
system only from the outside: through
its interface…”
Testing End-to-End

Test Input

Test Input
Validate

integration tests exercise
system’s Integration with its
external dependencies
my code

acceptance tests exercise
system End-to-End from
the outside
my code

integration tests differ from
acceptance tests only in HOW the
Lambda functions are invoked
observation

“the earlier you consider CI + CD, the
more time you save in the long run”
- me

“…We prefer to have the end-to-end
tests exercise both the system and the
process by which it’s built and
deployed…
This sounds like a lot of effort (it is), but
has to be done anyway repeatedly
during the software’s lifetime…”
Testing End-to-End

“deployment scripts
that only live on the CI
box is a disaster
waiting to happen”
- me

Jenkins build config deploys and tests
unit + integration tests
deploy
acceptance tests

if [ "$1" = "deploy" ] && [ $# -eq 4 ]; then
STAGE=$2
REGION=$3
PROFILE=$4
npm install
AWS_PROFILE=$PROFILE 'node_modules/.bin/sls' deploy -s $STAGE -r $REGION
elif [ "$1" = "int-test" ] && [ $# -eq 4 ]; then
STAGE=$2
REGION=$3
PROFILE=$4
npm install
AWS_PROFILE=$PROFILE npm run int-$STAGE
elif [ "$1" = "acceptance-test" ] && [ $# -eq 4 ]; then
STAGE=$2
REGION=$3
PROFILE=$4
npm install
AWS_PROFILE=$PROFILE npm run acceptance-$STAGE
else
usage
exit 1
ﬁ

build.sh allows repeatable builds on both local & CI

2016-07-12T12:24:37.571Z 994f18f9-482b-11e6-8668-53e4eab441ae
GOT is off air, what do I do now?

2016-07-12T12:24:37.571Z 994f18f9-482b-11e6-8668-53e4eab441ae
GOT is off air, what do I do now?
UTC Timestamp API Gateway Request Id
your log message

function name
date
function version

Yan
Logs are not easily searchable
in CloudWatch Logs.

CENTRALISE LOGS
MAKE THEM EASILY
SEARCHABLE

CloudWatch Logs AWS Lambda ELK stack

CloudWatch Events
CreateLogGroup
subscribe-log-group
Log group
create-subscription
stream
ship-to-ELK
capturedby

http://bit.ly/2f3zxQG

“my followers didn’t
receive my new post!”
- a user

correlation IDs*
* eg. request-id, user-id, yubl-id, etc.

kinesis client
http client
sns client

http://bit.ly/2k93hAj

traces do not span over
API Gateway

“where do I install
monitoring agents?”

• invocation Count
• error Count
• latency
• throttling
• granular to the minute
• support custom metrics

• same metrics as CW
• better dashboard
• support custom metrics
https://www.datadoghq.com/blog/monitoring-lambda-functions-datadog/

my code
internet internet
press button something happens

“how do I batch up
and send metrics in
the background?”

console.log(“hydrating yubls from db…”);
console.log(“fetching user info from user-api”);
console.log(“MONITORING|1489795335|27.4|latency|user-api-latency”);
console.log(“MONITORING|1489795335|8|count|yubls-served”);
timestamp metric value
metric type
metric namemetrics
logs

CloudWatch Logs AWS Lambda
ELK stack
logs
metrics
CloudWatch

http://bit.ly/2gGredx

DASHBOARDS
SET ALARMS
TRACK APP-LEVEL
METRICS

“you really don't want
your monitoring
system to fail at the
same time as the
system it monitors”
- Yan

easily and quickly propagate
config changes

me
Environment variables make it
hard to share conﬁgurations
across functions.

me
Environment variables make it
hard to implement ﬁne-grained
access to sensitive info.

sensitive data should be encrypted
in-flight, and at rest
(credentials, connection string, etc.)

SSM Parameter Store
HTTPS
role-based access
encrypted in-flight

SSM Parameter Store
encrypt
role-based access

SSM Parameter Store
encrypted at-rest

HTTPS
role-based access
SSM Parameter Store
encrypted in-flight

CENTRALISED
CONFIG SERVICE
CLIENT LIBRARY

invalidate at interval + signal

http://bit.ly/2yLUjwd

max 75 GB total deployment package size*
* limit is per AWS region

Janitor Lambda
http://bit.ly/2xzVu4a

install Serverless framework as dev
dependency at project level
dev dependencies are excluded since 1.16.0

http://bit.ly/2vzBqhC

http://amzn.to/2vtUkDU

Amazon X-Ray
1st invocation
2nd invocation
cold start

source: http://bit.ly/2oBEbw2

http://bit.ly/2rtCCBz

C#

Java

NodeJs, Python

complexity ceiling of a
Node.js app
complexity

Node.js app
complexity
referential transparency
immutability as default
type inference
option types
union types
…

for managing complexity
Node.js app
complexity
referential transparency
immutability as default
type inference
option types
union types
…

Node.js app
complexity
Node.js Lambda function

if you can limit the complexity
of your solution, maybe you
won’t need the tools for
managing that complexity.
me

CloudWatch Event AWS Lambda
ping
ping
ping
ping

CloudWatch Event AWS Lambda
ping
ping
ping
ping
HEALTH CHECKS?

AVOID HARD
ASSUMPTIONS
ABOUT FUNCTION
LIFETIME

USE RECURSION
FOR LONG
RUNNING TASKS

“AWS Lambda polls your stream and
invokes your Lambda function. Therefore, if
a Lambda function fails, AWS Lambda
attempts to process the erring batch of
records until the time the data expires…”
http://docs.aws.amazon.com/lambda/latest/dg/retries-on-errors.html

should function fail on
partial/any failures?

SNS
Kinesis
SQS
after 3 attempts
share processing logic
events are processed in
chronological order
failed events are retried out
of sequence

PROCESS SQS
WITH RECURSIVE
FUNCTIONS

http://bit.ly/2npomX6

“Each shard can support up to
5 transactions per second for
reads, up to a maximum total data
read rate of 2 MB per second.”
http://docs.aws.amazon.com/streams/latest/dev/service-sizes-and-limits.html

“If your stream has 100 active shards,
there will be 100 Lambda functions
running concurrently. Then, each
Lambda function processes events
on a shard in the order that they arrive.”
http://docs.aws.amazon.com/lambda/latest/dg/concurrent-executions.html

when no. of processors goes up…

ReadProvisionedThroughputExceeded
can have too many Kinesis read operations…

ReadRecords.IteratorAge
unpredictable spikes in read ‘latency’…

http://bit.ly/2uv5LsH

for subsystems that don’t have
to be realtime, or are task-
based (ie. order doesn’t
matter), consider other
triggers such as S3 or SNS.me

@theburningmonk
theburningmonk.com
github.com/theburningmonk

API Gateway and Kinesis
Authentication & authorisation (IAM, Cognito)
Testing
Running & Debugging functions locally
Log aggregation
Monitoring & Alerting
X-Ray
Correlation IDs
CI/CD
Performance and Cost optimisation
Error Handling
Configuration management
VPC
Security
Leading practices (API Gateway, Kinesis, Lambda)
http://bit.ly/2AA5zzk

API Gateway and Kinesis
Authentication & authorisation (IAM, Cognito)
Testing
Running & Debugging functions locally
Log aggregation
Monitoring & Alerting
X-Ray
Correlation IDs
CI/CD
Performance and Cost optimisation
Error Handling
Configuration management
VPC
Security
Leading practices (API Gateway, Kinesis, Lambda)
http://bit.ly/2AA5zzk
get 40% off with: ytcui

Serverless in production, an experience report (London js community)

More Related Content

What's hot

Similar to Serverless in production, an experience report (London js community)

More from Yan Cui

Recently uploaded

Serverless in production, an experience report (London js community)