0% found this document useful (0 votes)
66 views59 pages

APP306 - Using AWS CloudFormation For Deployment and Management at Scale

Using AWS CloudFormation for Deployment and Management at Scale

Uploaded by

Allan Huang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
66 views59 pages

APP306 - Using AWS CloudFormation For Deployment and Management at Scale

Using AWS CloudFormation for Deployment and Management at Scale

Uploaded by

Allan Huang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 59

APP306: Using AWS CloudFormation for

Deployment and Management at Scale


Tom Cartwright and Yavor Atanasov, BBC
November 12, 2014 Las Vegas, Nevada

© 2014 Amazon.com, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc.
Who are we?
Fifth largest site in UK, 55th Globally
• Top 20 in News, Sport, Arts, Childrens

Juggling depth of audience and breadth of services is


a key challenge

Source: Alexa
What do our services do?
Deploy at scale
• > 300 deployments per day
• 60,000 deployments in first 18 months

Deploy robustly
• All key video transcoding and packaging for BBC iPlayer
• Pipeline delivering election results to BBC News
• Live text for all BBC Sport events
How are Yavor and I involved?
We build tools for the full development lifecycle

Develop Build Deploy Run


And what are we going to talk about?
• Part One – Where did we come from and how did we get
where we are?
• Part Two – What have we built and how do we use AWS
CloudFormation to keep it running?
The beginning
The beginning — 2012
• Olympics dominating our planning and capacity
• On-premises platforms running key BBC Online
properties
• Hard to get focus on other projects
Ops are a constrained resource
Devs can touch test, but Ops own live:
• “Jira-powered deployment”
• 40,000 change tickets since October ’09

Leading to:
• Greater delta between releases
• Longer feedback loops
• High stress around emergency changes
Infrastructure is a constrained resource

Physical infrastructure needs to be bought, racked, configured:


• Weeks of lead time on new hardware
• Limited supplies of existing hardware

Leading to:
• Inflexibility to changing requirements
• Shared tenancy of hardware, weak software isolation
Three emerging trends
Continuous delivery
– Can we build better quality things, faster?
Cloud
– Can we reduce our costs or increase our agility?
DevOps
– Can we strike a better balance of freedom and
responsibility for engineers?
The grappling hook
The grappling hook
• Take two teams: one product, one platform

• Product team takes advantage of features as they become


available from platform and feeds requirements in

• Platform team builds features based on need but looks to


make them scale to many users

• Get the learning in software, not slideware


Continuous delivery
• Automate everything
• Keep everything in source control
• Build your binaries once
• Use the same mechanism to deploy to
all environments
• If anything fails, stop the line

Think continuous improvement — direction not position


DevOps
The people that wrote it:
– Will fix problems fastest
– Know when it is sensible to deploy

So give them the access to do it and ask them to take


responsibility for their actions
November 2012
• Spoke to others others solving the same problems
• Began to focus on the underlying principles rather than
immediate problems
• Came home and mustered the Simian Army
Grappling hook — reflections
The good
• Infrastructure costs exactly as predicted
• Numerous platform features ready for further use
• Had a developing set of principles around good practice

The not-so-good
• We learned many lessons about how to build, fewer about why…
Storming the tower
The platform pendulum

Restriction Freedom
The platform pendulum

Predictability Chaos
The platform pendulum

Slow Fast

Tools
Establishing principles
• Establish strong defaults for the way things get
built and create tooling for that
• Assume that there will be use cases where the
defaults don’t fit
Managing infrastructure at scale
• Repeatability
– Never “spin it up in the console and hope”
• Flexibility
– Teams are going to need that obscure service
• StackOverflow-ability
– If there is a well-known way of expressing it in the
world, use it
Managing deployment at scale
• Repeatability
– All instances should be identical
• Robustness
– Look for fail-safe mechanisms
• Resilience
– Minimize dependencies at instance startup
Handling support at scale
• Access
– Engineers should have access to the services they run
• Patterns
– Create patterns and templates for core infrastructure pieces
• Support
– Ask developers to take “the phone”
The rest is just software…
Inside the machine
Version Control How we deploy
Pull
Build binaries in a reproducible way;
build them once; automate everything
Commit
Push
Build Jenkins Repos

Registe
r Deploy

Cosmos
Test
Promote
Bake

Bakery

Live
Infrastructure
provisioning
Hardware is now software, embrace it
and treat it that way!

• Build infrastructure in a reliable and


reproducible way, just like you build
software
Infrastructure as code and AWS
CloudFormation

• Managed infrastructure dependencies


• AWS API interactions taken care for you
• Reproducibility
• Versioning
What does that mean for my
application?

• I can build identical copies of my app in different


environments
• I can version my infrastructure templates with
my code and reproduce the full stack at any
point in time
So my application is not just software,
it is software and infrastructure
combined

v1
v2

v3
Application infrastructure
Let’s look at what an application might look like and how we can define it with AWS
CloudFormation

Auto Scaling Group RDS database


= Security Groups
IAM Roles and Policies + S3 bucket
SQS Queue
Elastic Load Balancer SNS Topic
Route 53 Record
…defined in CloudFormation stacks
Separate stateful and stateless resources into separate templates

Auto Scaling Group RDS database


Security Groups S3 bucket
IAM Roles and Policies SQS Queue
Elastic Load Balancer SNS Topic
Route 53 Record

service-0.1.0.json resources-0.1.0.json
The best way to form clouds

• JSON is great for defining infrastructure


• But if you find yourself repeating the same template
over and over, consider abstracting it in code
• E.g., https://github.com/cloudtools/troposphere for
python
JSON vs code
Abstracting AWS CloudFormation
allows us to create default service
templates and provide them to teams
in a concise way.

530 lines of JSON vs 5 lines of python


AWS CloudFormation and deployments
Version Control How we deploy
Pull
Cosmos bakes an AMI and then
updates the service stack…
Commit
Push
Build Jenkins Repos

Registe
r
serv ice stack
Updates
Cosmos
Upd Test
ates
serv
Bake ice
s tack

Bakery

Live
The Bakery

• Takes repository information, packages to install


and environment specific configuration
• Bakes AMIs using a 2 step snapshot process – 1
snapshot just for the software and 1 for the software
with the configuration
Building machines is like building
software

• Build binaries once


• Build them in a reproducible way
What’s in a machine?

Environment
configuration

Service

Software binary

Base OS
2 step snapshotting

snap-432jrse
snap-w3r153r
Re-baking for different environments

snap-456qwf
snap-w3r153r

snap-w3r153r
snap-w3r153r
Version Control How we deploy
Pull
Cosmos bakes an AMI and then
updates the service stack…
Commit
Push
Build Jenkins Repos

Registe
r
serv ice stack
Updates
Cosmos
Upd Test
ates
serv
Bake ice
s tack

Bakery

Live
…what actually happens

• Cosmos updates the ImageId property of the Auto


Scaling Group’s LaunchConfiguration
• Based on the specified UpdatePolicy, the ASG
starts refreshing the instances with new ones using
the new AMI
Optimizing the ASG UpdatePolicy

• On test environments you can optimize for speed


and replace all instances at once
• Once live, you should update the ASG in batches
making sure you don’t have downtime
…for example
For a service with an ASG with 5 instances…

TEST LIVE
"UpdatePolicy": { "UpdatePolicy": {
"AutoScalingRollingUpdate": { "AutoScalingRollingUpdate": {
"PauseTime": "PT0S", "PauseTime": "PT15S",
"MaxBatchSize": "5", "MaxBatchSize": "2",
"MinInstancesInService": "0" "MinInstancesInService": "2"
} }
} }
Version Control Let’s see it in
Pull
action!
Commit
Push
Build Jenkins Repos

Registe
r
serv ice stack
Updates
Cosmos
Upd Test
ates
serv
Bake ice
s tack

Bakery

Live
Demo time
Let’s deploy one of our services and
see what happens…
AWS CloudFormation beyond the app
Defining our core infrastructure

• Provides the frame upon which services’


infrastructure is built
• Provides security and resilience through levels of
isolation
Levels of isolation

• Network and instance access — be isolated by


default
• Resource isolation — find all API limits and
resource limits and avoid sharing those among your
critical services; use different AWS accounts
Core
infrastructure
eu-west-1a

Private Public
Each AWS account is setup an
Amazon Virtual Private Cloud
spreading across the three Availability
Zones; the VPC contains three private
eu-west-1b

and three public subnets

Private Public Service’s ASGs are positioned in the


private subnets and their load
balancers go in the public ones
eu-west-1c

Private Public
Environments
Development and production
environments are built in separate
Production Development accounts to bring full isolation from
API and resource limits

All managed via AWS CloudFormation


stacks
SSH access
SSH access is granted via Bastion
machines positioned in a dedicated
Production Development VPC, which is peered with the VPCs
that should be accessed

Bastions
In Closing…
Recapping
Scale
• > 300 deployments per day
• 50,000 deployments in first 18 months

Speed
• Time from laptop to live reduced from 2 days to 10 minutes

Commitment
• All key video transcoding and packaging for BBC iPlayer
• Pipeline delivering election results to BBC News
• Live text for all BBC Sport events
Want to know more?
• We’re starting to share our work: https://github.com/bbc
• Come and talk to us, or our colleagues this week
• We’re hiring, in London and Salford, UK: http
://www.bbc.co.uk/careers
Please give us your feedback on this session.
Complete session evaluations and earn re:Invent swag.

APP306 http://bit.ly/awsevals

Join the conversation on Twitter with #reinvent


© 2014 Amazon Web Services, Inc. and its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon Web Services, Inc.

You might also like