A Brief, Rapid History
of Scaling Instagram

(with a tiny team)
Mike Krieger
QConSF 2013
!
Hello!
Instagram
30 million with 2 eng
(2010-end 2012)
150 million with 6 eng
(2012-now)
How we scaled
What I would have
done differently
What tradeoffs you make when
scaling with that size team
(if you can help it, have a
bigger team)
perfect solutions
survivor bias
decision-making process
Core principles
Do the simplest thing first
Every infra moving part is another
“thread” your team has to manage
Test & Monitor
Everything
This talk
Early days
Year 1: Scaling Up
Year 2: Scaling Out
Year 3-present: Stability, Video, FB
Getting Started
2010
2 guys on a pier
no one <3s it
Focus
Mike iOS, Kevin Server
Early Stack
Django + Apache mod_wsgi
Postgres
Redis
Gearman
Memcached
Nginx
If today
Django + uWSGI
Postgres
Redis
Celery
Memcached
HAproxy
Three months later
Server planning night
before launch
Traction!
Year 1: Scaling Up
scaling.enable()
Single server in LA
infra newcomers
“What’s a load average?”
“Can we get another
server?”
Doritos &
Red Bull &
Animal Crackers &
Amazon EC2
Underwater on recruiting
2 total engineers
Scale "just enough" to get
back to working on app
Every weekend was an
accomplishment
“Infra is what happens when you’re busy
making other plans”
—Ops Lennon
Scaling up DB
First bottleneck: disk IO
on old Amazon EBS
At the time: ~400 IOPS
max
Simple thing first
Vertical partitioning
Django DB Routers
Partitions
Media
Likes
Comments
Everything else
PG Replication to
bootstrap nodes
Bought us some time
Almost no application logic changes
(other than some primary keys)
Today: SSD and provisioned
IOPS get you way further
Scaling up Redis
Purely RAM-bound
fork() and COW
Vertical partitioning by
data type
No easy migration story;
mostly double-writing
Replicating + deleting
often leaves fragmentation
Chaining replication =
awesome
Scaling Memcached
Consistent hashing /
ketama
Mind that hash function
Why not Redis for kv
caching?
Slab allocator
Config Management
& Deployment
fabric + parallel git pull
(sorry GitHub)
All AMI based snapshots
for new instances
update_ami.sh
update_update_ami.sh
Should have done Chef
earlier
Munin monitoring
df, CPU, iowait
Ending the year
Infra going from 10% time
to 70%
Focus on client
Testing & monitoring kept
concurrent fires to a minimum
Several ticking time
bombs
Year 2: Scaling Out
App tier
Stateless, but plentiful
HAProxy
(Dead node detection)
Connection limits
everywhere
PGBouncer
Homegrown Redis pool
Hard to track down kernel
panics
Skip rabbit hole; use instance-
status to detect and restart
Database Scale Out
Out of IO again
(Pre SSDs)
Biggest mis-step
NoSQL?
Call our friends
and strangers
Theory: partitioning and rebalancing
are hard to get right,
let DB take care of it
MongoDB (1.2 at the
time)
Double write, shadow
reads
Stressing about Primary Key
Placed in prod
Data loss, segfaults
Could have made it
work…
…but it would have been
someone’s full time job
(and we still only had 3
people)
train + rapidly
approaching cliff
Sharding in Postgres
QCon to the rescue
Similar approach to FB
(infra foreshadowing?)
Logical partitioning, done
at application level
Simplest thing; skipped
abstractions & proxies
Pre-split
5000 partitions
note to self: pick a power
of 2 next time
Postgres "schemas"
database
schema
table
columns
machineA:
shard0
photos_by_user
shard1
photos_by_user
shard2
photos_by_user
shard3
photos_by_user
machineA:
shard0
photos_by_user
shard1
photos_by_user
shard2
photos_by_user
shard3
photos_by_user
machineA’:
shard0
photos_by_user
shard1
photos_by_user
shard2
photos_by_user
shard3
photos_by_user
machineA:
shard0
photos_by_user
shard1
photos_by_user
shard2
photos_by_user
shard3
photos_by_user
machineA’:
shard0
photos_by_user
shard1
photos_by_user
shard2
photos_by_user
shard3
photos_by_user
Still how we scale PG
today
9.2 upgrade: bucardo to
move schema by schema
ID generation
Requirements
No extra moving parts
64 bits max
Time ordered
Containing partition key
41 bits: time in millis (41 years of IDs)
	 	 13 bits: logical shard ID
	 	 10 bits: auto-incrementing
sequence, modulo 1024.
This means we can generate 1024
IDs, per shard, per table, per
millisecond
Lesson learned
A new db is a full time
commitment
Be thrifty with your
existing tech
= minimize moving parts
Scaling configs/host
discovery
ZooKeeper or DNS
server?
No team to maintain
/etc/hosts
ec2tag KnownAs
fab update_etc_hosts
(generates, deploys)
Limited: dead host
failover, etc
But zero additional infra, got
the job done, easy to debug
Monitoring
Munin: too coarse, too
hard to add new stats
StatsD & Graphite
Simple tech
statsd.timer
statsd.incr
Step change in developer
attitude towards stats
<5 min from wanting to
measure, to having a graph
580 statsd counters
164 statsd timers
Ending the year
Launched Android
(doubling all of our infra, most of
which was now horizontally scalable)
Doubled active users in
< 6 months
Finally, slowly, building up
team
Year 3+: Stability,
Video, FB
Scale tools to match
team
Deployment &
Config Management
Finally 100% on Chef
Simple thing first: knife
and chef-solo
Every new hire learns
Chef
Code deploys
Many rollouts a day
Continuous integration
But push still needs a
driver
"Ops Lock"
Humans are terrible
distributed locking systems
Sauron
Redis-enforced locks
Rollout / major config changes
/ live deployment tracking
Extracting approach
Hit issue
Develop manual approach
Build tools to improve manual / hands on
approach
Replace manual with automated system
Monitoring
Munin finally broke
Ganglia for graphing
Sensu for alerting
(http://sensuapp.org)
StatsD/Graphite still
chugging along
waittime: lightweight slow
component tracking
s = time.time()
# do work
statsd.incr("waittime.VIEWNAME.C
OMPONENT", time.time() - s)
asPercent()
Feeds and Inboxes
Redis
In memory requirement
Every churned or inactive user
Inbox moved to
Cassandra
1000:1 write/read
Prereq: having rbranson,
ex-DataStax
C* cluster is 20% of the
size of Redis one
Main feed (timeline) still in
Redis
Knobs
Dynamic ramp-ups and
config
Previously: required
deploy
knobs.py
Only ints
Stored in Redis
Refreshed every 30s
knobs.get(feature_name,
default)
Uses
Incremental feature rollouts
Dynamic page sizing (shedding load)
Feature killswitches
As more teams around
FB contribute
Decouple deploy from
feature rollout
Video
Launch a top 10 video site on day
1 with team of 6 engineers,
in less than 2 months
Reuse what we know
Avoid magic middleware
VXCode
Separate from main App
servers
Django-based
server-side transcoding
ZooKeeper ephemeral
nodes for detection
(finally worth it / doable to
deploy ZK)
EC2 autoscaling
Priority list for clients
Transcoding tier is
completely stateless
statsd waterfall
holding area for
debugging bad videos
5 million videos in first day
40h of video / hour
(other than perf improvements we’ve
basically not touched it since launch)
FB
Where can we skip a few
years?
(at our own pace)
Spam fighting
re.compile(‘f[o0][1l][o0]w’)
Simplest thing did not last
Generic features +
machine learning
Hadoop + Hive + Presto
"I wonder how they..."
Two-way exchange
2010 vintage infra
#1 impact: recruiting
Backend team: >10
people now
Wrap up
Core principles
Do the simplest thing first
Every infra moving part is another
“thread” your team has to manage
Test & Monitor
Everything
Takeaways
Recruit way earlier than
you'd think
Simple doesn't always
imply hacky
Rocketship scaling has been
(somewhat) democratized
Huge thanks to IG
Eng Team
mikeyk@instagram.com

Mike Krieger - A Brief, Rapid History of Scaling Instagram (with a tiny team)