Mike Krieger - A Brief, Rapid History of Scaling Instagram (with a tiny team)

1.
A Brief, RapidHistory of Scaling Instagram  (with a tiny team) Mike Krieger QConSF 2013 !

2.
Hello!

3.
Instagram

4.
30 million with2 eng (2010-end 2012)

5.
150 million with6 eng (2012-now)

6.
How we scaled

7.
What I wouldhave done differently

8.
What tradeoffs youmake when scaling with that size team

9.
(if you canhelp it, have a bigger team)

10.
perfect solutions

11.
survivor bias

12.
decision-making process

13.
Core principles

14.
Do the simplestthing ﬁrst

15.
Every infra movingpart is another “thread” your team has to manage

16.
Test & Monitor Everything

17.
This talk Early days Year1: Scaling Up Year 2: Scaling Out Year 3-present: Stability, Video, FB

18.
Getting Started

19.
2010 2 guys ona pier

23.
no one <3sit

24.
Focus

26.
Mike iOS, KevinServer

27.
Early Stack Django +Apache mod_wsgi Postgres Redis Gearman Memcached Nginx

28.
If today Django +uWSGI Postgres Redis Celery Memcached HAproxy

29.
Three months later

31.
Server planning night beforelaunch

32.
Traction!

33.
Year 1: ScalingUp

34.
scaling.enable()

35.
Single server inLA

36.
infra newcomers

37.
“What’s a loadaverage?”

38.
“Can we getanother server?”

39.
Doritos & Red Bull& Animal Crackers & Amazon EC2

40.
Underwater on recruiting

41.
2 total engineers

42.
Scale "just enough"to get back to working on app

43.
Every weekend wasan accomplishment

44.
“Infra is whathappens when you’re busy making other plans” —Ops Lennon

45.
Scaling up DB

46.
First bottleneck: diskIO on old Amazon EBS

47.
At the time:~400 IOPS max

48.
Simple thing ﬁrst

49.
Vertical partitioning

50.
Django DB Routers

51.
Partitions Media Likes Comments Everything else

52.
PG Replication to bootstrapnodes

53.
Bought us sometime

54.
Almost no applicationlogic changes (other than some primary keys)

55.
Today: SSD andprovisioned IOPS get you way further

56.
Scaling up Redis

57.
Purely RAM-bound

58.
fork() and COW

59.
Vertical partitioning by datatype

60.
No easy migrationstory; mostly double-writing

61.
Replicating + deleting oftenleaves fragmentation

62.
Chaining replication = awesome

63.
Scaling Memcached

64.
Consistent hashing / ketama

65.
Mind that hashfunction

66.
Why not Redisfor kv caching?

67.
Slab allocator

68.
Conﬁg Management & Deployment

69.
fabric + parallelgit pull (sorry GitHub)

70.
All AMI basedsnapshots for new instances

71.
update_ami.sh

72.
update_update_ami.sh

73.
Should have doneChef earlier

74.
Munin monitoring

75.
df, CPU, iowait

76.
Ending the year

77.
Infra going from10% time to 70%

78.
Focus on client

79.
Testing & monitoringkept concurrent ﬁres to a minimum

80.
Several ticking time bombs

81.
Year 2: ScalingOut

82.
App tier

83.
Stateless, but plentiful

84.
HAProxy (Dead node detection)

85.
Connection limits everywhere

86.
PGBouncer Homegrown Redis pool

87.
Hard to trackdown kernel panics

88.
Skip rabbit hole;use instance- status to detect and restart

89.
Database Scale Out

90.
Out of IOagain (Pre SSDs)

91.
Biggest mis-step

92.
NoSQL?

93.
Call our friends

94.
and strangers

95.
Theory: partitioning andrebalancing are hard to get right, let DB take care of it

96.
MongoDB (1.2 atthe time)

97.
Double write, shadow reads

98.
Stressing about PrimaryKey

99.
Placed in prod

100.
Data loss, segfaults

101.
Could have madeit work…

102.
…but it wouldhave been someone’s full time job

103.
(and we stillonly had 3 people)

104.
train + rapidly approachingcliff

105.
Sharding in Postgres

106.
QCon to therescue

107.
Similar approach toFB (infra foreshadowing?)

108.
Logical partitioning, done atapplication level

109.
Simplest thing; skipped abstractions& proxies

110.
Pre-split

111.
5000 partitions

112.
note to self:pick a power of 2 next time

113.
Postgres "schemas"

114.
database schema table columns

115.
machineA: shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

116.
machineA: shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user machineA’: shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

117.
machineA: shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user machineA’: shard0 photos_by_user shard1 photos_by_user shard2 photos_by_user shard3 photos_by_user

118.
Still how wescale PG today

119.
9.2 upgrade: bucardoto move schema by schema

120.
ID generation

121.
Requirements No extra movingparts 64 bits max Time ordered Containing partition key

122.
41 bits: timein millis (41 years of IDs) 13 bits: logical shard ID 10 bits: auto-incrementing sequence, modulo 1024.

123.
This means wecan generate 1024 IDs, per shard, per table, per millisecond

124.
Lesson learned

125.
A new dbis a full time commitment

126.
Be thrifty withyour existing tech

127.
= minimize movingparts

128.
Scaling conﬁgs/host discovery

129.
ZooKeeper or DNS server?

130.
No team tomaintain

131.
/etc/hosts

132.
ec2tag KnownAs

133.
fab update_etc_hosts (generates, deploys)

134.
Limited: dead host failover,etc

135.
But zero additionalinfra, got the job done, easy to debug

136.
Monitoring

137.
Munin: too coarse,too hard to add new stats

138.
StatsD & Graphite

139.
Simple tech

140.
statsd.timer statsd.incr

141.
Step change indeveloper attitude towards stats

142.
<5 min fromwanting to measure, to having a graph

143.
580 statsd counters 164statsd timers

144.
Ending the year

145.
Launched Android

146.
(doubling all ofour infra, most of which was now horizontally scalable)

147.
Doubled active usersin < 6 months

148.
Finally, slowly, buildingup team

149.
Year 3+: Stability, Video,FB

150.
Scale tools tomatch team

151.
Deployment & Conﬁg Management

152.
Finally 100% onChef

153.
Simple thing ﬁrst:knife and chef-solo

154.
Every new hirelearns Chef

155.
Code deploys

156.
Many rollouts aday

157.
Continuous integration

158.
But push stillneeds a driver

159.
"Ops Lock"

160.
Humans are terrible distributedlocking systems

161.
Sauron

162.
Redis-enforced locks

163.
Rollout / majorconﬁg changes / live deployment tracking

164.
Extracting approach Hit issue Developmanual approach Build tools to improve manual / hands on approach Replace manual with automated system

165.
Monitoring

166.
Munin ﬁnally broke

167.
Ganglia for graphing

168.
Sensu for alerting (http://sensuapp.org)

169.
StatsD/Graphite still chugging along

170.
waittime: lightweight slow componenttracking

171.
s = time.time() #do work statsd.incr("waittime.VIEWNAME.C OMPONENT", time.time() - s)

172.
asPercent()

174.
Feeds and Inboxes

175.
Redis

176.
In memory requirement

177.
Every churned orinactive user

179.
Inbox moved to Cassandra

180.
1000:1 write/read

181.
Prereq: having rbranson, ex-DataStax

182.
C* cluster is20% of the size of Redis one

183.
Main feed (timeline)still in Redis

184.
Knobs

185.
Dynamic ramp-ups and conﬁg

186.
Previously: required deploy

187.
knobs.py

188.
Only ints

189.
Stored in Redis

190.
Refreshed every 30s

191.
knobs.get(feature_name, default)

192.
Uses Incremental feature rollouts Dynamicpage sizing (shedding load) Feature killswitches

194.
As more teamsaround FB contribute

195.
Decouple deploy from featurerollout

196.
Video

197.
Launch a top10 video site on day 1 with team of 6 engineers, in less than 2 months

198.
Reuse what weknow

199.
Avoid magic middleware

200.
VXCode

201.
Separate from mainApp servers

202.
Django-based

203.
server-side transcoding

204.
ZooKeeper ephemeral nodes fordetection

205.
(ﬁnally worth it/ doable to deploy ZK)

206.
EC2 autoscaling

207.
Priority list forclients

208.
Transcoding tier is completelystateless

209.
statsd waterfall

210.
holding area for debuggingbad videos

211.
5 million videosin ﬁrst day 40h of video / hour

212.
(other than perfimprovements we’ve basically not touched it since launch)

213.
FB

214.
Where can weskip a few years?

215.
(at our ownpace)

216.
Spam ﬁghting

217.
re.compile(‘f[o0][1l][o0]w’)

218.
Simplest thing didnot last

219.
Generic features + machinelearning

220.
Hadoop + Hive+ Presto

221.
"I wonder howthey..."

222.
Two-way exchange

223.
2010 vintage infra

224.
#1 impact: recruiting

225.
Backend team: >10 peoplenow

226.
Wrap up

227.
Core principles

228.
Do the simplestthing ﬁrst

229.
Every infra movingpart is another “thread” your team has to manage

230.
Test & Monitor Everything

231.
Takeaways

232.
Recruit way earlierthan you'd think

233.
Simple doesn't always implyhacky

234.
Rocketship scaling hasbeen (somewhat) democratized

235.
Huge thanks toIG Eng Team

236.
[email protected]

Mike Krieger - A Brief, Rapid History of Scaling Instagram (with a tiny team)

More Related Content

What's hot

Viewers also liked

Similar to Mike Krieger - A Brief, Rapid History of Scaling Instagram (with a tiny team)

More from Jean-Luc David

Recently uploaded

Mike Krieger - A Brief, Rapid History of Scaling Instagram (with a tiny team)