Netflix's Transition to High-Availability Storage (QCon SF 2010)

Coordinates
Twitter @r39132
#netflixcloud
Blog http://practicalcloudcomputing.com
Linked In http://www.linkedin.com/in/siddharthanand
2@r39132 - #netflixcloud

Why Are You Here?
”What I need is an exact list of specific unknown
problems we might encounter."
-- anonymous
@r39132 - #netflixcloud 3

Motivation
 Circa late 2008, Netflix had a single data center
 Single-point-of-failure (a.k.a. SPOF)
 Approaching limits on cooling, power, space, traffic
capacity
 Alternatives
 Build more data centers
 Outsource the majority of our capacity planning and
scale out

Motivation
 Winner : Outsource the majority of our capacity planning and
scale out
 Leverage a leading Infrastructure-as-a-service provider
 Amazon Web Services
 Footnote : As it has taken us a while (i.e. ~2+ years) to realize
our vision of running on the cloud, we needed a interim solution
to handle growth
 We did build a second data center along the way
 We did outgrow it
6@r39132 - #netflixcloud

Cloud Migration Strategy
 Components
 Applications and Software Infrastructure
 Data
 Migration Considerations
 Security
 PII and PCI DSS stays in our DC, rest can go to the cloud
 Scalability and Availability for Business Success

 Scalability and Availability for Business Success
 High Growth or High Traffic Growth Data
 Video starts, Personalized Video choosing
 High Traffic Growth Applications
 Same as above
 Log Processing
 Time-to-market Critical Batch Processing
 Video encoding
 Not Included
 DVD inventory and shipment
 We are a streaming company that also ships DVD

Examples of Data that can be moved
 Video-centric data
 Critics’ reviews
 Metadata
 User-video-centric data – some of our largest data sets
 User-video queue
 Previously streamed and shipped video history
 Ratings (i.e. a 5-star rating system)
 Video streaming metadata (e.g. streaming bookmarks)

 High-level Requirements for our Site
 No big-bang migrations
 New functionality needs to launch in the cloud when
possible
 High-level Requirements for our Data
 Data needs to migrate before applications
 Data needs to be shared between applications running in
the cloud and our data center during the transition period

 Low-level Requirements for our Data
 Pick a (key-value) data store in the cloud
 Challenges
 Translate RDBMS concepts to KV store concepts
 Work-around Issues specific to the chosen KV store
 Create a bi-directional DC-Cloud data replication
pipeline

Pick a Data Store in the Cloud
An ideal storage solution should have the following features:
 Hosted
 Managed Distribution Model
 Works in AWS
 AP from CAP
 Handles a majority of use-cases accessing high-growth, high-traffic data
 Specifically, key access by customer id, movie id, or both

Pick a Data Store in the Cloud
 We picked SimpleDB and S3
 SimpleDB was targeted as the AP equivalent of our RDBMS
databases in our Data Center
 S3 was used for data sets where item or row data
exceeded SimpleDB limits and could be looked up purely
by a single key (i.e. does not require secondary indices and
complex query semantics)
 Video encodes
 Streaming device activity logs (i.e. CLOB, BLOB, etc…)
 Compression of old Rental History

Technology Overview : SimpleDB
SimpleDB Hash Table Relational Databases
Domain Hash Table Table
Item Entry Row
Item Name Key Mandatory Primary Key
Attribute Part of the Entry Value Column
Terminology

Soccer Players
Key Value
ab12ocs12v9 First Name = Harold Last Name = Kewell
Nickname = Wizard of
Oz
Teams = Leeds United,
Liverpool, Galatasaray
b24h3b3403b First Name = Pavel Last Name = Nedved
Nickname = Czech
Cannon
Teams = Lazio,
Juventus
cc89c9dc892 First Name = Cristiano Last Name = Ronaldo
Teams = Sporting,
Manchester United,
Real Madrid
SimpleDB’s salient characteristics
• SimpleDB offers a range of consistency options
• SimpleDB domains are sparse and schema-less
• The Key and all Attributes are indexed
• Each item must have a unique Key
• An item contains a set of Attributes
• Each Attribute has a name
• Each Attribute has a set of values
• All data is stored as UTF-8 character strings (i.e. no support for types such as numbers or dates)

What does the API look like?
 Manage Domains
 CreateDomain
 DeleteDomain
 ListDomains
 DomainMetaData
 Access Data
 Retrieving Data
 GetAttributes – returns a single item
 Select – returns multiple items using SQL syntax
 Writing Data
 PutAttributes – put single item
 BatchPutAttributes – put multiple items
 Removing Data
 DeleteAttributes – delete single item
 BatchDeleteAttributes – delete multiple items

 Options available on reads and writes
 Consistent Read
 Read the most recently committed write
 May have lower throughput/higher latency/lower
availability
 Conditional Put/Delete
 i.e. Optimistic Locking
 Useful if you want to build a consistent multi-master data
store – you will still require your own anti-entropy
 We do not use this currently, so we don’t know how it
performs

Translate RDBMS Concepts to Key-Value Store
Concepts
 Relational Databases are known for relations
 First, a quick refresher on Normal forms

Normalization
NF1 : All occurrences of a record type must contain the same number of
fields -variable repeating fields and groups are not allowed
NF2 : Second normal form is violated when a non-key field is a fact about
a subset of a key
Violated here
Fixed here
Part Warehouse Quantity Warehouse-
Address
Part Warehouse Quantity Warehouse Warehouse-
Address

Normalization
 Issues
 Wastes Storage
 The warehouse address is repeated for every Part-WH pair
 Update Performance Suffers
 If the address of the warehouse changes, I must update
many Part-WH pairs
 Data inconsistencies possible
 I can update the warehouse address for one Part-WH pair
and miss Parts for the same WH
 Data Loss Possible
 If at some point in time there are no parts, the WH address
will be lost

Normalization
 RDBMS  KV Store migrations can’t simply accept
denormalization!
 Especially many-to-many and many-to-one entity relationships
 Instead, pick your data set candidates carefully!
 Keep relational data in RDBMS
 Move key-look-ups to KV stores
 Luckily for Netflix, most data is accessed by Customer, Video,
or both : i.e. Key Lookups

Concepts
 Aside from relations, relational databases typically
offer the following:
 Transactions
 Locks
 Sequences
 Triggers
 Clocks
 A structured query language (i.e. SQL)
 Database server-side coding constructs (i.e. PL/SQL)
 Constraints

Concepts
 Partial or no SQL support. Loosely-speaking, SimpleDB supports a
subset of SQL
 BEST PRACTICE
 Do GROUP BY and JOIN operations in the application layer
involving smallish data sets
 No relations between domains
 BEST PRACTICE
 Compose relations in the application layer
 No transactions
 BEST PRACTICE
 Use SimpleDB’s Optimistic Concurrency Control API: ConditionalPut
and ConditionalDelete

Concepts
 No schema - This is non-obvious. A query for a misspelled attribute
name will not fail with an error
 BEST PRACTICE
 Implement a schema validator in a common data access layer
 No sequences
 BEST PRACTICE
 Sequences are often used as primary keys
 In this case, use a naturally occurring unique key
 If no naturally occurring unique key exists, use a UUID
 Sequences are also often used for ordering
 Use a distributed sequence generator

Concepts
 No clock operations, PL/SQL, Triggers
 BEST PRACTICE
 Do without
 No constraints. Specifically,
 No uniqueness constraints
 No foreign key or referential constraints
 No integrity constraints
 BEST PRACTICE
 Read Repair and Anti-entropy processes using Conditional
Put/Delete

Work-around Issues specific to the chosen KV
store
 Missing / Strange Functionality
 No back-up and recovery
 No native support for types (e.g. Number, Float, Date, etc…)
 You cannot update one attribute and null out another one for an
item in a single API call
 Mis-cased or misspelled attribute names in operations fail silently.
Why is SimpleDB case-sensitive?
 Neglecting "limit N" returns a subset of information. Why does the
absence of an optional parameter not return all of the data?
 Users need to deal with data set partitioning
 Beware of Nulls
 Poor Performance

store
No Native Types – Sorting, Inequalities Conditions,
etc…
 Since sorting is lexicographical, if you plan on sorting by certain
attributes, then
 zero-pad logically-numeric attributes
 e.g. –
 000000000000000111111  this is bigger
 000000000000000011111
 use Joda time to store logical dates
 e.g. –
 2010-02-10T01:15:32.864Z  this is more recent
 2010-02-10T01:14:42.864Z

store
 Anti-pattern : Avoid the anti-pattern Select SOME_FIELD_1 from
MY_DOMAIN where SOME_FIELD_2 is null as this is a full domain
scan
 Nulls are not indexed in a sparse-table
 BEST PRACTICE
 Instead, replace this check with a (indexed) flag column
called IS_FIELD_2_NULL: Select SOME_FIELD_1 from
MY_DOMAIN where IS_FIELD_2_NULL = 'Y'
 Anti-pattern : When selecting data from a domain and sorting by an
attribute, items missing that attribute will not be returned
 In Oracle, rows with null columns are still returned
 BEST PRACTICE
 Use a flag column as shown previously

store
 BEST PRACTICE : Aim for high index selectivity when you formulate
your select expressions for best performance
 SimpleDB select performance is sensitive to index selectivity
 Index Selectivity
 Definition : # of distinct attribute values in specified attribute /
# of items in domain
 e.g. Good Index Selectivity (i.e. 1 is the best)
 A table having 100 records and one of its indexed column
has 88 distinct values, then the selectivity of this index is
88 / 100= 0.88
 e.g. Bad Index Selectivity
 lf an index on a table of 1000 records had only 5 distinct
values, then the index's selectivity is 5 / 1000 = 0.005

store
Sharding Domains
 There are 2 reasons to shard domains
 You are trying to avoid running into one of the sizing limits
 e.g. 10GB of space or 1 Billion Attributes
 You are trying to scale your writes
 To scale your writes further, use BatchPutAttributes and
BatchDeleteAttributes where possible

Create a Bi-directional DC-Cloud Data
Replication Pipeline
 Home-grown Data Replication Framework known as IR for Item
Replication
 2 schemes in use currently
 Polls the main table (a.k.a. Simple IR)
 Doesn’t capture deletes but easy to implement
 Polls a journal table that is populated via a trigger on the
main table (a.k.a. Trigger-journaled IR)
 Captures every CRUD, but requires the development
of triggers

 How often do we poll Oracle?
 Every 5 seconds
 What does the poll query look like?
 select *
from QLOG_0
where LAST_UPDATE_TS > :CHECKPOINT  Get recent
and LAST_UPDATE_TS < :NOW_MINUS_30s  Exclude
most recent
order by LAST_UPDATE_TS  Process in order

 Data Replication Challenges & Best Practices
 SimpleDB throttles traffic aggressively via 503 HTTP Response
codes (“Service Unavailable”)
 With Singleton writes, I see 70-120 write TPS/domain
 IR
 Shard domains (i.e. partition data sets) to work-around these limits
 Employs Slow ramp up
 Uses BatchPutAttributes instead of (Singleton) PutAttributes call
 Exercises an exponential bounded-back-off algorithm
 Uses attribute-level replace=false when fork-lifting data

 Data Replication Challenges & Best Practices
 Implementing Multi-mastering and an Eventually-consistent
 SimpleDB offers optimistic concurrency control in the form of
conditional put (and deletes)
 For our data, it is ok to be “consistent, but not accurate”
 With this relaxation, we do not need to be concerned with
synchronizing logical clocks
 We simply just need to ensure that each conditional put puts a large
strictly increasing value into the “version” column

Netflix's Transition to High-Availability Storage (QCon SF 2010)

More Related Content

What's hot (18)

Similar to Netflix's Transition to High-Availability Storage (QCon SF 2010) (20)

More from Sid Anand (20)

Recently uploaded (20)

Netflix's Transition to High-Availability Storage (QCon SF 2010)

Editor's Notes