Scalability Design Principles
Scalability Design Principles
Abstract. The challenge of building consistent, available, and scalable data man-
agement systems capable of serving petabytes of data for millions of users has
confronted the data management research community as well as large internet
enterprises. Current proposed solutions to scalable data management, driven pri-
marily by prevalent application requirements, limit consistent access to only the
granularity of single objects, rows, or keys, thereby trading off consistency for
high scalability and availability. But the growing popularity of “cloud comput-
ing”, the resulting shift of a large number of internet applications to the cloud, and
the quest towards providing data management services in the cloud, has opened
up the challenge for designing data management systems that provide consis-
tency guarantees at a granularity larger than single rows and keys. In this paper,
we analyze the design choices that allowed modern scalable data management
systems to achieve orders of magnitude higher levels of scalability compared to
traditional databases. With this understanding, we highlight some design princi-
ples for systems providing scalable and consistent data management as a service
in the cloud.
1 Introduction
Scalable and consistent data management is a challenge that has confronted the database
research community for more than two decades. Historically, distributed database sys-
tems [15, 16] were the first generic solution that dealt with data not bounded to the
confines of a single machine while ensuring global serializability [2, 19]. This design
was not sustainable beyond a few machines due to the crippling effect on performance
caused by partial failures and synchronization overhead. As a result, most of these sys-
tems were never extensively used in industry. Recent years have therefore seen the
emergence of a different class of scalable data management systems such Google’s
Bigtable [5], PNUTS [6] from Yahoo!, Amazon’s Dynamo [7] and other similar but
undocumented systems. All of these systems deal with petabytes of data, serve on-
line requests with stringent latency and availability requirements, accommodate erratic
workloads, and run on cluster computing architectures; staking claims to the territories
used to be occupied by database systems.
One of the major contributing factors towards the scalability of these modern sys-
tems is the data model supported by these systems, which is a collection of key-value
⋆
This work is partially funded by NSF grant NSF IIS-0847925.
2
pairs with consistent and atomic read and write operations only on single keys. Even
though a huge fraction of the present class of web-applications satisfy the constraints
of single key access [7, 18], a large class of modern Web 2.0 applications such as col-
laborative authoring, online multi-player games, social networking sites, etc, require
consistent access beyond single key semantics. As a result, modern key-value stores
cannot cater to these applications and have to rely on traditional database technologies
for storing their content, while scalable key-value stores drive the in-house applications
of the corporations that have designed these stores.
With the growing popularity of the “cloud computing” paradigm, many applications
are moving to the cloud. The elastic nature of resources and the pay as you go model
have broken the infrastructure barrier for new applications which can be easily tested
out without the need for huge upfront investments. The sporadic load characteristics of
these applications, coupled with increasing demand for data storage while guarantee-
ing round the clock availability, and varying degrees of consistency requirements pose
new challenges for data management in the cloud. These modern application demands
call for systems capable of providing scalable and consistent data management as a ser-
vice in the cloud. Amazon’s SimpleDB (http://aws.amazon.com/simpledb/) is a first step in
this direction, but is designed along the lines of the key-value stores like Bigtable and
hence does not provide consistent access to multiple objects. On the other hand, relying
on traditional databases available on commodity machine instances in the cloud result
in a scalability bottleneck for these applications, thereby defeating the scalability and
elasticity benefits of the cloud. As a result, there is a huge demand for data manage-
ment systems that can bridge the gap between scalable key-value stores and traditional
database systems.
At a very generic level, the goal of a scalable data management system is to sustain
performance and availability over a large data set without significant over-provisioning.
Resource utilization requirements demand that the system be highly dynamic. In Sec-
tion 2, we discuss the salient features of three major systems from Google, Yahoo!, and
Amazon. The design of these systems is interesting not only from the point of view of
what concepts they use but also what concepts they eschew. Careful analysis of these
systems is necessary to facilitate future work. The goal of this paper is to carefully an-
alyze these systems to identify the main design choices that have lent high scalability
to these systems, and to lay the foundations for designing the next generation of data
management systems serving the next generation of applications in the cloud.
that the system aims to support, and different systems provide varying trade-offs be-
tween different attributes. In most cases, high scalability and high availability is given
a higher priority. Early attempts to design distributed databases in the late eighties and
early nineties made a design decision to treat both the system state and applications
state as a cohesive whole in a distributed environment. We contend that the decoupling
of the two states is the root cause for the high scalability of modern systems.
On the other hand, Amazon’s Dynamo [7] uses an approach similar to peer-to-peer
systems [17]. Partitioning of data is at a per-record granularity through consistent hash-
ing [13]. The key of a record is hashed to a space that forms a ring and is statically par-
titioned. Thus the location of a data item can be computed without storing any explicit
mapping of data to partitions. Replication is done at nodes that are neighbors of the
node to which a key hashes to, a node which also acts as a master (although Dynamo is
multi-master, as we will see later). Thus, Dynamo does not maintain a dynamic system
state with consistency guarantees, a design different compared to PNUTS or Bigtable.
Even though not in the same vein as scalable data management systems, Sinfonia [1]
is designed to provide an efficient platform for building distributed systems. Sinfonia [1]
can be used to efficiently design and implement systems such as distributed file systems.
The system state of the file system (e.g. the inodes) need to be maintained as well as
manipulated in a distributed setting, and Sinfonia provides efficient means for guaran-
teeing consistency of these critical operations. Sinfonia provides the minitransaction
abstraction, a light weight version of distributed transactions, supporting only a small
set of operations. The idea is to use a protocol similar to Two Phase Commit (2PC) [10]
for committing a transaction, and the actions of the transaction are piggy backed on the
messages sent out during the first phase. The light weight nature of minitransactions
allow the system to scale to hundreds of nodes, but the cost paid is a reduced set of
operations.
Thus, when it comes to critical system state, the designers of these scalable data
management systems rely on traditional mechanisms for ensuring consistency and fault-
tolerance, and are willing to compromise scalability. But this choice does not hurt the
system performance since this state is a very small fraction of the actual state (applica-
tion state comprises the majority of the state). In addition, another important distinction
of these systems is the number of nodes communicating to ensure the consistency of the
system state. In the case of Chubby and YMB, a commit for a general set of operations
is performed on a small set of participants (five and two respectively [3, 6]). On the
other hand, Sinfonia supports limited transactional semantics and hence can scale to a
larger number of nodes. This is in contrast to traditional distributed database systems,
which tried to make both ends meet, i.e., providing strong consistency guarantees for
both system state and application state over any number of nodes.
Distributed data management systems are designed to host large amounts of data for the
applications which these systems aim to support. We refer to this application specific
data as the application state. The application state is typically at least two to three
orders of magnitude larger than the system state, and the consistency, scalability, and
availability requirements vary based on the applications.
Data Model and its Implications. The distinguishing feature of the three main systems
we consider in this paper is their simple data model. The primary abstraction is a table
of items where each item is a key-value pair. The value can either be an uninterpreted
string (as in Dynamo), or can have structure (as in PNUTS and Bigtable). Atomicity
is supported at the granularity of a single item – i.e., atomic read/write and atomic
5
non-availability of a single component of the system resulted in the entire system be-
coming unavailable. On the other hand, modern systems are loosely coupled, and the
non-availability of certain portions of the system might not affect other parts of the sys-
tem. For example, if a partition is not available, then that does not affect the availability
of the rest of the systems, since all operations are single-object. Thus, even though the
system availability might be high, record level availability might be lower in the pres-
ence of failures.
In Bigtable [5], a single node (referred to as tablet server) is assigned the responsibility
for part of the table (known as a tablet) and performs all accesses to the records assigned
to it. The application state is stored in the Google File System (GFS) [9] which provides
the abstraction of a scalable, consistent, fault-tolerant storage for user data. There is no
replication of user data inside Bigtable (all replication is handled at the GFS level),
hence it is by default single master. Bigtable also supports atomic read-modify-write
on single keys. Even though scans on a table are supported, they are best-effort without
providing any consistency guarantees.
PNUTS [6] was developed with the goal of providing efficient read access to ge-
ographically distributed clients while providing serial single-key writes. PNUTS per-
forms explicit replication to ensure fault-tolerance. The replicas are often geographi-
cally distributed, helping improve the performance of web applications attracting users
from different parts of the world. As noted earlier in Section 2.1, Yahoo! Message Bro-
ker (YMB), in addition to maintaining the system state, also aids in providing applica-
tion level guarantees by serializing all requests to the same key. PNUTS uses a single
master per record and the master can only process updates by publishing to a single bro-
ker, as a result providing single-object time line consistency where updates on a record
are applied in the same order to all the replicas [6]. Even though the system supports
multi-object operations such as range queries, no consistency guarantees are provided.
PNUTS allows the clients to specify their consistency requirements for reads: a read
that does not need the guaranteed latest version can be satisfied from a local copy and
hence has low latency, while reads with the desired level of freshness (including read
from latest version) are also supported but might result in higher latency.
Dynamo [7] was designed to be a highly scalable key-value store that is highly avail-
able to reads but particularly for writes. This system is designed to make progress even
in the presence of network partitions. The high write availability is achieved through an
asynchronous replication mechanism which acknowledges the write as soon as a small
number of replicas have written it. The write is eventually propagated to other repli-
cas. To further increase availability, there is no statically assigned coordinator (thereby
making this a multi master system), and thus, the single-object writes also do not have
a serial history. In the presence of failures, high availability is achieved at the cost of
lower consistency. Stated formally, Dynamo only guarantees eventual consistency, i.e.
all updates will be eventually delivered to all replicas, but with no guaranteed order. In
addition, Dynamo allows multiple divergent versions of the same record, and relies on
application level reconciliation based on vector clocks.
7
of the underlying replication mechanism. For systems, with limited availability, allow-
ing the application to specify freshness requirements allows easy load spreading as well
as limited availability. This is the case in both PNUTS and Dynamo. But in these set-
ting we think designers should strongly consider adding support for multi-versioning,
similar to that supported in Bigtable. These versions are created anyway as part of the
process and the design decision is to store them or not. Note that old versions are im-
mutable anyway and when storage servers are decoupled as discussed above, this allows
analysis applications to efficiently pull data without interfering with the online system
and also allowing time-travel analysis.
The above mentioned design principles will form the basis for the next generation
of scalable data stores.
gap with systems that can provide varying degrees of consistency and scalability. In this
paper, our goal was to lay the foundations of the design of such a system for managing
“clouded data”.
References
1. Aguilera, M.K., Merchant, A., Shah, M., Veitch, A., Karamanolis, C.: Sinfonia: a new
paradigm for building scalable distributed systems. In: SOSP. pp. 159–174 (2007)
2. Bernstein, P.A., Hadzilacos, V., Goodman, N.: Concurrency Control and Recovery in
Database Systems. Addison Wesley, Reading, Massachusetts (1987)
3. Burrows, M.: The Chubby Lock Service for Loosely-Coupled Distributed Systems. In:
OSDI. pp. 335–350 (2006)
4. Chandra, T.D., Griesemer, R., Redstone, J.: Paxos made live: an engineering perspective. In:
PODC. pp. 398–407 (2007)
5. Chang, F., Dean, J., Ghemawat, S., Hsieh, W.C., Wallach, D.A., Burrows, M., Chandra, T.,
Fikes, A., Gruber, R.E.: Bigtable: A Distributed Storage System for Structured Data. In:
OSDI. pp. 205–218 (2006)
6. Cooper, B.F., Ramakrishnan, R., Srivastava, U., Silberstein, A., Bohannon, P., Jacobsen,
H.A., Puz, N., Weaver, D., Yerneni, R.: PNUTS: Yahoo!’s hosted data serving platform.
Proc. VLDB Endow. 1(2), 1277–1288 (2008)
7. DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., Siva-
subramanian, S., Vosshall, P., Vogels, W.: Dynamo: amazon’s highly available key-value
store. In: SOSP. pp. 205–220 (2007)
8. von Eicken, T.: Righscale Blog: Animoto’s Facebook Scale-up.
http://blog.rightscale.com/2008/04/23/animoto-facebook-scale-up/
(April 2008)
9. Ghemawat, S., Gobioff, H., Leung, S.T.: The Google file system. In: SOSP. pp. 29–43 (2003)
10. Gray, J.: Notes on data base operating systems. In: Operating Systems, An Advanced Course.
pp. 393–481. Springer-Verlag, London, UK (1978)
11. Helland, P.: Life beyond distributed transactions: an apostate’s opinion. In: CIDR. pp. 132–
141 (2007)
12. Hirsch, A.: Cool Facebook Application Game – Scrabulous – Facebook’s Scrabble.
http://www.makeuseof.com/tag/best-facebook-application-game-
scrabulous-facebooks-scrabble/ (2007)
13. Karger, D., Lehman, E., Leighton, T., Panigrahy, R., Levine, M., Lewin, D.: Consistent hash-
ing and random trees: distributed caching protocols for relieving hot spots on the world wide
web. In: STOC. pp. 654–663 (1997)
14. Lamport, L.: The part-time parliament. ACM Trans. Comput. Syst. 16(2), 133–169 (1998)
15. Lindsay, B.G., Haas, L.M., Mohan, C., Wilms, P.F., Yost, R.A.: Computation and communi-
cation in R*: a distributed database manager. ACM Trans. Comput. Syst. 2(1), 24–38 (1984)
16. Rothnie Jr., J.B., Bernstein, P.A., Fox, S., Goodman, N., Hammer, M., Landers, T.A., Reeve,
C.L., Shipman, D.W., Wong, E.: Introduction to a System for Distributed Databases (SDD-
1). ACM Trans. Database Syst. 5(1), 1–17 (1980)
17. Stoica, I., Morris, R., Karger, D., Kaashoek, M.F., Balakrishnan, H.: Chord: A scalable peer-
to-peer lookup service for internet applications. In: SIGCOMM. pp. 149–160 (2001)
18. Vogels, W.: Data access patterns in the amazon.com technology platform. In: VLDB. pp.
1–1. VLDB Endowment (2007)
19. Weikum, G., Vossen, G.: Transactional information systems: theory, algorithms, and the
practice of concurrency control and recovery. Morgan Kaufmann Publishers Inc. (2001)