Chapter Two
Functional Requirements
Computer systems are designed to meet one or multiple objectives. Without
consideration of such requirements, it is impossible to compare alternative designs.
When designing distributed systems, it can be helpful to distinguish between
functional requirements (FRs) and non-functional requirements (NFRs).
A functional requirement specifies something the system must be able to do. It
specifies a feature, expressed as a specific function or as a broadly defined
behavior. The system either has the feature, and thus provides the function or
behaves as specified, or not.
Examples of functional requirements of a multiplayer online game include:
1. The player can connect to the game servers and see who is connected to each
server.
2. The player can join their friends and exchange messages.
3. Players engaged in parkour against the clock can do so wherever they are in
the world, even if the clocks on their computers are not synchronized.
4. Specific actions where multiple players compete to retrieve the same object
are correctly resolved by the game, and, subsequently, each player sees the
same outcome.
5. Modifications made to an in-game object by the players are seen in the
correct order by all nearby players.
In contrast, a non-functional requirement focuses on how well the feature works,
defining quality attributes for the distributed system. Examples related to the same
online game include:
1. Players experience that their actions (clicks) receive a reaction from the
game within 150 milliseconds. This lag is not only limited but also stable.
2. Players can access the game services every time they try.
3. Each machine used by the gaming platform is utilized more than 50%, each
day.
4. The gaming platform consumes at most 1 MWh of electricity per day.
2.1 A Framework of Functional Requirements
Which Functional Properties Do We Expect?
• Naming: We expect that components of the same computer system find each
other, by identifier or name, and can communicate easily.
• Clock Synchronization: We also expect that the system provides a clock
and implicitly that the clock is synchronized for all system components; in
other words, acting 'at the same time' (synchronously) is trivial for
components in a computer system.
Compiled by: Ararsa Lemmessa (Eng.) Rift Valley
University
• Consensus: Similarly, components should trivially reach consensus, that is,
can easily agree on a single value, such as the maximum value across the
values stored by all components or which component has received the most
votes in the last election. (Surprisingly, real-world elections also struggle
with this issue like distributed computer systems.)
• Consistency: Last, for this lecture, we expect the data to which multiple
components write to show consistency, meaning simplified, that if some
components modify the data, afterward all components can read the result
and agree it is correct.
Distributed systems counter all these intuitions. Because the machines hosting
components of the system are physically distributed, the laws of physics have an
important impact: the real-world time it takes for information to get from one
component to another can be orders of magnitude higher than it normally takes in
the computers and smartphones we are used to. This real-world information delay
changes everything. Components in distributed systems cannot easily name or
communicate with other components. Distributed systems cannot easily achieve
clock synchronization, consensus, or consistency. Instead, all these functions
require specialized approaches in distributed systems.
2.2 Naming and Communication in Distributed Systems
Challenge of Communication in Distributed Systems
The ability to communicate is essential for systems. Even single machines are
constructed around the movement of data, from input devices, to memory and
persistent storage, to output devices. Although computers are increasingly
complex, this communication is well-understood. We include in the typical
functional requirements, and modern systems already meet these requirements, that
messages arrive correctly at the receiver, that there is an upper limit on the amount
of time it takes to read or write a message, and that developers know how much
data can be safely exchanged between applications at any point in time. In a
distributed system, none of this is true without additional effort.
Similar to applications running on a single machine,
• A distributed system can only function if its components are able to
communicate.
• The components in a distributed system are asynchronous: they run
independently from, and do not wait for, other components.
• Components must continue to function, even though communication with
other components can start and stop at any time.
• The networks used for communication in distributed system are unreliable.
Networks may drop, delay, or reorder messages arbitrarily, and components
need to take these possibilities into account.
a) Protocols for Computer Networking
To enable communication between computers, they need to speak the same
protocol. A protocol defines the rules of communication, including
Compiled by: Ararsa Lemmessa (Eng.) Rift Valley
University
• Which entity can speak,
• When, and
• How to represent the data that is communicated.
How a protocol is defined depends on the technology that underlays it. For
protocols that directly use the network’s transport layer, they need to define data
fields as a sequence of bits or bytes. Defining a protocol on this level, however, has
multiple disadvantages. It is labor intensive, the binary messages are challenging
to debug, and it is difficult to achieve backward compatibility.
When a protocol defines data fields on the level of bits and bytes, adding or
changing what data can be sent while still supporting older implementations is
difficult. For these and other reasons, distributed systems often define their
protocols on a higher layer of abstraction.
One of the simplest abstractions on top of byte streams is plain-text messages.
These are used widely in practice, especially in the older technologies that form the
core of the Internet. For example, all of the following protocols use plain-text
messages: -
• The Domain Name System (DNS),
• The Hypertext Transfer Protocol (HTTP), and
• The Simple Mail Transfer Protocol (SMTP).
Instead of defining fields with specified bit or byte lengths, plain-text protocols are
typically line-based, meaning every message ends with a new line character (“\n”).
The advantages of such protocols are that they are easy to debug by both humans
and computers, and that they offer increased flexibility due to variable-length
fields. Text-based protocols can easily be changed into binary protocols without
losing their advantages, by compressing the data before it is sent over the network.
Moving one more level up brings us to structured-text protocols. Such protocols
use specialized languages designed to represent data, and use them to define
messages. For example, REST APIs typically use JSON to exchange data. A
structured text comes with its own (more complex) rules on how to format
messages. Fortunately, many parser libraries exist for popular structured-text
formats such as XML and JSON, making it easier for distributed system
developers to use these formats without writing the tools themselves
Finally, structs and objects in programming languages can also be used as
messages. Typically, these structs are translated to and from structured-text or
binary representations with little or no work required from developers. Mapping
programming-language-specific data structures to and from message formats are
called marshaling and unmarshaling respectively. Marshaling libraries and tools
take care of both marshaling and unmarshaling. Examples of marshaling libraries
for structured-text protocols include Jackson for Java and the built-in JSON library
for Golang. Examples of marshaling libraries for binary formats include Java’s
built-in Serializable interface and Google’s Protocol Buffers.
Compiled by: Ararsa Lemmessa (Eng.) Rift Valley
University
b) Communication Models for Message Passing
Message passing always occurs between a sender and a receiver. It requires
messages to traverse a possibly unreliable communication environment and can
start and end at synchronized or arbitrary moments. This leads to multiple ways in
which message-passing communication can occur, and thus multiple useful
models.
Depending on whether the message in transit through the communication
environment is stored (persisted) until it can be delivered or not, we distinguish
between transient and persistent communication:
• Transient communication only maintains the message while the sender and
the receiver are online, and only if no transmission error occurs. This model is
the easiest to implement and matches well with the typical Internet-router based
on store-and-forward or cut-through technology. An example here is that real-
time games may occasionally drop updates and use local correction
mechanisms. This allows in many cases the use of relatively simple designs, but
for some game genres can lead to the perception of lag or choppy movement of
the avatars and objects.
• Persistent communication requires the communication environment to store
the message until it is received. This is convenient for the programmer, but
much more complex to guarantee by the distributed system. Worse, this leads
typically to lower scalability than approaches based on transient communication
due to the higher latency of the message broker storing incoming messages on a
persistent storage device as well as potential limits of the number of messages
that can be persisted at the same time. An example of the use of a persistent
communication system appears in the email system. Emails are sent and
received using SMTP and IMAP respectively. SMTP copies email from a client
or server to another server, and IMAP copies email from a server to a client.
The client can copy the email from their server repeatedly because the email is
persisted on the server.
Depending on whether the sender and/or the receiver has to wait (is blocked) in the
process of transmitting or receiving, we distinguish between asynchronous
communication and synchronous communication:
o In asynchronous communication, asynchronous senders and receivers
attempt to
transmit and receive, respectively, but will continue to other activities
immediately after the attempt, regardless of its outcome. UDP-like
communication uses the (transient) asynchronous model.
o In synchronous communication, synchronous senders and receivers
block until their operation (request) is confirmed (synchronized). We
identify three useful synchronization points: (1) when the request is
submitted, that is, when the request has been acknowledged by the
communication environment; (2) when the request is dispatched
(message/operation delivery), that is, when the communication
environment acknowledges the request has been delivered for
execution to the other side of the communication; (3) when the
request is fully processed (operation completed), that is, when the
request has reached its destination and has been fully processed, but
before the result of the processing has been sent back (as another
Compiled by: Ararsa Lemmessa (Eng.) Rift Valley
University
message, possibly with the same communication model). In practice,
the Message Passing Interface (MPI) standard provides the
programmer with the flexibility of choosing between all these
approaches to synchronization.
Remote Procedure Calls
Due to the unavailability of shared memory in distributed systems, communication
is a means to achieve a common goal among multiple computers. Often, we
communicate because one machine, the caller, wants to ask (call on) another
machine, the callee, to have them perform some work. This pattern is very typical
for how people have used telecommunication systems since their inception, just
that here we are dealing with machine-to-machine communication. Remote
procedure calls (RPCs) are an abstraction of this communication-enabled process
in DCS. Internally, RPC is typically built on top of message passing, and thus can
occur according to any of the models introduced in the previous block. RPC or
derivatives are still much used today. For instance, Google’s gRPC is a common
building block of large parts of the Google datacenter stack. Many uses of REST
also closely mimic RPC semantics, much to the disdain of purists who emphasize
that REST is supposed to be resource-centric and not procedure-centric. Modern
object-oriented variants of RPC such as RMI are often used in the Java community
for building distributed applications.
Implementation
Functionally, RPC wants to maintain the illusion that a program would call a local
implementation of the service. Since now caller and callee reside on different
machines, they need to agree on a definition of what the procedure is: its name and
parameters. This information is often encoded in an interface written in an
Interface Definition Language (IDL).
In order for local programs to be able to call the service, a stub is created that
implements the interface but, instead of doing function execution locally, encodes
the name and the argument values in a message that is forwarded to the callee.
Since this is a mechanical, deterministic process, the stub can be compiled
automatically by a stub generator.
On the server side, the message is received and the arguments need to be
unmarshalled from the message so that the function can be invoked on behalf of
the client. This is again performed by an automatically generated stub, on this side
of the system also often referred to as a skeleton (server stub in the image below)
Compiled by: Ararsa Lemmessa (Eng.) Rift Valley
University
The dynamics of this operation are as follows. When the client calls the procedure
on the local stub, the client stub marshals both the procedure name and the
provided arguments into a message. This message is then sent to a server that
contains the requested procedure. Upon receipt, the receiving stub unmarshals the
message and calls the corresponding procedure with the provided arguments. The
returned value is then sent back to the client, using the same approach. RPC uses
transient synchronous communication to create an interface that is as close as
possible to regular procedure calls
From Remote Procedures to Remote Objects
Since the late-1950s, the programming language community has proposed and
developed programming models that consider objects, rather than merely
operations and procedures for program control. Object-oriented programming
languages, such as Java (and Kotlin), Python, and C++, remain among the most
popular programming languages. It is thus meaningful to ask the question: Can
RPC be extended to (remote) objects?
An object-oriented equivalent of RPC is remote method invocation (RMI). RMI is
similar to RPC, but has to deal with the additional complexity of remote-object
state. In RMI, the object is located on the server, together with its methods
(equivalent of procedures for RPC). The client calls a function on a proxy, which
fulfills the same role as the client-stub in RPC. On the server side, the RMI
message is received by a skeleton, which executes the method call on the correct
object.
Compiled by: Ararsa Lemmessa (Eng.) Rift Valley
University
Communication Patterns
The messages that are sent between machines, be they sent as plain messages, or as
the underlying technology of RPC, show distinct patterns over time depending on
the properties of the system that uses them. Below we describe some of the most
prevalent communication patterns.
• Request-Reply Request reply is a one-to-one communication pattern where
a sender starts by asking something from the receiver. The receiver then
takes an action and replies to the sender.
• Publish-Subscribe In publish-subscribe, or pub-sub, one or multiple
machines generate periodic updates. If another machine is interested in these
updates, they can subscribe to them at the sender. The sender then sends the
updates to all the machines that are subscribed. This can work well in games,
where players may indicate that they want to receive some updates, but not
others.
• Pipeline communication works with producers, consumers, and prosumers.
In this messaging pattern, a producer wants to send messages to a particular
type of receiver but does not care about which machine this is specifically.
This pattern allows long chains and easy load-balancing, by adding more
machines to the system with a particular role.
• Broadcast A broadcast is a message sent out by a source and addressed, or
received, by all other entities on the network. Broadcast messages are useful
for bootstrapping or communicating the global state. An example of
bootstrapping is the broadcast DHCP message sent out by a client requesting
an IP. An example of broadcasting the global state can be found in games,
where a server may inform all players when a new player has joined.
• Flooding is a pattern where a broadcast is repeated by its receivers. This
works well for fast information dissemination on non-star-topology
networks but also uses large amounts of network resources. Systems that use
flooding must also actively stop the flooding once all machines have
received the message. One way of doing this is to allow machines to
propagate the broadcast only once.
• Multicast In between one-to-one communication such as request-reply, and
one-to-all communication such as broadcast is multicast. Here a sender
wants to send messages to a particular set of receivers. In a game, the
receivers could be teammates, whom the player sends data about themselves
that they do not want to share with the opposite team.
• Gossip For some systems the large number of resources required for
flooding is out of the question. To still do data dissemination, gossip
provides an alternative. In a gossiping exchange pattern, machines
periodically contact one random neighbor. They send messages to each
other, and then go back to sleep. This pattern allows information to
propagate through the entire network with high likelihood, without the
intensive resource utilization that flooding requires.
Communication in Practice
Here are some examples of popular messaging systems:
• RabbitMQ is a message broker, a middleware system that facilitates
sending and receiving messages. It is similar to Kafka, but not specifically
built for stream-processing systems.
Compiled by: Ararsa Lemmessa (Eng.) Rift Valley
University
• Protocol Buffers are a cross-platform, cross-language method of
(de)marshaling data. This is useful for HTTP or other services that do not
come with their own data marshaling libraries.
• Netty is a framework for building networked event-driven systems. It
supports both custom and existing protocols such as HTTP and Protocol
Buffers.
• Akka HTTP is part of the Akka distributed-systems framework. It provides
both a client-and server-side framework for building HTTP-based services.
• gRPC is an RPC library that marshals messages using Protocol Buffers.
• CORBA is another popular and long-running RPC library.
2.2.1 Naming in Distributed Systems
Distributed systems can consist of hundreds or even thousands of entities at any
given point. An entity can be a machine, a service, or a serverless function.
Depending on the workload, an entity may be active for any duration between
multiple years to only a few seconds. Whatever their lifetime, these entities must
be able to communicate with each other. This is where naming comes in. Naming
provides the entities in the distributed system to identify each other and set up
communication channels.
• Naming schema: The techniques used to assign names to entities,
• Naming services: using the naming schema and other elements to offer
name-related services to the distributed system.
a) Naming Schema in Distributed Systems
Naming schemes (schema) are the rules by which names are given to individual
entities. There are an infinite number of ways in which to ascribe names to entities.
In this section, we identify and discuss three categories of naming schema:
• Simple naming,
• Hierarchical naming, and
• Attribute-based naming.
Simple Naming: Focusing on uniquely identifying one entity among many is the
simplest way to name entities in a distributed system. Such a name contains no
information about the entity’s location or role.
Advantages: The main advantage of this approach is simplicity. The effort required
to assign a name is low—the only requirement is that the name is not already
taken. Various approaches can simplify even this verification step, at the cost of a
(very low) probability the name may cause a collision with another chosen name.
Disadvantages: A simple name shifts the complexity of locating it to the naming
service.
Addressing the downside of simple naming, distributed systems can use rich
names. Such names not only uniquely identify an entity, but also contain additional
information, for example, about
Compiled by: Ararsa Lemmessa (Eng.) Rift Valley
University
the entity's location. This additional information can simplify the task of the
naming service. As examples of rich names, we discuss hierarchical naming and
attribute-based naming.
Hierarchical Naming: In hierarchical naming, names are allowed to contain other
names, creating a tree structure.
Namespaces are commonly used in practice. Examples include file systems, the
DNS, and package imports in Java and other languages. These names consist of a
concatenation of words separated by a special character such as “.” or “/”. The tree
structure forms a name hierarchy, which combines well with, but is not the same
as, a hierarchical name resolution approach. When a hierarchical naming scheme is
combined with hierarchical name resolution, a machine is typically responsible for
all names in one part of the hierarchy. For example, when using DNS to look up
the name https://rift.com.et, we first contact one of the DNS root servers. These
forward us to the “et” servers, which forward us to the “com” servers, which in
turn know where to find “rvu.com.et”.
Figure 1. Simplified attribute-based naming for a Minecraft-like game. Steps 1-3
are discussed in the text.
Attribute-Based Naming names entities by concatenating the entity’s
distinguishing attributes. For example, a Minecraft server located in an EU
datacenter may be named “R=EU/G=Minecraft”, where “R” indicates the region
and “G” indicates the game.
Figure 1 illustrates how a player in the example might find a game of Minecraft
located in an EU datacenter. In step 1, the game client on the player's computer
automates this, by querying the naming service to
"search((R=“EU”)(G=“Minecraft”))". Because the entries in attribute-based
naming are key-value pairs, searches are easy to make, and also partial searches
can result in matches. In step 2, the naming service returns the information that
"server 42" is a server matching
Compiled by: Ararsa Lemmessa (Eng.) Rift Valley
University
the requested properties. In step 3, the game client resolves the name and connects
to the specific machine that runs the Minecraft server in the EU.
Naming schema in practice: The lightweight directory access protocol (LDAP) is
a name-resolution protocol that uses both hierarchical and attribute-based naming.
Names consist of attributes, and can be found by performing search operations.
The protocol returns all names with matching attributes. In addition to the
attributes, names also have a distinguished name, which is a unique hierarchical
name similar to a file path. This name can change when, for example, the name is
moved to a different server. Because LDAP is a protocol, multiple
implementations exist. ApacheDS is one of these implementations.
b) Naming Services in Distributed Systems
Once every entity in the system has a name, we would like to use those names to
address our messages. Networking approaches assume that we know, for each
entity, on which machine it is currently running. In distributed systems, we want to
break free of this limitation. Modern datacenter architectures often run systems
inside virtual machines that can be moved from one physical machine to the next
in seconds. Even if instances of entities are not moved, they may fail or be shut
down, while new instances of the same service are started on other machines.
Naming services address such complexity in distributed systems.
Name resolution: A subsystem is responsible for maintaining a mapping between
entity names and transport-layer addresses. Depending on the scalability
requirements of the system, this could be implemented on a single machine, as a
distributed database, etc.
Publish-Subscribe systems: The entities only have to indicate which messages
they are interested in receiving. In other words, they subscribe to certain messages
with the naming service. This subscription can be based on multiple properties.
Common properties for publish-subscribe systems include messages of a certain
topic, with certain content, or of a certain type. When an entity wants to send a
message, it sends it not to the interested entities, but to the naming service. The
naming service then proceeds by publishing the message to all subscribed entities.
The publish-subscribe service is reminiscent of the bus found in single-machine
systems. For this reason, the publish-subscribe service is often called the
"enterprise bus". A bus provides a single channel of communication to which all
components are connected. When one component sends a message, all others are
able to read it. It is then up to the recipients to decide if that message is of interest
to them. Publish-subscribe differs from this approach by centralizing the logic that
decides which messages are of interest to which entities.
Compiled by: Ararsa Lemmessa (Eng.) Rift Valley
University
Figure 2. Publish-subscribe example.
Figure 2 illustrates the operation of publish-subscribe systems, in practice. In step
1, user A announces their intention to receive all updates from user D; this forms
the subscription for all messages from user D. This request could be much more
complex, filtering only some messages. Also, the same user can make multiple
subscriptions. The system will ensure the request is enforced.
In step 2, users B and C send updates (messages) to the publish-subscribe system.
These are stored (published) and may be forwarded to users other than A.
In step 3, user D sends a new message to the publish-subscribe system. The system
analyzes this message and decides it fits the subscription made by user A.
Consequently, in step 4, user A will receive the new message.
In practice: The publish-subscribe approach is widely deployed in production
systems. Apache Kafka is a publish-subscribe platform for message streams. It is
suitable for systems that produce and reliably process large quantities of messages.
Kafka is part of a larger ecosystem. It can be used with other systems such as
Hadoop, Spark, Storm, Flink, and others.
2.3 Clock Synchronization, Consensus and Consistency Relation
Focusing on clock synchronization, consensus, and consistency, at first glance they
seem very different from each other. Yet, they each try to get multiple components
in a distributed system to agree on a particular value. This forms the basis of our
conceptual framework for these functional requirements.
Compiled by: Ararsa Lemmessa (Eng.) Rift Valley
University
Figure 1. Comparison of functional requirements.
Clock synchronization focuses on an agreement between components about the
time, which is a single, numerical value, which changes continuously but
unidirectionally (it only increases if one counts the elapsed number of milli- or
microseconds since a commonly agreed start time, as computer systems do) and
monotonically (with the clock frequency). We also observe a synchronized clock
enables establishing a happens-before relationship between events recorded in the
system; even without a physical clock, if we can otherwise establish this
relationship, we have the equivalent of a logical clock. Using the happens-before
relationship between any two events, we can create a total order of these events.
Consensus focuses on any value, so unlike the clock not only numerical. More
importantly, the value subject to consensus does not need to change as clocks do;
in fact, it may not even change at all. Consensus may focus on a single value but,
by reaching consensus repeatedly, can also enable a total ordering of events but
such an approach is expensive in time and resources.
Consistency focuses on any value from the many included in a dataset, creating a
flexible order. Consistency protocols in distributed systems define the kind of
order that can be achieved, for example, total order, and, more loosely, when the
order will be achieved, for example, after each operation, after some guaranteed
maximum number of operations, or eventually. Using consistency protocols to
order events in a form weaker than total ordering, and even some discrepancies
between how different components see the values in the database, are useful for
different classes of applications because they can often be achieved much quicker
and with much more scalable techniques.
Compiled by: Ararsa Lemmessa (Eng.) Rift Valley
University
2.4 Consensus
Coordination, with a Focus on Consensus
A simple program performs a series of operations to obtain some desired result. In
distributed systems, these operations are performed by multiple machines
communicating through unreliable network environments. While the system is
running, machines must coordinate to operate correctly. For example, the system
may need to agree on whether a certain action happened - to reach a consensus on
this question.
2.4.1 What is Consensus?
Consider a distributed key-value store that uses replication. Users submit read and
write operations to whichever process is closest to them, reducing latency. To give
the user the illusion of a single system, the processes must agree on the order to
perform the queries, and especially keep the results of writing (changing) data and
reading data in the correct order. Using clock synchronization techniques could
work for this, but the cost of having each machine in the distributed system ask
each other about whether the operations they received lead to some other order, for
each operation, is prohibitively expensive in both resources and time. Another
class of techniques needs to focus on the consensus problem.
In a distributed system,
Consensus is the ability to have all machines agree on a value. Consensus
protocols ensure this ability.
For a protocol to ensure consensus, three things have to happen.
• First, the system as a whole must agree on a decision.
• Second, machines do not apply decisions that are not agreed upon.
• Third, the system cannot change a decision.
Consensus protocols (distributed algorithms) can create a total order of operations
by repeatedly agreeing on what operation to perform next.
Why Consensus is impossible in some Circumstances
Theoretical computer science has considered for many decades the problem of
reaching consensus. When machine failures can occur, reaching consensus is
surprisingly difficult. If the delay of transmitting a message between machines is
left unbound, it is proved that, even when using reliable networks, no distributed
consensus protocol is guaranteed to complete. The proof itself is known as the FLP
proof, after the acronym of the family names of its creators. It can be found in the
aptly named article “Impossibility of Distributed Consensus with One Faulty
Process” [1].
Compiled by: Ararsa Lemmessa (Eng.) Rift Valley
University
[1] FLP Impossibility Theorem, accredited to Fischer, Lynch, and Paterson, has
proved that in a fully asynchronous distributed system where even a single process
may have a crash failure, it’s impossible to have a deterministic algorithm for
achieving consensus.
Consider that the claim is not true: There exists a consistency protocol, a
distributed algorithm that always reaches consensus in bounded time. For the
algorithm to be correct, all machines that decide on a value must decide on the
same value. This prevents the algorithm from simply letting the machines guess a
value. Instead, they need to communicate to decide which value to choose.
This communication is done by sending messages. Receiving, processing, and
sending messages makes the algorithm progress toward completion. At the start of
the algorithm, the system is in an undecided state. After exchanging a certain
number of messages, the algorithm decides. After a decision, the algorithm - and
the system - can no longer “change its mind.” The FLP proof shows that there is no
upper bound on the number of messages required to reach consensus.
General Consensus and an Approach to Reach It
To achieve consensus, consensus protocols must have two properties:
1. Safety, which guarantees "nothing incorrect can happen". The consensus
protocol must decide on a single value, and cannot decide on two values, or
more, at once.
2. Liveness, which guarantees "something correct will happen, even if only
slowly". The consensus protocol, left without hard cases to address - for
example, no failures for some amount of time -, can and will reach its
decision on which value is correct.
Many protocols have been proposed to achieve consensus, with various degrees of
capability under various forms of failures, messaging delays they tolerate, etc.
Among the protocols that are used in practice, Paxos, multi-Paxos, and more
recently Raft seem to be very popular. For example, etcd is a distributed database
built on top of the Raft consensus algorithm. Its API is similar to that of Apache
ZooKeeper (a widely-used open-source coordination service), allowing users to
store data in a hierarchical data-structure. Etcd is used by Kubernetes and several
other widely-used systems to keep track of shared state.
We sketch here the operation of the Raft approach to reach consensus. Raft is a
consensus algorithm specifically designed to be easy to understand. Compared to
other consensus algorithms, it has a smaller state space (the number of
configurations the system can have), and fewer parts.
Figure 3. Raft overview.
Compiled by: Ararsa Lemmessa (Eng.) Rift Valley
University
Figure 3 gives a Raft overview. There are four main components:
1. Raft first elects a leader ("leader election" in Figure 3). The other machines
become followers. Once a leader has been elected, the algorithm can start
accepting new log entries (data operations).
2. The log (data) is replicated across all the machines in the system ("log
replication" in the figure).
3. Users send new entries only to the leader.
4. The leader asks every follower to confirm. If most followers confirm, the log
is updated (performs the operation).
We describe three key parts of Raft. These do not form the entirety of Raft, which
is indicative that even a consensus protocol designed to be easy to understand still
has many aspects to cover.
The Raft leader election: Having a leader simplifies decision-making. The leader
decides on the values. The other machines are followers, accepting all decisions
from the leader. Easy enough. But how do we elect a leader? All machines must
agree on who the leader is—leader election requires reaching consensus, and must
have safety and liveness properties.
In Raft, machines can try to become the new leader by starting an election. Doing
so changes their role to candidate. Leaders are appointed until they fail, and
followers only start an election if they believe the current leader to have failed. A
new leader is elected if a candidate receives the majority of votes. With one
exception, which we discuss in the section on safety below, followers always vote
in favor of the candidate.
Raft uses terms to guarantee that voting is only done for the current election, even
when messages can be delayed. The term is a counter shared between all machines.
It is incremented with each election. A machine can only vote once for every term.
If the election completes without selecting a new leader, the next candidate
increments the term number and starts a new election. This gives machines a new
vote, guaranteeing liveness. It also allows distinguishing old from new votes by
looking at the term number, guaranteeing safety.
An election is more likely to succeed if there are fewer concurrent candidates. To
this end, candidates wait a random amount of time after a failed election before
trying again.
Log replication: In Raft, users only submit new entries to the leader, and log
entries only move from the leader to the followers. Users that contact a follower
are redirected to the leader.
Compiled by: Ararsa Lemmessa (Eng.) Rift Valley
University
Figure 4. Log replication in Raft. The crown marks the leader.
New entries are decided, or “chosen,” once they are accepted by a majority of
machines. As Figure 4 illustrates, this happens in a single round-trip: (a) The leader
propagates the entries to the followers and, (b) counts the votes and accepts the
entry only if a majority in the system voted positively.
Log replication is relatively simple because it uses a leader. Having a leader
means, for example, that there cannot be multiple log entries contending for the
same place in the log.
Safety in Raft: Electing a leader and then replicating new entries is not enough to
guarantee safety. For example, it is possible that a follower misses one or multiple
log entries from the leader, the leader fails, the follower becomes a candidate and
becomes the new leader, and finally overwrites these missed log entries.
(Sequences of events that can cause problems are a staple of consensus-protocol
analysis.) Raft solves this problem by setting restrictions on which machines may
be elected leader. Specifically, machines vote “yes” for a candidate only if that
candidate’s log is at least as up-to-date as theirs. This means two things must hold:
1. The candidate’s term must be at least as high as the followers, and
2. The candidate’s log entry index must be at least as high as the follower's.
When machines vote according to these rules, it cannot occur that an elected leader
overwrites chosen (voted upon) log entries. It turns out this is sufficient to
guarantee safety; additional information can be found in the original article.
Compiled by: Ararsa Lemmessa (Eng.) Rift Valley
University
2.5 Consistency in Distributed Systems
The Data Store
The essence of any discussion about consistency is the abstract notion of the data
store. Data stores can differ when servicing diverse applications, types of
operations, and kinds of transactions, but essentially a data store:
1. Stores data across multiple machines (replicas),
2. Receives a stream of various operations, of which most common are Read
(query data) and Write (update data); it is common for data stores to support
only a few operations, sometimes only Reads and Writes,
3. Enforces that the operations execute correctly, that is, delivering consistent
results, and
4. In practice, also supports other, functional and non-functional, requirements.
Figure 1. Data store with a single primary user.
Many applications only have a single primary user. You are likely the only one
accessing your email, for business or leisure. You may have a private Dropbox
folder, which you may want to access at home, on the train, wherever you stay
long enough to want to store new photos, etc. Many mobile-first users recognize
these and similar applications. Figure 1 depicts the data store for the single primary
user. Here, the user can connect from one location (or device), write new
information - a new email, a new Dropbox file, then disconnect. After moving to a
new location (or device), and reconnecting, the user should be able to resume the
email and access the latest version of the file.
Other applications have multiple users, writing together information to the same
shared document, changing together the state of an online game, making together
transactions affecting many shared accounts in a large data management system,
etc. Here, the data store again has to manage the data-updates, and deliver correct
results when users query (read).
2.5.1 What is Consistency?
The main goal of consistency is:
Compiled by: Ararsa Lemmessa (Eng.) Rift Valley
University
Goal: achieving consistency, which means establishing and enforcing a mutual
agreement between the data store and its client or clients, on the expected effect of
system operations on data.
In a distributed system, achieving consistency falls upon the consistency model
and consistency (enforcing) mechanisms.
• Consistency models determine which data and operations are visible to a
user or process, and which kind of read and write operations are supported
on them.
• Consistency mechanisms update data between replicas to meet the
guarantees specified by the model.
Classes of consistency models: The consistency model offers guarantees, but
outside the guarantees, almost anything is allowed, even if it seems counter-
intuitive.
We identify two main classes of consistency models:
1. Strong consistency: an operation, particularly a query, can return only a
consistent state.
2. Weak consistency: an operation, particularly a query, can return inconsistent
state, but there is an expectation there will be a moment when consistent
state is returned to client. Sometimes, the model guarantees which moment,
or which (partial) state.
The strictest forms of consistency are so costly to maintain that, in practice, there
may be some tolerance for a bit of inconsistency after all. The CAP theorem
suggests availability may suffer under these strict models, and, the PACELCA
framework further suggests also performance is a trade-off with how strict the
consistency model can be.
Weak Consistency Models
Many views on consistency models exist. Traditional results from theoretical
computer science and formal methods indicate
• What single-operation, single-object guarantees and
• What multi-operation, multi-object guarantees of consistency can be given for data
stores.
Notions of (i) linearizability or (ii) serializability emerged to indicate Write
operations can seem instantaneous yet a real-time or an arbitrary total order can be
enforced, respectively.
Building from these results,
(1) in operation-centric consistency models, a single client can access a single data
object,
(2) in transaction-centric consistency models, multiple clients can access any of the
multiple data objects, and
Compiled by: Ararsa Lemmessa (Eng.) Rift Valley
University
(3) in application-centric consistency models, specific applications can tolerate
some inconsistency or have special ways to avoid some of the costly update
operations.
Operation-Centric Consistency Models (single client, single data object, data
store with multiple replicas):
Several important models emerged in the past four decades, and more may
continue to emerge:
Sequential consistency: All replicas see the same order of operations as all other
replicas. This is desirable, but of course prohibitively expensive.
What other operation-centric consistency models can designers use?
Causal consistency weakens the promises, but also the needs to operate, of
sequential consistency: As for sequential consistency, causally related operations
must still be observed in the same order by all replicas. However, for other
operations that are not causally related, different replicas may see a different order
of operations and thus of outcomes. Important cases of causal consistency, with
important applications, include:
1. Monotonic Reads: Subsequent reads by the same process always return a
value that is at least as recent as a previous read. Important applications
include Calendar, inventories in online games, etc.
2. Monotonic Writes: Subsequent writes by the same process follow each other
in that order. Important applications include email, coding on multiple
machines, your bank account, bank accounts in online games, etc.
3. Read Your Writes: A client that writes a value, upon reading it will see a
version that is at least as recent as the version they wrote. Updating a
webpage should always, in our expectation, make the page refresh show the
update.
4. Writes Follow Reads: A client that first reads and then writes a value, will
write to the same, or a more recent, version of the value it read. Imagine you
want to post a reply on social media. You expect this reply to appear
following the post you read.
Causal consistency is still remarkably difficult to ensure in practice. What could
designers use that is so lightweight it can scale to millions of clients or more? The
key difficulty in scaling causal consistency is that updates that multiple replicas to
coordinate could hit the system concurrently, effectively slowing it down to non-
interactive responses and breaking scalability needs. A consistency model that can
delay when replicas need to coordinate would be very useful to achieve scale.
Eventual consistency: Absent new writes, all replicas will eventually have the same
contents. Here, the coordination required to achieve consistency can be delayed
until the system is less busy, which may mean indefinitely in a very crowded
system; in practice, many systems are not heavily overloaded much of the time,
and eventual consistency can achieve good consistency results in a matter of
minutes or hours.
Compiled by: Ararsa Lemmessa (Eng.) Rift Valley
University
Application-Centric Consistency Models: We only sketch the principle of
operation of these consistency models.
Under special circumstances, there is no need for the heavy, scalability-breaking
coordination needed to ensure consistency we saw for operation-centric
consistency models (and can have an intuition about the even heavier transaction-
centric consistency models). Identifying such circumstances in general has proven
very challenging, but good patterns have emerged for specific (classes of)
applications.
Applications where small inconsistencies can be tolerated include social media,
where for example information can often be a bit stale (but not too much!) without
much impact, online gaming, where slight inconsistencies between the positions of
in-game objects can be tolerated (but large inconsistencies cannot), and even
banking where inconsistent payments are tolerated as long as their sum does not
exceed the maximum amount allowed for the day. Consistency models where
limited inconsistency is allowed, but also tracked and not allowed to go beyond
known bounds, include conits (we discuss them in the next section).
Conflict-free Replicated Data Types (CRDTs) focus on efficiently reconciling
inconsistency situations. To this end, they are restricted to data types that only
allow monotonic operations, and whose replicas can always be correctly reconciled
by taking the union of operations across all replicas. For example, suppose the
data-object represents a set, to which items can only be added. In this case, it does
not matter in which order the objects are added, or which replica executes the
operation of addition. In the end, the correct set is obtained by executing every
addition operation in the system across all replicas. In this example, removal or
modification of an object would not be allowed because they are not monotonic.
We conclude by observing the designer of distributed systems and applications
must have at least a basic grasp of consistency, and of known classes of
consistency models with a proven record for exactly that kind of system or
application. This can be a challenging learning process, and mistakes are costly.
2.5.2 Consistency for Online Gaming, Virtual Environments, and the
Metaverse
Dead Reckoning
One of the earliest consistency techniques in games is the dead reckoning. The
technique addresses the key problem that information arriving over the network
may be stale by the moment of arrival due to network latency. The main intuition
behind this technique is that many values in the game follow a predictable
trajectory, so updates to these values over time can largely be predicted. Thus, as a
latency-hiding technique, dead reckoning uses a predictive technique, which
estimates the next value and, without new information arriving over the network
from the other nodes in the distributed system, updates the value to match the
prediction.
Although players are not extremely sensitive to accurate updates, and as long as
the updated values seem to follow an intuitive trajectory will experience the game
as smooth, they are sensitive to jumps in values. Thus, when the locally predicted
values and the values arriving over the network diverge, dead reckoning cannot
simply replace the local value with the newly arrived; such an
Compiled by: Ararsa Lemmessa (Eng.) Rift Valley
University
approach would lead to value jumps that disturb the players. Instead, dead
reckoning interpolates the locally predicted and the newly arrived values, using a
convergence technique.
The interplay between the two techniques, the predictive and the convergence,
makes dead reckoning an eventually consistent technique, with continuous updates
and managed inconsistency.
Advantages: Although using two internal techniques may seem complex, dead
reckoning is a simple technique with excellent properties when used in distributed
systems. It is also mature, with many decades of practical experience already
available.
For many gaming applications, trajectories are bound by limitations on allowed
operations, so the numerical inconsistency can be quantified as a function of the
staleness of information.
Drawbacks: As a significant drawback, dead reckoning works only for applications
where the two techniques, especially the predictive, can work with relatively low
overhead.
Figure 1. Example of dead reckoning.
Example: Figure 1 illustrates how dead reckoning works in practice. In this
example, an object is located in a 2D space (so, has two coordinates), in which it
moves with physical velocity (so, a 2D velocity vector expressed as a pair). The
game engine updates the position of each object after each time tick, so at time t=0,
t=1, t=2, etc. In the example, the local game engine receives an update about the
object, at t=0; this update positions the object at position (0,0), with velocity (2,2).
The dead reckoning predictor can easily predict the next positions the object will
take during the next time ticks: (2,2) at t=1, (4,4) at t=2, etc. If the local game
engine receives no further updates, this predictor can continue to update the object,
indefinitely.
Compiled by: Ararsa Lemmessa (Eng.) Rift Valley
University
However, the object, controlled remotely, moves differently, and the local game
engine receives at t=1 an update that the object is now located at (3,1), with the
new velocity (4,2). The game engine has already updated the local value, to (2,2).
For the next tick, t=2, the dead reckoning technique must now interpolate the next
value (4,4) and the value derived from the received updates (7,3). If it simply
replaces the next value with the value derived from the received update, the player
will observe a sudden jump, because the next value follows the intuitive path of the
previous values the player observed locally, whereas the value derived from the
received update does not. Instead, the dead reckoning technique computes
interpolated values, which will smoothly converge to the correct value if no new
information is received.
If the local game engine keeps receiving new information, dead reckoning ensures
a state of smooth inconsistency, which the players experience positively.
Lock-step Consistency
Toward the end of 1997, multiplayer gaming was already commonplace, and
games like Age of Empires were launched with much acclaim and sold to millions.
The technical conditions were much improved over the humble beginnings of such
games, around the 1960s for small-scale online games and through the 1970s for
large-scale games with hundreds of concurrent players (for example, in the
PLATO metaverse). Players could connect with the main servers through high-
speed networks... of 28.8 Kbps, with connections established over dial-up (phone)
lines with modems. So, following a true Jevons' paradox, gaming companies
developing real-time strategy games focused on scaling up, from a few tens of
units to hundreds, per player.
Consequently, the network became a main bottleneck - sending around
information about hundreds to thousands of units (location, velocity, direction,
status for each other tracked variable), about 20 times per second as required in this
game genre at the time, would quickly exceed the limit of about 3,000 bytes per
second. To expert designers, these network conditions could support a couple of
hundred but not 1,000 units. In a game like Age of Empires, the target limit set by
designers was even higher: 1,500 units across 8 players. How to ensure consistency
under these circumstances? (Similar situations continue to occur: For each
significant advance in the speed of the network and the processing power of the
local gaming rig, game developers embark again on new games that quickly
exceed the new capabilities.)
One more ingredient is needed to have a game where the state of every unit -
location, appearance, activity, etc. - appears consistent across all players: the state
needs to be the same at the same moment because players are engaged in a
synchronous contest against each other. So, the missing ingredient is a
synchronized clock linked to the consistency process.
Lock-step consistency occurs when simulations progress at the same rate and
achieve the same status at the end (or start) of each step (time tick).
One approach to achieve lock-step consistency is for all the computers in the
distributed system running the game to synchronize their game clocks. Players
would input their commands to their local game engines, which the local game
engine communicates over the network to all other game engines. Then, every
game engine updates the local status based on the received input, either
Compiled by: Ararsa Lemmessa (Eng.) Rift Valley
University
locally or over the network. The game moves in lock-step, and each step consists
of the sequence input, communication, then local updates that include input.
A main benefit of this approach is that the approach trades-off communication for
local computation: the communication part is reduced only to necessary updates,
such as player inputs, and the game engines recompute the state of the game using
dead reckoning and the inputs. The network bandwidth is therefore sufficient for a
game like Age of Empires with 1,500 moving units.
As a drawback, this approach uses a sequence of three operations, which is
prohibitive when the target is to complete all of them in under 50 milliseconds (to
enable updates 20 times per second, as described earlier in the section). Suppose
performance variability occurs in the distributed system, either in transferring data
over the Internet or in computing the updates, for any of the players. In this case,
the next step either cannot complete in time or has to wait for the slowest player to
complete (lock-step really means the step is locked until everyone completes it).
Another approach pipelines communication and communication processes, that is,
updating the state while receiving input from players. To prevent inconsistent
results, this approach again uses time ticks, and input received during one step is
always enforced two steps later.
Advantages: Such an approach guarantees that performance variability in
communicating input between players can be tolerated, as long as it is not longer
than twice the step duration (so, 400 ms for the typical tick duration of 200 ms).
The tolerance threshold of 400 ms was not chosen lightly and corresponds to the
tolerance to latency players exhibit for this game genre, which has been reported
by many empirical studies and summarized by meta-studies such as [2]. In other
words, for this game genre players still enjoy their in-game experience, even when
a latency of 400 ms is added to their input, as long as the results are smooth and
consistent.
Disadvantages: As for the first approach, performance variability, which is
predominantly caused by the processing of the slowest computer in the distributed
game or by the laggiest Internet connection, can cause problems. The problems
occur only when the performance variability is extreme, closer to 1,000 ms than to
400 ms over the typical performance at the time.
A third approach improves on the second by allowing turns to have variable
lengths and thus match performance variability. This approach works as the
second whenever performance becomes stable near normal levels: The turn length
stays at 200 ms, with ticks for communication and computation set at 50 ms.
Whenever performance degrades, this approach provides a technique to lengthen
the step duration, typically up to 1,000 ms. Beyond this value, empirical studies
indicate the game becomes much less enjoyable. Not only the turn lengthens when
needed, but also how it is allocated for computation and communication tasks, next
to local updates and rendering. This approach allocates, from the total turn
duration, more time for computation to accommodate for a slower computer
among the players or more time for communication to accommodate for slow
Internet connections.
Compiled by: Ararsa Lemmessa (Eng.) Rift Valley
University
To make decisions on turn duration, and its specific allocation to communication
and computation tasks, the system uses a distributed monitoring technique, where
each player reports to the leader (the host for the typical Age of Empires
deployment, an elected leader for the peer-to-peer deployment), during each turn,
the duration of its local computation task and the latency it observed when sending
a ping message to each other player. The leader then computes the maximum of the
received values for computation and communication tasks, and makes appropriate
decisions. A typical situation could occur when the Internet latency increases, for
example to 500 ms, with the turn length increasing correspondingly. Another
typical situation, of some computation tasks taking longer than usual, for example,
95 ms, would see the turn stay the normal duration, 200 ms, but inside it the
computation tick is increased to 100 ms.
Beyond mere lock-step consistency: Lock-step approaches still suffer from a
major drawback: when every player has to simulate locally every input, the amount
of computation can quickly overwhelm slower computers, especially when game
designers intend to scale well beyond 8 players, to possibly tens or hundreds or
thousands for real-time strategy games.
First, we partition the virtual world into areas so that the game engine can select
only those of interest for each player. Second, the game engine updates areas
judiciously. Some areas do not receive updates because no player is interested in
them. Areas interesting for only one player are updated on that player's machine.
Each area that is interesting for two or more players is updated with lock-step or
communication-only consistency protocols, depending on the computation and
communication capabilities of the players interested in the area.
In summary: Trading off communication for computation needs is a typical
problem for online games, virtual environments, and metaverses. Lock-step
consistency provides a solution based on this trade-off, with many desirable
properties. Still, lock-step consistency is challenging when the system exhibits
unstable behavior, such as performance variability. In production, games must
cope with unstable behavior often. Then, monitoring the system carefully while it
operates, and conducting careful empirical studies of how the players experience
the game under different levels of performance, is essential to addressing the
unstable behavior satisfactorily.
Conit-based Consistency
Although lock-step consistency is useful, in games where many changes occur that
do not fit local predictors, so for which dead-reckoning and other computationally
efficient techniques are difficult to find, it is better when scaling the virtual world
to allow for some inconsistency to occur. In particular, games such as Minecraft
could benefit from this.
Conits, abbreviation of consistency unit, have been designed to support
consistency approaches where inconsistency can occur but should be quantified
and managed. In the original design by Yu and Vahdat [1], conits quantify three
dimensions of inconsistency:
• Staleness - how old is this update?
• Numerical error- how large is the impact of this update? and
Compiled by: Ararsa Lemmessa (Eng.) Rift Valley
University
• Order error- how does this update relate to other updates?
Any conit-based consistency protocol uses at least one conit to capture the
inconsistency in the system along the three dimensions. Time elapsed and data-
changing operations lead to updates to the conit state, typically increasing
inconsistency values along one or more dimensions. At runtime, when the limit of
inconsistency set by the system operators is exceeded, the system triggers a
consistency-enforcing protocol and the conit is reset to (near-)zero inconsistency
across all dimensions.
Conits provide a versatile base for consistency approaches. Still, they so far have
not been much used in practice for two main reasons: First, not many applications
exist that would tolerate significant amounts of inconsistency. Second, setting the
thresholds after which consistency must be enforced is error-prone and application-
dependent.
2.6 Replication in Distributed Systems More Detail
To provide a seamless experience to their users, distributed systems often rely on
data replication. Replication allows companies such as Amazon, Dropbox, Google,
and Netflix to move data close to their users, significantly improving non-
functional requirements such as latency and reliability.
We study in this section what replication is and what are the main concerns for the
designer when using replication. One of the main such concerns, consistency of
data across the replica, relates to an important functional requirement and will be
the focus of the next sections in this module.
What is Replication?
The core idea of replication is to repeat essential operations by duplicating,
triplicating, and generally multiplying the same service or physical resource, thread
or virtual resource, or, at a finer granularity and with a higher level of abstraction,
data or code (computation).
Compiled by: Ararsa Lemmessa (Eng.) Rift Valley
University
Figure 1. Data and service sharing.
Like resource sharing, replication can occur (i) in time, where multiple replicas
(instances) co-exist on the same machine (node), simultaneously, or (ii) in space,
where multiple instances exist on multiple machines. Figure 1 illustrates how data
or services could be replicated in time or in space. For example, data replication in
space (Figure 1, bottom-left quadrant) places copies of the data from Node 1 on
several other nodes, here, Node 2 through n. As another example, service
replication in time (Figure 1, top-right quadrant) launches copies of the service on
the same node, Node 1.
To clarify replication, we must further distinguish it from other operational
techniques that use copies of services, physical resources, threads, virtual
resources, data, computation, etc. Replication differs from other techniques in
many ways, including:
1. Unlike partitioning data or computation, replication makes copies of (and
then uses) entire sources, so entire datasets, entire compute tasks, etc. (A
variant of replication, more selective, focuses on making copies of entire
sources, but only if sources are considered important enough.)
2. Unlike load balancing, replication makes copies of the entire workload.
(Selective replication only makes copies of the part of the workload
considered important enough.)
3. Unlike data persistence, checkpointing, and backups, replication techniques
repeatedly act on the replicas, and access to the source replica is similar to
accessing the other replicas.
4. Unlike speculative execution, replication techniques typically consider
replicas as independent contributors to overall progress with the workload.
5. Unlike migration, replication techniques continue to use the source.
General benefits of replication
Replication can increase performance. When more replicas can service users, if
each can deliver roughly the performance of the source replica, the service
effectively increases its performance linearly with the number of replicas; in such
cases, replication also increases the scalability of the system. For example, grid and
cloud computing systems replicate their servers, thus allowing the system to scale
to many users with similar needs.
When replicating in space, because the many nodes are unlikely to all be affected
by the same performance issue when completing a share of the workload, the entire
system delivers relatively stable performance; in this case, replication also
decreases performance variability.
Geographical replication, where nodes can be placed close-to-users, can lead to
important performance gains, guided by the laws of physics, particularly the speed
of light.
Replication can lead to higher reliability and to what practice considers high
availability: in a system with more replicas, more of them need to fail before the
entire system becomes unavailable, relative to a system with only one replica. The
danger of a single point of failure (see also the discussion about scheduler
architectures, in Module 4) is alleviated.
Compiled by: Ararsa Lemmessa (Eng.) Rift Valley
University
Consider one of the largest outages in the past year, occurring at Microsoft's
Teams, Xbox Live, and Azure services, an outage so serious it made the news, as
reported by the BBC [1]. According to the BBC, "tens of thousands of users" self-
reported failures, and Microsoft admitted the failures touched "a subset of users".
However, the Microsoft approach of replicating their services, both in time and in
space, allowed most of its hundreds of millions of users to continue working, even
while the others were experiencing failures when accessing the replicas servicing
them. The many replicas of the same service prevent partial failures from
spreading system-wide. In one of our studies [2], we observed many of these
situations across the services provided by Microsoft, Amazon, Google, Apple, and
others; some of these failures are not considered important enough to be reported
by these companies.
General drawbacks of replication:
In high-availability services, replication is a common mechanism to make the
system available. Typically, a single replica provides the service, with the others
available if the primary replica fails. Such systems cost more to operate. (The extra
cost may be worthwhile. For example, many business-critical services run with
services where at least two replicas exist; although the cost of operation for the IT
infrastructure effectively doubles, the higher likelihood the services will be
available when needed prevents much more costly situations when the service is
needed but not available. There is also a reputational cost at play, where the
absence of service may cause bad publicity well beyond the cost of the service.)
When multiple replicas can perform the same service concurrently, their local state
may become different, a consequence of the different operations performed by
each replica. In this situation, if the application cannot tolerate the inconsistency,
the distributed system must enforce a consistency protocol to resolve the
inconsistency, either immediately, at some point in time but with specific
guarantees, or eventually. As explained during the introduction, the CAP theorem
indicates consistency is one of the properties of distributed systems that cannot be
easily achieved, and in particular it presents trade-offs with availability (and
performance, as we will learn at the end of this module). So, this approach may
offset and even negate some of the benefits discussed earlier in this section.
Replication Approaches
In a small-scale distributed system, replication is typically achieved by executing
the incoming stream of tasks (requests in web and database applications, jobs in
Module 4) either (i) passively, where the execution happens on a single replica,
which then broadcasts to the others the results, or
(ii)actively, where each replica receives the input stream of tasks and executes it.
However, many more considerations appear as soon as the distributed system
becomes larger than a few nodes serving a few clients.
Compiled by: Ararsa Lemmessa (Eng.) Rift Valley
University
Many replication approaches have been developed and tested in practice in larger,
even global-scale, distributed systems. Depending on the scale and purpose of the
system, we consider here the principles of three aspects of the replication problem:
(i) replica-server location, (ii) replica placement, and (iii) replica updates. Many
more aspects exist, and in general replication is a rich problem that includes many
of the issues present in resource management and scheduling problems in
distributed systems (see Module 4): Who? What? When? For how long? etc.
Replica-server location: Like any data or compute task, replicas require physical
or virtual machines on which to run. Thus, the problem of placing these machines,
such that their locations provide the best possible service to the system and a good
trade-off with other considerations, is important. This problem is particularly
important for distributed systems with a highly decentralized administration, for
which decisions taken by the largely autonomous nodes can even interfere with
each other, and for distributed systems with highly volatile clients and particularly
those with high churn, where the presence of clients in one place or another can be
difficult to predict.
Replica-server location defines the conditions of a facility location problem, for
example, finding the best K locations out of the N possible, subject to many
performance, cost, and other constraints, with many theoretical solutions from
Operations Research.
An interesting problem is how should new replica locations emerge. When replica-
servers are permanent, for example, as game operators run their shared sites, or
web operators mirror the websites, all that is needed is to add a statically
configured machine. However, to prevent resource waste, it would be better to
allow replica-servers to be added or removed as needed, related to (anticipated)
load. (This is the essential case of many modern IT operations, which underlies the
need for cloud and serverless computing.) In such a situation, derived from
traditional systems considerations, who should trigger adding or removing a
replica-server, the distributed system or the client? A traditional answer is both,
which means that the designer must (1) consider whether to allow the replica-
server system to be elastic, adding and removing replicas as needed, and (2) enable
both system- and client-initiated elasticity.
Replica placement: Various techniques can help with placing replicas on available
replica-servers.
A simplistic approach is to place one replica on each available replica-server. The
main advantage of this approach is that each location, presumably enabled because
of its good capability to service clients, can actually do so. The main disadvantages
are that this approach does not enable replication in time, so multiple replicas
located in the same location when enough resources exist, and rapid changes in
demand or system conditions cannot be accommodated. (Longer-term changes can
be accommodated by good management of replica-server locations.)
Another approach is to use a multi-objective solver to place replicas in the replica-
server topology. The topological space can be divided, e.g., into Voronoi diagrams;
conditions such as poor connectivity between adjacent divisions can be taken into
account, etc. Online algorithms often use simplifications, such as partitioning the
topology only along the main axes, and greedy approaches, such as placing servers
first in the most densely populated areas.
Compiled by: Ararsa Lemmessa (Eng.) Rift Valley
University
Replica updates:
What to update? Replicas need to achieve a consistent state, but how they do so
can differ by system and, in dynamic systems, even by replica itself (e.g., as in [3]
for an online gaming application). Two main classes of approaches exist: (i)
updating from the result computed by one replica (the coordinating-replica), and
(ii) updating from the stream of input operations that, applied identically, will lead
to the same outcome and thus a consistent state across all replicas. (Note (i)
corresponds to the passive replication described at the start of this section, whereas
(ii) corresponds to active replication.)
Passive replication typically consumes fewer compute resources per replica
receiving the result. Conversely, active replication typically consumes fewer
networking resources to send the update to all replicas. In section 2.2.7,
Consistency for Online Gaming, Virtual Environments, and the Metaverse, we see
how these trade-offs are important to manage for online games.
When to perform updates? With synchronous updates, all replicas perform the
same update, which has the advantage that the system will be in a consistent state
at the end of each update, but also the drawbacks of waiting for the slowest part of
the system to complete the operation and of having to update each replica even if
this is not immediately necessary.
With asynchronous updates, the source informs the other replicas of the changes,
and often just that a new operation has been performed or that enough time has
elapsed since the last update. Then, replicas mark their local data as (possibly)
outdated. Each replica can decide if and when to perform the update, lazily.
Whom? Who initiates the replica update is important.
With push-based protocols, the system propagates modifications to replicas,
informing the clients when replica-updates must occur. This means replicas in the
system must be stateful, thus able to consider the need to propagate modifications
by inspecting the previous state. This approach is useful for applications with high
ratios of operations that do not change the state, relative to those that change it
(e.g., high read:write ratios). With this approach, the system is more expensive to
operate, and typically less scalable, than when it does not have to maintain state.
With pull-based protocols, clients ask for updates. Different approaches exist:
clients could poll the system to check for updates, but if the frequency is polling is
too high the system can get overloaded, and if it is too low (i) the client may get
stale information from its state, or (ii) the client may have to wait for a relatively
long time before obtaining the updated information from the system, leading to low
performance.
As is common in distributed systems, a hybrid approach could work better. Leases,
where push-based protocols are used while the lease is active, and pull-based
protocols are used outside the scope of the lease, are such a hybrid approach.
Compiled by: Ararsa Lemmessa (Eng.) Rift Valley
University
Exercise of chapter two
1) Using naming schema and services concept, describe how users and names are
stored in Active Directory and accessed using LDAP for authentication and
other services.
2) Read about etcd then understand what it is and how it works. Then write what
you get concerning etcd using examples, graphs or any means of etcd
descriptions.
3) Consensus algorithm “Raft” and direct democratic election process is mostly
related. Write the similarity, differences and the gaps of each of both
operations.
4) List different consistency techniques and describe them. Use clear example for
all lists and descriptions
5) List types of data replication and discuss them.
Programming Assignment One
a) Write fully working remote procedure call (RPC) program using java.
Example
b) Write fully working remote method invocation (RMI) program using java.
Example
References:
[1] The BBC, Microsoft says services have recovered after widespread outage,
Jan 2022.
[2] Sacheendra Talluri, (2021) Empirical Characterization of User Reports about
Cloud Failures. 2021.
[3] More on Consistency and Replication
Compiled by: Ararsa Lemmessa (Eng.) Rift Valley
University