0% found this document useful (0 votes)
68 views6 pages

Notes NoSQL Module 2 Leason 5

Uploaded by

shobhacr24
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views6 pages

Notes NoSQL Module 2 Leason 5

Uploaded by

shobhacr24
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Chapter 5.

Consistency

Relational databases try to exhibit strong consistency by avoiding all the various inconsistencies.

5.1. Update Consistency

We’ll begin by considering updating a telephone number. Coincidentally, Martin and Pramod are looking
at the company website and notice that the phone number is out of date. They both have update
access, so they both go in at the same time to update the number. We’ll assume they update it using a
slightly different format. This issue is called awrite-write conflict: two people updating the same data
item at the same time.
When the writes reach the server, the server will serialize them—decide to apply one, then the other.
Let’s assume it uses alphabetical order and picks Martin’s update first, thenPramod’s. Without any
concurrency control, Martin’s update would be applied and

immediately overwritten by Pramod’s. In this case Martin’s is a lost update. We see this as afailure of
consistency because Pramod’s update was based on the state before Martin’s update, yet was applied
after it.

Approaches for maintaining consistency in the face of concurrency: pessimistic or optimistic


A pessimistic approach works by preventing conflicts from occurring;
An optimistic approach lets conflicts occur, but detects them and takes action tosort them
out.
For update conflicts, the most common pessimistic approach is to have write locks,so that in
order to change a value you need to acquire a lock, and the system ensures that only one client
can get a lock at a time.
Eg. Martin and Pramod would both attempt to acquire the write lock, but only Martin (the first
one) would succeed. Pramod would then see the result of Martin’swriting before deciding
whether to make his own update.
A common optimistic approach is a conditional update where any client that does an update
tests the value just before updating it to see if it’s changed since his last read. In this case,
Martin’s update would succeed but Pramod’s would fail. The error would let Pramod know that
he should look at the value again and decide whetherto attempt a further update.
Both of the approaches work in a single server environment. In multiple server environments like
peer to peer then two nodes might apply the updates in a differentorder, resulting in a different
value for the telephone number on each peer.

Often, another approach for concurrency in distributed systems, sequential consistency—ensuring


that all nodes apply operations in the same order.

This approach is familiar to many programmers from version control systems, particularly
distributed version control systems that by their nature will often haveconflicting commits.
The next step again follows from version control: You have to merge the twoupdates
somehow.
Users may update the merged information or the computer may be able to perform the merge
itself; if it was a phone formatting issue, it may be able to realizethat and apply the new number
with the standard format.
Any automated merge of write-write conflicts is highly domain-specific and needsto be
programmed for each particular case.

Often, when people first encounter these issues, their reaction is to prefer pessimistic concurrency
because they are determined to avoid conflicts. While in some cases this is theright answer, there is
always a tradeoff.
Concurrent programming involves a fundamental tradeoff between safety (avoidingerrors such
as update conflicts) and liveness (responding quickly to clients).
Disadvantage of Pessimistic approach
Pessimistic approaches often severely degrade the responsiveness of a system to the degreethat it
becomes unfit for its purpose. This problem is made worse by the danger of errors— pessimistic
concurrency often leads to deadlocks, which are hard to prevent and debug.

Replication makes it much more likely to run into write-write conflicts. If different nodes have different
copies of some data which can be independently updated, then you’ll get conflicts unless you take
specific measures to avoid them. Using a single node as the targetfor all writes for some data makes it
much easier to maintain update consistency. Of the distribution models we discussed earlier, all but
peer-to-peer replication do this.

5.2. Read Consistency

Having a data store that maintains update consistency is one thing, but it doesn’tguarantee
that readers of that data store will always get consistent responses to their requests.
Let’s imagine we have an order with line items and a shipping charge. The shippingcharge is
calculated based on the line items in the order. If we add a line item, we thus also need to
recalculate and update the shipping charge. In a relational database, the shipping charge and
line items will be in separate tables.
The danger of inconsistency is that Martin adds a line item to his order, Pramod thenreads the
line items and shipping charge, and then Martin updates the shipping charge. This is an
inconsistent read or read-write conflict: In Figure 5.1 Pramod has done a read in the middle of
Martin’s write.

We refer to this type of consistency as logical consistency: ensuring that differentdata


items make sense together.
To avoid a logically inconsistent read-write conflict, relational databases supportthe notion
of transactions.
Providing Martin wraps his two writes in a transaction, the system guarantees that Pramodwill either read
both data items before the update or both after the update.
A common claim we hear is that NoSQL databases don’t support transactions and thus
can’t be consistent. Such a claim is mostly wrong because

Our first clarification is that any statement about lack of transactions usually onlyapplies to
some NoSQL databases, in particular the aggregate-oriented ones. In contrast, graph
databases tend to support ACID transactions just the same as relational databases.
Secondly, aggregate-oriented databases do support atomic updates, but only within a single
aggregate. This means that you will have logical consistency withinan aggregate but not
between aggregates.
Of course not all data can be put in the same aggregate, so any update that affects multiple aggregates leaves open a
time when clients could perform an inconsistent read. The length of time an inconsistency is present is called the
inconsistency window.

Relaxing Consistency

Consistency is a Good Thing—but,it comes with sacrifices.

It is always possible to design a system to avoid inconsistencies, but often impossible


to do so without making unbearable sacrifices in other characteristicsof the system.

As a result, we often have to tradeoff consistency for something else.

While some architects see this as a disaster, we see it as part of the inevitabletradeoffs
involved in system design.

Furthermore, different domains have different tolerances for inconsistency, andwe


need to take this tolerance into account as we make our decisions.

Trading off consistency is a familiar concept even in single-server relational database systems.
Here, our principal tool to enforce consistency is the transaction,and transactions can provide
strong consistency guarantees.

However, transaction systems usually come with the ability to relax isolation
levels, allowing queries to read data that hasn’t been committed yet.

In practice we see most applications relax consistency down from the highestisolation
level (serialized) in order to get effective performance.

We most commonly see people using the read-committed transaction level, whicheliminates
some read-write conflicts but allows others.

Many systems forgo transactions entirely because the performance impact of


transactions is too high.

On a small scale, we saw the popularity of MySQL during the days when it didn’tsupport
transactions. Many websites liked the high speed of MySQL and were prepared to live
without transactions.
At the other end of the scale, some very large websites, such as eBay, have had toforgo
transactions in order to perform acceptably— this is particularly true when you need to introduce
sharding.

Even without these constraints, many application builders need to interact with remote systems
that are outside a transaction boundary, so updating outside oftransactions is a quite common
occurrence for enterprise applications.

The CAP Theorem

In the NoSQL world it’s common to refer to the CAP theorem as the reason why you may need to relax
consistency. It was originally proposed by Eric Brewer in 2000 [Brewer] and given a formal proof by Seth
Gilbert and Nancy Lynch [Lynch and Gilbert] a couple of yearslater.

The basic statement of the CAP theorem is that, given the three properties of
Consistency, Availability, and Partition tolerance, you can only get two.

Obviously, this depends very much on how you define these three properties, and differing
opinions have led to several debates on what the real consequences of the CAP theorem are.

Consistency in database systems refers to the requirement that any given database transaction must
change affected data only in allowed ways. For a database to be consistent, data written to the database
must be valid according to all defined rules.

Consistency does not guarantee correctness of the transaction in all ways an application programmer
might expect (that is the responsibility of application-level code). Instead, consistency merely
guarantees that programming errors cannot result in the violation of any defined database constraints.

Availability has a particular meaning in the context of CAP—it means that if youcan talk to a node in the
cluster, it can read and write data.

Partition tolerance means that the cluster can survive communication breakages inthe cluster that separate the
cluster into multiple partitions unable to communicate with each other.

A single-server system is the obvious example of a CA system—a system that has Consistency and Availability but not
Partition tolerance. A single machine can’t partition, so it does not have to worry about partition tolerance.

It is theoretically possible to have a CA cluster. However, this would mean that if a partition ever occurs in the cluster, all
the nodes in the cluster would go down so that no client can talk to a node. By the usual definition of “available,” this
would mean a lack of availability, but this is where CAP’s special usage of “availability” gets confusing. CAP defines
“availability” to mean “every request received by a nonfailing node in the system must result in a response”.

An example should help illustrate this. Martin and Pramod are both trying to book the last hotel room on a system that
uses peer-to-peer distribution with two nodes (London for Martin and Mumbai for Pramod). If we want to ensure
consistency, then when Martin tries to book his room on the London node, that node must communicate with the
Mumbai node before confirming the booking. Essentially, both nodes must agree on the serialization of their requests.
This gives us consistency—but should the network link break, then neither system can book any hotel room, sacrificing
availability.

One way to improve availability is to designate one node as the master for a particular hotel and ensure all bookings are
processed by that master. Should that master be Mumbai, then Mumbai can still process hotel bookings for that hotel
and Pramod will get the last room. If we use master-slave replication, London users can see the inconsistent room
information but cannot make a booking and thus cause an update inconsistency. However, users expect that it could
happen in this situation—so, again, the compromise works for this particular use case.

Relaxing Durability

So far we’ve talked about consistency, which is most of what people mean when they talk about the ACID properties of
database transactions. The key to Consistency is serializing requests by forming Atomic, Isolated work units. But most
people would scoff at relaxing durability—after all, what is the point of a data store if it can lose updates?

As it turns out, there are cases where you may want to trade off some durability for higher performance. If a database
can run mostly in memory, apply updates to its in-memory representation, and periodically flush changes to disk, then it
may be able to provide substantially higher responsiveness to requests. The cost is that, should the server crash, any
updates since the last flush will be lost.

One example of where this tradeoff may be worthwhile is storing user-session state. A big website may have many users
and keep temporary information about what each user is doing in some kind of session state. There’s a lot of activity on
this state, creating lots of demand, which affects the responsiveness of the website. The vital point is that losing the
session data isn’t too much of a tragedy—it will create some annoyance, but maybe less than a slower website would
cause. This makes it a good candidate for nondurable writes. Often, you can specify the durability needs on a call-by-call
basis, so that more important updates can force a flush to disk. Another example of relaxing durability is capturing
telemetric data from physical devices. It may be that you’d rather capture data at a faster rate, at the cost of missing the
last updates should the server go down.

Another class of durability tradeoffs comes up with replicated data. A failure of replication durability occurs when a
node processes an update but fails before that update is replicated to the other nodes. A simple case of this may happen
if you have a master-slave distribution model where the slaves appoint a new master automatically should the existing
master fail. If a master does fail, any writes not passed onto the replicas will effectively become lost. Should the master
come back online, those updates will conflict with updates that have happened since. We think of this as a durability
problem because you think your update has succeeded since the master acknowledged it, but a master node failure
caused it to be lost.

Quorums

When you’re trading off consistency or durability, it’s not an all or nothing proposition. The more nodes you involve in a
request, the higher is the chance of avoiding an inconsistency. This naturally leads to the question: How many nodes
need to be involved to get strong consistency.

Imagine some data replicated over three nodes. You don’t need all nodes to acknowledge a write to ensure strong
consistency; all you need is two of them—a majority. If you have conflicting writes, only one can get a majority. This is
referred to as a write quorum and expressed in a slightly pretentious inequality of W > N/2, meaning the number of
nodes participating in the write (W) must be more than the half the number of nodes involved in replication (N). The
number of replicas is often called the replication factor.

Similarly to the write quorum, there is the notion of read quorum: How many nodes you need to contact to be sure you
have the most up-to-date change. The read quorum is a bit more complicated because it depends on how many nodes
need to confirm a write.

Let’s consider a replication factor of 3. If all writes need two nodes to confirm (W = 2) then we need to contact at least
two nodes to be sure we’ll get the latest data. If, however, writes are only confirmed by a single node (W = 1) we need
to talk to all three nodes to be sure we have the latest updates. In this case, since we don’t have a write quorum, we
may have an update conflict, but by contacting enough readers we can be sure to detect it. Thus we can get strongly
consistent reads even if we don’t have strong consistency on our writes.
This relationship between the number of nodes you need to contact for a read (R), those confirming a write (W), and the
replication factor (N) can be captured in an inequality: You can have a strongly consistent read if R + W > N.

The number of nodes participating in an operation can vary with the operation. When writing, we might require quorum
for some types of updates but not others, depending on how much we value consistency and availability. Similarly, a
read that needs speed but can tolerate staleness should contact less nodes. Often you may need to take both into
account. If you need fast, strongly consistent reads, you could require writes to be acknowledged by all the nodes, thus
allowing reads to contact only one (N = 3, W = 3, R = 1). That would mean that your writes are slow, since they have to
contact all three nodes, and you would not be able to tolerate losing a node. But in some circumstances that may be the
tradeoff to make.

You might also like