Concurrency Freaks: ConcurrentLinkedQueue

Showing posts with label ConcurrentLinkedQueue. Show all posts

Thursday, February 2, 2017

Nitsan Wakart on Lock-Free Queues

About a year ago Nitsan Wakart gave a very nice presentation about lock-free queues
Here is the link to vimeo: https://vimeo.com/131688512

Nitsan works for Azul Systems and he has a blog named Psy-Lobotomy-Saw.

If you're into non-blocking queues then make sure not to miss out this video because it has lots of interesting facts.
Some of the topics he covers are:

The difficulty in running proper benchmarks for lock-free queues. IMO the hardest to obtain is determinism.
What is throughput on a queue? single-enqueue-single-dequeue or bursts?
Throughput vs Latency.
Benchmarking empty queues vs non-empty.
Warmup on benchmarks.
Memory Bounded (circular array or ring buffer) vs Memory Unbounded (singly-linked list).
When can we discard the volatile keyword in Java Queues (hint: very rarely), and even some of the masters have made this mistake.
False Sharing.

Friday, October 7, 2016

Self-linking and Latency + Life of a Twitter jvm engineer

Today we're going to take a look at the effects of doing self-linking in singly-list based data structures in Java. More specifically, how it changes tail latency in singly-list based queues like the one by Alex Kogan and Erez Petrank.

Back in 2011 Alex Kogan and Erez Petrank showed a multi-producer-multi-consumer wait-free queue.
http://www.cs.technion.ac.il/~erez/Papers/wfquque-ppopp.pdf
To the best of our knowledge, they provided the only reasonably useful and correctly implemented memory-unbounded MPMC wait-free queue (that's a mouthful).
Yes, there are generic wait-free techniques, but using them on a queue is just too slow, and yes, there two other MPMC wait-free algorithms but their implementation has bugs and it's CPU specific.
Even the algorithm by Kogan and Petrank relies on having an automatic Garbage Collector (GC) so they provided their implementation in Java. As the authors themselves mention in their paper, there is currently no wait-free GC, neither implemented in production nor in the literature, which means that this queue is never really wait-free, but let's leave that for another post, and just focus on latency.

How good is this "wait-free" queue when it comes to tail latency?

Pretty good as it turns out. We showed some plots in a previous post:
http://www.concurrencyfreaks.com/2016/08/throughput-vs-latency-and-lock-free-vs.html

Unfortunately, as soon as you start "hammering on it" with lots of contention, a problem appears.
The problem is that the GC will do pauses, really really long pauses.

This queue is based on a singly-linked list, and like all singly-list based data structures when used with a GC, the unused nodes can't just be unlinked from the list, they need to be self-linked by pointing the next pointer to the node itself.
The why this is important is related to how GCs work, and there is a very good explanation in this presentation starting at minute 23
Life of a Twitter jvm engineer - The garbage keeps coming... by Tony Printezis
http://www.youtube.com/watch?v=M9o1LVfGp2A&t=22m45s

which by the way I recommend watching in its entirety, but for the purpose of this post, if you don't know what this self-linking stuff is all about, then go watch ten minutes of it starting at minute 23 and come back and read the rest of this post.

Do you understand now why self-linking is important?
This is really a non-obvious detail of GCs, and the first time I learned about this was several years ago from Doug Lea.

The following code is pretty much what was written in the original paper but with some sun.misc.unsafe added to it:
https://github.com/pramalhe/ConcurrencyFreaks/blob/master/Java/com/concurrencyfreaks/queues/KoganPetrankNoSLQueue.java
and in our benchmarks it has GC pauses that go up to 70 seconds.
yes, that's right, seventy seconds during which our 32 core machine is stuck doing nothing except running the GC :-O
Imagine you are reading this blog post on your laptop/tablet and then you click on a link and everything freezes for 70 seconds due to the GC... does that sound reasonable in any way?
You're going to say that this is the GC's fault and that using a JVM with a decent GC like Zing would fix it. I don't know because I don't have access to a JVM with Zing, and I believe that using another GC would improve, but I seriously doubt it will be as effective as doing self-linking.
For a deep-dive on this topic, check out the paper named "A Performance Study of Java Garbage Collectors on Multicore Architectures", particularly figure 1 and table 3:
http://www-public.tem-tsp.eu/~thomas_g/research/biblio/2015/carpen-amarie15pmam-gcanalysis.pdf

The trick to help the GC is to do self-linking of the nodes after dequeueing, but we can't self-link the node where your value is, instead, we self-link the node that points to it, because the node where the value is may still be accessible through head, and we need its next to be valid because it will be used by the next thread calling deq() to obtain its "value". Here is the variant we did, with self-linking of unused nodes:
https://github.com/pramalhe/ConcurrencyFreaks/blob/master/Java/com/concurrencyfreaks/queues/KoganPetrankQueue.java
and in our benchmarks it has GC pauses that go up to 3 milliseconds.
Keep in mind that they are exactly the same algorithm, apart from an extra check in help_finish_enq() and for the self-linking in deq() and yet, they have completely different latency profiles.

Below is one example we got with GC logs. In this example the JVM ran out of memory during the benchmark and had to run a full GC which took 17 seconds and then another that took 42 seconds. At the end of the benchmark we call System.gc() and sleep for a while to trigger another GC run and that one took more than 45 seconds.
These 45 seconds are not accounted for in the benchmark which is unfair because the GC is cleaning up the garbage that the queue is producing, so that work should be taken into account when doing benchmarks as well, but anyways, we don't care much about throughput, it's more about latency for us:
##### KPNoSLQueue                          #####
Starting run...
[GC (Allocation Failure) 3932481K->221057K(15073280K), 17.3218925 secs]
[GC (Allocation Failure) 4153217K->547969K(15073280K), 42.6884304 secs]
Ending run
Number of enqueues/dequeues per sec = 2235
[GC (System.gc()) 1324375K->601121K(15073280K), 45.8331662 secs]
[Full GC (System.gc()) 601121K->8434K(15073280K), 0.0763649 secs]

When we run the same benchmark for the self-linking version, there is still one GC allocation failure during the benchmark, but it takes only 1 millisecond to complete, and the throughput is significantly faster because there is very little waiting for the GC:
##### KPQueue                              ##### Starting run...

[GC (Allocation Failure) 3932485K->805K(15073280K), 0.0013195 secs]
Ending run
Number of enqueues/dequeues per sec = 3216
[GC (System.gc()) 2541778K->453K(15073280K), 0.0012654 secs]
[Full GC (System.gc()) 453K->327K(15073280K), 0.0142511 secs]

This is the problem with not reclaiming unused memory (nodes) and leaving it up to the GC. Once you go down that path, you need to understand what the GC is doing to somehow make friends with the GC so that it will behave the way you want it to.

Long story short, if you're going to implement a singly-list based queue in Java, you better do self-linking.

Saturday, August 6, 2016

Throughput vs Latency and Lock-Free vs Wait-Free

On the previous post we showed an excellent presentation by Cliff Click, and one of the things he mentions is that there is performance for throughput and performance for latency, which really resonated with me.

When people talk about performance, they're usually just interested in throughput. This is fine, but throughput is just one of the vectors of performance, it's just a number (for a particular workload).
Having a number is nice, it means we can quantify, we can measure, and therefore, we can compare.

You know what is much better than having a number?
Having a distribution, namely, the latency distribution.

How do you know if a certain technique or algorithm X is better than algorithm Y?
Typically, you measure, and typically, you do it by measuring the throughput of X and Y in different scenarios or workloads.
If you're really sophisticated, then you might even do several runs, and choose the median of those runs, and/or compute some error bars based on the standard deviation or the min/max of the sample of runs.
This is fine (it's already one step ahead of just guessing which algorithm is better), but we can do more:
Instead of showing a number (plus or minus an error) we can show a distribution. And when it comes to concurrent algorithms, the best is to show the latency distribution.

And why is that, you may ask?
Because the inverse of the mean of the latency distribution is...
... I'll give you a few seconds to think about it...
... yes you guessed it, the throughput!

Looking at a latency distribution has way more information than looking at the throughput, the same way that looking at a statistical distribution has way more information than the mean of the same distribution, why, because it's just a number.
You know the old saying of "one picture is worth a thousand words"? Well then, one latency distribution plot is worth a thousand throughput measurements, and this last statement is literally true if the plot is an histogram with 1000 entries... lol

Why is this important?
For two reasons:

First, if you care about throughput, and only about throughput, then you care about the latency distribution but in a very specific way, namely, you want to know only the mean of the distribution. That's fine, but then how do you compare two algorithms, you are going to do several measures, right?
What is the error in this measurement?
How to compute error bars?
Do you use a standard deviation or something else?
Are the error bars assuming a Normal distribution?
Is it safe to assume a Normal distribution?
Why now just show the latency distribution?

Second, there is typically a tradeoff when it comes to throughput vs latency (not always, but typically). Going back to our example, perhaps algorithm X provides better throughput than algorithm Y, but Y has a better tail latency than X.
Which one is "better"?
There is no right answer, it depends on how you define better: If you just want raw throughput and you don't care that 1% of the cases will take an enormous amount of time to complete, then X is better. If you're willing to sacrifice throughput to have more fairness and better (lower) latency at the tail of the latency distribution, then algorithm Y is better.

One concrete example is when you look at a lock-free queue (like the one by Maged Michael and Michael Scott) vs a wait-free queue (like the one by Alex Kogan and Erez Petrank):

In these plots, lower is better.
The plot for the median (this is the median, not the mean) shows that the Michael-Scott queue is better than Kogan-Petrank (about 10x better), but when it comes to the tail latency, namely at the 99.9%, the Kogan-Petrank is better.
By 99.9% we mean that 99.9% of the calls to enqueue() take x microseconds or less to complete, and therefore, 0.1% of the calls take more than x microseconds to complete, where x is the number on the vertical axis of the plots.

Just like it doesn't make sense to say that Throughput is better than (tail) Latency, it is also meaningless to say that a Lock-Free algorithm that has higher throughput than a similar Wait-Free algorithm is better. The only exception is if they both have the same throughput or the lock-free algorithm has a lower throughput than the wait-free one, in that case, by definition, the wait-free is better, because it's better in all metrics.

Fedor Pikus is going to talk about lock-free again this year at CppCon, so let's see what he has to say about this topic.
I talked about latency at the beginning of my presentation at CppCon last year in case you want to learn more.

Monday, November 10, 2014

Relaxed atomics, out-of-thin-air results, and ConcurrentLinkedQueueRelaxed

On today's post we have a quiz on Relaxed Atomics.

Before you go any further, I recommend reading this paper by Hans Boehm and Brian Demsky entitled "Outlawing Ghosts: Avoiding Out-of-Thin-Air Results", if you haven't done so already. Without reading this paper, you may be missing some of the subtleties of relaxed atomics.

Done already?
Ok, let's start.

Suppose we have three slightly different programs named Program A, Program B, and Program C, each with two threads. The code that each thread executes for each program is described below, and I'll use Java notation, but this is also true for C++1x, C11, D, and for Scala (not for Go, because it doesn't have relaxed atomics from what I could tell):

Program A

volatile int x = 0;
volatile int y = 0;

Thread 1:
if (x == 1) y = 1;

Thread 2:
if (y == 1) x = 1;

Program B

int x = 0;
int y = 0;

Thread 1:
if (x == 1) y = 1;

Thread 2:
if (y == 1) x = 1;

Program C

int x = 0;
int y = 0;

Thread 1:
if (x == 42) y = 1;

Thread 2:
if (y == 1) x = 1;

Question 1:
What are the possible outcomes of each of these programs?

Question 2:
Which (if any) of these three programs generate sequentially consistent results ?

I've taken the liberty of emphasizing the differences between the three programs in bold, so as to make it easier to analyze the code.
Program A is the only one where the variables x and y are sequentially consistent atomics. Programs B and C use relaxed atomics for all their operations (in the JVM, all loads and stores on an int are guaranteed atomicity). Program C compares x with 42 instead of comparing to 1.

If these programs seems somewhat familiar it is because Program B is one of the examples on the paper (see section 2) indicated as recommended reading at the beginning of this post.

So have you thought about it for a while?
What are the possible outcomes for each one of these three programs?
Scroll down to look at the solutions.

a little bit further....

For Program A, the only possible outcome is {x=0, y=0}.
There is no other possible outcome, and I hope this is obvious. If it is not, then I recommend watching Herb Sutter's presentation on atomics because I don't even know how to explain it. But basically, no re-ordering is possible due to implicit acquire barriers on the loads, and release barriers on the stores.
As for the consistency model, Program A provides at least sequential consistency because, well, it's constant/immutable, in the sense that there is only on possible state and it never changes.

For Program B, there are two possible outcomes {x=0, y=0} or {x=1,y=1}. To understand why the {1,1} state is possible, the best is to read the paper by Hans Boehm.
There is no sequentially consistent history for Program B that can result in {x=1,y=1}, so Program B is not sequentially consistent.

For Program C, there is only one possible outcome which is {x=0, y=0}, mainly because we check against a value that x can never take (a value of 42). Unlike for Program B, in Program C there is no way branch prediction can predict 42 and then it actually becomes 42 because there is no code path for which x is set to 42.
Program C has only one state and it's constant/immutable, so yes, it is sequentially consistent (if it even makes sense to say so).

So why should we care about this?

The reason for spending time thinking about these scenarios is related to why the optimization on item in the ConcurrentLinkedQueueRelaxed can be done (mentioned on the previous post). Whenever we read the variable item with a relaxed load, one of three things may happen:
- The value of item is seen a being the item we're searching for: It could happen that it has been removed in the mean time (CAS'ed to null), so we have to re-load using an acquire barrier (i.e. volatile load);
- The value of item is seen as being a non-null value but not the item we're searching for: It may actually be at null by now, but we don't care about that item because we're looking for a different item;
- The value of item is seen as being null: It may be null but it may also (in very special circumstances) be the item we're looking for, and because of this, we need to re-load with an acquire barrier (volatile load);

Here's a diagram that explains what are the possible observable states for item depending on the possible "real" state of item: