Psychosomatic, Lobotomy, Saw: Direct ByteBuffer

Showing posts with label Direct ByteBuffer. Show all posts

Wednesday, 24 July 2013

Java Concurrency Torture Update: Don't Stress

A piece of late news: the concurrency suite is back. And so is my little test on top of it.
A few months ago I posted on yet another piece of joyful brilliance crafted by Master Shipilev, the java-concurrency-suite. I used the same framework to demonstrate some valid concerns regarding unaligned access to memory, remember children:

Unaligned access is not atomic
Unaligned access is potentially slow, in particular:

On older processors
If you cross the cache line

Unaligned access can lead to SEG_FAULT on non-intel architectures, and the result of that might be severe performance hit or your process crashing...

It was so much fun that someone had to put a stop to it, and indeed they did.
The java-concurrency-suite had to be removed from github, and the Master Inquisitor asked me to stop fucking around and remove my fork too, if it's not too much trouble... so I did.
Time flew by and at some point the powers that be decided the tool can return, but torture was too much of a strong word for those corporate types and so it has been rebranded JCStress. JC is nothing to do with the Jewish Community, as some might assume, it is the same Java Concurrency but now it's not torture (it's sanctioned after all), it's stress. Aleksey is simply stressing your JVM, he is not being excessively and creatively sadistic, that would be too much!
Following the article I had a short discussion with Mr T and Gil T with regards to unaligned access in the comments to Martin's blog post on off-heap tuple like storage. The full conversation is long, so I won't bore you, but the questions asked were:

Is unaligned access a performance only, or also a correctness issue?
Are 'volatile' write/reads safer than normal writes/reads?

The answers to the best of my understanding are:

Unaligned access is not atomic[Correction 23/09/2013: On later Intel processor unaligned access within the cache line is atomic, but access across the line is not. See update below and related later post], and therefore can lead to the sort of trouble you will not usually experience in your programs (i.e half written long/int/short values). This is a problem even if you use memory barriers correctly. It adds a 'happened-badly' eventuality to the usual happens before/after reasoning and special care must be taken to not get slapped in the face with a telephone pole.
Volatile read/writes are NOT special. In particular doing a CAS[Correction 23/09/2013: CAS is atomic, all others are not. See update below and related later post]/putOrdered/putVolatile write to an unaligned location such that the value is written across the cache line is NOT ATOMIC.

This came up in a discussion recently, which prompted me to go and fork JCStress on to bitbucket and rerun the experiments on both a Core2Duo and a Xeon processor and re-confirmed the result. This is a problem for folks considering the Atomics for ByteBuffer suggestion discussed on the concurrency-interest mailing list. There is nothing in the ByteBuffer interface to stop you from doing unaligned access...

UPDATE (1/08/2013): Shipilev's slides from JVMLS lightning talk.

UPDATE (23/09/2013): I've discussed the issue further with Gil and Martin, which led to the following post. Some corrections to the above observations have been made which I now inlined above.

Sunday, 7 April 2013

135 Million messages a second between processes in pure Java

{This post is part of a long running series on lock free queues, checkout the full index to get more context here}
Porting an existing single producer/single consumer concurrent queue into an IPC mechanism via memory mapped files and getting 135 million messages throughput in pure Java.
In my previous post I covered a single producer/consumer queue developed and shared by Martin Thompson capable of delivering an amazing 130M messages per second. The queue he delivered is a great tool for communicating between threads, but sometimes communicating between threads is not enough. Sometime you need to leave your JVM and go out of process. Inter Process Communications (IPC) is a different problem to inter thread communications, can it be cracked by the same approach?

IPC, what's the problem?

Inter Process Communication is an old problem and there are many ways to solve it (which I will not discuss here). There are several attractions to specialized IPC solutions for Java:

Faster than socket communication.
An out of process integration option with applications written in other languages.
A means of splitting large VMs to smaller ones improving performance by allowing GC and JIT specialization.

For IPC to be attractive it has to be fast, otherwise you may as well go for network based solutions which would extend beyond your local machine uniformly. I attended an Informatica conference a while back and got talking to Todd Montgomerey about the Disruptor and mechanical sympathy, he suggested that IPC should be able to perform as well as inter thread messaging. I found the idea interesting and originally meant to port the Disruptor, but Martin's queue is simpler (and quicker) so I went for that instead. Starting with a good algorithm/data structure is very good indeed, now I just needed to bridge the gap and see if I can maintain the benefits.

Off the heap we go!

To do IPC we must go off heap. This has several implications for the queue, most importantly references are not supported. Also note persistence to and from the queue is required, though one could extend my implementation to support a zero copy interaction where a struct is acquired, written and committed instead of the offer method, and similarly acquired, read and finally released instead of the poll method. I plan to make several flavours of this queue to test out these ideas in the near future.

My IPC queue uses a memory mapped file as a means of acquiring a chunk of shared memory, there is no intention to use the persisted values though further development in that direction may prove interesting to some. So now that I got me some shared memory, I had to put the queue in it.

I started by laying out the queue counters and cached counters. After realizing the counters need to be aligned to work properly I learnt how to align memory in Java. I went on to verify that aligned memory offers the guarantees required for concurrent access. Quick summary:

aligned access means writing data types to addresses which divide by their size.
unaligned access is not atomic, which is bad for concurrency :(
unaligned access is slow, which is bad for performance :(
unaligned access may not work, depending on OS and architecture. Not working is very bad :(

Sorting out alignment is not such a big deal once you know how it works. One of the nice things about going off-heap was that solving false sharing has become far more straightforward. Move your pointer and you're in the next cache line, job done. This left me rather frustrated me with the tricks required to control memory layout in Java. Going back to the original implementation you will notice the Padded classes who's role it is to offer false sharing protection. They are glorious hacks (with all due respect) made necessary by this lack of control. The @Contended annotation coming in JDK 8 will hopefully remove the need for this.
This is how the memory layout worked out:

To illustrate in glorious ASCII graphics (each - is a byte), this is what the memory layout looks like when broken into cache lines:
|--------|--------|--------|head....|--------|--------|--------|--------|
|--------|--------|--------|tailCach|--------|--------|--------|--------|

|--------|--------|--------|tail----|--------|--------|--------|--------|

|--------|--------|--------|headCach|--------|--------|--------|--------|

I played around with mixing off heap counters with on heap buffer but in the interest of brevity I'll summarize and say the JVM does not like that very much and the end result performance is not as good as all heap/off-heap solutions. The code is available with everything else.

Once alignment and memory layout were sorted I had to give up the flexibility of having reference pointers and settle for writing my data (an integer) directly into the memory. This leaves my queue very restrictive in it's current form. I intend to revisit it and see what I can do to offer a more extendable API on top of it.

Let me summarize the recipe at this point:

Create a memory mapped file large enough to hold:

4 cache lines for counters/cached counters.
4 bytes(per integer) * queue capacity (must be a power of 2).
1 spare cache line to ensure you can align the above to the cache line.

Get a mapped byte buffer, which is a direct byte buffer on top of the mapped memory.
Steal the address and get the contained aligned byte buffer.
Setup pointers to the counters and the beginning of the buffer
Replace use of natural counters with off heap counters accessed via Unsafe using the pointers.
Replace use of array with use of offset pointers into buffer and Unsafe access.
Test and debug until you work out the kinks...

The above code should give you a fair idea how it works out and the rest is here. This queue can work in process and out of process as demonstrated in the tests included in the repository. Now that it works (for the limited use case, and with room for further improvement... but works), is it fast enough? not so fast? is it...<gasp> ... FASTER????!?!?!

Smithers, release the hounds

Here are the numbers for using the different implementations in process:

Implementation/Affinity	Same core	Cross core	Cross socket
P1C1QueueOriginal3	110M	130M	19M
P1C1OffHeapQueue	130M	220M	200M
P1C1QueueOriginalPrimitive	124M	220M	215M

Confused? Let me explain. First line is the measurements taken for the original queue. Similar to what was presented in prev. post, though I saw a slight improvement in the results with increasing the compile threshold to 100000.
The second line is my offheap implementation of same algorithm. It is significantly faster. This is not IPC yet, this is in process. The reason it is faster is because data is inlined in the queue, which means that by loading an entry in the queue we get the data as opposed to a reference to the data. Getting a reference is what you get when you have and Object[] array. The array holds the references and the data is elsewhere, this seems to make it more painful as we get further from the producer.
The last entry is a mutation of P1C1QueueOriginal3 into a primitive array backed queue to compare performance like for like. As you can see this displays very similar results to the off heap implementation supporting the theory that data in-lining is behind the observed performance boost.
The lesson here is an old one, namely that pointer chasing is expensive business further amplified by the distance between the producing CPU and consuming CPU.
The off-heap queue can offer an alternative to native code integration as the consuming thread may interact directly with the off-heap queue and write results back to a different off-heap queue.
Running a similar benchmark adapted to use a memory mapped file as the backing DirectByteBuffer for the off-heap queue we get:
    same core - ops/sec=135M
    across cores - ops/sec=98M
    across sockets - ops/sec=25M

JOY! a pure Java IPC that gives you 135M messages per second is more throughput then you'd get with most commercial products out there. This is still not as fast as the same queue in process and I admit I'm not sure what the source of the performance difference is. Still I am quite happy with it.
A few notes/observations from the experimentation process:

I got a variety of results, stabilizing around different average throughputs. I chose the best for the above summary and plan to go into detail about the results in the near future.
The JVM was launched with: -XX:+UseCondCardMark -XX:CompileThreshold=100000
Removing the Thread.yield from the producer/consumer loops improved performance when running on the same core, but made it worse otherwise.
Moving the queue allocation into the test loop changes the performance profile dramatically.
I've not had time to fully explore the size of the queue as a variable in the experiment but the little I've done suggests it makes a difference, choose the right size for your application.

I realize this post is rather less accessible than the previous one, so if you have any questions please ask.

Tuesday, 12 February 2013

Alignment, Concurrency and Torture (x86)

Summary: Exploring the effects of unaligned/aligned concurrent access in Java for off/on heap memory, how to test for correctness, torture and cushions.

Concurrent access to memory is a pain at the best of times. The JMM offers Java programmers a way to reason about concurrency in a platform independent manner, and while not without fault it's certainly better then having no memory model at all. Other languages are catching up of course, but the focus of this article is not comparative concurrency, so look it up yourselves ;).
When we go off heap via Unsafe or Direct Byte Buffers, or when we utilize Unsafe to gain direct access to on heap memory we are on our own. Direct memory access strips away the JVM guarantees and leaves you exposed to the underlying OS/architecture memory access rules and the interpretation of the JVM runtime.
Should you be worried?
Silly question.
If you are not the worried type, follow this link with my blessings.
Worried? Here's 2 guarantees now gone to worry about:

Atomicity - memory access on the JVM is guaranteed to be atomic, even when the underlying system does not support it. No such similar guarantees are available for ByteBuffers. In fact ByteBuffers are explicitly not thread safe, so atomicity is not guaranteed at all. Unsafe has no guarantees for anything, so caution is advised.
Word tearing - this means concurrent updates to adjacent bytes in memory do not result in corruption of state. Some processors perform memory writes in units of a word, which can be 2 or 4 bytes. This can result in the unmolested parts of the word overwriting bytes written to by other threads. While the JLS is binding for 'plain' memory access (via assignment) the guarantee does not necessarily extend to off heap memory.

The problem with verifying these behaviours is that concurrency guarantees and issues are by definition timing dependent. This means that while most of the time there is no issue, that is not to say that throwing a few extra processors in, or changing memory layout, or maybe just waiting a bit longer will not trigger the issue...
So how do you prove your code is thread safe once you stepped off heap and into the wild? I recently answered a question on stack overflow regarding testing for correctness under multi-threaded access of a data structure. My answer was accepted, so obviously it's correct, I cut out the niceties to keep this short:

1. Formally prove the correctness of your program in the terms of the JMM.

2. Construct test cases which demonstrate intended above behaviour by using count down latches or other means of suspending threads of execution at specific points in your program to force contention.

3. Statistically demonstrate correctness by exercising the code over a sufficient period of time from multiple threads.

This is all fine for when you write 'safe' Java, but option 1, and therefore 2 go out the window when you are straying off spec. This is an important observation to make: if the rules are not well defined then testing the edge cases for those rules is not that meaningful.
Which leaves us with number 3.

Science is torture

It is a fact well known to those who know it well that a scientific experiment can only prove your theory is either definitely wrong or so far right. This (limited) power of induction from past experiences is what empiricism is all about. To prove JVM implementations implement the JMM correctly Mr. Shipilev (Benchmarking Tzar) has put together the Java Concurrency Torture Suite into which he drags JVMs and tries to break them on any number of architectures (not to worry, he tries it on cute bunnies first). It's a great project for anyone trying to reason about the JMM as it gives you examples in code of edge cases and the corresponding range of expected behaviours on all architectures. It also means that if you got questionable behaviours you want to explore on a particular architecture Aleksey just saved you allot of work.
The JCTS project is a very professional piece of work, so I'll try not to mince words over what you can read in the readme. The general approach is that if you leave it (the concurrent behaviour) long enough the edge cases will come crawling out of the woodwork. Here's one bug found using it, the discussion in the ticket is quite interesting.
The tests run also report how many times each state was detected, which has the added benefit of educating us on the distribution of occurrences for the different cases. The next version may support ASCII bar charts too. Instead of yapping on about it, lets see how atomicity is tackled and all will become clear.

Biggles! Fetch...THE CUSHIONS!

The JCTS already has atomicity tests, I just had to my own to target unaligned access. I added 2 tests, one for unaligned access within the line and one for crossing the cache line, for brevity this is just the long test:
The method for allocating aligned direct memory byte buffers is described in this post. The cross cache line case is very similar with the position being set intentionally to cause cache straddled access.
I expected the straddled line access to be non-atomic and for once I was right(read on, it doesn't last)!!!
Note the fine printed output, a joy! The test run several times and we can see how in the majority of cases things are fine, the observed result is either 0 or -1. But there are other values which should not be there (when our mother is not), those observed values are the result of the lack of atomicity. Reading the value in some undefined trashed state gives us... mmm... trash. As I suspected! Patting myself on the shoulder I moved on to the unaligned non-straddling tests.
The results suggested that unaligned access inside the cache line is atomic on my machine. I admit it was not what I thought. This is a bit surprising. Good old Intel, they can really make this work!

Our chief weapon is surprise

Going through these experiments and results reminded me of a comment made by Gil Tene (Azul CTO) on Martin Thompson's blog:

Cache alignment is only a performance issue. Never a correctness issue. Cache lines do not add or remove atomicity. Specifically, non atomic updates within a single cache line can still be seen non-atomically by other threads.The Java spec only guarantees atomic treatment (as in the bits in the field are read and written all-or-nothing) for boolean, byte, char, short, int, and float field types. The larger types (long and double) *may* be read and written using two separate operations (although most JVMs will still perform these as atomic, all-or-nothing operations).
BTW, all x86 variants DO support both unaligned data types in memory, as well as LOCK operations on such types. This means that on an x86, a LOCK operations that spans boundary between two cache lines will still be atomic (this little bit of historical compatibility probably has modern x86 implementors cursing each time).

I read the above to suggest Gil was thinking atomicity is a given across the cache line, so I sent him an e-mail to discuss my findings. Now Gil has been under the JVM hood more than most people, and this turned out to be quite enlightening:

I'm not saying that cross cache line boundaries cannot induce non-atomicity to occur more often. I *am* saying that whenever you see non-atomicity on accesses that cross cache line boundaries, that same non-atomicity potential was there [for the same code] even within a single cache line, and is usually there regardless of whether or not the in-same-cache-line access is aligned to the type size boundary.

He went on to point out that the whole experimentation as means to proving correctness is not possible (I told you it's well known) and that where the spec. doesn't promise you anything, expect nothing:

... there is no way to verify atomicity by experimentation. You can only disprove it, but cannot prove it.

The only way you can control atomicity on a known x86 arch. for off heap access would be to control ALL x86 instructions used to access your data, where knowing exactly which instructions are used and what the layout restrictions you impose would allow you to prove atomicity through known architectural qualities.

I didn't expect a kind of Spanish Inquisition

Mmmm... but how would I know? Given there are no guarantees on the one hand and the limitations of experimentation on the other, we are back to square one... And while musing on Gil's wise words I ran the unaligned access test again, and what do you know:

Bugger. Bugger. Bugger. It seems like a special value sneaks in every once in a while. This happens quite infrequently and as things worked out it failed to happen until my little exchange with Gil was over, leading me to believe Gil possesses that power of the experienced to have reality demonstrate for them on cue. On the other hand, now I know it doesn't work. Certainty at last.
What about aligned memory access? I gave it a good run and saw no ill effects, but given Gil's warnings this may not be enough. The JCTS already had a test for aligned access and there are no particular expectations there. This is important: it is OK for a JVM implementation to give you no atomicity on putLong/Int/... (via Unsafe or DirectByteBuffer). Aligned access on x86 is atomic, but as Gil points out we don't have full control on every level so that may not be a strong enough guarantee...

Conclusion?

The take away from the above is that offheap storage should be treated as if atomicity is not guaranteed, unless read/write alignment is properly handled. Is that the end of the world? certainly not, but it does mean precautions must be taken if data is to be written and read concurrently. Note that mutating shared data is always a risk, but you may go further without noticing issues with a POJO approach. Also note that some of the tools used for proper object publication are not available, in particular final fields.

Many thanks to Aleksey Shipilev and Gil Tene for their feedback and guidance along the way. I leave word tearing as an exercise to the interested reader.

Update 14/02/2013:
Martin Thompson got back to me asking if the same atomicity issue is still there for putOrdered*/put*Volatile when the cache line is crossed. I expected it would be (so did he), wrote some more tests and it turned out we were both correct. It makes no difference if you use put*/putOrdered*/put*Volatile/compareAndSwap* [correction: CAS is atomic see update 23/09/2013 below], if you cross the cache line there is no atomicity. Similarly I would expect unaligned access within the cache line to lack atomicity, but have not had time to add the tests. Code is on GitHub for your pleasure.
Thanks Martin.

Update 24/07/2013:
JCT has gone into the OpenJDK toolbox and is now public again as JCStress (I suggested JangBang, but Shipilev wouldn't listen :-(). I 'forked' it and added my tests to the new project, it's all available here. The tests are under the org.openjdk.jcstress.tests.unsafe.unaligned package.

Update 23/09/2013:
I've written a further post on JCStress here. And a further clarification on the atomicity of unaligned writes using different methods here. The bottom line being that the only method to perform guaranteed atomic unaligned writes is by using CAS. Volatile reads are not atomic however, so this method is of limited use.

Sunday, 6 January 2013

Direct Memory Alignment in Java

Summary: First in a quick(hopefully) series of posts on memory alignment in Java. This post introduces memory alignment, shows how to get memory aligned blocks, and offers an experiment and some results concerning unaligned memory performance.

Since the first days of Java, one of the things you normally didn't need to worry about was memory. This was a good thing for all involved who cared little if some extra memory was used, where it was allocated and when will it be collected. It's one of the great things about Java really. On occasion however we do require a block of memory. A common cause for that is for doing IO, but more recently it has been used as a means to having off-heap memory which allows some great performance benefits to the right use case. Even more recently there has been talk of bringing back C-style structs to Java, to give programmers better control over memory layout. Having stayed away from direct memory manipulation for long(or having never had to deal with it) you may want to have a read of the most excellent series of articles "What Every Programmer Should Know About Memory" by Ulrich Drepper. I'll try and cover the required material as I go along.

Getting it straight

Memory alignment is mostly invisible to us when using Java, so you can be forgiven for scratching you head. Going back to basic definitions here:

"A memory address a, is said to be n-byte aligned when n is a power of two and a is a multiple of n bytes.

A memory access is said to be aligned when the datum being accessed is n bytes long and the datum address is n-byte aligned. When a memory access is not aligned, it is said to be misaligned. Note that by definition byte memory accesses are always aligned."

In other words for n which is a power of 2:

boolean nAligned = (address%n) == 0;

Memory alignments of consequence are:

Type alignment(1,2,4,8) - Certain CPUs require aligned access to memory and as such would get upset(atomicity loss) if you attempt misaligned access to your data. E.g. if you try to MOV a long to/from an address which is not 8-aligned.
Cache line alignment(normally 64, can be 32/128/other) - The cache line is the atomic unit of transport between the main memory and the different CPU caches. As such packing relevant data to that alignment and to that punctuation can make a difference to your memory access performance. Other considerations are required here due to the rules of interaction between the CPU and the cache lines(false sharing, word tearing, atomicity)
Page size alignment(normally 4096, can be configured by OS) - A page is the smallest chunk of memory transferable to an IO device. As such aligning your data to page size, and using the page size as your allocation unit can make a difference to interactions when writing to disk/network devices.

Memory allocated in the form of objects (anything produced by 'new') and primitives is always type aligned and I will not be looking much into it. If you want to know more about what the Java compiler make of you objects, this excellent blog post should help.
As the title suggests I am more concerned with direct memory alignment. There are several ways to get direct memory access in Java, they boil down to 2 methods:

JNI libraries managing memory --> not worried about those for the moment
Direct/MappedByteBuffer/Unsafe which are all the same thing really. The Unsafe class is at the bottom of this gang and is the key to official and semi-official access to memory in Java.

The memory allocated via Unsafe.allocatememory will be 8 byte aligned. Memory acquired via ByteBuffer.allocateDirect will ultimately go through the Unsafe.allocatememory but will differ in 2 important ways:

ByteBuffer memory will be zero-ed out.
Up until JDK6 direct byte buffers were page aligned at the cost of allocating an extra page of memory for every byte buffer allocated. If you allocated lots of direct byte buffers this got pretty expensive. People complained and as of JDK7 this is no longer the case. This means that direct ByteBuffer are now only 8 byte aligned.
Less crucial, but convenient: memory allocated via ByteBuffer.allocateDirect is freed as part of the ByteBuffer object GC. This mechanism is only half comforting as it depends on the finalizer being called[UPDATE: things are a tad more complex than this, but this is not the topic of this post... DirectByteBuffer cleanup is done by a sun.misc.Cleaner, which is a PhatomRefeference to the buffer and will trigger deallocation when the PhantomReference is processed, see the code for more details], and that is not much to depend on.

Allocating an aligned ByteBuffer

No point beating about the bush, you'll find it's very similar to how page alignment works/used to work for ByteBuffers, here's the code:

To summarize the above, you allocate as much extra memory as the required alignment to ensure you can position a large enough memory slice in the middle of it.

Comparing Aligned/Unaligned access performance

Now that we can have an aligned block of memory to play with, lets test drive some theories. I've used Caliper as the benchmark framework for this post (it's great, I'll write a separate post on my experience/learning curve) [UPDATE: This post is from the dark pre-JMH days, these days you should NOT use Caliper, use JMH instead] and intend to convert previous experiments to use it when I have the time. The theories I wanted to test out in this post are:

Theorem I: Type aligned access provides better performance than unaligned access.
Theorem II: Memory access that spans 2 cache lines has far worse performance than aligned mid-cache line access.
Theorem III: Cache line access performance changes based on cache line location.

To test the above theorems I put together a small experiment/benchmark, with an algorithm as follows:

Allocate a page aligned memory buffer of set sizes: 1,2,4,8,16,32,64,128.256,512,1024,2048 pages. Allocation time is not measured but done up front. The different sizes will highlight performance variance which stems from the cache hierarchy of sizes.
We will write a single long per cache line (iterate through the whole memory block), then go back and read the written values (iterate through the whole memory block). There are usually 64 cache lines in a page (4096b/64b case), but we can only write 63 longs which will 'straddle' 2 lines, so the non-straddled offset will skip the first line to have the same number of operations.
Repeat step 2 (2048/(size in pages)) times. The point is to do the same number of memory access operations per experiment.

Here is the code for the experiment:

Before moving on to the results please note that this test is very likely to yield different results on different hardware, so your mileage is very likely to vary. If you find the time to run the test on your own hardware I'd be very curious to compare notes. I used my laptop, which has an i5-2540M processor(Sandy Bridge, dual core, hyper threaded, 2.6Ghz). The L1 Cache is 32K, L2 is 256K and L3 is 3M. This is important in the context of the changes in behaviour we would expect to see as the data set exceeds the capacity of each cache. Should you run this experiment on a machine with different cache allocations, expect your results to vary in accordance.
This experiment is demonstrating a very simple use case(read/write long). There is no attempt to enjoy the many features the CPU offers which may indeed stronger alignment than simple reading and writing. As such take the results to be limited in scope to this use case alone.
Finally, note that the charts reflect relative performance to offset 0 memory access. As the performance varied also as a function of memory buffer size I felt this helps highlight the difference in behaviour which is related to offset rather than size. The effect of the size of the memory buffer is reflected in this chart, showing time as function of size for offset 0:

The chart shows how the performance is determined by the cache the memory buffer fits into. The caches size in pages is 8 for L1, 64 for L2, and 768 for L3 which correspond to the steps(200us, 400us, 600us) we see above, with a final step into main memory(1500us). While not the topic at hand it is worth noting the effect a reduced data set size can have on your application performance, and by extension the importance of realistic data sets in benchmarking.

Theorem I: Type aligned access provides better performance than unaligned access - NOT ESTABLISHED (for this particular use case)

To test Theorem I I set offset to values 0 to 7. Had a whirl and to my surprise found it made no bloody difference! I double and triple checked and still no difference. Here's a chart reflecting the relative cost of memory access:

What seems odd is the difference in performance for the first 4 byte, which are significantly less performant then the rest for memory buffer sizes which fit in the L1 cache(32k = 4k * 8 = 8 pages). The rest show relatively little variation and for the larger sizes there is hardly any difference at all. I have been unable to verify the source of the performance difference...
What is this about? Why would Mr. Drepper and many others on the good web lie to me? As it turns out Intel in their wisdom have sorted this little nugget out for us recently with Nehalem generation of CPUs, such that unaligned type access has no performance impact. This may not be the case on other non-Intel CPUs, or older Intel CPU. I ran the same experiment on my i5 laptop and my Core2 desktop, but could not see a significant difference. I have to confess I find this result a bit alarming given how it flies in the face of accepted good practice. A more careful read of Intel's Performance Optimization Guide reveals the recommendation to align your type stands, but it is not clear what adverse effects it may have for the simple use case we have here. I will attempt to construct a benchmark aimed at uncovering those issues in a further post perhaps. For now it may comfort you to know that breaking the alignment rule seems to not be the big issue it once was on x86 processors. Note that behaviour on non-x86 processors may be wildly different.

Theorem II: Memory access that spans 2 cache lines has far worse performance than aligned mid-cache line access - CONFIRMED

To test Theorem II I set the offset to 60. The result was a significant increase in the run time of the experiment. The difference was far more pronounced on the Core2 (x3-6 the cost of normal access) then on the i5 (roughly x1.5 the cost of normal access), and we can look forward to it being less significant as time goes by. This result should be of interest to anybody using direct memory access as the penalty is quite pronounced. Here's a graph comparing offset 0,32 and 60:

The other side effect of cache line straddling is loss of update atomicity, which is a concern for anyone attempting concurrent direct access to memory. I will go back to the concurrent aspects of aligned/unaligned access in a later post.

Theorem III: Cache line access performance changes based on cache line location - NOT ESTABLISHED

This one is based on Mr. Dreppers article:

3.5.2 Critical Word Load

Memory is transferred from the main memory into the caches in blocks which are smaller than the cache line size. Today 64 bits are transferred at once and the cache line size is 64 or 128 bytes. This means 8 or 16 transfers per cache line are needed.
The DRAM chips can transfer those 64-bit blocks in burst mode. This can fill the cache line without any further commands from the memory controller and the possibly associated delays. If the processor prefetches cache lines this is probably the best way to operate.
If a program's cache access of the data or instruction caches misses (that means, it is a compulsory cache miss, because the data is used for the first time, or a capacity cache miss, because the limited cache size requires eviction of the cache line) the situation is different. The word inside the cache line which is required for the program to continue might not be the first word in the cache line. Even in burst mode and with double data rate transfer the individual 64-bit blocks arrive at noticeably different times. Each block arrives 4 CPU cycles or more later than the previous one. If the word the program needs to continue is the eighth of the cache line, the program has to wait an additional 30 cycles or more after the first word arrives

To test Theorem III I set the offset values to the different type aligned locations on a cache line for a long: 0,8,16,24,32,40,48,56. My expectation here was to see slightly better performance for the 0 location as explained in Mr. Drepper's paper, with degrading performance as you get further from the begining of the cache line, at least for the large buffer sizes where data is always fetched from main memory. Here's the result:

The results surprised me. As long as the buffer fit in my L1 cache, writing/reading from the first 4 bytes was significantly more expensive than elsewhere on the line, as discussed in Theorem I, but the other locations did not have measurable differences in cost. I pinged Martin Thompson which pointed me at the Critical Word First feature/optimization explained here:

Critical word first takes advantage of the fact that the processor probably only wants a single word by signaling that it wants the missed word to appear first in the block. We receive the first word in the block (the one we actually need) and pass it on immediately to the CPU which continues execution. Again, the block continues to fill up in the "background".

I was unable to determine when exactly Intel introduced the feature to their CPU's, and I'm not sure which processors fail to supply it.

Conclusions/Takeaways

Of the 3 concerns stated, only cache line straddling proved to be a significant performance consideration. Cache line straddling impact is diminishing in later Intel processors, but is unlikely to disappear altogether. As such it should feature in the design of any direct memory access implementation and may be of interest to serializing/de-serializing code.
The other significant factor is memory buffer size, which is obvious when one considers the cache hierarchy. This should prompt us to make an effort towards more compact memory structures. In light of the negligible cost of type mis-alignment we may want to skip type alignment driven padding of structures when implementing off-heap memory stores.
Finally, this made me realize just how fast moving is the pace of development in hardware which invalidates past conventional wisdom/assumptions. So yet another piece of evidence validating the call to 'Measure, Measure, Measure' all things performance.

Running the benchmarks yourself

Code is on GitHub
Run 'ant run-unaligned-experiments'
Wait... it takes a while to finish
Send me the output?

To get the most deterministic results you should run the experiments pinned to a core which runs nothing else, that would reduce noise from other processes 'polluting' the cache.

Many thanks to Martin, Aleksey Shipilev and Darach Ennis for proof reading and providing feedback on initial drafts, any errors found are completely their fault :-).

[UPADTE 22/06/2015: Experiments were repeated in JMH form, arriving at similar conclusions with great insight and cross platform testing by Aleksey Shipilev.]