Avoiding retpolines with static calls

By Jonathan Corbet
March 26, 2020

January 2018 was a sad time in the kernel community. The Meltdown and Spectre vulnerabilities had finally been disclosed, and the required workarounds hurt kernel performance in a number of ways. One of those workarounds — retpolines — continues to cause pain, with developers going out of their way to avoid indirect calls, since they must now be implemented with retpolines. In some cases, though, there may be a way to avoid retpolines and regain much of the lost performance; after a long gestation period, the "static calls" mechanism may finally be nearing the point where it can be merged upstream.

Indirect calls happen when the address of a function to be called is not known at compile time; instead, that address is stored in a pointer variable and used at run time. These indirect calls, as it turns out, are readily exploited by speculative-execution attacks. Retpolines defeat these attacks by turning an indirect call into a rather more complex (and expensive) code sequence that cannot be executed speculatively.

Retpolines solved the problem, but they also slow down the kernel, so developers have been keenly interested in finding ways to avoid them. A number of approaches have been tried; a few of which were covered here in late 2018. While some of those techniques have been merged, static calls have remained outside of the mainline. They have recently returned in the form of this patch set posted by Peter Zijlstra; it contains the work of others as well, in particular Josh Poimboeuf, who posted the original static-call implementation.

An indirect call works from a location in writable memory where the destination of the jump can be found. Changing the destination of the call is a matter of storing a new address in that location. Static calls, instead, use a location in executable memory containing a jump instruction that points to the target function. Actually executing a static call requires "calling" to this special location, which will immediately jump to the real target. The static-call location is, in other words, a classic code trampoline. Since both jumps are direct — the target address is found directly in the executable code itself — no retpolines are needed and execution is fast.

Static calls must be declared before they can be used; there are two macros that can do that:

    #include <linux/static_call.h>

    DEFINE_STATIC_CALL(name, target);
    DECLARE_STATIC_CALL(name, target);

DEFINE_STATIC_CALL() creates a new static call with the given name that initially points at the function target(). DECLARE_STATIC_CALL(), instead, declares the existence of a static call that is defined elsewhere; in that case, target() is only used for type checking the calls.

Actually calling a static call is done with:

    static_call(name)(args...);

Where name is the name used to define the call. This will cause a jump through the trampoline to the target function; if that function returns a value, static_call() will also return that value.

The target of a static call can be changed with:

    static_call_update(name, target2);

Where target2() is the new target for the static call. Changing the target of a static call requires patching the code of the running kernel, which is an expensive operation. That implies that static calls are only appropriate for settings where the target will change rarely.

One such setting can be found in the patch set: tracepoints. Activating a tracepoint itself requires code patching. Once that is done, the kernel responds to a hit on a tracepoint by iterating through a linked list of callback functions that have been attached there. In almost every case, though, there will only be one such function. This patch in the series optimizes that case by using a static call for the single-function case. Since the intent behind tracepoints is to minimize their overhead to the greatest extent possible, use of static calls makes sense there.

This patch set also contains a further optimization not found in the original. Jumping through the trampoline is much faster than using a retpoline, but it is still one more jump than is strictly necessary. So this patch causes static calls to store the target address directly into the call site(s), eliminating the need for the trampoline entirely. Doing so may require changing multiple call sites, but most static calls are unlikely to have many of those. It also requires support in the objtool tool to locate those call sites during the kernel build process.

The end result of this work appears to be a significant reduction in the cost of the Spectre mitigations when using tracepoints — a slowdown of just over 4% drops to about 1.6%. It has been through a number of revisions, as well as some improvements to the underlying text-patching code, and appears to be about ready. Chances are that static calls will go upstream in the near future.

Index entries for this article
Kernel	Retpoline
Kernel	Static calls

Avoiding retpolines with static calls

Posted Mar 27, 2020 4:53 UTC (Fri) by alison (subscriber, #63752) [Link] (29 responses)

The point of indirect branches is that the address of the function to be called is known only at runtime, where it is read from writable memory, and is not known at compile time. The article makes it sound like instead static calls may be inserted into .text at compile time, but then how is the problem of dynamic function lookup addressed? It doesn't magically go away.

Avoiding retpolines with static calls

Posted Mar 27, 2020 6:50 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (27 responses)

A lot of dynamic calls change infrequently and do not depend on data. So it's possible to patch the calling code.

This won't work if you make calls, for example, into a filesystem driver from the VFS layer.

Avoiding retpolines with static calls

Posted Mar 27, 2020 15:21 UTC (Fri) by alison (subscriber, #63752) [Link] (26 responses)

> A lot of dynamic calls change infrequently and do not depend on data. So it's > possible to patch the calling code.

In other words, if the calling code wants one function from library A and another from library B, we just change the code.

> This won't work if you make calls, for example, into a filesystem driver from > the VFS layer.

That's pretty much the case I was considering: resolution of a base class pointer. In your example, the VFS has a default implementation of whatever action, and individual filesystems may override it. Compiler Explorer shows explicitly that the base class pointer resolution is in fact an indirect branch. AFAICT, this new mechanism offers no hope to eliminate this branch. Some type of "reflection" would be needed to resolve the problem, perhaps with _THIS_IP and _RET_IP_ in include/linux/kernel.h.

Avoiding retpolines with static calls

Posted Mar 27, 2020 17:48 UTC (Fri) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

> In other words, if the calling code wants one function from library A and another from library B, we just change the code.
Yep.

> That's pretty much the case I was considering: resolution of a base class pointer.
This can also be done, but not with this approach. Basically, you rewrite indirect calls into "if" chains:
if (this->class == ClassA) return ClassA_Method(this, ...)
if (this->class == ClassB) return ClassB_Method(this, ...)

This actually can work faster than indirects if you only have a couple of implementations (which is normally the case).

Avoiding retpolines with static calls

Posted Mar 29, 2020 10:07 UTC (Sun) by Homer512 (subscriber, #85295) [Link]

You could extend this to an unrolled binary search similar to what the compiler does with large, sparsely populated switch-case statements. I'm wondering whether the kernel could just generate trampoline functions with such code on-the-fly. Then you could just register a new indirect call target, the kernel rewrites its trampoline and everything is done via direct calls. You can even leave in the indirect call as a fallback.

Avoiding retpolines with static calls

Posted Mar 29, 2020 14:28 UTC (Sun) by anton (subscriber, #25547) [Link] (23 responses)

Actually the literature on implementing object-oriented languages discusses optimizations called "inline caching" and "polymorphic inline caching", which are targeted at exactly this problem.

Inline caching corresponds pretty much with the "further optimization" of the article: it works with an inline call, and guards it with an inline check that the object has the correct class (in VFS, that the file has the correct file system type); it was useful if the class rarely changes for a call site.

Polymorphic inline caching has several inline calls, selected by several inline checks. It was useful if a call site gets passed objects of a few classes alternatingly, and may be appropriate for the VFS case.

These techniques fell out of fashion when correctly predicted indirect branches became as fast as direct branches and indirect-branch prediction became pretty good. Now, with Spectre they may see a renaissance (hopefully a short one curtailed by the introduction of properly fixed hardware).

Avoiding retpolines with static calls

Posted Mar 29, 2020 18:30 UTC (Sun) by NYKevin (subscriber, #129325) [Link] (20 responses)

> (hopefully a short one curtailed by the introduction of properly fixed hardware).

As I currently understand things, this is at best optimistic.

Spectre is not a simple hardware flaw that you can fix by rearranging a few transistors. It is a design flaw in speculative execution itself. Spectre has been endemic to every general-purpose microprocessor that has been manufactured in the past twenty or more years (and probably most special-purpose microprocessors too). To "fix" it, you would have to completely redesign, not just one or two architectures, but the entire concept of what a modern microprocessor looks like.

I am open to being corrected, of course. It's entirely possible that some manufacturers have gone down that "redesign everything from the ground up" road that I described, or perhaps they found a simpler path forward. But my current understanding is that workarounds, usually at higher layers of the system, are all we've got right now and for the foreseeable future.

Avoiding retpolines with static calls

Posted Mar 29, 2020 18:47 UTC (Sun) by Cyberax (✭ supporter ✭, #52523) [Link] (9 responses)

Speculative (mis-)execution is not a problem in itself, it's the fact that it may cause observable side effects. So one avenue of attack on this problem is removing the side effects.

Avoiding retpolines with static calls

Posted Mar 29, 2020 20:56 UTC (Sun) by anton (subscriber, #25547) [Link] (6 responses)

Yes. E.g., do not update the caches from speculative loads. Instead, keep the load results in a private buffer and only update the caches if the loads become non-speculative, or when they are committed.

Another avenue is to make it hard to train the predictor for a Spectre v2 exploit, e.g., by scrambling the input and/or the output of the predictor with a per-thread secret that changes now and then. You can combine this avenue with the other one.

Avoiding retpolines with static calls

Posted Mar 29, 2020 21:01 UTC (Sun) by Cyberax (✭ supporter ✭, #52523) [Link]

I've spoken with hardware people about it a couple of years ago and they had some ideas like separate speculative caches, that can be referred only from speculative contexts. Or adding artificial access delay to cache lines added by from speculative contexts. I'm sure other ideas will come in future.

Avoiding retpolines with static calls

Posted Mar 30, 2020 1:52 UTC (Mon) by NYKevin (subscriber, #129325) [Link] (4 responses)

I'm not convinced that is sufficient. If there is any statistical relationship between any part of the processor's observable behavior and any data that any higher-level code might possibly consider a "secret," then you can perform a statistical analysis of that behavior to recover that secret, with arbitrarily good statistical confidence (depending on the available data). This is basic information theory. Adding noise at random parts of the processor will force the attacker to gather more data, but if the system under attack belongs to a cloud provider, then that's not an obstacle (because renting out processors for extended periods is their entire business model).

The problem is that performance is a form of observable behavior. So you have to completely disentangle performance variations from every piece of data in main memory, cache, and registers, or else you have to exhaustively prove that the code under execution has access to that data, in every security model that the system cares about. The latter would extend to, for example, determining that the code under execution lives inside of a Javascript sandbox and should not be allowed to access the state of the rest of the web browser. I don't see how the processor can be reasonably expected to do that, so we're back to the first option: The processor's performance must not vary in response to any data at all. But then you can't have a branch predictor.

I would really like someone to convince me that I'm wrong, because if true, this is rather terrifying. But as far as I can see, there is no catch-all mitigation for this problem.

Avoiding retpolines with static calls

Posted Mar 30, 2020 2:32 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (2 responses)

> because renting out processors for extended periods is their entire business model
Pretty much all major cloud computing providers assign CPUs exclusively to customers, except maybe for the cheapest offerings (like T2/T3 instances on AWS).

Avoiding retpolines with static calls

Posted Apr 2, 2020 8:35 UTC (Thu) by wahern (subscriber, #37304) [Link] (1 responses)

By CPUs do you mean cores or packages? Multiple cores on a single package share cache (e.g. L3) and are vulnerable to cross-core, cross-VM leaks. (See https://www.usenix.org/system/files/conference/usenixsecu....) I don't know how much of a headache are cross-package attacks that exploit cache coherency protocols, but there are such attacks in the literature, which I'd keep in mind for high-value assets absent details from AWS.

I'd be surprised if the bulk of instances were package isolated. The Xeon Platinum 8175M used for M5 instances has 24 cores per package, 48 threads, and therefore 48 vCPUs. AFAIU, EC2 doesn't share cores, but a vCPU is still the equivalent of a logical SMT-based core, so any instance type using less than 47 vCPUs would be leaving at least an entire core unused. AWS offers 2-, 8-, 16-, 32-, 48-, 64-, and 96-vCPU M5 instances. I'd bet a large number and probably a majority of customers are utilizing 2- to 32-vCPU instances, that they invariably share packages, and thus share L3 cache. And I'd also bet that 64-vCPU instances share one of their packages with other instances.

Avoiding retpolines with static calls

Posted Apr 2, 2020 16:37 UTC (Thu) by Cyberax (✭ supporter ✭, #52523) [Link]

I'm pretty sure AWS splits nodes in such way that instances don't share cache.

> I'd keep in mind for high-value assets absent details from AWS.
All major cloud providers also have dedicated instances that won't be shared across customers. In case of AWS you can even run your own T2/T3 instances on top of the dedicated hardware nodes.

Avoiding retpolines with static calls

Posted Mar 30, 2020 9:50 UTC (Mon) by anton (subscriber, #25547) [Link]

It's not clear what attack scenario you have in mind. Can you give an example?

If you refer to the scrambling of the branch predictor inputs and outputs , note that the secret changes regularly (with an interval designed such that only a small part of the secret can be extracted).

Concerning attacks from within the same thread (as in JavaScript engines), assuming good-enough scrambling, the attacker cannot predict which other branches a branch is aliased with in the branch predictor, and also cannot get the prediction to branch to the gadget the attacker wants to be speculatively executed.

Concerning observable behaviour, lots of things are theoretically observable (e.g., through power consumption or electromagnetic emissions), and have been used practically on other components (e.g., emissions from CRTs or screen cables to watch what you are watching), but the danger is that you get into a "the sky is falling" mode of thinking that does not differentiate how realistically possible an attack is. I have seen hardware and software providers blamed for waiting for the demonstration of an attack before they start acting, but OTOH, if they acted on any far-fetched theoretical weakness (not even attack scenario) that somebody worries about, they would not have a product.

Avoiding retpolines with static calls

Posted Apr 2, 2020 9:32 UTC (Thu) by ras (subscriber, #33059) [Link] (1 responses)

Uhmmm, in this case the side effect they are exploiting is the speed up of the program. That happens to the very side effect caches were added for. Any mitigation will reduce that highly desirable side effect, and thus reduce the usefulness of cache's in general.

The more general problem is privilege separation. This is just a case of lower privilege code being able to deduct what higher privilege code left in its cache. I wonder how long it will be before someone figures out how to exploit spectre via a javascript JIT, or wasm. The only solution is hardware protecting the browser memory from the javascript, but the hardware doesn't provide anything lightweight enough. If we did have something lightweight enough micro kernels would be more viable.

Avoiding retpolines with static calls

Posted Apr 2, 2020 12:04 UTC (Thu) by anton (subscriber, #25547) [Link]

The question is how much a proper hardware mitigation costs compared to the current mitigations. So what are the costs of not updating caches on speculative memory accesses?

If the access becomes non-speculative (the vast majority of cases in most code), the caches will be updated a number of cycles later. The performance impact of that is minimal.

If the access is on a mispredicted path, the caches will not be updated. If the line is then not accessed before it would be evicted, this is a win. If the line is accessed, we incur the cost of another access to an outer level or main memory. A somewhat common case is when you have a short if, and then a load after it has ended; if the if is mispredicted, and the load misses the cache, the speculative execution will first perform an outer access, and then it will be reexecuted on the corrected path and will perform the outer access again; the cost here is the smaller of the cache miss cost and the branch misprediction cost. And the combination is relatively rare.

Overall, not updating caches on speculative memory accesses reduces the usefulness of caches by very little, if at all (there are also cases where it increases the usefulness of caches).

A Spectre-based exploit using JavaScript has been published pretty early. No, MMU-style memory protection is not the only solution, Spectre-safe hardware as outlined above would also be a solution.

Avoiding retpolines with static calls

Posted Apr 2, 2020 11:58 UTC (Thu) by davecb (subscriber, #1574) [Link] (9 responses)

re: Spectre has been endemic to every general-purpose microprocessor that has been manufactured in the past twenty or more years

I suspect that the cpu family used in the T5 from (the late, lamented) Sun Microsystems dodged this: it's claim to fame was running other, non-blocked code whenever a thread was blocked on a memory fetch, as described in https://en.wikipedia.org/wiki/UltraSPARC_T1

Avoiding retpolines with static calls

Posted Apr 2, 2020 12:27 UTC (Thu) by anton (subscriber, #25547) [Link] (8 responses)

Generally, in-order implementations are not affected, and existing out-of-order implementations are affected; not because it's not possible to make Spectre-immune OoO implementations, but because they did not think about how to do it when they designed these CPUs (Spectre was discovered later).

UltraSPARC T1-SPARC T3 are in-order, but SPARC T4, T5, M7, M8 are OoO. Performing other threads on a cache miss does not protect against Spectre: The cache or other microarchitectural state can still be updated from a speculative load.

Recent in-order CPUs include Cortex A53 and A55, and Bonnell, the first generation of Intel's Atom line.

Avoiding retpolines with static calls

Posted Apr 2, 2020 12:47 UTC (Thu) by davecb (subscriber, #1574) [Link]

Many thanks!

Interestingly, the kind of side-channel attack that Spectre and friends are variants of were studied back in the mainframe days, well before anyone got around to inventing them (;-))

--dave

Avoiding retpolines with static calls

Posted Apr 2, 2020 13:27 UTC (Thu) by farnz (subscriber, #17727) [Link] (6 responses)

Note, though, that even in-order CPUs can be affected. The core of Spectre family attacks is that the CPU speculatively executes code that changes micro-architectural state, and thus any CPU with speculation can be affected. There exist demonstrations that suggest that the Cortex A53, for example, is partially affected by Spectre, but the resulting side channel is too limited to be practical with the currently known exploits.

In theory, even a branch predictor can carry enough state to be attacked!

Avoiding retpolines with static calls

Posted Apr 2, 2020 15:12 UTC (Thu) by davecb (subscriber, #1574) [Link]

Yes: in an experiment at Sun, we found branch prediction could cause differing numbers of cache-line fills. Logically you could use it to signal single bits, for a lowish-speed side channel.

Avoiding retpolines with static calls

Posted Apr 2, 2020 15:37 UTC (Thu) by anton (subscriber, #25547) [Link] (4 responses)

Yes, in-order CPUs perform branch prediction and use that to perform several stages of the pipeline of the first predicted instructions, but normally they do not do anything that depends on the result of one of the predicted instructions; they would need some kind of register renaming functionality for that, and if as a hardware designer you go there, you can just as well do a full OoO design.

A Spectre-style attack consists of a load of the secret, and some action that provides a side channel, with a data flow from the load to the action (otherwise you don't side-channel the secret). An OoO processor can do that, but an in-order processor as outlined above cannot AFAICS. If you can provide a reference to the A53 demonstration, I would be grateful.

Avoiding retpolines with static calls

Posted Apr 2, 2020 18:18 UTC (Thu) by farnz (subscriber, #17727) [Link] (1 responses)

Even something as apparently trivial as using the branch predictor to decide which instructions to fetch is enough to make a timing difference that Spectre-type exploits can predict. "All" you need to do is ensure that the branch predictor correctly predicts for a single value of the secret and mispredicts for all others (or vice-versa), and that I have a timer that lets me determine whether or not the branch predictor correctly predicted (made easier if I can control L1I$ contents), and I have a side-channel. Not a very good side channel, but enough that I can extract secrets slowly.

And there's no point having a branch predictor if you then insert a pipeline bubble that means that there's no timing difference between a correct prediction and a mispredict; it's that timing difference that creates the side-channel, though, so you need to somehow prevent the attacker from measuring a timing difference there to be completely Spectre-free.

The demonstration was something in-house, using exactly the components I've described above; it was a slow enough channel to be effectively useless, but it existed.

Avoiding retpolines with static calls

Posted Apr 3, 2020 7:59 UTC (Fri) by anton (subscriber, #25547) [Link]

Yes, the branch predictor can be used as the side channel; in a Spectre-type attack the action part would then be a branch.

But for a Spectre-type attack the branch predictor needs to be updated based on the speculatively loaded value. This does not happen in in-order processors (because the execution does not get that far), and it will not happen in a Spectre-resistant OoO CPU (e.g., by only updating the branch predictor based on committed branches).

Note that not all side-channel attacks are Spectre attacks. We can prevent some side channels through hardware design. E.g., there are Rowhammer-immune DRAMs; I also dimly remember some CPU-resource side channel (IIRC a limitation on the cache ports) that was present in one generation of Intel CPUs, but not in the next generation, which had more resources. Closing other side channels in the non-speculative case appears to be too expensive, e.g., cache and branch predictor side channels, as you explain.

There are established software mitigations for protecting critical secrets (branchless code, data-independent memory accesses for code that handles these secrets) such as encryption keys from non-Spectre/Meltdown side channels. But these do not help against Spectre, because Spectre can use all code in the process to extract the secret.

From your description, you read about a non-Spectre side-channel attack through the branch predictor of the A53.

Avoiding retpolines with static calls

Posted Apr 3, 2020 14:30 UTC (Fri) by excors (subscriber, #95769) [Link] (1 responses)

> A Spectre-style attack consists of a load of the secret, and some action that provides a side channel, with a data flow from the load to the action (otherwise you don't side-channel the secret). An OoO processor can do that, but an in-order processor as outlined above cannot AFAICS.

I don't see why an in-order processor couldn't do that, in principle. They're still going to do branch prediction, and speculatively push some instructions from the predicted target into the start of the pipeline. Once they detect a mispredict they'll flush the pipeline and start again with the correct target. As long as the flush happens before the mispredicted instructions reach a pipeline stage that writes to memory or to registers, that should be safe (ignoring Spectre). But if the mispredicted instructions read from a memory address that depends on a register (which may contain secret data), and the read modifies the TLB state or cache state, then it would be vulnerable to Spectre-like attacks.

(I'm thinking of an artificial case like "insecure_debug_mode = 0; int x = secret; if (insecure_debug_mode) x = array[x];", where mispredicting the 'if' means the very next instruction will leak the secret data via the cache. It doesn't need a long sequence of speculative instructions. That could be a case where the programmer has carefully analysed the expected code path to avoid data-dependent memory accesses etc so their code is safe from side-channels attacks according to the advertised behaviour of the processor, but the processor is speculatively executing instructions that the programmer did not expect to be executed, so I think that counts as a Spectre vulnerability. As I see it, the fundamental issue with Spectre is that it doesn't matter how careful you are about side-channels when writing code, because the processor can (speculatively) execute an arbitrarily different piece of code that is almost impossible for you to analyse.)

In practice, in-order processors seem to typically have short enough pipelines that the misprediction is detected within a few cycles, before the mispredicted instructions have got far enough through the pipeline to have any side effects. But that seems more by luck than by design, and maybe some aren't quite lucky enough. OoO processors aren't more dangerous because they're OoO, they're more dangerous because they typically have much longer pipelines and mispredicted instructions can progress far enough to have many side effects.

Avoiding retpolines with static calls

Posted Apr 3, 2020 16:10 UTC (Fri) by anton (subscriber, #25547) [Link]

So you are thinking about an architecturally visible (i.e., non-speculative) secret, and using speculation only for the side channel. One difference from the classical Spectre attacks is that, like for non-speculative side-channel attacks, software developers only need to inspect the code that deals with the secret and avoid such ifs there; but yes, it's an additional rule they have to observe in such code.

The length of the pipeline of an OoO processor is not the decisive factor. Unlike an in-order processor, speculatively executed instructions on an OoO do not just enter the pipeline, they produce results that other speculatively executed instructions can use as inputs. Essentially the in-order front end (instruction fetch and decoding) is decoupled from the in-order commit part (where the results become architecturally visible) by the OoO engine, and the front end can run far ahead (by hundreds of instructions and many cycles) of the commit part.

Avoiding retpolines with static calls

Posted Mar 30, 2020 9:32 UTC (Mon) by pbonzini (subscriber, #60935) [Link] (1 responses)

> These techniques fell out of fashion when correctly predicted indirect branches became as fast as direct branches and indirect-branch prediction became pretty good. Now, with Spectre they may see a renaissance (hopefully a short one curtailed by the introduction of properly fixed hardware).

Inline caching and polymorphic inline caching are not only useful to avoid indirect branches, but also (and especially) to avoid method lookups in dynamic languages. The code for an inline cache would be more like this

    if (class != inline_cache->last_class) {
        inline_cache->last_class = class;
        inline_cache->last_method = hashtable_lookup(class, method_name);
    }
    inline_cache->last_method(...);

where possibly inline_cache->last_method is not an indirect call but a patched (self-modifying) direct call.

Avoiding retpolines with static calls

Posted Mar 30, 2020 17:35 UTC (Mon) by anton (subscriber, #25547) [Link]

Yes, inline caching was originally introduced in a Smalltalk implementation, which had a hash-table-based cache as an intermediate-speed path and the slow path was not even a hash table, but a linear search through the chain of superclasses (I don't remember what was used for looking up methods within a class). I think that one reason for that was that the RAM sizes of the time discouraged speeding up the method lookup with complete hash tables or full class*methods matrices.

But why did they patch the code rather than using separate variables as your C code does? The paper states

Note that this is a kind of dynamic code modification, which is generally condemned in modern practice. The n[ative]-method address can just as well be placed out-of-line and accessed indirectly; code modification is more efficient, and we are using it in a well-confined way.

Why was it more efficient? Apart from the relative speed of direct vs. indirect calls at the hardware of the time there were the following factors:

Hardware at the time typically (I do not find what hardware they used) did not have a separate instruction cache, and therefore there was no cache consistency cost of writing to the code; by contrast, when I last measured it (IIRC on a Skylake CPU), writing to the cache line that also contains executed code had an overall cost of >300 cycles.
The hardware had just a single CPU core, so it was easy to protect against execution concurrent with the update of the inline cache (if that was an issue at all).

One other difference from the C code you posted is that the actual check for the correct class was in the called method:

The entry code of an n-code method checks the stored receiver class from the point of call against the actual receiver class. If they do not match, relinking must occur, just as if the call had not yet been linked.

Avoiding retpolines with static calls

Posted Mar 29, 2020 0:39 UTC (Sun) by ndesaulniers (subscriber, #110768) [Link]

Consider the difference between having a collection of different objects, and calling a similarly identified method on each of them, vs having a function pointer that once assigned to never changes. In the latter case, it would be nice to provide some form of optimization for these write once semantics (or even write-infrequently semantics, as is the current use of static keys). In this case, it's possible to replace indirect calls with direct calls, and change them only when needed, which is both infrequent and relatively expensive compared to the cost of an indirect call. You have to be careful about leaving enough space to patch in a jmp to an offset within an encode-able range, and be careful about concurrent modifications, but it's possible, and the kernel does so today, via static keys.

Avoiding retpolines with static calls

Posted Mar 27, 2020 7:36 UTC (Fri) by josh (subscriber, #17465) [Link] (14 responses)

If we compile the kernel with link-time optimization, in theory the compiler could know every possible value stored into a function pointer, such as a filesystem driver's callback functions. Together with a compiler plugin, could that allow generating devirtualized function calls to a fixed set of callback functions?

(This would also allow eliminating a type of callback entirely if no non-NUL function is ever stored in that function pointer. That could optimize various kinds of function pointers that don't always get used in a given struct of pointers.)

Avoiding retpolines with static calls

Posted Mar 27, 2020 9:07 UTC (Fri) by epa (subscriber, #39769) [Link] (13 responses)

I was wondering how much of this would be dealt with if the kernel used C++, with virtual functions taking the place of the current indirect calls. A good C++ compiler, if it knows the type of an object, can optimize a virtual function call into a direct call. Though I don't know whether any compiler has the option to generate retpolines or similar armouring.

Avoiding retpolines with static calls

Posted Mar 27, 2020 9:31 UTC (Fri) by gus3 (guest, #61103) [Link] (11 responses)

It isn't a virtual "function"; it's a virtual method bound to an object. It's most useful when the derived class of the object is unknown at runtime, so the call is made through the base class instead, which fetches the method pointer from the object (via _vptr in g++) in order to make the "proper" call.

Avoiding retpolines with static calls

Posted Mar 27, 2020 9:39 UTC (Fri) by epa (subscriber, #39769) [Link] (10 responses)

I was trying to follow the C++ terminology that calls them functions rather than methods. To be completely clear I should have said 'virtual member function'. See http://www.stroustrup.com/glossary.html

Avoiding retpolines with static calls

Posted Mar 28, 2020 10:41 UTC (Sat) by ncm (guest, #165) [Link] (9 responses)

This is correct. Furthermore, a virtual member function *really is* just a function. The implied object is just a pointer argument passed to the function, and might not even be used there.

Fetishism around object-oriented conventions is worse than pointless. Virtual calling is a language mechanism that has various uses, implementing polymorphic runtime binding being one among them.

Sometimes virtual calls perform better than calling through a function pointer, because there is less choice. But sometimes the function pointer call is faster, particularly if the pointer rarely or never changes. On modern cores, both are hardly slower than a direct call except in pathological cases.

A virtual call, by offering fewer choices for what could happen, is often easier for a reader to understand than a call through a function pointer. So, it is often a good idea to cobble up a base class just to provide a virtual call, even where there is logically no meaningful object involved. Having cobbled one up, though, it often becomes useful to park things there. Having parked things there, it may be worth formalizing a class with invariants. Or it might not.

Avoiding retpolines with static calls

Posted Mar 28, 2020 19:12 UTC (Sat) by nivedita76 (subscriber, #121790) [Link] (8 responses)

The technique in the article is for the "pathological" case of retpolines, which don't allow speculation and hence are guaranteed to be much slower than direct calls.

Avoiding retpolines with static calls

Posted Mar 29, 2020 3:18 UTC (Sun) by ncm (guest, #165) [Link] (7 responses)

It seems like a virtual call would be closely akin to the static mechanism, as the vtable array of function pointers is in the object-code, "text" section.

But of course C++ code is hard to get into the kernel. Linus's deliberate choice to use C++ keywords in kernel headers is one barrier to overcome.

Avoiding retpolines with static calls

Posted Mar 29, 2020 4:06 UTC (Sun) by Cyberax (✭ supporter ✭, #52523) [Link]

Virtual calls in C++ defeat the whole purpose of the exercise. They still require an indirect jump and so can be exploited by Spectre.

Avoiding retpolines with static calls

Posted Mar 29, 2020 10:41 UTC (Sun) by ballombe (subscriber, #9523) [Link] (4 responses)

You are missing th historical perspective.
Stroupstrup deliberate choice to break C compatibility by adding new keyword to C++ is one barrier to overcome.

Avoiding retpolines with static calls

Posted Mar 29, 2020 12:28 UTC (Sun) by mpr22 (subscriber, #60784) [Link]

The historical perspective includes the fact that Linux is newer than C++.

Avoiding retpolines with static calls

Posted Apr 3, 2020 18:07 UTC (Fri) by adobriyan (subscriber, #30858) [Link]

In a perfect world (lets call it TexC) keywords would start with backslash...

\let foo: u32 = 0;

Avoiding retpolines with static calls

Posted Apr 4, 2020 17:18 UTC (Sat) by nix (subscriber, #2304) [Link] (1 responses)

I think that's overdoing it. Even C's done that, repeatedly. Do you blame the ANSI C committee for adding new keywords? (They added almost as many as C++ did.)

Avoiding retpolines with static calls

Posted Apr 6, 2020 15:03 UTC (Mon) by mathstuf (subscriber, #69389) [Link]

Aren't they all in the `_Upper` namespace (namepattern?) though? C reserved that long ago (and is why C++ can't use it either; C may add names there at any time).

Avoiding retpolines with static calls

Posted Mar 29, 2020 15:11 UTC (Sun) by nivedita76 (subscriber, #121790) [Link]

I'm not sure what you mean by that. The "static" mechanism discussed in the article is self-modifying code that patches the text to make a direct call. The vtable array is just a const array of function pointers, but is still accessed via an indirect call, which would have all the overhead of retpoline blocking branch prediction. The "closely akin" method would be if the compiler optimizes the virtual function call to a couple of test-and-direct-branch based on knowledge of what types could be involved.

Avoiding retpolines with static calls

Posted Mar 29, 2020 15:07 UTC (Sun) by nivedita76 (subscriber, #121790) [Link]

I would hope that the -mindirect-branch option if used also uses retpolines for virtual calls, not just function pointers.