Improve allocation throughput by outlining cache miss code path
Previously, GCC 11 on x86-64 inlined the heavy weight logic for
potentially triggering GC into newobj_alloc(). This slowed down
the hotter code path where the ractor cache hits, causing a degradation
to allocation throughput.
Outline the logic into a separate function and have it never inlined.
This restores allocation throughput to the same level as
98eeadc ("Development of 3.4.0 started.").
To evaluate, instrument miniruby so it allocates a bunch of objects and
then exits:
With 3.3-equiv being 98eeadc, and pre being f2728c3393d
and post being this commit, I have:
$ hyperfine -L buildtag post,pre,3.3-equiv '/ruby/build-{buildtag}/miniruby'
Benchmark 1: /ruby/build-post/miniruby
Time (mean ± σ): 873.4 ms ± 2.8 ms [User: 377.6 ms, System: 490.2 ms]
Range (min … max): 868.3 ms … 877.8 ms 10 runs
Benchmark 2: /ruby/build-pre/miniruby
Time (mean ± σ): 960.1 ms ± 2.8 ms [User: 430.8 ms, System: 523.9 ms]
Range (min … max): 955.5 ms … 964.2 ms 10 runs
Benchmark 3: /ruby/build-3.3-equiv/miniruby
Time (mean ± σ): 886.9 ms ± 2.8 ms [User: 379.5 ms, System: 501.0 ms]
Range (min … max): 883.0 ms … 890.8 ms 10 runs
Summary
'/ruby/build-post/miniruby' ran
1.02 ± 0.00 times faster than '/ruby/build-3.3-equiv/miniruby'
1.10 ± 0.00 times faster than '/ruby/build-pre/miniruby'
These results are from a Skylake server with GCC 11.
Improve allocation throughput by outlining cache miss code path
Previously, GCC 11 on x86-64 inlined the heavy weight logic for
potentially triggering GC into newobj_alloc(). This slowed down
the hotter code path where the ractor cache hits, causing a degradation
to allocation throughput.
Outline the logic into a separate function and have it never inlined.
This restores allocation throughput to the same level as
98eeadc ("Development of 3.4.0 started.").
To evaluate, instrument miniruby so it allocates a bunch of objects and
then exits:
With
3.3-equiv
being 98eeadc, andpre
being f2728c3393dand
post
being this commit, I have:These results are from a Skylake server with GCC 11.