Project

General

Profile

« Previous | Next » 

Revision 158177e3

Added by alanwu (Alan Wu) 11 months ago

Improve allocation throughput by outlining cache miss code path

Previously, GCC 11 on x86-64 inlined the heavy weight logic for
potentially triggering GC into newobj_alloc(). This slowed down
the hotter code path where the ractor cache hits, causing a degradation
to allocation throughput.

Outline the logic into a separate function and have it never inlined.

This restores allocation throughput to the same level as
98eeadc ("Development of 3.4.0 started.").

To evaluate, instrument miniruby so it allocates a bunch of objects and
then exits:

diff --git a/eval.c b/eval.c
--- a/eval.c
+++ b/eval.c
@@ -92,6 +92,15 @@ ruby_setup(void)
     }
     EC_POP_TAG();

+rb_gc_disable();
+rb_execution_context_t *ec = GET_EC();
+long const n = 20000000;
+for (long i = 0; i < n; ++i) {
+    rb_wb_protected_newobj_of(ec, 0, T_OBJECT, 40);
+}
+printf("alloc %ld\n", n);
+exit(0);
+
     return state;
 }

With 3.3-equiv being 98eeadc, and pre being f2728c3393d
and post being this commit, I have:

$ hyperfine -L buildtag post,pre,3.3-equiv '/ruby/build-{buildtag}/miniruby'
Benchmark 1: /ruby/build-post/miniruby
  Time (mean ± σ):     873.4 ms ±   2.8 ms    [User: 377.6 ms, System: 490.2 ms]
  Range (min … max):   868.3 ms … 877.8 ms    10 runs

Benchmark 2: /ruby/build-pre/miniruby
  Time (mean ± σ):     960.1 ms ±   2.8 ms    [User: 430.8 ms, System: 523.9 ms]
  Range (min … max):   955.5 ms … 964.2 ms    10 runs

Benchmark 3: /ruby/build-3.3-equiv/miniruby
  Time (mean ± σ):     886.9 ms ±   2.8 ms    [User: 379.5 ms, System: 501.0 ms]
  Range (min … max):   883.0 ms … 890.8 ms    10 runs

Summary
  '/ruby/build-post/miniruby' ran
    1.02 ± 0.00 times faster than '/ruby/build-3.3-equiv/miniruby'
    1.10 ± 0.00 times faster than '/ruby/build-pre/miniruby'

These results are from a Skylake server with GCC 11.