From: "kjtsanaktsidis (KJ Tsanaktsidis) via ruby-core" <ruby-core@...>
Date: 2024-01-10T23:39:08+00:00
Subject: [ruby-core:116164] [Ruby master Bug#20169] `GC.compact` can raises `EFAULT` on IO

Issue #20169 has been updated by kjtsanaktsidis (KJ Tsanaktsidis).


I did a bit of experimentation with the `userfaultfd(2)` system call here: https://gist.github.com/KJTsanaktsidis/40e2a8e23012bf16af823db9ff9a890e

The SIGBUS/SIGSEGV handling we currently do only traps userspace accesses to memory, but it seems with userfaultfd it's possible to trap kernel access to memory too - my example above works both when accessing `page` directly, or when `write(2)`'ing it into a memfd.

Instead of read-protecting pages with `mprotect(2)` and then handling the trap, we can do the following:

* Register pages with userfaultfd when we allocate them
* When we would readprotect a page, instead, remap it somewhere else with `mremap(2)` `MREMAP_MAYMOVE`, and leave a faulting region behind with `MREMAP_DONTUNMAP`.
* When someone tries to read that page, we'll get the fault in the userfaultfd thread
* The userfaultfd thread can then remap the page back into its original position with `mremap(2)` `MAP_FIXED` and re-attempt the faulting access.

This pattern is actually documented in the Linux manpage for `mremap(2)`: https://man7.org/linux/man-pages/man2/mremap.2.html

> Garbage collection: `MREMAP_DONTUNMAP` can be used in conjunction with `userfaultfd(2)` to implement garbage collection algorithms (e.g., in a Java virtual machine). Such an implementation can be cheaper (and simpler) than conventional garbage collection techniques that involve marking pages with protection `PROT_NONE` in conjunction with the use of a `SIGSEGV` handler to catch accesses to those pages.

So that's the good part. The bad parts are:

* This will, of course, only work on Linux.
* It will only work on Linux kernels >= 5.7
* It requires that the calling process either have `CAP_SYS_PTRACE` or that `/proc/sys/vm/unprivileged_userfaultfd` be set to 1. It seems common  distributions default this to 0 :(

So... if we did want to go down the userfaultfd handling path, we would need to either:

* Only support GC compaction when the above conditions are met?
* Or, have both a userfaultfd implementation and the current signal-based implementation, and just accept that GC compaction can cause rare crashes on non-linux? (this sounds bad)


The only other more portable option I can think of is to define symbols for `read`, `write`, and other system calls that take userspace buffers which overwrite and wrap the libc versions, handle EFAULT return values, and invoke the "cancel GC compaction" logic if the faulting address is in the Ruby heap. I'm open to other bright ideas anybody might have...

----------------------------------------
Bug #20169: `GC.compact` can raises `EFAULT` on IO
https://bugs.ruby-lang.org/issues/20169#change-106169

* Author: ko1 (Koichi Sasada)
* Status: Open
* Priority: Normal
* Backport: 3.0: UNKNOWN, 3.1: UNKNOWN, 3.2: UNKNOWN, 3.3: UNKNOWN
----------------------------------------
1. `GC.compact` introduces read barriers to detect read accesses to the pages.
2. I/O operations release GVL to pass the control while their execution, and another thread can call `GC.compact` (or auto compact feature I guess, but not checked yet).
3. Call `write(ptr)` can return `EFAULT` when `GC.compact` is running because `ptr` can point read-barrier protected pages (embed strings).

Reproducible steps:


Apply the following patch to increase possibility:

```patch
diff --git a/io.c b/io.c
index f6cd2c1a56..83d67ba2dc 100644
--- a/io.c
+++ b/io.c
@@ -1212,8 +1212,12 @@ internal_write_func(void *ptr)
         }
     }

+    int cnt = 0;
   retry:
-    do_write_retry(write(iis->fd, iis->buf, iis->capa));
+    for (; cnt < 1000; cnt++) {
+        do_write_retry(write(iis->fd, iis->buf, iis->capa));
+        if (result <= 0) break;
+    }

     if (result < 0 && !iis->nonblock) {
         int e = errno;
```

Run the following code:

```ruby
t1 = Thread.new{ 10_000.times.map{"#{_1}"}; GC.compact while true }
t2 = Thread.new{
  i=0
  $stdout.write "<#{i+=1}>" while true
}
t2.join
```

and 

```
$ make run
(snip)
4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4><4>#<Thread:0x00007fa61b4dd758 ../../src/trunk/test.rb:3 run> terminated with exception (report_on_exception is true):
../../src/trunk/test.rb:5:in `write': Bad address @ io_write - <STDOUT> (Errno::EFAULT)
        from ../../src/trunk/test.rb:5:in `block in <main>'
../../src/trunk/test.rb:5:in `write': Bad address @ io_write - <STDOUT> (Errno::EFAULT)
        from ../../src/trunk/test.rb:5:in `block in <main>'
make: *** [uncommon.mk:1383: run] Error 1
```

I think this is why we get `EFAULT` on CI. To increase possibilities running many busy processes (`ruby -e 'loop{}'` for example) will help (and on CI environment there are such busy processes accidentally).


-- 
https://bugs.ruby-lang.org/
 ______________________________________________
 ruby-core mailing list -- ruby-core@ml.ruby-lang.org
 To unsubscribe send an email to ruby-core-leave@ml.ruby-lang.org
 ruby-core info -- https://ml.ruby-lang.org/mailman3/postorius/lists/ruby-core.ml.ruby-lang.org/