ruby-core

Issue #7693 has been reported by apoikos (Apollon Oikonomopoulos).

----------------------------------------
Bug #7693: Thread deadlock (gvl.wait_yield is not (re-)initialized)
https://bugs.ruby-lang.org/issues/7693

Author: apoikos (Apollon Oikonomopoulos)
Status: Open
Priority: Normal
Assignee: 
Category: core
Target version: 
ruby -v: ruby 1.9.3p194 (2012-04-20 revision 35410) [x86_64-linux]


We are running ruby 1.9.3p194 on Debian and are experiencing a deadlock in the
following scenario:

We have a bunch of resque[1] workers equiped with the New Relic agent[2]. The
New Relic agent runs on a separate thread and simply collects process metrics
from /proc/self and submits them over the network.

The workers fork() a child for every job they process. Sporadically, one or two
of the forked children will hang, all of them in the following state:

(gdb) info threads
  3 Thread 32111  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
  2 Thread 32594  0x00007feac7c233e3 in select () at ../sysdeps/unix/syscall-template.S:82
* 1 Thread 32126  pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:211

Thread 1 is the New Relic agent thread, which runs as expected (i.e. wakes up
every minute and does work). The weird thing is the state of thread 3 (main
thread) in this case:

(gdb) thread apply 3 bt

Thread 3 (Thread 32111):
#0  pthread_cond_wait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:162
#1  0x00007feac8af11a9 in native_cond_wait (cond=0x108cfd4, mutex=0x80) at thread_pthread.c:305
#2  0x00007feac8af47f9 in gvl_yield (limits_us=<value optimized out>) at thread_pthread.c:109
#3  rb_thread_schedule_limits (limits_us=<value optimized out>) at thread.c:1025
#4  0x00007feac8af48d0 in rb_threadptr_execute_interrupts_common (th=0x108d520) at thread.c:1327
#5  0x00007feac8ae63ed in vm_exec_core (th=0x108d520, initial=128) at insns.def:1184
#6  0x00007feac8ae7029 in vm_exec (th=<value optimized out>) at vm.c:1220
#7  0x00007feac8ae8f03 in invoke_block_from_c (th=<value optimized out>, proc=<value optimized out>, self=<value optimized out>, argc=<value optimized out>, argv=<value optimized out>, blockptr=<value optimized out>) at vm.c:624
#8  rb_vm_invoke_proc (th=<value optimized out>, proc=<value optimized out>, self=<value optimized out>, argc=<value optimized out>, argv=<value optimized out>, blockptr=<value optimized out>) at vm.c:670
#9  0x00007feac89fdd1a in proc_call (argc=1, argv=0x27e3d70, procval=<value optimized out>) at proc.c:562
#10 0x00007feac8ae9433 in vm_call0 (th=0x108d520, recv=<value optimized out>, id=<value optimized out>, argc=1, argv=0x27e3d70, me=0x1191f60) at vm_eval.c:79
#11 0x00007feac8aea2d3 in rb_eval_cmd (cmd=<value optimized out>, arg=<value optimized out>, level=<value optimized out>) at vm_eval.c:1196
#12 0x00007feac8af48e2 in rb_threadptr_execute_interrupts_common (th=0x108d520) at thread.c:1290
#13 0x00007feac8af4c1c in rb_threadptr_execute_interrupts (th=0x108d520, func=0x7feac8af5980 <ubf_select>, arg=0x108d520, old=0x7fffb5541f28) at thread.c:1335
#14 set_unblock_function (th=0x108d520, func=0x7feac8af5980 <ubf_select>, arg=0x108d520, old=0x7fffb5541f28) at thread.c:269
#15 0x00007feac8af7e84 in rb_thread_blocking_region (func=0x7feac8a5d070 <rb_waitpid_blocking>, data1=0x7fffb5541f80, ubf=0x80, data2=0x1) at thread.c:1129
#16 0x00007feac8a5d174 in rb_waitpid (pid=32593, st=0x7fffb5541fdc, flags=0) at process.c:655
#17 0x00007feac8a5d3c4 in rb_syswait (pid=32593) at process.c:3018
#18 0x00007feac8a1b365 in rb_io_close (io=<value optimized out>) at io.c:3838
#19 0x00007feac8a1f0b1 in rb_f_backquote (obj=<value optimized out>, str=41828360) at io.c:7519
...

Frame #2 reveals that it is stuck waiting in gvl_yield, in the following
native_cond_wait (thread_pthread.c:109):

    /* An another thread is processing GVL yield. */
    if (UNLIKELY(vm->gvl.wait_yield)) {
        while (vm->gvl.wait_yield)
            native_cond_wait(&vm->gvl.switch_wait_cond, &vm->gvl.lock);
        goto acquire;
    }

Now, by reading the code of gvl_yield(), this would normally only be possible if
we had another thread in native_cond_wait(), which is *not* the case. For
completeness, here are the backtraces of the other threads:

Thread 1 (New Relic agent)
#0  pthread_cond_timedwait@@GLIBC_2.3.2 () at ../nptl/sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:211
#1  0x00007feac8af2f8e in native_cond_timedwait (cond=0xca1a580, mutex=0xca1a5e8, ts=<value optimized out>)
    at thread_pthread.c:323
#2  0x00007feac8af45e0 in native_sleep (th=0xca1a4f0, timeout_tv=0x7feabdfe4900) at thread_pthread.c:908
#3  0x00007feac8af4a82 in sleep_timeval (th=0xca1a4f0, tv=...) at thread.c:895
#4  0x00007feac8a5a842 in rb_f_sleep (argc=1, argv=0x7feabde64108) at process.c:3472
#5  0x00007feac8adfc3b in vm_call_cfunc (th=0xca1a4f0, cfp=0x7feabdf63ca0, num=1, blockptr=<value optimized out>, 
    flag=<value optimized out>, id=140646186633392, me=0x1185560, recv=72328920) at vm_insnhelper.c:404
#6  vm_call_method (th=0xca1a4f0, cfp=0x7feabdf63ca0, num=1, blockptr=<value optimized out>, 
    flag=<value optimized out>, id=140646186633392, me=0x1185560, recv=72328920) at vm_insnhelper.c:534
#7  0x00007feac8ae2462 in vm_exec_core (th=0xca1a4f0, initial=<value optimized out>) at insns.def:1015
#8  0x00007feac8ae7029 in vm_exec (th=<value optimized out>) at vm.c:1220
#9  0x00007feac8ae8f03 in invoke_block_from_c (th=<value optimized out>, proc=<value optimized out>, 
    self=<value optimized out>, argc=<value optimized out>, argv=<value optimized out>, 
    blockptr=<value optimized out>) at vm.c:624
#10 rb_vm_invoke_proc (th=<value optimized out>, proc=<value optimized out>, self=<value optimized out>, 
    argc=<value optimized out>, argv=<value optimized out>, blockptr=<value optimized out>) at vm.c:670
#11 0x00007feac8af6898 in thread_start_func_2 (th=<value optimized out>, stack_start=0x7feabdfe6000) at thread.c:453
#12 0x00007feac8af68f0 in thread_start_func_1 (th_ptr=0xca1a4f0) at thread_pthread.c:651
#13 0x00007feac87878ca in start_thread (arg=<value optimized out>) at pthread_create.c:300
#14 0x00007feac7c29b6d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#15 0x0000000000000000 in ?? ()

Thread 2 (Timer):
#0  0x00007feac7c233e3 in select () at ../sysdeps/unix/syscall-template.S:82
#1  0x00007feac8af3c92 in thread_timer (p=<value optimized out>) at thread_pthread.c:1156
#2  0x00007feac87878ca in start_thread (arg=<value optimized out>) at pthread_create.c:300
#3  0x00007feac7c29b6d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:112
#4  0x0000000000000000 in ?? ()

We also see the following GVL state:

(gdb) frame
#3  rb_thread_schedule_limits (limits_us=<value optimized out>) at thread.c:1025
(gdb) p th->vm->gvl.wait_yield
$1 = 1
(gdb) p th->vm->gvl.need_yield
$2 = 0
(gdb) p th->vm->gvl.waiting
$3 = 0

Searching for gvl.wait_yield, we find out that it is *never* initialized in
gvl_init(), along with the other GVL flags. Since it is only fork()ed children
from multi-threaded parents that sporadically deadlock, and since gvl_init() is
also called at fork time, we suspect that the child ends up with an
inconsistent GVL state when a fork() happens while a thread is waiting in
gvl_yield() (I'm not sure if this is possible though). The child then inherits
a gvl.wait_yield = 1 which is not reset and leads to a deadlock the first time
a gvl_yield() is processed in the new process.

Setting wait_yield to 0 in gvl_init() (see attached patch) seems to fix the issue
for us, I'm not sure it's the proper fix though.

Regards,
Apollon

P.S.: The same issue should affect trunk as well, judging from the code.

[1] https://github.com/defunkt/resque
[2] https://github.com/newrelic/rpm



-- 
http://bugs.ruby-lang.org/

Thread

Prev Next

In This Thread

Prev Next