Skip to content

Reliable compiler deadlock in odin build and odin check #4615

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
gfaster opened this issue Dec 22, 2024 · 7 comments
Open

Reliable compiler deadlock in odin build and odin check #4615

gfaster opened this issue Dec 22, 2024 · 7 comments

Comments

@gfaster
Copy link
Contributor

gfaster commented Dec 22, 2024

Context

Please provide any relevant information about your setup. This is important in case the issue is not reproducible except for under certain conditions.

	Odin:    dev-2024-12
	OS:      NixOS 25.05 (Warbler), Linux 6.11.11
	CPU:     Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz
	RAM:     32025 MiB
	Backend: LLVM 18.1.8

I also reproduced the error on debug builds on 597fba7

The following code (somewhat) reliably deadlocks the compiler:

package main

Struct1 :: struct { }
Struct2 :: struct { }

StructArr :: struct($C: typeid) {
    field: [dynamic]struct { c: C }
}

arr_for :: proc($T: typeid) -> ^StructArr(T) { }

iterate :: proc($C: typeid) {
    arr := arr_for(C).field
    for c in arr { }
}

iterate_all :: proc() {
    iterate(Struct1)
    iterate(Struct2)
}

arr_init :: proc() {
    clear(&arr_for(Struct1).field)
    clear(&arr_for(Struct2).field)
}

main :: proc() { }

Failure Information (for bugs)

running odin check . on the above resulting in 71 deadlocks when running it 1000 times. Duplicating the above code (adding Struct3 to Struct7, along with clear and iterate calls) seemed to result in more failures with 116 deadlocks out of 1000 trials.

I don't know if it fails at the same spot every time, but I'm seeing thread 1 blocking trying to lock a mutex:

(gdb) thr 1
[Switching to thread 1 (Thread 0x7ffff7e6d180 (LWP 98711))]
#0  0x00007fffeef1513d in syscall () from /nix/store/wn7v2vhyyyi6clcyn0s9ixvl7d4d87ic-glibc-2.40-36/lib/libc.so.6
(gdb) bt
#0  0x00007fffeef1513d in syscall () from /nix/store/wn7v2vhyyyi6clcyn0s9ixvl7d4d87ic-glibc-2.40-36/lib/libc.so.6
#1  0x00005555555acd2d in futex_wait (addr=addr@entry=0x7fffe7b327f0, val=2) at src/threading.cpp:698
#2  0x00005555555acccd in mutex_lock_slow (m=m@entry=0x7fffe7b327f0, curr_state=<optimized out>) at src/threading.cpp:355
#3  0x00005555555d5043 in mutex_lock (m=0x7fffe7b327f0) at src/threading.cpp:364
#4  type_set_offsets (t=t@entry=0x7fffe7b32770) at src/types.cpp:3932
#5  0x00005555555d497e in type_align_of_internal (t=t@entry=0x7fffe7b32770, path=<optimized out>, path@entry=0x7fffffff05d0)
    at src/types.cpp:3826
#6  0x00005555555d5ea5 in type_align_of (t=0x7fffe7b32770) at src/types.cpp:3709
#7  0x00005555555946ab in check_parsed_files (c=0x7fffe9ac89b0) at src/checker.cpp:6535
#8  0x000055555557d356 in main (arg_count=<optimized out>, arg_ptr=<optimized out>) at src/main.cpp:3543

While every other thread is blocking here:

(gdb) thr 2
[Switching to thread 2 (Thread 0x7fffee4676c0 (LWP 98712))]
#0  0x00007fffeef1513d in syscall () from /nix/store/wn7v2vhyyyi6clcyn0s9ixvl7d4d87ic-glibc-2.40-36/lib/libc.so.6
(gdb) bt
#0  0x00007fffeef1513d in syscall () from /nix/store/wn7v2vhyyyi6clcyn0s9ixvl7d4d87ic-glibc-2.40-36/lib/libc.so.6
#1  0x00005555555acd2d in futex_wait (addr=addr@entry=0x555555815414 <global_thread_pool+36>, val=22915) at src/threading.cpp:698
#2  0x00005555555d0465 in thread_pool_thread_proc (thread=<optimized out>) at src/thread_pool.cpp:235
#3  internal_thread_proc (arg=<optimized out>) at src/threading.cpp:564
#4  0x00007fffeee97d02 in start_thread () from /nix/store/wn7v2vhyyyi6clcyn0s9ixvl7d4d87ic-glibc-2.40-36/lib/libc.so.6
#5  0x00007fffeef173ac in __clone3 () from /nix/store/wn7v2vhyyyi6clcyn0s9ixvl7d4d87ic-glibc-2.40-36/lib/libc.so.6

Steps to Reproduce

  1. run odin check . or odin build . on the above script until it deadlocks (it's expected to fail the check, but this breaks on well-formed programs too)
@gfaster
Copy link
Contributor Author

gfaster commented Dec 23, 2024

After instrumenting BlockingMutex, it looks like a lock was copied in either a locked or waiting state since the blocking lock is the very first occurrence of a mutex at that address

@cg-jl
Copy link

cg-jl commented Dec 23, 2024

Reproduced (with exactly 2 structs) on all three configs (debug, release-native and release builds).

Minimal thread count at -thread-count=2, 3 threads in total. One locks in type_set_offsets, the other two at waiting for a task.

The mutexes are copied correctly.

Instrumented BlockingMutex with a copy constructor in this way:

    BlockingMutex(i32 state) : state_(state) {
        printf("%s %p\n", names[state], this);
    }
    constexpr BlockingMutex() : state_{} {}
    BlockingMutex(BlockingMutex const &other)
        : state_(other.state().load(std::memory_order_acquire)) {
        printf("%s %p %p\n", names[state_], this, &other);
    }

Built with and without CXXFLAGS=-fno-elide-constructors, and both deadlock.
The printf may cause interference with timings but the deadlock still happens in the same place.
All of the prints say unlocked, which means that all the places copy or init the mutexes in their unlocked state.

Context of repro

odin report

	Odin:    dev-2024-12:ad99d20d2
	OS:      EndeavourOS, Linux 6.12.6-arch1-1
	CPU:     AMD Ryzen 7 5825U with Radeon Graphics         
	RAM:     14803 MiB
	Backend: LLVM 18.1.8

I'm using a chain of scripts to launch the binary multiple times. I initially used multiple processes, but with MT on each process I just get more noise. I run the binary again and again until it deadlocks. Using IOT signal to make the binary generate a core dump.

# run_sequential.bash <logs directory>
mkdir -p $1
i=0
while true; do
	(( i += 1 ))
	echo -n .
	if ! timeout -s IOT 5 bash run_one.bash $1; then
		echo $1 :: $i
		break
	fi
done
# run_one.bash <logs directory>
# NOTE: `rr record` does not work, it's too slow (apparently) to find this deadlock.

~/Odin/odin check . 2> $1/err.txt 1> $1/log.txt
exit 0

These points may be of use, may just be noise:

  • Did not reproduce (on either build configuration) with >=18 structs.
  • Did not reproduce on release when pinning the execution to a single core (via taskset)
  • Tried using rr record when running the binary, does not seem to deadlock in ~20s of trying.

@github-actions github-actions bot added the stale label Apr 23, 2025
@gfaster
Copy link
Contributor Author

gfaster commented Apr 28, 2025

The bot marked it as stale, but it's not fixed. It's just a cursed issue (still broken as of d463aba)

@Kelimion Kelimion removed the stale label Apr 28, 2025
@Feoramund
Copy link
Contributor

Building the Odin compiler with -fsanitize=thread brings up between 7 and 14 data races when I check the code given in the original report. No need to run multiple times; this happens with every invocation of the compiler.

The smallest amount of Odin code possible to build has slightly less but still causes around 10 data races.

package main

main::proc(){}

Worse yet, odin check examples/demo brings up around 300 ThreadSanitizer warnings.

@Kelimion
Copy link
Member

Worse yet, odin check examples/demo brings up around 300 ThreadSanitizer warnings.

For me odin build examples/demo results in approximately 450 ThreadSanitizer warnings.
odin build examples/demo 2>&1 | tee tsan.log attached.

tsan.log

@Kelimion
Copy link
Member

It's a bit variable. I've seen odin check examples/demo go as high as 550 TSan warnings, but there doesn't seem to be a meaningful difference between odin check and odin build for the number of warnings. That suggests that most of the races are in the frontend.

@gingerBill
Copy link
Member

@Feoramund I do wonder how many of those data races are false positives or not. The reason being is that the MPMCQueue which is used by the parser is fundamentally designed to rely on data races to work.

I am not doubting we have a other data races though. We do need to fix quite a bit.

Also, example does a lot more than just "nothing". It type checks the entire base:runtime package too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants