[Gc] Possible race condition / dead lock in garbage collector thread handling?

Burkhard Linke blinke at cebitec.uni-bielefeld.de
Fri Feb 6 04:51:37 PST 2009


Hi,

I'm using mono 2.2 under Solaris 10/x86 with the latest libgc release(7.1).

A heavily multi-thread mono application dead locks under irreproducible 
conditions. According to the stack traces of the thread (using the Sorlaris 
pstack tool) most threads are blocked in sigsuspend, and one thread is 
currently starting to do a garbage collection, being blocked in sem_wait:

mono main thread:
-----------------  lwp# 1 / thread# 1  --------------------
 fec65587 sigsuspend (fef7e6f0)
 fef68aa3 GC_suspend_handler_inner (2f, 8046d20, 0, fec5252f, fec8e000, 
fefb2000
) + 9f
 fef68b89 GC_suspend_handler (2f, 8046f20, 8046d20) + 2d
 fec64a4f __sighndlr (2f, 8046f20, 8046d20, fef68b5c) + f
 fec5ae72 call_user_handler (2f, 8046f20, 8046d20) + 22b
 fec5aff2 sigacthandler (2f, 8046f20, 8046d20) + bb
 --- called from signal handler with signal 47 (SIGRTMAX-1) ---
 fec649ab __lwp_park (834adc8, 834adb0, 8047050) + b
 fec5f172 cond_wait_queue (834adc8, 834adb0, 8047050, 0) + 3b
 fec5f512 cond_wait_common (834adc8, 834adb0, 8047050) + 1df
 fec5f746 _cond_timedwait (834adc8, 834adb0, 80470c4) + 51
 fec5f7b1 cond_timedwait (834adc8, 834adb0, 80470c4) + 24
 fec5f7ed pthread_cond_timedwait (834adc8, 834adb0, 80470c4, ee6b280) + 1e
 081ee39e timedwait_signal_poll_cond (1, fd33f97d, 8, 81f09d6) + 6a
 081f0b0e _wapi_handle_timedwait_signal_handle (109, 0, 1, 82d2a50) + 146
 081f0b78 _wapi_handle_wait_signal_handle (109, 1, 8047198, 8207d29, 0, 0) + 
20
 08207d83 WaitForSingleObjectEx (109, ffffffff, 1, 81a35d1, 109, 0) + 3a7
 081a35ec ves_icall_System_Threading_WaitHandle_WaitOne_internal (86aaf60, 
109, 
ffffffff, 0) + 2c
 fddf0740 ???????? (86aaf60, 109, ffffffff, 0)
 fd33f985 ???????? (86aaf60, 85ef0d8, 86c43c0, 80472c8)
 fd33f88f ???????? (86c43c0, 86aaf78, 893410c, 80472d0)
 fd33f66c ???????? (86b9e40, 83315ac, 8608678, 8047408)
 fd33b3de ???????? (85ef0d8, 831ae80, 8047408, fea8e690)
 fea8f0e3 ???????? (8318f60, fea8e1c0, 8047458, 80acff9)
 fea8e203 ???????? (0, 8047488, 0, fea8e5b8)
 0817082a mono_runtime_exec_main (83315ac, 8318f60, 0, 8047744, 2, 8318f60) + 
c6
 08171af5 mono_runtime_run_main (83315ac, 3, 8047738, 0) + 1a1
 080fed8b mono_main (4, 8047734, 82d2a50, 80476f0) + 1287
 08074b10 main     (8074990, 4, 8047734) + 28
 08074990 _start   (4, 804784c, 8047851, 804785e, 8047883, 0) + 80

garbage collecting thread:
-----------------  lwp# 2 / thread# 2  --------------------
 fec64c57 nanosleep (fe7d0fc4, 0)
 fedf2b67 nanosleep (fe7d0fc4, 0, 0, 81eb8de, 0, 0) + 1b
...skipping...
 fef68878 GC_stop_world (0, 8eda838, f691ba98, fef6145b, 8e1fa80, fef7b06c) + 
e8
 fef5a001 GC_stopped_mark (fef594ac, 0, 0, 0, fef59533, fef7b06c) + 2d
 fef5a2a9 GC_try_to_collect_inner (fef594ac, f9a8c800, 0, fef7b06c, fef7b06c, 
0) + 9d
 fef5a4b4 GC_collect_or_expand (6, 0, 0, f691bb68, fec5e5f1, fef7b914) + d8
 fef5edb9 GC_alloc_large (5690, 0, 0, 15a4, 5690, 6) + a9
 fef5f145 GC_generic_malloc (568e, 0, 0, 0, 0, 0) + 11d
 fef5f3f6 GC_core_malloc_atomic (568e, 832ffa0, 0, 0, 0, 1) + ce
 fef670cd GC_malloc_atomic (568e, 833f3ec, 0, 0, 0, 2b40) + d9
 0817025d mono_string_new_size (8314e40, 2b40, f691bc58, 81b4d43) + 35
 081b4d4f ves_icall_System_String_InternalAllocateStr (2b40, 830e9c8, 2b40, 
f691bc8c) + 17
 fe72bfc8 ???????? (2b40, f691bd84, 1, ef9d8d0)
 fe02c4a7 ???????? (ecb9220, 15a9, f691bcf8, 81b4d43)
 fe031803 ???????? (ecb9220, a471800, 4, 1)

(the stack frames using ???????? as entry point are probably mono jit compiled 
methods)

Other non blocked or suspend thread are used for asynchronous IO or mono 
internal stuff and should not be part of the problem.

Nonetheless one thread has a different stack trace:
-----------------  lwp# 46 / thread# 46  --------------------
 fec649ab lwp_park (0, 0, 0)
 fec5e224 slow_lock (fe79ac00, fef7b914, 0) + 3d
 fec5e31a mutex_lock_impl (fef7b914, 0) + ec
 fec5e426 mutex_lock (fef7b914, 0, fec8e000, fef7b06c, 0, fef7e500) + 1a
 fef6759f GC_generic_lock (fef7b914, fef7b06c, fb0cdf18, fef68263, fef7b914, 
fb0cdf28) + 9f
 fef6764e GC_lock  (fef7b914, fb0cdf28, 8204ef6, fef7b06c, 0, fb0cdf7c) + 4a
 fef68263 GC_unregister_my_thread (fe79ac00, fec8e000, fb0cdf98, fec64bbc, 0, 
fb0cdf54) + af
 fef68280 GC_thread_exit_proc (0) + 18
 fec64bbc _ex_clnup_handler (fb0cdfb0, 8367f80, 0, 0, fb0cdfb4, 0) + c
 fef62c3c GC_call_with_stack_base (fef6789c, 8367f80, fe79ac00, fec8e000, 
fec62f8b) + 1c
 fef677f0 GC_start_routine (8367f80) + 28
 fec64662 _thr_setup (fe79ac00) + 4e
 fec64950 _lwp_start (fe79ac00, 0, 0, fb0cdff8, fec64950, fe79ac00)

This thread is about to exit and is blocked locking the global garbage 
collector lock, which is probably held by the garbage collecting thread. But 
this thread does not receive the SUSPEND signal send by the GC thread. Since 
it is still registered, the GC thread is waiting for it to post the 
semaphore, which will never happen. Dead lock.

I suspect that the thread is using a signal mask that is blocking the SUSPEND 
signal. I would propose to enable the SUSPEND signal in the GC thread exit 
handler (GC_thread_exit_proc) prior to unregistering the thread, since it 
currently relies on the thread not blocking the suspend signal.

Since it is difficult to reproduce this bug due to its timing dependency, it 
will be difficult for me to create a regression test, create a patch and 
verify that the patch fixes the problem. I'll try to write the patch (it has 
been some years since the last C program...) and monitor the application in 
question whether the problem still occurs.

If there are better solutions to the problem or if I'm completely mistaken 
with the error analysis, please send a reply to mailing list.

With best regards,
Burkhard Linke



More information about the Gc mailing list