[Gc] segfault with CACAO on OpenSolaris

Boehm, Hans hans.boehm at hp.com
Fri Aug 29 17:08:46 PDT 2008



> -----Original Message-----
> From: Christian Thalinger [mailto:twisti at complang.tuwien.ac.at]
> Sent: Wednesday, August 27, 2008 2:21 PM
> To: Boehm, Hans
> Cc: gc ml
> Subject: RE: [Gc] segfault with CACAO on OpenSolaris
>
> On Tue, 2008-08-19 at 23:03 +0200, Christian Thalinger wrote:
> > On Tue, 2008-08-19 at 20:19 +0000, Boehm, Hans wrote:
> > > The CVS version contains a recent bug fix to call
> > > GC_init_thread_local from GC_register_my_thread.  (See
> > >
> http://bdwgc.cvs.sourceforge.net/bdwgc/bdwgc/pthread_support.c
> ?r1=1.13&r2=1.14 around line 1036.)  Might this be the problem here?
> >
> > Right, that seems to fix it.  I'll test further.  Thanks so far.
>
> I got another problem.  I think the problem is that an object
> is collected although it's still in use.  I try to explain it.
>
> When an object's lock is contended we allocate a lock record
> for this object.  To clean these lock records up when the
> corresponding object is collected, we register a special
> finalizer which is calling the Java finalizer (if any) and
> freeing the lock record.
>
> The crash I'm now seeing is like this:
>
> LOG: [0x0000000000000003] [finalizer lockrecord: o=5d7020 p=0
> class=Harness$TimeoutWatcher SYNCHRONIZED]
> LOG: [0x0000000000000003] [lock_record_free  : lr=e658e0]
>
> The object at 0x5d7020 gets collected and the lock record at
> 0xe658e0 is freed.  But later another threads wakes up which
> is waiting on 0x5d7020 (or is trying to enter the lock,
> I'm not completely sure yet) and segfaults because the lock
> record is not around any more:
>
>   ---- called from signal handler with signal 11 (SIGSEGV) ------
>   [16] mutex_lock_impl(0xa5a5a5a5a5a5a5a5, 0x0, 0x5d7028,
> 0x200, 0x2, 0x14), at 0xfffffd7fff03d580
>   [17] mutex_lock(0x0, 0x0, 0x0, 0x0, 0x0, 0x0), at 0xfffffd7fff03d70c
>   [18] Mutex::lock(this = 0xa5a5a5a5a5a5a5a5), line 127 in
> "mutex-posix.hpp"
>   [19] Mutex_lock(mutex = 0xa5a5a5a5a5a5a5a5), line 36 in
> "removeme.cpp"
>   [20] lock_record_enter(t = 0x462f00, lr = 0xe658e0), line
> 744 in "lock.c"
>   [21] lock_monitor_enter(o = 0x5d7020), line 1028 in "lock.c"

Is this by any chance called from another finalizer?  Or might the object previously have been reachable only from finalizers and then "resurrected"?

Getting JVM implementations right in this area is quite tricky.  In order for this kind of finalization to work, you really need the collector's normal "topologically ordered" finalization semantics, so that a lock record is not deallocated while it is reachable from other finalization-enabled objects.  But those are unfortunately not the finalization semantics required by Java.  I think there is gcj-inspired code (a.k.a. ugly hack) in the collector to make this sort of thing work (see GC_register_finalizer_unreachable() in gc.h).

If that's not the problem, and the CACAO code is correct, your best bet it to apply the standard premature deallocation debugging techniques from the web site, and to see why the parent object is not getting marked in the prior collection, and hence gets finalized.
>
> So my question is, is it possible that there is a bug
> somewhere in the Solaris marking code?
That's always a possibility.  It's really only the root finding code that's OS and machine specific, though.  The fact that you need finalizers to trigger the bug makes me suspicious that this is something else.

Hans
>
> - twisti
>
>



More information about the Gc mailing list