[Gc] Re: GC 6.4 vs Irix w/ threads

Boehm, Hans hans.boehm at hp.com
Fri Apr 22 11:47:23 PDT 2005


Dan -

This is presumably on a cluster with a fast interconnect that's
being used for MPI?

I believe that on Irix, the collector finds potential roots by
looking for mappings, not by traversing lists of dynamic libraries.
I would guess it's trying to scan memory that's somehow mapped
uncached to some hardware device, and the machine doesn't like
it.

If you can identify the offending memory region and call
GC_exclude_static_roots on it, things should be OK.  Depending
on your application, it's sometimes also possible to turn off
scanning for everything but the main program, and manually
register any other data areas that actually need to be traced.

Hans

> -----Original Message-----
> From: Dan Bonachea [mailto:bonachea at cs.berkeley.edu] 
> Sent: Sunday, April 17, 2005 8:30 AM
> To: Boehm, Hans
> Cc: gc at napali.hpl.hp.com; Boehm, Hans
> Subject: Re: [Gc] Re: GC 6.4 vs Irix w/ threads
> 
> 
> At 08:04 AM 4/12/2005, Hans Boehm wrote:
> >I think this patch helps for the Irix 64-bit case.  This 
> just seems to 
> >be a case of generating a SIGBUS where a SIGSEGV was 
> expected. I tested 
> >only very superficially.
> >
> >I have no idea whether this helps the MPI problem.  Unfortunately, I 
> >also don't have a way to test.
> 
> Thanks Hans - with the addition of this patch, both 32 and 
> 64-bit IRIX gctest 
> appear to be working properly.
> 
> However, I'm still having trouble in programs mixing the GC 
> with MPI on IRIX. 
> When collection is enabled (GC_dont_gc == 0), all MPI 
> programs using the GC 
> crash with:
> 
> MPI: MPI_COMM_WORLD rank 0 has terminated without calling 
> MPI_Finalize()
> MPI: aborting job
> MPI: Received signal 10
> 
> and writes entries to /var/adm/SYSLOG like:
> 
> Apr 17 06:47:39 4A:lou unix: |$(0xb6b)WARNING: 
> /hw/module/001c21/node/cpubus/0/b: Uncached Partial Read 
> Error on MSPEC 
> access, physaddr 0x844dcc3f0, process [arrayCopyTest] pid 
> 1324873 Apr 17 06:47:39 5A:lou unix: |$(0xb5c)NOTICE: 
> /hw/module/001c21/node/cpubus/0/b: User Data Bus error in 
> Mspec space at 
> physical address 0x844dcc3f0 
> /hw/module/001c21/node/memory/dimm_bank/1 (EPC 
> 0x1025ecc0)
> 
> here's a crash stack:
> 
>  >  0 GC_mark_from(mark_stack_top = 0x10808040, mark_stack = 
> 0x10808000, 
> mark_stack_limit = 0x10810000) 
> ["/home/ece/bonachea/Ti/src/runtime/gc/mark.c":769, 0x10389674]
>     1 GC_mark_some(cold_gc_frame = 0x7fff2b38 = "") 
> ["/home/ece/bonachea/Ti/src/runtime/gc/mark.c":361, 0x103889d0]
>     2 GC_stopped_mark(stop_func = 0x103843a0) 
> ["/home/ece/bonachea/Ti/src/runtime/gc/alloc.c":519, 0x10385268]
>     3 GC_try_to_collect_inner(stop_func = 0x103843a0) 
> ["/home/ece/bonachea/Ti/src/runtime/gc/alloc.c":366, 0x10384c90]
>     4 GC_init_inner() 
> ["/home/ece/bonachea/Ti/src/runtime/gc/misc.c":782, 
> 0x103832f4]
>     5 GC_generic_malloc_inner(lb = 1, k = 1) 
> ["/home/ece/bonachea/Ti/src/runtime/gc/malloc.c":123, 0x1038c954]
>     6 GC_generic_malloc(lb = 1, k = 1) 
> ["/home/ece/bonachea/Ti/src/runtime/gc/malloc.c":192, 0x1038cc14]
>     7 GC_malloc(lb = 1) 
> ["/home/ece/bonachea/Ti/src/runtime/gc/malloc.c":297, 
> 0x1038d2d8]
>     8 real_main(argc = 1, argv = 0x7fff2ec4, envp = 0x7fff2ecc) 
> ["/home/ece/bonachea/Ti/src/runtime/backend/mpi-cluster-smp/ma
> in.c":887, 
> 0x1036b3e4]
>     9 main(argc = 1, argv = 0x7fff2ec4, envp = 0x7fff2ecc) 
> ["/home/ece/bonachea/Ti/src/runtime/backend/mpi-cluster-smp/ma
> in.c":864, 
> 0x1036b2b4]
> More (n if no)?
>     10 __start() 
> ["/xlv55/kudzu-apr12/work/irix/lib/libc/libc_n32_M4/csu/crt1te
xt.s":177, 
> 0x1004e3d
> 
> non-MPI programs and MPI programs with collection disabled 
> (GC_dont_gc == 1) 
> all appear to work properly.
> 
> >I have no idea whether this helps the MPI problem.  Unfortunately, I 
> >also don't have a way to test.
> 
> Hans - here's instructions on how you can reproduce the 
> problem on lou:
> 
> Compile with the following command:
> 
> $ ~bonachea/.tc-dist/dist-debug/bin/tcbuild -v --keep --backend 
> mpi-cluster-uniprocess ~bonachea/.tc-dist/arrayCopyTest.ti
> 
> this will build a Titanium/MPI program for you called 
> arrayCopyTest, which 
> links the GC.
> Run the program with:
> 
> $ mpirun -np 2 ./arrayCopyTest
> 
> You can disable collection by setting environment variable 
> TI_NOGC before 
> running. If you want to try a new GC lib, you can link it in 
> by copy&pasting 
> the final link line from tcbuild and replacing -lgc-uniproc 
> with the path to 
> your own libgc.a.
> 
> Dan
> 
> 
> >Hans
> >
> >--- os_dep.c.orig       Tue Apr 12 04:22:53 2005
> >+++ os_dep.c    Tue Apr 12 06:32:59 2005
> >@@ -698,7 +698,7 @@
> >  #   if defined(SUNOS5SIGS) || defined(IRIX5) || defined(OSF1) \
> >      || defined(HURD) || defined(NETBSD)
> >         static struct sigaction old_segv_act;
> >-#      if defined(_sigargs) /* !Irix6.x */ || defined(HPUX) \
> >+#      if defined(IRIX5) || defined(HPUX) \
> >         || defined(HURD) || defined(NETBSD)
> >             static struct sigaction old_bus_act;
> >  #      endif
> >@@ -731,9 +731,11 @@
> >                 /* and setting a handler at the same time.  
>             */
> >                 (void) sigaction(SIGSEGV, 0, &old_segv_act);
> >                 (void) sigaction(SIGSEGV, &act, 0);
> >+               (void) sigaction(SIGBUS, 0, &old_bus_act);
> >+               (void) sigaction(SIGBUS, &act, 0);
> >  #        else
> >                 (void) sigaction(SIGSEGV, &act, &old_segv_act);
> >-#              if defined(IRIX5) && defined(_sigargs) /* 
> Irix 5.x, not 6.x 
> >*/ \
> >+#              if defined(IRIX5) \
> >                    || defined(HPUX) || defined(HURD) || 
> defined(NETBSD)
> >                     /* Under Irix 5.x or HP/UX, we may get 
> SIGBUS.      */
> >                     /* Pthreads doesn't exist under Irix 
> 5.x, so we     */
> >@@ -772,7 +774,7 @@
> >  #       if defined(SUNOS5SIGS) || defined(IRIX5) \
> >            || defined(OSF1) || defined(HURD) || defined(NETBSD)
> >           (void) sigaction(SIGSEGV, &old_segv_act, 0);
> >-#        if defined(IRIX5) && defined(_sigargs) /* Irix 
> 5.x, not 6.x */ \
> >+#        if defined(IRIX5) \
> >              || defined(HPUX) || defined(HURD) || defined(NETBSD)
> >               (void) sigaction(SIGBUS, &old_bus_act, 0);
> >  #        endif
> >
> >On Tue, 12 Apr 2005, Dan Bonachea wrote:
> >
> > > At 04:39 AM 4/11/2005, you wrote:
> > > >GC6.4 apparently no longer worked on Irix with threads.  
> Apparently 
> > > >a bug in aix_irix_threads.c was no longer hidden by a 
> very lenient 
> > > >pthread_attr_getdetachstate.
> > > >
> > > >The following patch should solve the problem.
> > >
> > > Hi Hans - Thanks for the patch.
> > >
> > > 32-bit gctest seems to be working now, although I'm still seeing 
> > > some bus errors using the IRIX GC in Titanium applications on 
> > > MPI-based backends,
> > so I
> > > believe there are some other IRIX-GC issues remaining - I suspect 
> > > the
> > problem
> > > is the shared libraries which MPI loads (eg 
> /usr/lib32/libmpi.so), 
> > > but I
> > don't
> > > have proof of that yet. Do you have any small GC 
> correctness tests 
> > > that
> > test
> > > the use of MPI and/or shared libraries that allocate non-trivial 
> > > memory?
> > >
> > > In any case, 64-bit GC still seems to be completely 
> broken on lou, 
> > > both
> > with
> > > and without pthreads. If you try configuring 6.4, including your 
> > > patch
> > below
> > > with:
> > >    setenv CC "/usr/bin/cc -64"
> > > then gctest should give you a bus error in GC_find_limit :
> > >
> > > #0  0x000000001001559c in GC_find_limit (p=0xfffffffab50 
> "", up=1) 
> > > at os_dep.c:811 #1  0x000000001001561c in GC_get_stack_base () at 
> > > os_dep.c:1038 #2  0x000000001000fdfc in GC_init_inner () at 
> > > misc.c:676 #3  0x000000001001d9c4 in 
> GC_generic_malloc_inner (lb=7, 
> > > k=1) at
> > malloc.c:123
> > > #4  0x000000001001dc4c in GC_generic_malloc (lb=7, k=1) at 
> > > malloc.c:192 #5  0x000000001001dfa8 in GC_malloc (lb=7) at 
> > > malloc.c:297 #6  0x000000001000af44 in run_one_test () at 
> > > test.c:1218 #7  0x000000001000bdac in main () at test.c:1517
> > >
> > > This is the signal handler problem I originally reported that 
> > > apparently
> > still
> > > remains. It's also possible this signal issue is the same problem 
> > > MPI is having - perhaps it registers some SIGBUS handlers for its 
> > > own uses (eg parallel job shutdown) that interfere with the 
> > > GC_find_limit scan. I think perhaps we need a more robust way to 
> > > find the stack base on IRIX...
> > >
> > > Dan
> > >
> > > PS - lou lacks gdb, but I have it installed here: 
> > > ~bonachea/bin/gnu/bin/gdb (note you'll need to link the static 
> > > libgc.a to use gdb)
> > >
> > > >Hans
> > > >
> > > >--- aix_irix_threads.c.orig     Sat Apr  9 20:37:22 2005
> > > >+++ aix_irix_threads.c  Sat Apr  9 20:38:17 2005
> > > >@@ -580,7 +580,11 @@
> > > >      si -> start_routine = start_routine;
> > > >      si -> arg = arg;
> > > >
> > > >-    pthread_attr_getdetachstate(attr, &detachstate);
> > > >+    if (NULL == attr) {
> > > >+       detachstate = PTHREAD_CREATE_JOINABLE;
> > > >+    } else {
> > > >+        pthread_attr_getdetachstate(attr, &detachstate);
> > > >+    }
> > > >      if (PTHREAD_CREATE_DETACHED == detachstate) 
> my_flags |= DETACHED;
> > > >      si -> flags = my_flags;
> > > >      result = pthread_create(new_thread, attr, 
> GC_start_routine, 
> > > > si);
> > >
> > >
> >_______________________________________________
> >Gc mailing list
> >Gc at linux.hpl.hp.com 
> >http://www.hpl.hp.com/hosted/linux/mail-archives/gc/
> 
> 



More information about the Gc mailing list