[Gc] Extra info in mark-bytes?

Jan Wielemaker J.Wielemaker at vu.nl
Tue Jan 3 02:30:36 PST 2012


Hi Bruce,

Thanks. I'll mostly use your message to get my arguments straight though
:-) I know this technique. I use it in SWI-Prolog, using symbols as
proxies for `foreign' resources such as streams. It can solve this
problem too, but at a rather high cost in terms of development and (more
importantly) proxy (de-)allocation and locking, while my proposal to
change the class of the object at runtime makes life easy:

[Scenario 1] is basically a (multi) linked list. It is processed from
Prolog, which scans through it and, if it finds a match, returns the
match and creates a `choice-point', pointing to the next cell to resume
execution there if required. This happens from multiple threads.  If I
do as I propose, all I need to do is

    - If an object needs to be deleted, set a bit D (as you propose) and
      make my previous point to my next.  Whether the cell is
      involved in a search (involves references from C variables) or
      is referenced from a choice-point doesn't matter.  Both will check
      the D bit and proceed to the next if this bit it set.  This
      requires locking only for writers.

Note that it is not uncommon for a single Prolog query to returns many
thousands of objects on `backtracking', which implies that, when using
the proxy apporach, many thousands of proxies would need to be created
and will be destroyed on the next GC. All this work happens also in the
case where the DB is completely static, while the direct GC approach
as outlined would not involve any (de-)allocation or locking.

[Scenario 2] concerns Prolog clauses, which are represented as an array
of virtual machine instructions. These things get deleted sometimes,
where a similar story holds: they may be executing, either suspended in
a choice, actively in some thread or being processed by one of the
reflective predicates. Currently I keep reference counts, but this is
pretty complicated and has a significant impact on performance. If I can
use GC for this, I can simply unlink them and leave things to GC. But,
again, there may be many millions of them that are 99% of their time
uninteresting to GC. Worse, currently I keep the reference counts on
predicates (= set of clauses) to reduce synchronization overhead, and
actually collect if the reference count drops to zero and there is a
sufficient amount of garbage. But, some (dynamic) predicates involved in
heavy multi-threaded communication rarely reach refcount zero :-(

If I read http://www.hpl.hp.com/personal/Hans_Boehm/gc/issues.html, two
sections stand out to me: "Development effort" and "CPU-time
performance".  I think your approach looses on both when compared to
the adapted GC approach.  Agree?

My tests are progressing well. `Migration' of objects from `managed' to
`unmanaged' and back seems to work fine. Timing indicates that the
overhead of GC_MALLOC(), followed by `unmanaging it' is only slightly
(20%) more than using GC_MALLOC_ATOMIC_UNCOLLECTABLE(), while subsequent
GC times are practically the same (and very short). Impact on the
collectors performance seems neglectable (quick imprecise measurements).
The header has one additional field, representing the number of
unmanaged objects.

There are still some things to do, but the approach seems feasible. My
initial approach was to be able to control atomic and uncollectable
independently, but considering that marked objects are not scheduled for
scanning for pointers, the atomic flag has no impact. This is fine for
the above use-case scenarios though.

I have two questions:

   * Do people agree with my analysis?
   * I don't want my private copy of the collector.  Can these changes
     make it into the official version?  If so, using what interface?

One interface I see is this:

	void * GC_MALLOC_UNMANAGED(size_t size)
	void * GC_MALLOC_ATOMIC_UNMANAGED(size_t size)
	void   GC_MANAGE(void* ptr)

The other is

	void   GC_UNMANAGE(void *ptr)
	void   GC_MANAGE(void *ptr)

These APIs would apply to GC_MALLOC() and GC_MALLOC_ATOMIC(). It would
be an error to call this on the result of GC_MALLOC_UNCOLLECTABLE(). The
2nd is more general, but the first can be faster as it needs no
additional lock.

	Regards --- Jan

On 01/02/2012 11:18 PM, Bruce Hoult wrote:
> This seems to me to be entirely the wrong way to go about this.
>
> If the proportion of objects involved in the multi-threaded
> computation is small then I think you should make those computations
> refer to them via a GC-able proxy object with a finalizer.
>
> What you'll need:
>
> - your DB class X
>
> - a bit D in X saying whether it has been unlinked from the main
> structure and should be deleted
>
> - a garbage collected proxy object P for each X that is currently
> involved in the computation. This simply contains a pointer to its X.
>
> - a way to find the unique P (if any) for each X. This can be a
> pointer within X if X is big enough that the space overhead is low, or
> a (non GC) hash table if time overhead is more acceptable than space
> overhead.
>
> When an X is used in the computation its P should be located, or one
> created if it doesn't have one, and the P passed to the computation
> instead of the X itself.
>
> When an X is deleted from the DB check if it has a P. If not then just
> delete it. Otherwise unlink it and set its D.
>
> The finalizer for P should check whether the D in its X is set and if
> so then delete the X. It should in any case remove itself from the
> hash table (if that's how an X finds its P), or null the pointer to
> itself (if that's how an X finds its P).
>
>
> On Tue, Jan 3, 2012 at 2:53 AM, Jan Wielemaker<J.Wielemaker at vu.nl>  wrote:
>> Hi,
>>
>> PROBLEM
>> =======
>>
>> I'm trying to figure out whether I can use bdwgc for the following
>> use-case:
>>
>>   * A huge main-memory DB of mostly small objects.  Should scale to
>>     memory available in server HW (say 100Gb at this moment).
>>   * Typically, most objects are nicely linked from a shared view on
>>     the data.  I.e., there is nothing to do for GC.  I'll call this
>>     the `permanent' heap.
>>   * In addition, there are short-lived objects involved in (multi-
>>     threaded) computation.  Using GC greatly simplifies life here.
>>
>> So far, so good: just use GC for the latter category.  But ...
>>
>>   * These temporary objects (and C variables) tend to reference the
>>     `permanent ones'.  `Permanent' between quotes, because they may be
>>     deleted.  Deleting here means disconnecting from other `permanent'
>>     objects. They must be kept alive as long as they are referenced.
>>
>> This requires GC to be used on the `permanent' objects, but marking the
>> zillions of objects in the `permanent' heap is too slow :-(
>>
>> EXPERIMENT
>> ==========
>>
>> I did an experiment by creating my own kind with marking procedure and
>> have a bit on the object that was set before the object was unlinked
>> from the permanent heap. The custom marking procedure only marks
>> referenced pointers if this bit is set. That gets the marking time down
>> dramatically, but of course the first GC now wipes most permanent
>> objects :-(
>>
>> NEXT STEP
>> =========
>>
>> We need a way to protect objects. That is `uncollectable', but there is no
>> way to move objects from uncollectable to collectable. Digging into
>> the source, I discovered that the mark bit is actually a mark byte. That
>> offered possibilities. I am planning to use two of the bits here. One to
>> state that the object is uncollectable and the other that it is atomic
>> (which avoids the need for my custom marker for most of the use-cases).
>>
>> I added an experimental GC_set_flags(ptr, flags) and GC_clear_flags(ptr,
>> flags), such that I can do
>>
>>         obj = GC_malloc(size);
>>         <initialise and link to permanent heap>
>>         GC_set_flags(obj, GC_FLAG_UNCOLLECTABLE|GC_FLAG_ATOMIC);
>>
>> and, to get rid of the object:
>>
>>         GC_clear_flags(obj, GC_FLAG_UNCOLLECTABLE|GC_FLAG_ATOMIC);
>>         <unlink from permanent heap>
>>
>> (There is some room for optimizing here; e.g. clearing the marks is more
>>   expensive, etc., but this seems feasible, e.g., by keeping a count of  the
>> number of objects with GC_FLAG_UNCOLLECTABLE in a block; often this  will be
>> all objects in the block, removing the need to clear and scan for unmarked
>> objects during the sweep).
>>
>> The idea is that objects are not collected if either the MARK bit or the
>> UNCOLLECTABLE bit is set.  Marking will be skipped if GC_FLAG_ATOMIC is
>> set (unimplemented).
>>
>> The attached patch is my current state.
>>
>> QUESTIONS
>> =========
>>
>>   * The comments say that using a byte for marker avoids the need for
>>     interlocked bitwise or.  Is this true?  Isn't setting a byte effectively
>>     fetching a word, changing it and writing it back?  That would explain
>> some
>>     crashes I've seen in my tests ...
>>
>>   * As soon as I set the above bits, I see that GC_malloc() no longer uses
>>     the free list, but calls GC_generic_malloc_many() for most calls.  It
>>     seems that playing with these bits affect the free list!?  Does someone
>>     have an idea why?
>>
>>         Thanks --- Jan
>>
>> --
>> This message has been scanned for viruses and
>> dangerous content by MailScanner, and is
>> believed to be clean.
>>
>>
>> _______________________________________________
>> Gc mailing list
>> Gc at linux.hpl.hp.com
>> http://www.hpl.hp.com/hosted/linux/mail-archives/gc/



More information about the Gc mailing list