[Gc] allocator and false sharing

Alexander Herz alexander.herz at mytum.de
Mon Aug 20 08:23:24 PDT 2012


Ok, never mind.

The thread library I'm using does some lazy initialization which 
apparently induced the effect.
If I force init it before doing the measurements then everything is fine.

Sorry,
Alex


On 20.08.2012 14:11, Alexander Herz wrote:
> I just verified:
>
> GC_malloc inside thread_local_alloc.c is used and the results are 
> still as bad.
>
> Alex
>
> On 20.08.2012 13:43, Bruce Hoult wrote:
>> I'd hope and expect that using the thread-local free list facility
>> would prevent that.
>>
>> Did you build with -DTHREAD_LOCAL_ALLOC and follow the other 
>> instructions at:
>>
>>   http://www.hpl.hp.com/personal/Hans_Boehm/gc/scale.html
>>
>> On Mon, Aug 20, 2012 at 11:34 PM, Alexander Herz
>> <alexander.herz at mytum.de> wrote:
>>> Hi,
>>>
>>> While benchmarking all kinds of parallel code I come across another 
>>> serious
>>> scalability problem with the boehm allocator:
>>>
>>> executing this in 2 parallel threads:
>>>
>>> struct counter : virtual gc
>>> {
>>>      inline counter()
>>>      {
>>>          i=0;
>>>      }
>>>
>>>      tbb::atomic<int>;
>>>
>>>      inline void up()
>>>      {
>>>          i.fetch_and_add(1); //this boils down to a locked add
>>>      }
>>> };
>>>
>>> int loop(long v, counter* c)
>>> {
>>>      LOOP_LABEL:
>>>      if (v <= 0)
>>>      {
>>>          return 0;
>>>      }
>>>      else
>>>      {
>>>          c->up();
>>>      }
>>>      v=v - 1;
>>>      goto LOOP_LABEL;
>>> }
>>>
>>> //following thread is exeuted 2 times (in parallel)
>>> thread
>>> {
>>>      counter* c=new (GC_MALLOC(sizeof(counter))) counter();
>>>      loop(5000000,c);
>>> }
>>>
>>> then I execute the same code sequentially and look at the speedup.
>>>
>>> I get a speedup of 0.6 (parallel version is slower than sequential).
>>> Looking at some profiling data, I get a lot of false sharing 
>>> (different data
>>> in the same cache line accessed by
>>> different threads).
>>>
>>> If I replace GC_MALLOC(sizeof(counter)) by GC_MALLOC(64) (cache line 
>>> size of
>>> my cpu) then I get the expected speed up of almost 2.0.
>>>
>>> So apparently the boehm allocator does not take care to avoid false 
>>> sharing
>>> (unlike tbb::scalable_malloc).
>>>
>>> Again it would be nice to be able to substitute a different allocator.
>>>
>>> Regards,
>>> Alex
>>>
>>>
>>>
>>> _______________________________________________
>>> Gc mailing list
>>> Gc at linux.hpl.hp.com
>>> http://www.hpl.hp.com/hosted/linux/mail-archives/gc/
>>>
>>> -- 
>>> This message has been scanned for viruses and
>>> dangerous content by MailScanner, and is
>>> believed to be clean.
>>>
>> _______________________________________________
>> Gc mailing list
>> Gc at linux.hpl.hp.com
>> http://www.hpl.hp.com/hosted/linux/mail-archives/gc/
>
> _______________________________________________
> Gc mailing list
> Gc at linux.hpl.hp.com
> http://www.hpl.hp.com/hosted/linux/mail-archives/gc/



More information about the Gc mailing list