[Gc] Alignment, Executable memory and MMX registers
hans.boehm at hp.com
Wed Oct 20 10:09:42 PDT 2004
> -----Original Message-----
> From: Ben Hutchings [mailto:ben.hutchings at businesswebsoftware.com]
> I'm not interested in anything that only works with single threads,
> though I realize Matt might be.
I'm actually not sure if/how broken it is with threads. I'll turn it on
the next time I test on Windows.
> What I was talking about doing was implementing virtual dirty bits
> using the GetWriteWatch API present in recent versions of
> Windows, which
> is explicitly intended for use by GCs. I didn't realise that
> there was
> already an implementation using VirtualProtect and exception-handling!
I wasn't aware of that API. That's probably a much better way to do
things. Unfortunately it wasn't available at the time the current
implementation was developed.
> > Having said that, incremental GC currently rarely improves
> > and probably costs more on Windows than Linux.
> With a very large heap (a gigabyte or so) a full GC takes several
> seconds so it's well worth spreading out the cost somehow.
Understood. I completely agree.
I've been looking at tracing rates in the upcoming 7.0alpha1. The
good news is that I think I can make it go a little faster. The bad news
is that I'm reasonably convinced I can't make it go a lot faster.
Other things semi-orthogonal things that are worth trying in this area:
- If you know the exact target architecture try turning on prefetching.
If that helps a lot, someone might become motivated to provide for
different near-clones of the mark routine, so that we can decide dynamically
how to prefetch, based on processor type. It doesn't help for GCBench
on a Pentium 4, but I suspect that's because the benchmark is too regular
to foil the hardware prefetcher. That probably doesn't apply to real
code. (You may need to implement the PREFETCH macros for Windows.
Presumably there is a suitable compiler intrinsic.)
- Port the parallel collector to Windows. (7.0 will have it reconfigured
to work better on Pentium 4s. Currently it suffers since the atomic mark
bit update is too slow on hyperthreaded P4s.)
More information about the Gc