
Summary
With the O(1) scheduler, affinity overrules all other aspects of
the scheduler. Not only that, but a task that has never run
on a CPU may still have affinity for that CPU! The net effect is that
applications calling
sched_yield() at a certain minimal frequency may end up getting
all their tasks running on a single CPU, even when there are other
CPUs that are idle. This is not a temporary condition, but one that
can last indefinitely, e.g., keeping one CPU busy with a load-level of
10 while other CPUs are sitting by idly.
Description
The O(1) scheduler has a very simple notion of affinity: any task
that has run on a particular CPU within the last N milliseconds
is considered to be CPU-affine and will not be considered for
migration to another CPU. The exact value of N depends on
the CPU architecture, cache size, memory bandwidth, and potentially
other factors. On all platforms, a value of N is chosen at
boot time. Typical values are in the 1-10ms range. (In the kernel
sources, N is stored in a variable called
cache_decay_ticks. The unit of this value is "jiffies", i.e.,
clock ticks.)
To understand where this notion of CPU affinity may cause a problem,
consider an application that has parallelized a loop to execute on
M threads. The application might do something along the lines
of the following:
- spawn M threads
- let each thread iterate over 1/Mth of the total number of loop iterations
- collect the (partial) results of every thread into a single global result
A real parallel application normally consists of multiple such
parallelized loops.
For such barrier-style parallel applications, it is important for all
threads to finish their portion of the loop at roughly the same rate,
so that they reach step 3 at the same time. Otherwise, a single
thread that is substantially slower than the others would cause all
but one CPU to go idle and the program would be prevented from
completing step 3 and entering the next parallel phase, thus causing
undesirable idle time.
Now, if all loop iterations took exactly the same amount of time to
execute, simultaneous arrival could be guaranteed trivially, but if
the amount of work differs a bit from one iteration to the next, there
is a real issue with threads completing their work at very different
times. To mitigate this effect, the application may want to call
sched_yield() periodically (e.g., after completing 10
iterations). The goal of doing this is effectively to ensure that
each thread completes the same number of iterations per second, even
if some threads require more CPU time than others. In other words, in
such applications, "fairness" is measured here in "number of completed
iterations", not "number of CPU cycles spent".
So what's the problem? There is no problem if each loop iteration
takes long enough that sched_yield() is called only
infrequently, but as soon as sched_yield() is called at least
once every N milliseconds, we have a problem: all of the
M threads may be considered affine ("cache hot") on one and the
same CPU and they will therefore not be considered for migration.
Thus, we may end up with a "parallel" application that ends up keeping
just one CPU busy!
NOTE! Of course, it would be tempting to dictate that
a program should never invoke sched_yield() at a high
frequency. Doing so clearly costs some performance due to additional
system-call and (potentially) context-switch overhead. However,
things are not quite so simple in practice. First, keep in mind that
a cache-affinity timeout of around 10ms is actually quite large for
modern CPUs (and a large timeout value is often a Good Thing, because
it keeps communicating threads together on the same CPU). Second, if
sched_yield() is to be an effective tool to encourage
application-level fairness, it needs to be called at least once per
time-slice. Otherwise, the normal scheduling policy will get "in the
way" and interfere with reaching the goal of completing loop
iterations at the same rate. So this puts an lower bound on the
frequency on which sched_yield() needs to be called. Third,
consider that it is not always easy to estimate just how fast a
loop-iteration will execute. For example, the same code might easily
complete one iteration in 5ms when all data is in the cache, but take
more than 1s when the data has to be read from memory. So, depending
on the input data of the application, execution time can vary widely.
Of course, the application could try to be "smarter" and measure loop
iteration times, but that's not free either and may cause other
problems. Even if you don't believe in
sched_yield() as a means to encourage application-level
fairness, the same problem could arise for a group of communicating
tasks. If the tasks complete a request/response roundtrip in less
than Nms and impose a load average of more than 1, the same
problem of keeping only one CPU busy would arise. If this still
doesn't convince, consider that there are real (and important)
parallel applications out there which work just fine with the original
Linux scheduler, but run substantially slower with the O(1) scheduler
precisely because of this problem. Maintaining compatibility for
existing applications generally is considered to be one of the
hall-marks of the Linux kernel.
Test-case to reproduce the problem
The problem can be demonstrated easily with the artificial test program below. It should be invoked like this:
$ /usr/bin/time ./test-busy
On a dual-CPU system, the above
test case might produce the following results:
- without O(1) scheduler (Linux v2.4.19):
$ /usr/bin/time ./test-yield
40.83user 37.51system 0:39.18elapsed 199%CPU
0inputs+0outputs (32major+12minor)pagefaults 0swaps
- with O(1) scheduler (Linux v2.5.69):
$ /usr/bin/time ./test-yield
38.46user 164.06system 3:22.44elapsed 100%CPU
0inputs+0outputs (32major+15minor)pagefaults 0swaps
The important part here is the difference in CPU-utilization: without
the O(1) scheduler, the test-program keeps both CPUs busy (199%
utilization). With the O(1) scheduler, only one CPU is ever busy
(even though the load average is 2 for more than 3 minutes!). Note
that the difference in elapsed time is not particularly relevant
because the test-program is a highly simplified version of a
real-world test case.
Status
Fixed: Recent 2.5/2.6
kernels break affinity if there are any idle CPUs. This fixes this
particular problem, but of course has the unfortunate effect of
destroying the nice "initial affinity" property which made the O(1)
scheduler work so well for communicating processes, such as LMbench's
lat_udp. Given that v2.6.0 is just about to be released,
this will probably be the final behavior for the 2.6 series of the
kernel.
- Analysis: The problem is due to a design-flaw in the O(1)
scheduler. We believe that the fundamental problem is that the O(1)
scheduler lets affinity override all other concerns. Over short
periods of time, that is acceptable, but over longer periods of time
the O(1) scheduler should notice that one CPU has a load average of 2
(or more) while other CPU(s) sit idle. When such a situation arises,
the scheduler should invoke the load balancer.
- Scope:
believed to affect all O(1)-based kernels. Problem has been confirmed to
exist on:
- 4-way 200MHz Pentium Pro, kernel v2.4.20-9smp, Red Hat 9
- 2-way 1000MHz Itanium 2, kernel v2.5.69, Debian 3.0
- 4-way 1000MHz Itanium 2, kernel v2.5.69, Debian 3.0
- Workarounds:
- Running a kernel without the O(1) scheduler would be the safest solution
for applications affected by this problem
-
For dynamically linked applications that call sched_yield()
frequently, a possible workaround is the replace the system call with
a dummy function that does nothing. The romp script available here does that transparently. To use
romp, extract the tar ball, type "make install" as root, then
run the application in question with:
$ romp progname arguments
The downside of this workaround is that all sched_yield() calls
become no-ops, which has a (small) potential of causing other undesirable
effects.
- If complete source code for the application is available, it might
be possible to reduce the frequency of sched_yield() calls or
to use the
sched_setaffinity() system call to pin each thread to a particular
CPU.
Other potential fixes and why they may not be a good idea
As alluded to in the previous section, we believe the fundamental
problem here is the lack of load-balancing, despite a persistent
load-average difference of 2 or more. However, depending on the
view-point, different solutions may seem obvious. Here are some of
them and why we do not believe they address the real issue:
- Let tasks earn CPU affinity
- We could argue that the problem is due to the fact that the O(1)
scheduler considers a task affine ("cache-hot") on the CPU
that created it, even if the task has never executed on that
CPU so far. This argument makes some sense but has a serious
flaw: if tasks were to start out without initial affinity,
communicating tasks would often end up being load-balanced
across CPUs, even if they could execute much faster if they
were kept on the same CPU. This effect is readily visible in
the latency-results of the LMBench microbenchmark suite (e.g.,
lat_udp suffers tremendously without the initial
affinity).
- Let sched_yield() look for idle CPUs
- When a task calls sched_yield(), it's effectively saying
"well, I could continue to run, but I want to be fair and
give others a chance to run". When there are idle CPUs in the
system, this could be taken as a signal to forget about CPU
affinity and let the calling task migrate to one of the idle
CPUs. This solution would be quite safe and should work well
in practice. However, it does nothing about fixing the
problem for frequently communicating tasks which cause a
load-average of significantly more than 1. Thus, we consider
a sched_yield() change more of a stop-gap solution than
as something addressing the fundamental problem.
Credits
Thanks to Rudolf Morf of PSI in Switzerland for reporting this
problem and for providing a test-case.
click here to download the test program
#include <pthread.h>
#include <sched.h>
#include <unistd.h>
void *
worker (void *arg)
{
long i;
for (i = 0; i < 100000000; ++i)
sched_yield (); /* give other's a chance to run */
return NULL;
}
int
main (int argc, char **argv)
{
int i, nthreads = sysconf (_SC_NPROCESSORS_ONLN);
pthread_t tid[nthreads];
for (i = 0; i < nthreads; ++i)
pthread_create (&tid[i], NULL, worker, (void *) (intptr_t) i);
for (i = 0; i < nthreads; ++i)
pthread_join (tid[i], NULL);
return 0;
}
|