Jump to content United States-English
HP.com Home Products and Services Support and Drivers Solutions How to Buy
» Contact HP

hp.com home


Affinity über alles

» 

HP Labs

» Research
» News and events
» Technical reports
» About HP Labs
» Careers @ HP Labs
» People
» Worldwide sites
» Downloads
Content starts here


Author: David Mosberger, HP
Created: 20 May 2003
Last update:14 Oct 2003

Summary

With the O(1) scheduler, affinity overrules all other aspects of the scheduler. Not only that, but a task that has never run on a CPU may still have affinity for that CPU! The net effect is that applications calling sched_yield() at a certain minimal frequency may end up getting all their tasks running on a single CPU, even when there are other CPUs that are idle. This is not a temporary condition, but one that can last indefinitely, e.g., keeping one CPU busy with a load-level of 10 while other CPUs are sitting by idly.

Description

The O(1) scheduler has a very simple notion of affinity: any task that has run on a particular CPU within the last N milliseconds is considered to be CPU-affine and will not be considered for migration to another CPU. The exact value of N depends on the CPU architecture, cache size, memory bandwidth, and potentially other factors. On all platforms, a value of N is chosen at boot time. Typical values are in the 1-10ms range. (In the kernel sources, N is stored in a variable called cache_decay_ticks. The unit of this value is "jiffies", i.e., clock ticks.)

To understand where this notion of CPU affinity may cause a problem, consider an application that has parallelized a loop to execute on M threads. The application might do something along the lines of the following:

  1. spawn M threads
  2. let each thread iterate over 1/Mth of the total number of loop iterations
  3. collect the (partial) results of every thread into a single global result
A real parallel application normally consists of multiple such parallelized loops.

For such barrier-style parallel applications, it is important for all threads to finish their portion of the loop at roughly the same rate, so that they reach step 3 at the same time. Otherwise, a single thread that is substantially slower than the others would cause all but one CPU to go idle and the program would be prevented from completing step 3 and entering the next parallel phase, thus causing undesirable idle time.

Now, if all loop iterations took exactly the same amount of time to execute, simultaneous arrival could be guaranteed trivially, but if the amount of work differs a bit from one iteration to the next, there is a real issue with threads completing their work at very different times. To mitigate this effect, the application may want to call sched_yield() periodically (e.g., after completing 10 iterations). The goal of doing this is effectively to ensure that each thread completes the same number of iterations per second, even if some threads require more CPU time than others. In other words, in such applications, "fairness" is measured here in "number of completed iterations", not "number of CPU cycles spent".

So what's the problem? There is no problem if each loop iteration takes long enough that sched_yield() is called only infrequently, but as soon as sched_yield() is called at least once every N milliseconds, we have a problem: all of the M threads may be considered affine ("cache hot") on one and the same CPU and they will therefore not be considered for migration. Thus, we may end up with a "parallel" application that ends up keeping just one CPU busy!

NOTE! Of course, it would be tempting to dictate that a program should never invoke sched_yield() at a high frequency. Doing so clearly costs some performance due to additional system-call and (potentially) context-switch overhead. However, things are not quite so simple in practice. First, keep in mind that a cache-affinity timeout of around 10ms is actually quite large for modern CPUs (and a large timeout value is often a Good Thing, because it keeps communicating threads together on the same CPU). Second, if sched_yield() is to be an effective tool to encourage application-level fairness, it needs to be called at least once per time-slice. Otherwise, the normal scheduling policy will get "in the way" and interfere with reaching the goal of completing loop iterations at the same rate. So this puts an lower bound on the frequency on which sched_yield() needs to be called. Third, consider that it is not always easy to estimate just how fast a loop-iteration will execute. For example, the same code might easily complete one iteration in 5ms when all data is in the cache, but take more than 1s when the data has to be read from memory. So, depending on the input data of the application, execution time can vary widely. Of course, the application could try to be "smarter" and measure loop iteration times, but that's not free either and may cause other problems. Even if you don't believe in sched_yield() as a means to encourage application-level fairness, the same problem could arise for a group of communicating tasks. If the tasks complete a request/response roundtrip in less than Nms and impose a load average of more than 1, the same problem of keeping only one CPU busy would arise. If this still doesn't convince, consider that there are real (and important) parallel applications out there which work just fine with the original Linux scheduler, but run substantially slower with the O(1) scheduler precisely because of this problem. Maintaining compatibility for existing applications generally is considered to be one of the hall-marks of the Linux kernel.

Test-case to reproduce the problem

The problem can be demonstrated easily with the artificial test program below. It should be invoked like this:

$ /usr/bin/time ./test-busy
On a dual-CPU system, the above test case might produce the following results:
  • without O(1) scheduler (Linux v2.4.19):
    $ /usr/bin/time ./test-yield
    40.83user 37.51system 0:39.18elapsed 199%CPU
    0inputs+0outputs (32major+12minor)pagefaults 0swaps
    
  • with O(1) scheduler (Linux v2.5.69):
    $ /usr/bin/time ./test-yield
    38.46user 164.06system 3:22.44elapsed 100%CPU
    0inputs+0outputs (32major+15minor)pagefaults 0swaps
    

The important part here is the difference in CPU-utilization: without the O(1) scheduler, the test-program keeps both CPUs busy (199% utilization). With the O(1) scheduler, only one CPU is ever busy (even though the load average is 2 for more than 3 minutes!). Note that the difference in elapsed time is not particularly relevant because the test-program is a highly simplified version of a real-world test case.

Status

Fixed: Recent 2.5/2.6 kernels break affinity if there are any idle CPUs. This fixes this particular problem, but of course has the unfortunate effect of destroying the nice "initial affinity" property which made the O(1) scheduler work so well for communicating processes, such as LMbench's lat_udp. Given that v2.6.0 is just about to be released, this will probably be the final behavior for the 2.6 series of the kernel.

  • Analysis: The problem is due to a design-flaw in the O(1) scheduler. We believe that the fundamental problem is that the O(1) scheduler lets affinity override all other concerns. Over short periods of time, that is acceptable, but over longer periods of time the O(1) scheduler should notice that one CPU has a load average of 2 (or more) while other CPU(s) sit idle. When such a situation arises, the scheduler should invoke the load balancer.

  • Scope: believed to affect all O(1)-based kernels. Problem has been confirmed to exist on:
    • 4-way 200MHz Pentium Pro, kernel v2.4.20-9smp, Red Hat 9
    • 2-way 1000MHz Itanium 2, kernel v2.5.69, Debian 3.0
    • 4-way 1000MHz Itanium 2, kernel v2.5.69, Debian 3.0

  • Workarounds:
    • Running a kernel without the O(1) scheduler would be the safest solution for applications affected by this problem
    • For dynamically linked applications that call sched_yield() frequently, a possible workaround is the replace the system call with a dummy function that does nothing. The romp script available here does that transparently. To use romp, extract the tar ball, type "make install" as root, then run the application in question with:
      $ romp progname arguments
      
      The downside of this workaround is that all sched_yield() calls become no-ops, which has a (small) potential of causing other undesirable effects.
    • If complete source code for the application is available, it might be possible to reduce the frequency of sched_yield() calls or to use the sched_setaffinity() system call to pin each thread to a particular CPU.

Other potential fixes and why they may not be a good idea

As alluded to in the previous section, we believe the fundamental problem here is the lack of load-balancing, despite a persistent load-average difference of 2 or more. However, depending on the view-point, different solutions may seem obvious. Here are some of them and why we do not believe they address the real issue:

Let tasks earn CPU affinity
We could argue that the problem is due to the fact that the O(1) scheduler considers a task affine ("cache-hot") on the CPU that created it, even if the task has never executed on that CPU so far. This argument makes some sense but has a serious flaw: if tasks were to start out without initial affinity, communicating tasks would often end up being load-balanced across CPUs, even if they could execute much faster if they were kept on the same CPU. This effect is readily visible in the latency-results of the LMBench microbenchmark suite (e.g., lat_udp suffers tremendously without the initial affinity).
Let sched_yield() look for idle CPUs
When a task calls sched_yield(), it's effectively saying "well, I could continue to run, but I want to be fair and give others a chance to run". When there are idle CPUs in the system, this could be taken as a signal to forget about CPU affinity and let the calling task migrate to one of the idle CPUs. This solution would be quite safe and should work well in practice. However, it does nothing about fixing the problem for frequently communicating tasks which cause a load-average of significantly more than 1. Thus, we consider a sched_yield() change more of a stop-gap solution than as something addressing the fundamental problem.

Credits

Thanks to Rudolf Morf of PSI in Switzerland for reporting this problem and for providing a test-case.

Test program

click here to download the test program

#include <pthread.h>
#include <sched.h>
#include <unistd.h>

void *
worker (void *arg)
{
  long i;

  for (i = 0; i < 100000000; ++i)
    sched_yield ();	/* give other's a chance to run */
  return NULL;
}

int
main (int argc, char **argv)
{
  int i, nthreads = sysconf (_SC_NPROCESSORS_ONLN);
  pthread_t tid[nthreads];

  for (i = 0; i < nthreads; ++i)
    pthread_create (&tid[i], NULL, worker, (void *) (intptr_t) i);
  for (i = 0; i < nthreads; ++i)
    pthread_join (tid[i], NULL);
  return 0;
}
Printable version
Privacy statement Using this site means you accept its terms Feedback to HP Labs
© 2008 Hewlett-Packard Development Company, L.P.