The following list contains some of the most common questions about the kernel perfmon
interface:
- How can I use the PMU (and perfmon) to get fine grain timer interrupts?
The principle is rather simple. You need to use the CPU_CYCLES event and
set the value such that it will overflow after n cycles, i.e, 264- n. You also
need to use the PFM_REGFL_OVFL_NOTIFY on the PMC holding the event. Then you must indicate
that you want to be notified on overflow in the ctx_notify_pid field during context creation.
A simple example of such a program has been written by Peter Chubb of UNSW, you can get the program
here.
- How can I check that my kernel has the perfmon support?
This can be determine by looking at the kernel boot log via dmesg. You should
see a few lines related to perfmon. Something similar to:
...
perfmon: version 1.0 (sampling format v1.0) IRQ 238
perfmon: 47 bits counters
perfmon: 4 PMC/PMD pairs, 16 PMCs, 18 PMDs
...
Another way, is by checking the existence of /proc/perfmon file. A third
way is to look into /proc/interrupts to see if there is an interrupt (238)
registered to perfmon.
In the current 2.4 kernel this is not supported, even though several inheritance mode
are defined. In the case of a multi-threaded application monitored with pfmon, only the
first child task will be monitored.
- Which version of perfmon contains randomization support for the sampling periods?
The randomization support has been added starting with the 2.4.20 and 2.5.39 kernels or newer.
- Is it possible to monitor multi-threaded application?
In the current 2.4 kernel this is not supported, even though several inheritance mode
are defined. In the case of a multi-threaded application monitored with pfmon, only the
first child task will be monitored.
- Does the perfmon interface support attaching/detaching from an already running process?
In the current 2.4 kernel, this is not supported.
- Is the perfmon subsystem in 2.5 identical to the one in 2.4?
At this point, they are more or less identical depending on how fast we propagate the modifications
from one kernel tree to the other. All the new features will be added to a future version of the 2.5
kernel.
- For a system-wide session, can I monitor more than one CPU from one task?
No, because perfmon needs one task per monitored CPU and that task cannot be allowed to migrate.
- For a system-wide session, do I have to use multiple threads to monitor multiple CPUs?
No, it is also possible to use separate processes. In Linux the distinction between threads and process is
very fuzzy because of the way the LinuxThreads is implemented. For system-wide sessions, it is
usually easier to have multiple threads than multiple processes because most likely they have to share some
state.
- For a system-wide session, do I have to monitor all CPUS?
No, each session only applies to one CPU at a time. You can monitor a subset of CPUs. In fact, different
users can be measuring on different CPUs at the same time. Each session can measure a different set of
events.
- Does a system-wide session monitor the idle task on a CPU?
When the privilege level mask of the events is including the kernel level (PFM_PLM0), then all
activities inside the kernel for the designated CPU are monitored. Given that there is one idle
task per CPU, it is monitored.
- For a system-wide session, is it possible to exclude the idle task when monitoring at the kernel level?
Yes, if your kernel does have perfmon-1.3 or higher. You need to specify PFM_FL_EXCL_IDLE flag when
creating the context. This option is only relevant for system wide monitoring.
- Does a system-wide session automatically monitor at the kernel level?
No, It is up to the application to specify a privilege level mask for each event.
- For a system-wide session, I cannot use the PFM_FL_NOTIFY_BLOCK flag, why?
It is not possible to use this flag because perfmon cannot block that task which caused a counter
overflow. This task may be a kernel daemon or the idle task for instance, stopping those will lock up
the system. In system wide, we must keep the system running when an overflow notification is generated.
- For system-wide sessions, can the kernel aggregate the results?
Because each system-wide session is independent of each other, the kernel treats them as separate.
This is true for simple counting and also for sampling sessions. It is up to the monitor tool to aggregate
if necessary.
- For system-wide sessions, is it possible to share the sampling buffer between sessions?
Because each system-wide session is independent of each other, each one use a different sampling buffer.
The kernel sampling buffer (really on overflow cache) is allocated by the kernel when the perfmon context
is created. Each one is independent and there is no way to share them. It is up to the application to
aggregate the buffers.
- For sampling system-wide sessions, what process id is recorded in each sample?
The process id of the active task on the CPU when the sampling period expired.
- For system-wide sessions, is the pinning of the task maintained after the context is destroyed?
No, the previous set of allowed CPUs for the task is restored, typically it indicates that the task can run on any CPU.
- For system-wide sessions, does the kernel check for incompatible pinning when the context is created?
Yes, it does. Perfmon will refuse to pin a task onto a CPU on which it cannot currently run.
- How is a sampling session detected by the kernel?
Perfmon will set up a sampling session when it sees that the number of entries indicated in
ctx_smpl_entries is greater than 0.
- How do I get access to the sampling buffer is my application?
Perfmon automatically remaps the buffer into the address space of the creator of a context. The virtual address
of the first byte of the buffer is returned in ctx_smpl_vaddr.
- Can I ask for the buffer not to be remapped?
No, not at this point.
- How big a buffer can an application create?
Perfmon checks that the size of the buffer does not exceed the amount of memory that
a task can have locked (RLIMIT_MEMLOCK).
- Can I have my own format for the buffer?
No, not at this point.
- How do I know when the buffer becomes full?
If the application requested a notification on overflow for certain events, then a SIGPROF
signal will be sent. At which point, monitoring is stopped and the sampling buffer can safely be
processed. Notification is generated only when a counter overflows. Typically the notification is
requested for counters used to delimit a sampling period.
- How do I know when the buffer becomes full?
If the application requested a notification on overflow for certain events, then a SIGPROF
signal will be sent. At which point, monitoring is stopped and the sampling buffer can safely be
processed.
- Am I setting the PFM_REGFL_OVFL_NOTIFY on the PMC but it is refused?
This flag only indicates that a notification must be sent, it does say to which task. This error indicates
that the task to notify on overflow was not specified in the ctx_notify_pid at the creation of the context.
- What happens when I use a sampling buffer but no notification?
When the buffer becomes full, perfmon automatically resets the buffer index to zero, i.e., recording will wrap
around overwriting older samples. In this case, monitoring is resumed right away.
- Is it possible to have more than one sampling period?
Yes, you can have any many sampling periods as there are counters.
- How do I specify which PMD to record in each sample?
You need to speficy the PMDs to record in each sample via the ctx_smpl_regs bitfield
when the context is created.
- Is it possible to record something different from PMDs in each sample?
No, not at this point.
- Is it possible to record different sets of PMD in each sample?
No, not at this point.
- Where do I find the PMDs in each record?
Each sample begins with a fixed size header and his followed by the set of PMDs to record, which can be empty.
The PMD registers are stored as unsigned long and in increasing order based on their index. For instance, if
both PMD4 and PMD5 are to be recorded, then the first PMD after the header is PDM4 followed by PMD5.
- Where do I find out how many valid samples are in the buffer?
At the beginning of the sampling buffer, there is a buffer header which contains information about
the sampling buffer, such as the size of each entry. The hdr_count field contains the number
of valid entries when monitoring is stopped.
- Can I write into the sampling buffer area mapped into my address space?
No, it is mapped read-only. Any write access will result in a segmentation violation.
- How do I figure out which counter overflowed when I get a notification via SIGPROF?
First of all, the application needs to check that the SIGPROF was generated by perfmon and not another subsystem.
This can be done by checking the si_code in the siginfo structure passed to the signal handler. The code must be
equal to PROF_OVFL which is defined in perfmon.h. In this case, the siginfo structure contains an extended
structure for SIGPROF. Because perfmon is still being developed this structure is not yet available in the
/usr/include/bits/siginfo.h provided by the C library. To alleviate the problem, the libpfm library comes
with a private siginfo header which can be used to access the information. The file to include
is /usr/include/perfmon/pfm_siginfo.h. This file defines a pfm_siginfo_t structure which can be layed over
the regular siginfo structure. In this structure the sy_pfm_ovfl is a bitmap in which each bit set indicates a PMD
which overflowed. For instance, if the bitmask value is 0x10, it means that PMD4 overflowed. More
than one bit can be set.
- Does a per-process session monitoring at the kernel level, exclude device interrupt
execution on behalf of other tasks?
No, if the current task is monitored and there is, let us say, a network interrupt,
then the execution of the interrupt handler will be part of the monitored execution. There is currently
no way of breaking down the kernel execution spent service actual requests from a task from asynchronous
interrupts, such as device I/O possibly for other tasks.