| Command name |
Description |
| PFM_CREATE_CONTEXT |
create a perfmon context |
| PFM_WRITE_PMCS |
program PMC registers |
| PFM_WRITE_PMDS |
program PMD registers |
| PFM_READ_PMDS |
read PMD registers values |
| PFM_START |
activate monitoring |
| PFM_STOP |
stop monitoring |
| PFM_LOAD_CONTEXT |
attach perfmon context |
| PFM_UNLOAD_CONTEXT |
attach perfmon context |
| PFM_RESTART |
resume monitoring after notification |
| PFM_CREATE_EVTSETS |
create or modify event sets |
| PFM_DELETE_EVTSETS |
delete event sets |
| PFM_GETINFO_EVTSETS |
get information about event sets |
The system call was preferred over the device driver model because it is built in, by construction, and offers better flexibility
for the type and number of parameters of the call compared to using ioctl() with the driver model. A system call makes it
easier for implementations to support the per-thread monitoring mode which requires access to the thread context switch
code to save and restore the PMU machine state. The system call resembles to ioctl(), with a file descriptor (fildes), a
command (cmd) to apply, the argument (cmd_arg) to the command. The difference, though, is that cmd_arg can
be a vector of arguments to which the command must be applied. The number of elements in the vector is
indicated by the cmd_num_arg parameter. The set of commands defined by the interface is shown in Table above.
The entire PMU machine state is encapsulated by a software abstraction called the
perfmon context. The context, by itself, is never directly exposed to applications. However, each context
is uniquely identified and manipulated with a file descriptor obtained when the context is created using the
PFM_CREATE_CONTEXT command.
 |
| Figure 1: Itanium® 2 PMU register mappings |
The PMU hardware interface is abstracted by considering that it is composed of two sets of registers: a set of control (PMC)
and a set of data (PMD) registers. The actual PMU registers are not exposed to applications. To simplify applications further,
the interface exports all PMU registers as 64-bit wide. In particular, counters are always manipulated as 64-bit
registers even though many PMU implementations have much less. For example, Itanium® implements 32 bits
while Itanium® 2 implements only 47 bits. Not having to worry about the width of counters is important for sampling.
The interface does not know what each register does, how many there are, how they are associated with each other. Each
implementation maps the logical PMC and PMD onto the actual PMU registers. In the case of IPF, the mapping is relatively
simple, for instance, PMC4 represents the actual PMC4 register. The debug registers, IBR0-IBR7 and DBR0-DBR7, are used
by the range restriction feature of the Itanium® 2 PMU. They are manipulated as PMC registers and have dedicated
mappings as shown in figure 1.
The PMC registers can be written with the
PFM_WRITE_PMCS command. The PMD registers can be read and written with the PFM_READ_PMDS and
PFM_WRITE_PMDS commands respectively. Both commands accept vector arguments which make it possible to program
multiple PMC or PMD registers in a single system call. Registers are manipulated using (number,value) pair representation.
To further isolate the interface from PMU-specific knowledge, all event specific information is relegated to the user level
where it can be fairly easily be encapsulated in a library.
Whenever possible, the interface exploits the existing infrastructure of the Linux kernel. In particular we take full advantage
of the virtual file system layer. A context is destroyed by simply invoking the close() system call. Similarly, a context can
be manipulated by all threads with access to the file descriptor. File descriptor sharing on fork() is used to access a
context in a child process, when needed.
2.2 Per-thread monitoring
A context can be attached to a thread using the
PFM_LOAD_CONTEXT command. A thread can monitor another thread or itself, i.e., self-monitoring.
For multi-threaded processes, it is necessary to create one context per thread. It is possible to
attach to and detach from a thread that is already running.
2.3 System-wide monitoring
|
| Figure 2: how to measure across multiple processor cores? |
System-wide monitoring is setup using the exact same series of call as for a
per-thread session. The type of the context is indicated when the context is created.
The interface uses a CPU-wide approach to implement
system-wide monitoring. A context can only monitor all the threads running on a designated processor. Coverage
for a multi-processor system is achieved by creating one context on each processor as shown in figure 2. A context
is bound to a processor during the call to PFM_LOAD_CONTEXT. The processor used by the calling thread is considered
the processor to monitor. The model is simple in that it does not try to invent yet another mechanism to set processor
affinity. In order to guarantee that the right processor is monitored, the thread must be pinned using the
sched_setaffinity() system call.
The approach simplifies the kernel implementation because there is no need to propagate the PMU settings across all
processors. The interface scales much better to large, NUMA-style, configurations because of improved code and data
locality. This is especially important when sampling because the overhead of writing the samples quickly becomes a problem
at high sampling rates. In our model, there is one sampling buffer per processor.
2.4 Support for sampling
The interface currently supports Time-Based Sampling (TBS) at the user level. In this mode, the sampling period is
determined by a timeout. When the timeout expires, a sample is recorded. This mode can easily be implemented at the user
level using existing interfaces such as setitimer().
The interface also supports Event-Based Sampling (EBS) where the sampling period is expressed as a number
of occurrences of an event rather than by time. This mode requires that the PMU be capable of generating
an interrupt when a counter overflows, i.e., wraps from 2w-1 back to 0, when the counter width is w.
All modern PMUs provide this capability. Hence the sampling period p is expressed as 2w-p
where w is the bit width of the counter. After p events have been observed, the counter overflows and the PMU interrupts
indicating that it is time to record a sample. Time-based sampling can fairly easily be emulated in this mode by
using an event with a strong correlation with time, such as the number of elapsed cycles which many PMUs offer.
The interface offers as many sampling periods as there are counters with overflow capability. Each period can be randomized
to avoid biased samples. This phenomenon is fairly frequent when sampling on events that occur a lot, such as branches. In
that case, certain branches will never be captured even though they are executed very frequently as well. This is explained by
the fact that the sampling period becomes somehow in lockstep with the execution of the monitoring program.
Because the PMU is designed for statistical sampling, randomization can be used to improve the accuracy of the
samples.
When a counter overflows the interface can notify the monitoring thread via a message. Such notification can be selected on a
per-counter basis. Here again, the existing kernel infrastructure is leveraged. Each message is retrieved from the message
queue with a simple read() system call. Similarly, asynchronous notification with a signal (SIGIO) can be requested by
setting O_ASYNC flag on the descriptor. Furthermore a single thread may wait for notifications coming from multiple
contexts using the select() or poll() system call.
2.4.1 Kernel level sampling buffer
 |
| Figure 3: re-mapping of the kernel sampling buffer |
When sampling, it is important to minimize the overhead as it impacts the sampling rate and may affect the accuracy of the
profiles. Although it is possible to sample completely from user level, the interface also provides a kernel level sampling
buffer which is allocated when the context is created. The idea is to amortize the cost of notification by
sending the message only when a lot of samples are available, i.e., the buffer is acting as a cache. Furthermore, to
minimize the cost of extracting the buffer from the kernel with potentially large memory copies, the buffer is
re-mapped into the user address space of the monitoring process using the mmap() system call as shown in figure 3
2.4.2 Custom sampling buffer formats
 |
| Figure 4: custom sampling buffer format architecture |
The problem with having a kernel level sampling buffer is that there exist many ways of storing the samples into the buffer.
Some tools want to keep all samples in sequential order, such as VTUNE, others, like OProfile, aggregate them, others may
want to record non PMU information such as the amount of free memory.
We realized early on that it was not possible to come up with a
universal buffer format. Instead, the interface provides a mechanism for users to plug in their own buffer
formats via kernel modules (DLKM). The perfmon core does not know anything about how samples are recorded. Figure 4
depicts the architecture of the formats.
Each module provides a set of callbacks which the perfmon core invokes on certain conditions. For instance, each format
provides a handler function which is called when a counter overflows. Each format is uniquely identified by a
128-bit unique identifier (UUID) which is passed to the kernel when a context is created. Each sampling format
controls:
- how the buffer is allocated. Formats are not required to use the perfmon core to allocate.
- how the buffer is exported to the user. Formats may export the buffer using a private interface, such as a device
driver interface for instance.
- how samples are recorded in the buffer
- what information is recorded. Formats may store PMU registers but also other kernel information.
- when an overflow notification is sent to the monitoring tool.
A default format which stores samples in sequential order in a linear buffer is provided. We have successfully developed
other formats such as one using n-way buffering, where the buffer is split into n sections. When one section fills up
the application is notified but sampling continues in the other sections. This technique is used to minimize
blind spots. We have also developed a format which samples the kernel level call stack, i.e., the full call path.
Similarly, the OProfile team was able, in two days, to hook up their sample recording code to our interface.
They were able to reuse all of their kernel level code. The buffer is exported through their own device driver
interface.
The custom sampling format mechanism constitutes an innovation and key strength of the interface because it opens up a lot
of possibilities for sampling tools. It lets developers focus on the added-value of the tools instead of the intricacies of how to
access the PMU or how to export a sampling buffer. For instance, the kernel call stack sampling format required about 300
lines of C code.
2.5 Event sets and multiplexing
The amount of hardware resources dedicated to the PMU is limited because this is not a critical part of the processor. That
explains why the number of counters may be limited and why there may be some constraints on how events can
be combined. For instance, it is frequent to have constraints such as event A cannot be measured with event B.
Because of such limitations, certain measurements may require multiple runs.
For instance, on Itanium® 2, a detailed cycle breakdown requires about a dozen events, yet there are only four counters.
Restarting the workload may not be convenient and may introduce fluctuations in the results making the interpretation
difficult.
To overcome these difficulties, the interface exports the abstraction of an
event set. The idea is to split events into sets of no more than m events if the PMU has m counters. Counters are then
multiplexed between the various sets. Simple scaling can be used to create the illusion that all counters were active
during the entire monitoring sessions.
Implementing multiplexing at the kernel instead of user level provides a significant
performance boost. This is especially visible for per-thread context because the switching
occurs in the context of the monitored thread instead of by context-switching back and forth to the monitoring tool. This is very important because if the overhead is
small, the multiplexing frequency can be increased, which minimizes blind spots and therefore increases accuracy of the
scaling.
Each set includes the full PMU state and is uniquely identified by a number. Sets can be created and deleted on demand. They
are stored in an ordered list based on their identification number. Switching follows the list in a round-robin fashion. Each set
can choose from two modes of switching:
- time-based: a timeout is specified per set, when the timeout expires, the next set is activated.
- overflow-based: switching occurs when selected counters, called triggers, overflow.
There can be as many triggers as there are counters. Each trigger has an overflow threshold which determines
after how many overflows switching occurs. This mechanism avoids dedicating a counter as a trigger and therefore maximizes
resource utilization. This is another innovation of the interface.
Overflow-based switching can be used to implement counter cascading, where certain events are counted only after a certain
threshold is reached. The threshold is expressed as a number of occurrences an event.
2.6 Security
The interface is designed to be implemented into the Linux kernel therefore it follows the same security guidelines. In
particular, it is not admissible to leak kernel information to users, or access information about other processes unless
authorized. The interface survives malicious usage, such as denial of service attacks. For instance, an application cannot
allocate an arbitrary large amount of kernel memory for the sampling buffer. Similarly, the creation of a system-wide context
may be restricted to a certain group of trusted users as this can be used to potentially extract information from other
processes.
2.7 Current status
The interface has been implemented in the 2.6 kernel series
for Linux/ia64. Earlier 2.4 kernel series Linux/ia64 kernels implement the first generation of
the interface which had strong limitations. The current interface is not backward compatible
with the first generation, some small porting effort is needed to adapt existing applications.
The full specification of the interface is available as
an HPLabs technical report
Several open-source and commercial tools are already available
for Linux/ia64 and the perfmon2 interface. Among them, the
HP Caliper tool.
A new implementation of the interface is now available
as a standalone kernel patch. It follows more closely the specificiation. It also includes
support for the X86-64 and P6 processor family as well as preliminary support for PPC64.