Jump to content United States-English
HP.com Home Products and Services Support and Drivers Solutions How to Buy
» Contact HP

hp.com home


overview of perfmon kernel interface

» 

HP Labs

» Research
» News and events
» Technical reports
» About HP Labs
» Careers @ HP Labs
» People
» Worldwide sites
» Downloads
Content starts here

All modern processors have a hardware performance monitoring unit (PMU) which exports a set of counters to collect micro-architectural events such as the number of elapsed cycles or the number of L1 cache misses. Exploiting those counters to analyze the performance of key applications and operating systems is becoming common practice. No standard industry benchmark results is published without going through rounds of performance analysis which help pinpoint compilers, operating system, processor, or machine configuration problems.

In the context of the Itanium® Processor Family (IPF), performance monitoring is extremely important because the performance is heavily dependent on code quality. If the code is poorly scheduled, it does not perform well. Compilers need performance feedback to adjust the type and level of optimizations they do. On IPF, compilers can really benefit from Profile Based Optimization (PBO). With this technique, an application is instrumented at compile time, then run. The collected profile is finally fed back to the compiler for a second pass of optimization. Even though today PBO is mostly done through instrumentation, it is moving in the direction of using the hardware counters to gain access to lower-level information, e.g., from basic-block call counts to cache misses. Similarly managed runtime environments (MREs), such as Java, which compile code on the fly, are also beginning to exploit hardware counters to tweak how they generate code, i.e., moving towards dynamic optimization. Finally, with the rise of multi-threaded processors, it may become useful for the scheduler of an operating system to understand the execution profile of a thread, especially the cache behavior, to adjust its scheduling decision to avoid cache thrashing between two threads, for instance.

Although the PMU can harvest very useful information, it is not always exploited to the fullest of its capability because of a lack of a good kernel interface. The PMU is normally accessed from the kernel because it requires special instructions which can only be executed at the most privileged level of execution. Until recently, PMUs have often been documented poorly. They were kept secret and were used internally by hardware vendors during processor bring up and for optimization of key proprietary applications. As such, there was no real need for software standardization.

With the rise of Linux as an open-source operating systems that runs across all major hardware platforms, the issue of standardizing on a performance monitoring interface is becoming more pressing. Linux has historically been rather limited in its offering of good PMU-based performance tools, including on the leading IA-32 platform. There is the gprof profiler which uses instrumentation and recompilation and some PMU-based tools such as VTUNE™ Performance Analyzer, or OProfile. The lack of common interface has led to software fragmentation as each tool comes with its own specialized kernel interface. Just on the IA-32 platform, VTUNE uses its own driver, the PAPI toolkit is based on the Perfctr interface, and there is the OProfile kernel interface used by tools such as Prospect. The danger with this approach is that there is code duplication and no coordination between each interface yet they share the same hardware resource. Until recently none of the interfaces were even part of the official kernel source tree, instead users had to download a patch or an external kernel module and recompile. Such situation is not very attractive for developers and especially for ISVs because they cannot find a suitable interface that is built in, stable, documented, and present on all Linux distributions. We believe this has slowed down the development or port of interesting performance tools compared to other commercial operating systems.

By construction, the PMU is very specific to each processor implementation because it operates at the micro-architecture level. Even inside a processor family there can be huge variations. The case of IPF is quite interesting because this is the first time that the framework for the PMU is specified by the processor architecture documentation. Within the framework, each Itanium® processor can extend the basic functionalities of the PMU. Differences exist already between Itanium® and Itanium® 2. For instance, the width of the counters, the number and encoding of events, are different between the two models.

Although there is a lot of variations, even inside the same processor architecture, we believe it is possible to exploit some common characteristics. The key point is that all modern PMUs use a register-based interface with simple read and write operations. More precisely, a PMU commonly exports a set of control registers used to configure what is to be measured and a set of data registers where results are collected. The IPF PMU follows this model with Performance Monitoring Configuration(PMC) and Performance Monitoring Data (PMD) registers. The IA-32 processor PMUs use Machine Specific Registers (MSR). If the interface stays close to this basic hardware interface, we believe it can be made portable across platforms. Furthermore, if the interface focuses solely on providing access to the hardware resource and not on the specifics of programming each PMU, which requires knowledge of events and what they do, we believe that a standard can be established across all Linux hardware platforms and possibly for other operating systems.

During the development of Linux/ia64, we realized that we needed a kernel interface to access the PMU. We faced the problem of a lack of industry standard interface and we decided to use the port of Linux to IPF as the perfect test-bed to design such an interface.

The work presented in this paper is addressing the following challenge:

  • how to design a generic performance monitoring interface to access all PMU implementations that would also support a variety of performance tools?

That interface would have to be built-in secure, robust and documented. It would have to support the diverse needs of performance tools. For instance, some tools are simply collecting counts of events while others are sampling to collect profiles. It must also support measurements applied to a single thread (per-thread) or to the entire system (system-wide). The interface must support all these modes of operation in a flexible and efficient manner in order to minimize the overhead which may perturb measurements and lead to misinterpretation of the results. The interface must also allow existing tools to be ported without too much effort.

2 Our solution

We have designed a generic interface called perfmon2. We have first implemented the interface for Linux on IPF. We do have ports to other architectures, such as IA-32, X86-64 and PPC64.

2.1 A new system call

The keystone of the interface consists of a new system call defined as follows:

int perfmonctl(int     fildes,
               int        cmd,
               void  *cmd_arg,
               int cmd_num_arg);

Command name Description
PFM_CREATE_CONTEXT create a perfmon context
PFM_WRITE_PMCS program PMC registers
PFM_WRITE_PMDS program PMD registers
PFM_READ_PMDS read PMD registers values
PFM_START activate monitoring
PFM_STOP stop monitoring
PFM_LOAD_CONTEXT attach perfmon context
PFM_UNLOAD_CONTEXT attach perfmon context
PFM_RESTART resume monitoring after notification
PFM_CREATE_EVTSETS create or modify event sets
PFM_DELETE_EVTSETS delete event sets
PFM_GETINFO_EVTSETS get information about event sets

The system call was preferred over the device driver model because it is built in, by construction, and offers better flexibility for the type and number of parameters of the call compared to using ioctl() with the driver model. A system call makes it easier for implementations to support the per-thread monitoring mode which requires access to the thread context switch code to save and restore the PMU machine state. The system call resembles to ioctl(), with a file descriptor (fildes), a command (cmd) to apply, the argument (cmd_arg) to the command. The difference, though, is that cmd_arg can be a vector of arguments to which the command must be applied. The number of elements in the vector is indicated by the cmd_num_arg parameter. The set of commands defined by the interface is shown in Table above.

The entire PMU machine state is encapsulated by a software abstraction called the perfmon context. The context, by itself, is never directly exposed to applications. However, each context is uniquely identified and manipulated with a file descriptor obtained when the context is created using the PFM_CREATE_CONTEXT command.

Figure 1: Itanium® 2 PMU register mappings

The PMU hardware interface is abstracted by considering that it is composed of two sets of registers: a set of control (PMC) and a set of data (PMD) registers. The actual PMU registers are not exposed to applications. To simplify applications further, the interface exports all PMU registers as 64-bit wide. In particular, counters are always manipulated as 64-bit registers even though many PMU implementations have much less. For example, Itanium® implements 32 bits while Itanium® 2 implements only 47 bits. Not having to worry about the width of counters is important for sampling.

The interface does not know what each register does, how many there are, how they are associated with each other. Each implementation maps the logical PMC and PMD onto the actual PMU registers. In the case of IPF, the mapping is relatively simple, for instance, PMC4 represents the actual PMC4 register. The debug registers, IBR0-IBR7 and DBR0-DBR7, are used by the range restriction feature of the Itanium® 2 PMU. They are manipulated as PMC registers and have dedicated mappings as shown in figure 1.

The PMC registers can be written with the PFM_WRITE_PMCS command. The PMD registers can be read and written with the PFM_READ_PMDS and PFM_WRITE_PMDS commands respectively. Both commands accept vector arguments which make it possible to program multiple PMC or PMD registers in a single system call. Registers are manipulated using (number,value) pair representation.

To further isolate the interface from PMU-specific knowledge, all event specific information is relegated to the user level where it can be fairly easily be encapsulated in a library.

Whenever possible, the interface exploits the existing infrastructure of the Linux kernel. In particular we take full advantage of the virtual file system layer. A context is destroyed by simply invoking the close() system call. Similarly, a context can be manipulated by all threads with access to the file descriptor. File descriptor sharing on fork() is used to access a context in a child process, when needed.

2.2 Per-thread monitoring

A context can be attached to a thread using the PFM_LOAD_CONTEXT command. A thread can monitor another thread or itself, i.e., self-monitoring. For multi-threaded processes, it is necessary to create one context per thread. It is possible to attach to and detach from a thread that is already running.

2.3 System-wide monitoring

Figure 2: how to measure across multiple processor cores?

System-wide monitoring is setup using the exact same series of call as for a per-thread session. The type of the context is indicated when the context is created.

The interface uses a CPU-wide approach to implement system-wide monitoring. A context can only monitor all the threads running on a designated processor. Coverage for a multi-processor system is achieved by creating one context on each processor as shown in figure 2. A context is bound to a processor during the call to PFM_LOAD_CONTEXT. The processor used by the calling thread is considered the processor to monitor. The model is simple in that it does not try to invent yet another mechanism to set processor affinity. In order to guarantee that the right processor is monitored, the thread must be pinned using the sched_setaffinity() system call.

The approach simplifies the kernel implementation because there is no need to propagate the PMU settings across all processors. The interface scales much better to large, NUMA-style, configurations because of improved code and data locality. This is especially important when sampling because the overhead of writing the samples quickly becomes a problem at high sampling rates. In our model, there is one sampling buffer per processor.

2.4 Support for sampling

The interface currently supports Time-Based Sampling (TBS) at the user level. In this mode, the sampling period is determined by a timeout. When the timeout expires, a sample is recorded. This mode can easily be implemented at the user level using existing interfaces such as setitimer().

The interface also supports Event-Based Sampling (EBS) where the sampling period is expressed as a number of occurrences of an event rather than by time. This mode requires that the PMU be capable of generating an interrupt when a counter overflows, i.e., wraps from 2w-1 back to 0, when the counter width is w. All modern PMUs provide this capability. Hence the sampling period p is expressed as 2w-p where w is the bit width of the counter. After p events have been observed, the counter overflows and the PMU interrupts indicating that it is time to record a sample. Time-based sampling can fairly easily be emulated in this mode by using an event with a strong correlation with time, such as the number of elapsed cycles which many PMUs offer.

The interface offers as many sampling periods as there are counters with overflow capability. Each period can be randomized to avoid biased samples. This phenomenon is fairly frequent when sampling on events that occur a lot, such as branches. In that case, certain branches will never be captured even though they are executed very frequently as well. This is explained by the fact that the sampling period becomes somehow in lockstep with the execution of the monitoring program. Because the PMU is designed for statistical sampling, randomization can be used to improve the accuracy of the samples.

When a counter overflows the interface can notify the monitoring thread via a message. Such notification can be selected on a per-counter basis. Here again, the existing kernel infrastructure is leveraged. Each message is retrieved from the message queue with a simple read() system call. Similarly, asynchronous notification with a signal (SIGIO) can be requested by setting O_ASYNC flag on the descriptor. Furthermore a single thread may wait for notifications coming from multiple contexts using the select() or poll() system call.

2.4.1 Kernel level sampling buffer
Figure 3: re-mapping of the kernel sampling buffer

When sampling, it is important to minimize the overhead as it impacts the sampling rate and may affect the accuracy of the profiles. Although it is possible to sample completely from user level, the interface also provides a kernel level sampling buffer which is allocated when the context is created. The idea is to amortize the cost of notification by sending the message only when a lot of samples are available, i.e., the buffer is acting as a cache. Furthermore, to minimize the cost of extracting the buffer from the kernel with potentially large memory copies, the buffer is re-mapped into the user address space of the monitoring process using the mmap() system call as shown in figure 3

2.4.2 Custom sampling buffer formats
Figure 4: custom sampling buffer format architecture

The problem with having a kernel level sampling buffer is that there exist many ways of storing the samples into the buffer. Some tools want to keep all samples in sequential order, such as VTUNE, others, like OProfile, aggregate them, others may want to record non PMU information such as the amount of free memory.

We realized early on that it was not possible to come up with a universal buffer format. Instead, the interface provides a mechanism for users to plug in their own buffer formats via kernel modules (DLKM). The perfmon core does not know anything about how samples are recorded. Figure 4 depicts the architecture of the formats.

Each module provides a set of callbacks which the perfmon core invokes on certain conditions. For instance, each format provides a handler function which is called when a counter overflows. Each format is uniquely identified by a 128-bit unique identifier (UUID) which is passed to the kernel when a context is created. Each sampling format controls:

  • how the buffer is allocated. Formats are not required to use the perfmon core to allocate.
  • how the buffer is exported to the user. Formats may export the buffer using a private interface, such as a device driver interface for instance.
  • how samples are recorded in the buffer
  • what information is recorded. Formats may store PMU registers but also other kernel information.
  • when an overflow notification is sent to the monitoring tool.

A default format which stores samples in sequential order in a linear buffer is provided. We have successfully developed other formats such as one using n-way buffering, where the buffer is split into n sections. When one section fills up the application is notified but sampling continues in the other sections. This technique is used to minimize blind spots. We have also developed a format which samples the kernel level call stack, i.e., the full call path. Similarly, the OProfile team was able, in two days, to hook up their sample recording code to our interface. They were able to reuse all of their kernel level code. The buffer is exported through their own device driver interface.

The custom sampling format mechanism constitutes an innovation and key strength of the interface because it opens up a lot of possibilities for sampling tools. It lets developers focus on the added-value of the tools instead of the intricacies of how to access the PMU or how to export a sampling buffer. For instance, the kernel call stack sampling format required about 300 lines of C code.

2.5 Event sets and multiplexing

The amount of hardware resources dedicated to the PMU is limited because this is not a critical part of the processor. That explains why the number of counters may be limited and why there may be some constraints on how events can be combined. For instance, it is frequent to have constraints such as event A cannot be measured with event B.

Because of such limitations, certain measurements may require multiple runs. For instance, on Itanium® 2, a detailed cycle breakdown requires about a dozen events, yet there are only four counters. Restarting the workload may not be convenient and may introduce fluctuations in the results making the interpretation difficult.

To overcome these difficulties, the interface exports the abstraction of an event set. The idea is to split events into sets of no more than m events if the PMU has m counters. Counters are then multiplexed between the various sets. Simple scaling can be used to create the illusion that all counters were active during the entire monitoring sessions.

Implementing multiplexing at the kernel instead of user level provides a significant performance boost. This is especially visible for per-thread context because the switching occurs in the context of the monitored thread instead of by context-switching back and forth to the monitoring tool. This is very important because if the overhead is small, the multiplexing frequency can be increased, which minimizes blind spots and therefore increases accuracy of the scaling.

Each set includes the full PMU state and is uniquely identified by a number. Sets can be created and deleted on demand. They are stored in an ordered list based on their identification number. Switching follows the list in a round-robin fashion. Each set can choose from two modes of switching:

  • time-based: a timeout is specified per set, when the timeout expires, the next set is activated.
  • overflow-based: switching occurs when selected counters, called triggers, overflow. There can be as many triggers as there are counters. Each trigger has an overflow threshold which determines after how many overflows switching occurs. This mechanism avoids dedicating a counter as a trigger and therefore maximizes resource utilization. This is another innovation of the interface.

Overflow-based switching can be used to implement counter cascading, where certain events are counted only after a certain threshold is reached. The threshold is expressed as a number of occurrences an event.

2.6 Security

The interface is designed to be implemented into the Linux kernel therefore it follows the same security guidelines. In particular, it is not admissible to leak kernel information to users, or access information about other processes unless authorized. The interface survives malicious usage, such as denial of service attacks. For instance, an application cannot allocate an arbitrary large amount of kernel memory for the sampling buffer. Similarly, the creation of a system-wide context may be restricted to a certain group of trusted users as this can be used to potentially extract information from other processes.

2.7 Current status

The interface has been implemented in the 2.6 kernel series for Linux/ia64. Earlier 2.4 kernel series Linux/ia64 kernels implement the first generation of the interface which had strong limitations. The current interface is not backward compatible with the first generation, some small porting effort is needed to adapt existing applications.

The full specification of the interface is available as an HPLabs technical report

Several open-source and commercial tools are already available for Linux/ia64 and the perfmon2 interface. Among them, the HP Caliper tool.

A new implementation of the interface is now available as a standalone kernel patch. It follows more closely the specificiation. It also includes support for the X86-64 and P6 processor family as well as preliminary support for PPC64.



perfmon project links

» project home
» perfmon overview
» libpfm overview
» pfmon overview
» mailing list
» bibliography
» presentations

kernel interface links

» FAQ
» examples
Printable version
Privacy statement Using this site means you accept its terms Feedback to HP Labs
© 2009 Hewlett-Packard Development Company, L.P.