Basic sampling
- Principles
- Sampling output formats
- Sampling examples
- Sampling in system wide sessions
- Randomization of sampling periods
- Blocking on overflow notifications
1. Pfmon command line options
Pfmon supports the following list of options on any host system:
| -h, --help | display the list of supported options |
| -V, --version | display pfmon version information |
| -l[regex] | show list of supported events by host PMU |
| -i | get information about a particular event |
| -u | monitor at the user level for all events |
| -k | monitor at the kernel level for all events |
| -1 | monitor at the privilege level 1 for all events |
| -2 | monitor at the privilege level 2 for all events |
| -eev1,ev2,... | select events to monitor |
| -I | list the supported PMU models |
| --verbose | print more information during execution |
| --outfile=filename | print counts in a file |
| --append | open outfile in append mode |
| --overflow-block | block when sample buffer is full |
| --system-wide | create a system wide monitoring session |
| --cpu-mask=mask | where to start system wide session |
| -S format | info about a sampling output format |
| -t secs | duration of the session in seconds |
| --smpl-outfile=filename | save sampling results in a file |
| --smpl-entries=n | entries in the sampling buffer |
| --long-smpl-periods=val1,... | event long sampling periods |
| --short-smpl-periods=val1,... | event short sampling periods |
| --with-header | put a description header with results |
| --aggregate-results | aggregate counts and sampling buffer outputs |
| --tigger-start-address=addr | start monitoring when execution reaches addr |
| --priv-levels=lvl1,... | set privilege level per event |
| --show-time | show real, user, system time for the executed command |
| --us-counter-format | print counters using commas (1,024) |
| --eu-counter-format | print counters using points (1.024) |
| --hex-counter-format | print counters in hexadecimal (0x400) |
| --smpl-output-format=fmt | select fmt as sampling output format |
| --symbol-file=filename | use the ELF archive filename to look for symbols |
| --sysmap-file=filename | use the System.map filename for kernel symbols |
| --smpl-periods-random=mask1:seed1,... | randomize sampling periods per event |
| --trigger-start-delay=secs | number of seconds before monitoring starts |
| --smpl-print-counts | print counter results when sampling |
| --exclude-idle | exclude idle task from system wide session |
2. Getting event information
The list of events supported by pfmon depends on the host PMU. You can get the list
of supported events using the following pfmon option:
% pfmon -l
CPU_CYCLES
IA64_INST_RETIRED
IA64_TAGGED_INST_RETIRED_PMC8
IA64_TAGGED_INST_RETIRED_PMC9
INST_DISPERSED
EXPL_STOPBITS
ALL_STOPS_DISPERSED
IA32_INST_RETIRED
ISA_TRANSITIONS
NOPS_RETIRED
....
If you specify an argument to the -l option (no space between l and the
argument), it is interpreted as a regular expression and all matching events
will be listed:
% pfmon -ll1d
L1D_READ_FORCED_MISSES_RETIRED
L1D_READ_MISSES_RETIRED
L1D_READS_RETIRED
PIPELINE_FLUSH_L1D_WAYMP_FLUSH
You can get more detailed information about each event using the following option:
% pfmon -i nops_retired
Name : NOPS_RETIRED
VCode : 0x30
Code : 0x30
PMD/PMC: [ 4 5 ]
EAR : No (N/A)
Umask : None
BTB : No
Thres : 6
Qual : [Instruction Address Range] [OpCode match]
Pfmon is case insensitive for event names. Here you see some details about the event.
The first 4 lines are generic and provided on all PMU models even though the codes may
vary:
- Code is the event code used by the PMU.
- Vcode is a libpfm internal event code which encapsulates the event code and other
information describing the type of the event. For simple events, the two codes are
usually identical.
- PMD/PMC: list the counting monitors on which this event can be programmed. Not
all events can necessarily be programmed on all available counting
monitors. This constraint is taken care of by the libpfm library.
Here the remaining information is specific to the Itanium 2 PMU.
Even with the -i option, you can use a regular expression for the event:
% pfmon -i'writes$'
Name : L2_DATA_REFERENCES_WRITES
VCode : 0x20069
Code : 0x69
PMD/PMC: [ 4 5 6 7 ]
Umask : 0010
EAR : No (N/A)
BTB : No
MaxIncr: 2 (Threshold [0-1])
Qual : [Instruction Address Range] [OpCode Match] [Data Address Range]
On some PMU models (currently Itanium2), the events information contains a
text description of the event.
Events can be specified using their code:
% pfmon -i 0x45
Name : L2_INST_PREFETCHES
VCode : 0x45
Code : 0x45
PMD/PMC: [ 4 5 6 7 ]
Umask : 0000
EAR : No (N/A)
BTB : No
MaxIncr: 1 (Threshold 0)
Qual : [Instruction Address Range]
Group : None
Set : None
Desc : L2 Instruction Prefetch Requests
Information about what each event measures can be found in the relevant CPU model specific
micro-architecture documentation. The architecture imposes that only two events be defined by all PMUs:
- CPU_CYCLES : the number of elapsed CPU cycles.
- IA64_INST_RETIRED : the number of instructions retired.
Those two events are guaranteed to exist on all PMU but their codes may vary. The PMU specific
event names may not be exactly the same, however, pfmon and especially the library it uses
(libpfm) will always ensure that those two events can always be called by the two names list
above. As alluded to earlier, pfmon can support more than one PMU in a single binary. Pfmon
also incorporates a generic PMU model which provides only the features defined by the
architecture, this includes the two events. If pfmon does not have specific support for the
host PMU it will default to the so called 'Generic' PMU support, if compiled in. You can find
out what PMU support is compiled into pfmon as follows:
% pfmon -I
detected host CPUs: 4-way 800MHz Itanium (Merced, C0)
supported PMU models: [itanium2] [itanium] [generic]
detected host PMU: itanium
supported sampling outputs: [detailed-itanium] [raw] [compact] [btb] [example]
pfmlib version: 2.0
kernel perfmon version: 1.0
It is possible to force pfmon to operate in generic mode even though it has support for the
host CPU using the pfmon_gen command:
% pfmon_gen -I
forced libpfm to generic support
detected host CPUs: 4-way 800MHz Itanium (Merced, C0)
supported PMU models: [itanium2] [itanium] [generic]
detected host PMU: generic
supported sampling outputs: [raw] [compact] [example]
pfmlib version: 2.0
kernel perfmon version: 1.0
% pfmon_gen -i CPU_CYCLES
forced libpfm to generic support
Name : CPU_CYCLES
VCode : 0x12
Code : 0x12
PMD/PMC: [ 4 5 6 7 ]
The pfmon_gen is not a separate command but just a symlink to pfmon. In fact, pfmon always
checks the name it was invoked with. If this name is equal to 'pfmon_gen' and the generic
support is compiled in, then pfmon will operate in generic mode. Such feature is useful when
moving pfmon to a PMU for which neither pfmon itself nor libpfm have support yet.
3. Basic counting
In generic mode, pfmon only supports the two architected events listed
above. For comparison, the Itanium PMU supports about 230 events and the
Itanium2 PMU about 470. No instrumentation of the program is required to monitor the
system or a single process.
3.1 Simple examples
To collect counts on a specific command, you just need to launch it via pfmon, just like
you would do with the time or strace command:
% pfmon ls -ial /dev/null
210135 crw-rw-rw- 1 root root 1, 3 Mar 24 2001 /dev/null
2910724 CPU_CYCLES
When invoked with no particular event, pfmon default to CPU_CYCLES. To monitor specific events,
you can type:
% pfmon -e cpu_cycles,IA64_inst_Retired -- ls -ial /dev/null
210135 crw-rw-rw- 1 root root 1, 3 Mar 24 2001 /dev/null
2984546 CPU_CYCLES
2666884 IA64_INST_RETIRED
As you can see, pfmon is not case sensitive with regards to event names. More than one event
can be measured at a time using a comma separated list of events. You MUST not have space
after the comma.
If the command you want to run takes options, you can clearly distinguish the options of
pfmon from the options of your command using the '--' symbol:
% pfmon -e ia64_inst_retired -- ls -ial /dev/null
210135 crw-rw-rw- 1 root root 1, 3 Mar 24 2001 /dev/null
2709704 IA64_INST_RETIRED
Otherwise, pfmon will stop parsing arguments as option as the first
argument which does not start with a - or --.
3.2 Specifying event privilege levels
By default, pfmon monitors only what is going at the user level
(application level). This is true for both per-process and system wide
mode.
It is possible to monitor at any of the 4 privilege levels provided by IA-64.
It is also possible to monitor at several levels at the same time by specifying
more than one level. The levels can be specified for all events or on a per-event
basis. To affect all events, you can use any combinations of -k (-0), -1, -2, -u (or -3).
To set the level for each event, the --priv-levels option must be used.
By default, pfmon only measures at the user level:
% pfmon -e nops_retired ls
counts the number of NOPS_RETIRED when ls is running at the user level only
(equivalent to specifying -u or -3).
% pfmon -k -e nops_retired ls
counts the number of NOPS_RETIRED when ls is running at the kernel level only.
% pfmon -k -u -e nops_retired ls
counts the number of NOPS_RETIRED when ls is running at the kernel level
or user level, i.e. all the time.
It is possible to refine the settings on a per event basis using the
--priv-levels option.
% pfmon -e loads_retired,nops_retired ls
Both events are measured at the user level only.
% pfmon --priv-level=u,k -e loads_retired,nops_retired ls
LOADS_RETIRED is measured at the user level only, NOPS_RETIRED at the
kernel level only.
% pfmon --priv-level=,uk -e loads_retired,nops_retired ls
LOADS_RETIRED is measured at the user level only, NOPS_RETIRED at the
user and kernel levels.
% pfmon -k --priv-level=uk -e loads_retired,nops_retired ls
LOADS_RETIRED is measured at the user and kernel levels, NOPS_RETIRED at the
kernel level only.
3.3 Specifying counter output formats
Pfmon can display the final counts in various formats. There are 4 formats
defined. The default one is shown in the example above. To make is easier
to read large numbers or to feed the number to other programs, pfmon
supports:
- --us-counter-format where the thousands, millions, billions are separated
with commands (US and UK style):
% pfmon --us-counter-format ls -l /dev/null
crw-rw-rw- 1 root root 1, 3 Mar 24 2001 /dev/null
2,292,689 CPU_CYCLES
- --eu-counter-format where the thousands, millions, billions are separated
with points (European style):
% pfmon --eu-counter-format ls -l /dev/null
crw-rw-rw- 1 root root 1, 3 Mar 24 2001 /dev/null
1.703.898 CPU_CYCLES
- --hex-counter-format where the counts are shown in hexadecimal format:
% pfmon --hex-counter-format ls -l /dev/null
crw-rw-rw- 1 root root 1, 3 Mar 24 2001 /dev/null
0x000000000019c164 CPU_CYCLES
3.4 Saving results
By default, the counts are printed on the controlling tty. However it is
possible to save them in a file using the --outfile option:
% pfmon --outfile=b --hex-counter-format ls -l /dev/null
crw-rw-rw- 1 root root 1, 3 Mar 24 2001 /dev/null
% cat b
0x000000000016a8b1 CPU_CYCLES
It is possible to include a header with the results using the --with-header
option. It will be printed on the controlling tty or saved in the output
file. The header contains detailed information about the configuration of
the host machine and on the monitoring session:
% pfmon --with-header --outfile=b --hex-counter-format ls -l /dev/null
crw-rw-rw- 1 root root 1, 3 Mar 24 2001 /dev/null
% cat b
#
# date: Wed Nov 20 16:03:13 2002
#
# hostname: hpljumbo.hpl.hp.com
#
# kernel version: Linux 2.4.18 #2 SMP Tue Aug 6 11:54:56 PDT 2002
#
# pfmon version: 2.0
# kernel perfmon version: 1.0
#
#
#
# page size: 16384 bytes
# CLK_TCK: 1024 ticks/second
# CPU configured: 4
# CPU online: 4
# physical memory: 6827933696
# physical memory available: 5606391808
#
# host CPUs: 4-way 800MHz Itanium (Merced, C0)
# PAL_A: 6.6.23
# PAL_B: 7.7.28
# Cache levels: 3 Unique caches: 4
# L1D: 16384 bytes, line 32 bytes, load_lat 2, store_lat 0
# L1I: 16384 bytes, line 32 bytes, load_lat 2, store_lat 0
# L2 : 98304 bytes, line 64 bytes, load_lat 6, store_lat 6
# L3 : 4194304 bytes, line 64 bytes, load_lat 21, store_lat 21
#
#
# captured events:
# PMD4: CPU_CYCLES, user level(s)
#
# monitoring mode: per-process
#
#
# instruction sets:
# PMD4: CPU_CYCLES, ia32/ia64
#
#
# command: pfmon --with-header --outfile=b --hex-counter-format ls -l /dev/null
#
#
#
0x00000000001a8956 CPU_CYCLES
3.5 Delaying monitoring
By default, pfmon will start monitoring at the first instruction of the
program, i.e., the entry point when the privilege level is limited to user
level. Even when kernel level monitoring is enabled nothing will be measured
until the process leaves the kernel for the first time, after fork.
Sometimes, it may be useful to delay the activation of monitoring until
a certain point in the execution. This is the case when the
initialization must not be included in the counts. Pfmon provides two different
ways to delay the point at which monitoring is turned on with the
--trigger-start-address and --trigger-start-delay options.
The --trigger-start-address option only applies to per-process sessions and is ignored for
system-wide. It uses a code address to trigger monitoring. Once execution reaches
the bundle address specified with the option, the monitoring will be turned on and
will remain on until the program terminates. The address can be specified in hexadecimal or
a code symbol name can be provided. It is not possible to specify a kernel address, pfmon
will reject any such address. When an address is explicitely used, pfmon will not try to
validate it except by checking it is not in the kernel. The delayed start mechanism will
be used only the first time the address is reached.
For instance, if the address of main() is 0x40000000000004a0, then we
can delay monitoring until main() is reached using:
% pfmon --trigger-start-address=0x40000000000004a0 -e loads_retired foo
74 LOADS_RETIRED
or using the symbol table:
% pfmon --trigger-start-address=main -e loads_retired foo
74 LOADS_RETIRED
IMPORTANT: Note that pfmon can ONLY lookup symbols in the
"main" program and NOT in any dynamically linked libraries. To allow complete coverage, the program
MUST be linked statically.
Whereas the same program executed without the trigger address, will get:
% pfmon -e loads_retired foo
1598 LOADS_RETIRED
This example proves that the libc initialization used 1598-74=1524 loads all by itself.
The --trigger-start-delay option uses time to delay monitoring. You simply specify a delay
in seconds. When the delay expires, monitoring will be turned on. This options works
for both per-process and system-wide monitoring. If the monitored process terminates before
the delay expires, then nothing gets measured. This applies to both per-process and
system wide sessions using a process to delimit session. Note that the session effectively
starts when monitoring is turned on. Hence, the --session-timeout is only armed when monitoring
in turned on.
The following example will start monitoring 5 seconds in the execution of foo:
% pfmon --trigger-start-delay=5 -e loads_retired foo
The following example will start monitoring 5 seconds in the execution of foo and for
10 seconds after that point:
% pfmon --trigger-start-delay=5 --session-timeout=10 -e loads_retired foo
3.6 Getting timing information
It is possible to get a tim breakdown of the execution of the monitored command for
both per-process and system-wide mode using the --show-time option. The output is similar
to the time(1) command. For instance:
% pfmon --show-time -e nops_retired ls /dev/null
/dev/null
real 0h00m00.098s user 0h00m00.000s sys 0h00m00.095s
247913 NOPS_RETIRED
3.7 Testing event combinations
Sometimes it is handy to check if some events can be measured simultaneously without actually
starting the monitoring session. the --check-events-only option of pfmon allows this mode of
operation. it will check that the combination is valid and then exit. if the conbination is
invalid, it will print out the reason and return with an exit value of 1, otherwise the exit
value is 0. on Itanium 2, for instance, you can try:
% pfmon --check-events-only -e loads_retired,stores_retired
event loads_retired and stores_retired cannot be measured at the same time
% echo $?
1
Note that in this mode, you do not need to specify a command to execute.
4. System wide sessions
When the --system-wide flag is used, pfmon operates in system wide mode. This means that
it does not monitor a specific program anymore but instead all the processes that execute
on a specific set of CPUs. In this mode, you do no need to specify a command. You do not
need to be root to create a system wide session.
A system wide session cannot co-exist with any per-process sessions. But a system wide session
can run concurrently with other system wide sessions as long as they do not monitor the same
set of CPUs. Of course multiple per-process sessions are possible.
4.1 Selecting CPUs to monitor
The --cpu-mask option can be used to restrict monitoring to a specific set of CPUs. When this
option is not present, pfmon will automatically launch a system wide session on all available
CPUs as reported by /proc/cpuinfo.
So if the system has 2 available CPUS:
% pfmon --system-wide -u -e cpu_cycles,ia64_inst_retired
<Press ENTER to stop session>
CPU0 248793 CPU_CYCLES
CPU0 60710 IA64_INST_RETIRED
CPU1 26690 CPU_CYCLES
CPU1 7706 IA64_INST_RETIRED
A system wide session can monitor at any privilege level (kernel, user, or both).
If you want to restrict to a specific CPU, you can use the --cpu-mask command:
% pfmon --system-wide --cpu-mask=0x2 -u -e cpu_cycles,ia64_inst_retired
<Press ENTER to stop session>
CPU1 17841 CPU_CYCLES
CPU1 7577 IA64_INST_RETIRED
The CPU mask is a bitmask where each bit represents a CPU. CPU are numbered starting at 0.
So bit 0 represents CPU0, bit 1, CPU2 and so on. Therefore the above command will only
monitor events happening on CPU1. More than one bit can be set in the mask. For instance,
with --cpu-mask=0x3, pfmon will monitor on CPU0 and CPU1 at the same time.
4.2 Delimiting a system wide session
There are three ways to delimit a system wide session. By default, the
session will terminate when the user press the key. It is also
possible to use a timeout expressed in seconds. Finally, the session can
also be delimited by the execution of a command. It will start when the
command starts and stops when it terminates. Here are some examples:
Monitor cpu_cycles and instruction retired on the first two CPUs at both
user and kernel levels and wait for a keypress to stop:
% pfmon --cpu-mask=0x3 --system-wide -u -k -e cpu_cycles,ia64_inst_retired
<Press ENTER to stop session>
CPU0 821818169 CPU_CYCLES
CPU0 1338893885 IA64_INST_RETIRED
CPU1 821813442 CPU_CYCLES
CPU1 1341176908 IA64_INST_RETIRED
Monitor cpu_cycles and instruction retired on the first two CPUs at both
user and kernel levels for 10 seconds:
% pfmon --session-timeout=10 --cpu-mask=0x3 --system-wide -u -k \
-e cpu_cycles,ia64_inst_retired
<Session to end in 10 seconds>
CPU0 8003156088 CPU_CYCLES
CPU0 12800683300 IA64_INST_RETIRED
CPU1 8003106584 CPU_CYCLES
CPU1 12899764561 IA64_INST_RETIRED
Monitor cpu_cycles and instruction retired on the first two CPUs at the
user level only during the execution of the ls command (here obviously run
on CPU0):
% pfmon --cpu-mask=0x3 --system-wide -u \
-e cpu_cycles,ia64_inst_retired -- ls -l /dev/null
crw-rw-rw- 1 root root 1, 3 Mar 24 2001 /dev/null
CPU0 46560 CPU_CYCLES
CPU0 26839 IA64_INST_RETIRED
CPU1 7514 CPU_CYCLES
CPU1 1184 IA64_INST_RETIRED
4.3 Results aggregation
It is possible to aggregate counts when monitoring more than one CPU:
% pfmon --aggregate-results --system-wide -k -e cpu_cycles,ia64_inst_retired
<Press ENTER to stop session>
852331455 CPU_CYCLES
1387206797 IA64_INST_RETIRED
In which case, the per CPU results are summed. Pfmon does not allow different
events to be monitored on different CPUs. For this you can run separate instances of pfmon
with a different CPU mask, using a command line similar to:
% pfmon --session-timeout=10 --cpu-mask=0x1 --system-wide -k -e cpu_cycles &
% pfmon --session-timeout=10 --cpu-mask=0x2 --system-wide -k -e ia64_inst_retired &
4.4 Excluding idle task
Pfmon now allows the user to exclude the idle tasks from system wide monitoring
session. This only works with a kernel that has perfmon 1.3 or higher. Pfmon
checks the kernel version and may abort in case the wrong version is detected.
Linux has one idle task per cpu. This task is run when nothing else can.
The idle task is a kernel only task with a pid if 0. The pid 0 is use for
ALL idle tasks. They do not show up in ps or top.
When running a system wide session, it may be useful to stop monitoring
when the idle task is running, this way we monitor only the USEFUL execution.
Of course, monitoring the idle task or not implies that monitoring is active
at the kernel privilege level, i.e., when using the -k or -0 option of pfmon.
When monitoring only at the user level, excluding the idle task has no effect.
Similarly, excluding the idle task for a per-process session has not effect.
For instance, here is what we get without exclusion:
% pfmon -k --session-timeout=10 --system-wide
8003084826 CPU_CYCLES
This is run on a 800MHz Itanium CPU, so 10s is 8 billions cycles.
But if we run with exclusion:
% pfmon --exclude-idle -k --session-timeout=10 --system-wide
259663 CPU_CYCLES
This is the useful cycles for the 10s period.
5. Dealing with symbols
Whenever an option takes an address (code or data) as argument, it is
possible to directly use a symbol name rather than use its address.
For instance, this is true for the --trigger-address option. The user
has two ways to indicate where the find the symbol table. Pfmon
can extract the symbol table using an ELF image directly. This is
for instance what is done implicitely in per-process mode. Pfmon also
understands the System.map format which is typically used to save the
symbol table of the kernel.
There are a couple of restrictions concerning the symbols. Pfmon cannot
extract symbol information that is coming from dynamically linked
libraries or modules. To avoid this problem, the program must be statically
linked and should not explicitely use dl_open().
If the symbol table has been stripped, pfmon will not find any symbol.
In case the option requires a code address, pfmon will only look for matching
code symbols. Conversly, if the option requires a data address, pfmon will only
look for matching data symbols.
By default, the symbols are automatically extracted from the command being
run. This is true in per process mode but also in system wide mode when
a command is specified. In case where symbols must be extract from an
alternative ELF archive, then the user must use the --symbol-file option.
The filename specified there must be a ELF/ia64 binary.
Note that the Linux/ia64 kernel is also an ELF/ia64 archive, however
for most distribution the kernel image found in /boot/efi is oftentimes
compressed. The compression scheme used for Linux/ia64 is different
from the one used on Linux/ia32. The compressed is image is simply
the ELF/ia64 image compressed with gzip. So it is possible to decompress
it to get the original ELF archive. The main caveat is that most of the
time the compressed image is stripped. Therefore the user must rely on
the corresponding System.map file usually placed in /boot/efi. In this
case, the user must explicitely specify the location of the System.map
file via the --sysmap-file option.
Here are a few examples on Itanium:
- Count the number of time main() is called in the noploop program:
% file noploop
noploop: ELF 64-bit LSB executable, IA-64, version 1, \
statically linked, not stripped
% pfmon --checkpoint-func=main -e ia64_inst_retired noploop 10000
Here the symbol information for main() is directly extracted from noploop
itself.
- Count the number of times main() is called in the noploop-s program:
% file noploop-s
noploop-s: ELF 64-bit LSB executable, IA-64, version 1, \
statically linked, stripped
% pfmon --symbol-file=noploop --checkpoint-func=main \
-e ia64_inst_retired noploop-s 1000
Here noploop and noploop-s are the same program except that the latter does not have
the symbol table anymore.
- Count the number of times sys_getpid() is called during the execution of noploop:
% pfmon -k --symbol-file=/boot/efi/vmlinux-nostrip --checkpoint-func=sys_getpid \
-e ia64_inst_retired noploop 1000
Here we assume that the kernel file vmlinux was not stripped. If the
kernel has been stripped, then we can use the System.map instead:
% pfmon -k --sysmap-file=/boot/efi/System.map --checkpoint-func=sys_getpid \
-e ia64_inst_retired noploop 1000
6. Basic sampling
Pfmon has support for sampling on any events or combination of events. Samples are collected
into a buffer which can then be written to a file or simply on the screen.
6.1 Principles
Each sample is composed of two parts, a fixed size header which contains information about
the sample and a variable body which consists of a set of 64-bit values each one representing
a PMD register each representing the other events being monitored. All samples record the same set
of PMDs, this set is determined by pfmon based on what is being measured.
The sampling buffer is controlled by the kernel but its size is configurable. By default
pfmon uses a buffer with 2048 entries. This can be changed using the --smpl-entries option.
The sampling works as follows:
- the user specifies which events are to be recorded in each sample.
- the user specifies the sampling period (via an event) and optional randomization parameters.
- at the end of a period, a sample is recorded into the sampling buffer by the kernel.
- if the sampling buffer is not full, a new sampling period is reloaded and execution/monitoring resumes. we go back to step 3.
- if the sampling buffer becomes full, pfmon is notified.
- pfmon processes the buffer, i.e., prints and/or saves the buffer.
- pfmon then notifies the kernel that it is done.
- the kernel reload a new sampling period and execution/monitoring resumes. we go back to step 3.
Pfmon (and the kernel) uses two sampling periods instead of just one. The first one is called
short-smpl-period and the second is called long-smpl-period. The short-smpl-period is used
in step 4, this is when the sampling buffer is not full after writing the sample. The
long-smpl-period is used in step 8 when the reload occurs after the buffer became full.
But why do we need 2 periods?
As you might imagine there is some overhead is recording a sample. This overhead is
increased even more when pfmon needs to get involved to drain the buffer. This operation
can take some time and will inevitably introduce some noise in the measurements in the form
of TLB and/or cache pollution. To try and hide this noise, it is sometimes beneficial to
adjust the sampling period, i.e., make it larger to ensure that the next sample will not
record an event that is the consequence of the overhead generated by the monitoring but rather
a normal event occuring in the program/system being monitored. So it is expect that the
long-smpl-period >= short-smpl-period. Of course if the two are equal, this is equivalent to
having only one sampling period. Note that the long-smpl-period is only used to set the
distance to the first sample recorded after the buffer is marked as empty again (step 7).
6.2 Sampling output formats
There are many ways in which the samples can be saved or printed on the
screen. Pfmon has support for custom formats. Note that at this point, the
kernel sampling buffer format is fixed. Here the customization happens in
the tool. Pfmon comes with a set of output formats. Some of them can be
used with any PMU models, others are specific to the Itanium or Itanium 2
PMUs. While all PMDs on all PMUs are 64 bits what they contains can vary
from one PMU to the other.
You can figure out which formats are available for the host PMU by typing:
% pfmon -I
supported PMU models: [itanium2] [itanium] [generic]
detected host PMU: itanium
supported sampling outputs: [detailed-itanium] [raw] [compact] [btb] [example]
You can get a short description of what each format does by using the -S
option:
% pfmon -S detailed-itanium
Name : detailed-itanium
Description : Details each event in clear text
PMU models : [itanium]
Some formats are supported on all PMU models, in which case they are listed
as generic:
% pfmon -S compact
Name : compact
Description : Column-style raw values
PMU models : [generic]
For instance, the compact format works on Itanium and Itanium 2:
% pfmon --smpl-output-format=compact --long-smpl-periods=100000 ls
0 14130 0 0x2000000000015771 0x0000582a9cf18e79 0x0010 100000
1 14130 0 0x2000000000015851 0x0000582a9cf34a40 0x0010 100000
2 14130 0 0x2000000000015941 0x0000582a9cf4e5e8 0x0010 100000
3 14130 0 0x2000000000023da0 0x0000582a9cf69db7 0x0010 100000
....
For more information about the various formats please refer to the source
code :-<
6.3 Sampling examples
Suppose you want to record how many instructions are retired every 50000 cycles, i.e.,
you want to sample based on CPU_CYCLES and record the value of IA64_INST_RETIRED in
each sample. This can be done as follows:
% pfmon --smpl-output-format=detailed-itanium \
--short-smpl-period=50000 --long-smpl-period=50000 \
-e cpu_cycles,ia64_inst_retired -- ls /dev/null
The two periods are identical in this example because the number of instruction executed
by the ls command is not influenced by the fact that we monitor. The syntax is such that
the 50000 value of short-period applies to the first event specified in the event list.
The same rule applies for long-period.
With pfmon it is possible to use more than one event as the 'sampling event'. You
can also specify a sampling period for IA64_INST_RETIRED, in which case we take a sample
whenever the first OR second period expires:
% pfmon --smpl-output-format=detailed-itanium --short-smpl-period=50000,10000 \
--long-smpl-period=50000,10000 -e cpu_cycles,ia64_inst_retired ls
Here a sample will be recorded every 50000 cpu cycles OR each time 10000 instructions have
been retired.
You do not necessarily need to specify both periods. If you specify one, then pfmon will use the value to
initialize the other one. In other words, as soon as you specify only one period, the unspecified one will
get the same value.
Let us look at the information in the sampling buffer for the detailed-itanium format. For the first
example above, we get something like this printed on the screen:
/dev/null
Entry 0 PID:1490 CPU:3 STAMP:0x39e28c5cf782 IIP:0x2000000000004c70
OVFL: 4
PMD5 : 0x0000000000004708
Entry 1 PID:1490 CPU:3 STAMP:0x39e28c5f8e0a IIP:0x2000000000026ee0
OVFL: 4 LAST_VAL: 5000
PMD5 : 0x0000000000007310
Entry 2 PID:1490 CPU:3 STAMP:0x39e28c6273d2 IIP:0x2000000000025e40
OVFL: 4 LAST_VAL: 5000
PMD5 : 0x000000000000b5e6
Entry 3 PID:1490 CPU:3 STAMP:0x39e28c63ef1b IIP:0x2000000000018490
OVFL: 4 LAST_VAL: 5000
PMD5 : 0x000000000001137f
Entry 4 PID:1490 CPU:3 STAMP:0x39e28c64c6f5 IIP:0x2000000000024f60
OVFL: 4 LAST_VAL: 5000
PMD5 : 0x0000000000018a73
Entry 5 PID:1490 CPU:3 STAMP:0x39e28c6596cb IIP:0x2000000000018490
OVFL: 4 LAST_VAL: 5000
PMD5 : 0x00000000000222df
.....
The first line is the output from the ls command. Next you see the entries extracted from the sampling buffer.
Entry 0 is the first entry recorded in this monitoring session. The first line of each sample (entry) shows
the fixed header. The fields are as follows:
- PID: the identity of the process that generated the event
- CPU: the CPU number on which the event occurred
- STAMP: a time stamp guaranteed to be unique in time per CPU.
- IIP: the value of the IP when the event occurred (DANGER, see note below)
- OVFL: the counter that triggered the recording of the sample (more than one possible).
- LAST_VAL: the last value loaded into the first counter which overflowed
VERY IMPORTANT NOTE: users are advised NOT TO TRUST the value reported in IIP. Samples get recorded by forcing a counter overflow
and which then triggers an interrupt which will cause the kernel to record the information. Because of the
parallel nature of the architecture and its implementations, it is very likely that by the time the PMU realizes
that there was a counter overflow and generates the interrupt, the program execution has progressed way beyond
the instruction that caused the event leading the a skewed IIP. At best IIP points to the next bundle given
that interrupts can only be delivered at bundle boundaries.
After the header, you get the value of PMD5. This register contains the number of instructions retired for our
example. The second event specified by the user DOES NOT necessarily end up in PMD5. To figure out how the
events were dispatched among the various PMDs, you can use the --with-header option (described earlier).
The header contains detailed machine and session description. In our case it would like as follows:
#
# date: Wed Nov 20 17:00:43 2002
#
# hostname: hpljumbo.hpl.hp.com
#
# kernel version: Linux 2.4.18 #2 SMP Tue Aug 6 11:54:56 PDT 2002
#
# pfmon version: 2.0
# kernel perfmon version: 1.0
#
#
#
# page size: 16384 bytes
# CLK_TCK: 1024 ticks/second
# CPU configured: 4
# CPU online: 4
# physical memory: 6827933696
# physical memory available: 5598134272
#
# host CPUs: 4-way 800MHz Itanium (Merced, C0)
# PAL_A: 6.6.23
# PAL_B: 7.7.28
# Cache levels: 3 Unique caches: 4
# L1D: 16384 bytes, line 32 bytes, load_lat 2, store_lat 0
# L1I: 16384 bytes, line 32 bytes, load_lat 2, store_lat 0
# L2 : 98304 bytes, line 64 bytes, load_lat 6, store_lat 6
# L3 : 4194304 bytes, line 64 bytes, load_lat 21, store_lat 21
#
#
# captured events:
# PMD4: CPU_CYCLES, user level(s)
# PMD5: IA64_INST_RETIRED, user level(s)
#
# monitoring mode: per-process
#
#
# instruction sets:
# PMD4: CPU_CYCLES, ia32/ia64
# PMD5: IA64_INST_RETIRED, ia32/ia64
#
#
# command: ./pfmon --with-header --smpl-output-format=detailed-itanium ...
#
#
#
#
# kernel sampling format: 1.0
# sampling entry size: 56
#
# recorded PMDs: PMD5
# sampling buffer entries: 2048
#
# short sampling rates (base/mask/seed):
# CPU_CYCLES 50000
# IA64_INST_RETIRED none
#
# long sampling rates (base/mask/seed):
# CPU_CYCLES 50000
# IA64_INST_RETIRED none
#
#
Near the end of the header, you see in the "captured events" section: PMD5: IA64_INST_RETIRED.
Pfmon will record the value of the PMD for which the event has no sampling period defined. For our
first example, it means that it will record the value of the PMD counting the number of instructions
retired. Let us look at a more complicated example using some of the Itanium specific events:
% pfmon --with-header --short-smpl-periods=50000 --long-smpl-periods=50000 \
-e cpu_cycles,ia64_inst_retired,l2_misses,cpu_cpl_changes -- ls /dev/null
Here cpu_cycles is controlling the sampling period and each sample will include value of the PMDs counting
the number of L2 misses (L2_MISSES) and the number of CPU privilege level changes (CPU_CPL_CHANGES):
entry 0 PID:18723 CPU:3 STAMP:0x23b06dc011261 IIP:0x2000000000024d40
PMD OVFL: 4
PMD5 : 0x00000000000017d7
PMD6 : 0x00000000000001de
PMD7 : 0x0000000000000008
Where the assignments were:
# captured events:
# PMD4: CPU_CYCLES, user level(s)
# PMD5: IA64_INST_RETIRED, user level(s)
# PMD6: L2_MISSES, user level(s)
# PMD7: CPU_CPL_CHANGES, user level(s)
Using the compact format instead of the detailed one, you get results that are formatted such that they can be
easily parsed by other tools. The header contains the description of every
column:
# column 1: entry number
# column 2: process id
# column 3: cpu number
# column 4: instruction pointer
# column 5: unique timestamp
# column 6: bitmask of PMDs which overflowed
# column 7: initial value of PMD which overflowed
# column 8: PMD5
# column 9: PMD6
# column 10: PMD7
and the data is formatted as follows:
When sampling, the counts printed at the end of the session are not very useful, especially for
the counters used as sampling periods. Those should be discarded and they are NOT saved in the
sampling result file.
6.4 Sampling in system wide sessions
Sampling is possible in the same manner for system wide sessions. By default, the buffer is printed on the
controlling tty. When sampling on more than one CPU at a time, samples for each CPU will be printed. When
sampling results are redirected into a file, then you get one file per CPU. If the file is called
'myresults', then 'myresults.cpu0' contains the samples captured on CPU0, 'myresults.cpu1' the ones from CPU1,
and so on.
The --aggregate-results options also influences the way samples are saved to files. When this option is used,
then samples are merged into a single file. In our example, they would go into 'myresults'. If you don't use
the --smpl-no-entry-header every sample will have the CPU information.
6.5 Randomization of sampling periods
Pfmon supports randomization of both sampling periods. The user must supply a bitmask and a seed value
using the --smpl-periods-random option. The same mask and seed applies to both the long and short period
for each event. Each event can have a different mask and seed. Two separate invocations of pfmon using
the same seed and mask arguments are guaranteed to generate to same "pseudo-random" series of numbers
allowing reproducibility.
The sampling buffer will report the random value used for the sampling period used to generate each sample
in the LAST_VAL field in the detailed output format, otherwise it is in one of the columns in compact modes.
In the following command, the long (and short) sampling period are initially set to 100000 and
we activate randomization using a seed of 5. The mask indicates that we allow the value to vary
between 100000 and 100255 (inclusive):
% pfmon --smpl-periods-random=0xff:5 --long-smpl-period=100000 \
-e cpu_cycles -- noploop 1000000000
entry 0 PID:509 CPU:0 STAMP:0xa9b83faf28 IIP:0x4000000000000400
OVFL: 4 LAST_VAL: 100000
entry 1 PID:509 CPU:0 STAMP:0xa9b8413a4d IIP:0x4000000000000400
OVFL: 4 LAST_VAL: 100005
entry 2 PID:509 CPU:0 STAMP:0xa9b842c532 IIP:0x4000000000000400
OVFL: 4 LAST_VAL: 100067
entry 3 PID:509 CPU:0 STAMP:0xa9b8445077 IIP:0x4000000000000400
OVFL: 4 LAST_VAL: 100181
entry 4 PID:509 CPU:0 STAMP:0xa9b845db4e IIP:0x4000000000000400
OVFL: 4 LAST_VAL: 100064
entry 5 PID:509 CPU:0 STAMP:0xa9b84766b5 IIP:0x4000000000000400
OVFL: 4 LAST_VAL: 100212
entry 6 PID:509 CPU:0 STAMP:0xa9b848f1d5 IIP:0x4000000000000400
OVFL: 4 LAST_VAL: 100140
The randomization is shown in the LAST_VAL field which gives the value loaded into PMD4
(the PMD which overflowed) for each sample. Hence, 100181 is the number of cycles elapsed
between entry 2 and entry 3.
Randomization is important when sampling to avoid getting in lockstep with the execution
and thereby collecting biased results.
6.5 Blocking on overflow notifications
Whenever the sampling buffer becomes full and pfmon is notified you have
the option of either letting the monitored program continue or block it. In both cases, monitoring
is off during the processing of the sampling buffer. By default, pfmon lets the program continue
its execution. It is possible to block the program using the --overflow-block option. Blocking
the program ensures pfmon sees the entire execution. Keeping the program running ensures that
the caches and TLB are kept somewhat warm, i.e., with some state belonging to the running process,
especially on SMP systems.
