Jump to content United States-English
HP.com Home Products and Services Support and Drivers Solutions How to Buy
» Contact HP

hp.com home


pfmon-2.0 common features

» 

HP Labs

» Research
» News and events
» Technical reports
» About HP Labs
» Careers @ HP Labs
» People
» Worldwide sites
» Downloads
Content starts here

Table of contents
  1. Command line options
  2. Getting event information
  3. Basic counting
    1. Simple examples
    2. Privilege levels
    3. Counter output formats
    4. Saving results
    5. Delaying monitoring
    6. Getting timing information
    7. Testing event combinations
  4. System wide sessions
    1. Selecting CPUs to monitor
    2. Delimiting a session
    3. Results aggregation
    4. Excluding idle tasks
  5. Dealing with symbols
  6. Basic sampling
    1. Principles
    2. Sampling output formats
    3. Sampling examples
    4. Sampling in system wide sessions
    5. Randomization of sampling periods
    6. Blocking on overflow notifications

    1. Pfmon command line options

    Pfmon supports the following list of options on any host system:

    -h, --helpdisplay the list of supported options
    -V, --versiondisplay pfmon version information
    -l[regex]show list of supported events by host PMU
    -i get information about a particular event
    -umonitor at the user level for all events
    -kmonitor at the kernel level for all events
    -1monitor at the privilege level 1 for all events
    -2monitor at the privilege level 2 for all events
    -eev1,ev2,...select events to monitor
    -Ilist the supported PMU models
    --verboseprint more information during execution
    --outfile=filenameprint counts in a file
    --appendopen outfile in append mode
    --overflow-blockblock when sample buffer is full
    --system-widecreate a system wide monitoring session
    --cpu-mask=maskwhere to start system wide session
    -S formatinfo about a sampling output format
    -t secsduration of the session in seconds
    --smpl-outfile=filenamesave sampling results in a file
    --smpl-entries=nentries in the sampling buffer
    --long-smpl-periods=val1,...event long sampling periods
    --short-smpl-periods=val1,...event short sampling periods
    --with-headerput a description header with results
    --aggregate-resultsaggregate counts and sampling buffer outputs
    --tigger-start-address=addrstart monitoring when execution reaches addr
    --priv-levels=lvl1,...set privilege level per event
    --show-timeshow real, user, system time for the executed command
    --us-counter-formatprint counters using commas (1,024)
    --eu-counter-formatprint counters using points (1.024)
    --hex-counter-formatprint counters in hexadecimal (0x400)
    --smpl-output-format=fmtselect fmt as sampling output format
    --symbol-file=filenameuse the ELF archive filename to look for symbols
    --sysmap-file=filenameuse the System.map filename for kernel symbols
    --smpl-periods-random=mask1:seed1,...randomize sampling periods per event
    --trigger-start-delay=secsnumber of seconds before monitoring starts
    --smpl-print-countsprint counter results when sampling
    --exclude-idleexclude idle task from system wide session

    2. Getting event information

    The list of events supported by pfmon depends on the host PMU. You can get the list of supported events using the following pfmon option:

       % pfmon -l
       CPU_CYCLES
       IA64_INST_RETIRED
       IA64_TAGGED_INST_RETIRED_PMC8
       IA64_TAGGED_INST_RETIRED_PMC9
       INST_DISPERSED
       EXPL_STOPBITS
       ALL_STOPS_DISPERSED
       IA32_INST_RETIRED
       ISA_TRANSITIONS
       NOPS_RETIRED
       ....
    

    If you specify an argument to the -l option (no space between l and the argument), it is interpreted as a regular expression and all matching events will be listed:

       % pfmon -ll1d
       L1D_READ_FORCED_MISSES_RETIRED
       L1D_READ_MISSES_RETIRED
       L1D_READS_RETIRED
       PIPELINE_FLUSH_L1D_WAYMP_FLUSH
    

    You can get more detailed information about each event using the following option:

       % pfmon -i nops_retired
       Name   : NOPS_RETIRED
       VCode  : 0x30
       Code   : 0x30
       PMD/PMC: [ 4 5 ]
       EAR    : No (N/A)
       Umask  : None
       BTB    : No
       Thres  : 6
       Qual   : [Instruction Address Range] [OpCode match]
    

    Pfmon is case insensitive for event names. Here you see some details about the event. The first 4 lines are generic and provided on all PMU models even though the codes may vary:

    • Code is the event code used by the PMU.
    • Vcode is a libpfm internal event code which encapsulates the event code and other information describing the type of the event. For simple events, the two codes are usually identical.
    • PMD/PMC: list the counting monitors on which this event can be programmed. Not all events can necessarily be programmed on all available counting monitors. This constraint is taken care of by the libpfm library.

    Here the remaining information is specific to the Itanium 2 PMU.

    Even with the -i option, you can use a regular expression for the event:

       % pfmon -i'writes$'
       Name   : L2_DATA_REFERENCES_WRITES
       VCode  : 0x20069
       Code   : 0x69
       PMD/PMC: [ 4 5 6 7 ]
       Umask  : 0010
       EAR    : No (N/A)
       BTB    : No
       MaxIncr: 2  (Threshold [0-1])
       Qual   : [Instruction Address Range] [OpCode Match] [Data Address Range] 
    

    On some PMU models (currently Itanium2), the events information contains a text description of the event.

    Events can be specified using their code:

       % pfmon -i 0x45
       Name   : L2_INST_PREFETCHES
       VCode  : 0x45
       Code   : 0x45
       PMD/PMC: [ 4 5 6 7 ]
       Umask  : 0000
       EAR    : No (N/A)
       BTB    : No
       MaxIncr: 1  (Threshold 0)
       Qual   : [Instruction Address Range] 
       Group  : None
       Set    : None
       Desc   : L2 Instruction Prefetch Requests
    

    Information about what each event measures can be found in the relevant CPU model specific micro-architecture documentation. The architecture imposes that only two events be defined by all PMUs:

    • CPU_CYCLES : the number of elapsed CPU cycles.
    • IA64_INST_RETIRED : the number of instructions retired.

    Those two events are guaranteed to exist on all PMU but their codes may vary. The PMU specific event names may not be exactly the same, however, pfmon and especially the library it uses (libpfm) will always ensure that those two events can always be called by the two names list above. As alluded to earlier, pfmon can support more than one PMU in a single binary. Pfmon also incorporates a generic PMU model which provides only the features defined by the architecture, this includes the two events. If pfmon does not have specific support for the host PMU it will default to the so called 'Generic' PMU support, if compiled in. You can find out what PMU support is compiled into pfmon as follows:

       % pfmon -I
       detected host CPUs:  4-way 800MHz Itanium (Merced, C0)
       supported PMU models: [itanium2] [itanium] [generic] 
       detected host PMU: itanium
       supported sampling outputs: [detailed-itanium] [raw] [compact] [btb] [example] 
       pfmlib version: 2.0
       kernel perfmon version: 1.0
    

    It is possible to force pfmon to operate in generic mode even though it has support for the host CPU using the pfmon_gen command:

       % pfmon_gen -I
       forced libpfm to generic support
       detected host CPUs:  4-way 800MHz Itanium (Merced, C0)
       supported PMU models: [itanium2] [itanium] [generic] 
       detected host PMU: generic
       supported sampling outputs: [raw] [compact] [example] 
       pfmlib version: 2.0
       kernel perfmon version: 1.0
    
       % pfmon_gen -i CPU_CYCLES
       forced libpfm to generic support
       Name   : CPU_CYCLES
       VCode  : 0x12
       Code   : 0x12
       PMD/PMC: [ 4 5 6 7 ]
    

    The pfmon_gen is not a separate command but just a symlink to pfmon. In fact, pfmon always checks the name it was invoked with. If this name is equal to 'pfmon_gen' and the generic support is compiled in, then pfmon will operate in generic mode. Such feature is useful when moving pfmon to a PMU for which neither pfmon itself nor libpfm have support yet.


    3. Basic counting

    In generic mode, pfmon only supports the two architected events listed above. For comparison, the Itanium PMU supports about 230 events and the Itanium2 PMU about 470. No instrumentation of the program is required to monitor the system or a single process.

    3.1 Simple examples

    To collect counts on a specific command, you just need to launch it via pfmon, just like you would do with the time or strace command:

       % pfmon ls -ial /dev/null
       210135 crw-rw-rw-    1 root     root       1,   3 Mar 24  2001 /dev/null
                 2910724 CPU_CYCLES
    

    When invoked with no particular event, pfmon default to CPU_CYCLES. To monitor specific events, you can type:

       % pfmon -e cpu_cycles,IA64_inst_Retired -- ls -ial /dev/null
       210135 crw-rw-rw-    1 root     root       1,   3 Mar 24  2001 /dev/null
                  2984546 CPU_CYCLES
                  2666884 IA64_INST_RETIRED
    

    As you can see, pfmon is not case sensitive with regards to event names. More than one event can be measured at a time using a comma separated list of events. You MUST not have space after the comma.

    If the command you want to run takes options, you can clearly distinguish the options of pfmon from the options of your command using the '--' symbol:

       % pfmon -e ia64_inst_retired -- ls -ial /dev/null
       210135 crw-rw-rw-    1 root     root       1,   3 Mar 24  2001 /dev/null
                  2709704 IA64_INST_RETIRED
    

    Otherwise, pfmon will stop parsing arguments as option as the first argument which does not start with a - or --.

    3.2 Specifying event privilege levels

    By default, pfmon monitors only what is going at the user level (application level). This is true for both per-process and system wide mode.

    It is possible to monitor at any of the 4 privilege levels provided by IA-64. It is also possible to monitor at several levels at the same time by specifying more than one level. The levels can be specified for all events or on a per-event basis. To affect all events, you can use any combinations of -k (-0), -1, -2, -u (or -3). To set the level for each event, the --priv-levels option must be used.

    By default, pfmon only measures at the user level:

       % pfmon -e nops_retired ls
    

    counts the number of NOPS_RETIRED when ls is running at the user level only (equivalent to specifying -u or -3).

       % pfmon -k -e nops_retired ls
    

    counts the number of NOPS_RETIRED when ls is running at the kernel level only.

       % pfmon -k -u -e nops_retired ls
    

    counts the number of NOPS_RETIRED when ls is running at the kernel level or user level, i.e. all the time.

    It is possible to refine the settings on a per event basis using the --priv-levels option.

       % pfmon -e loads_retired,nops_retired ls
    

    Both events are measured at the user level only.

       % pfmon --priv-level=u,k -e loads_retired,nops_retired ls
    

    LOADS_RETIRED is measured at the user level only, NOPS_RETIRED at the kernel level only.

       % pfmon --priv-level=,uk -e loads_retired,nops_retired ls
    

    LOADS_RETIRED is measured at the user level only, NOPS_RETIRED at the user and kernel levels.

       % pfmon -k --priv-level=uk -e loads_retired,nops_retired ls
    

    LOADS_RETIRED is measured at the user and kernel levels, NOPS_RETIRED at the kernel level only.

    3.3 Specifying counter output formats

    Pfmon can display the final counts in various formats. There are 4 formats defined. The default one is shown in the example above. To make is easier to read large numbers or to feed the number to other programs, pfmon supports:

    • --us-counter-format where the thousands, millions, billions are separated with commands (US and UK style):
         % pfmon --us-counter-format ls -l /dev/null
         crw-rw-rw-    1 root     root       1,   3 Mar 24  2001 /dev/null
                        2,292,689 CPU_CYCLES
      
    • --eu-counter-format where the thousands, millions, billions are separated with points (European style):
         % pfmon --eu-counter-format ls -l /dev/null
         crw-rw-rw-    1 root     root       1,   3 Mar 24  2001 /dev/null
                        1.703.898 CPU_CYCLES
      
    • --hex-counter-format where the counts are shown in hexadecimal format:
         % pfmon --hex-counter-format ls -l /dev/null
         crw-rw-rw-    1 root     root       1,   3 Mar 24  2001 /dev/null
               0x000000000019c164 CPU_CYCLES
      
    3.4 Saving results

    By default, the counts are printed on the controlling tty. However it is possible to save them in a file using the --outfile option:

       % pfmon --outfile=b --hex-counter-format ls -l /dev/null
       crw-rw-rw-    1 root     root       1,   3 Mar 24  2001 /dev/null
       % cat b
       0x000000000016a8b1 CPU_CYCLES
    

    It is possible to include a header with the results using the --with-header option. It will be printed on the controlling tty or saved in the output file. The header contains detailed information about the configuration of the host machine and on the monitoring session:

       % pfmon --with-header --outfile=b --hex-counter-format ls -l /dev/null
       crw-rw-rw-    1 root     root       1,   3 Mar 24  2001 /dev/null
       % cat b
       #
       # date: Wed Nov 20 16:03:13 2002
       #
       # hostname: hpljumbo.hpl.hp.com
       #
       # kernel version: Linux 2.4.18 #2 SMP Tue Aug 6 11:54:56 PDT 2002
       #
       # pfmon version: 2.0
       # kernel perfmon version: 1.0
       #
       #
       #
       # page size: 16384 bytes
       # CLK_TCK: 1024 ticks/second
       # CPU configured: 4
       # CPU online: 4
       # physical memory: 6827933696
       # physical memory available: 5606391808
       #
       # host CPUs:  4-way 800MHz Itanium (Merced, C0)
       #	PAL_A: 6.6.23
       #	PAL_B: 7.7.28
       #	Cache levels: 3 Unique caches: 4
       #	L1D:    16384 bytes, line  32 bytes, load_lat   2, store_lat   0
       #	L1I:    16384 bytes, line  32 bytes, load_lat   2, store_lat   0
       #	L2 :    98304 bytes, line  64 bytes, load_lat   6, store_lat   6
       #	L3 :  4194304 bytes, line  64 bytes, load_lat  21, store_lat  21
       #
       #
       # captured events:
       #	PMD4: CPU_CYCLES, user level(s)
       #
       # monitoring mode: per-process
       #
       #
       # instruction sets:
       #	PMD4: CPU_CYCLES, ia32/ia64
       #
       #
       # command: pfmon --with-header --outfile=b --hex-counter-format ls -l /dev/null
       #
       #
       #
                0x00000000001a8956 CPU_CYCLES
    
    3.5 Delaying monitoring

    By default, pfmon will start monitoring at the first instruction of the program, i.e., the entry point when the privilege level is limited to user level. Even when kernel level monitoring is enabled nothing will be measured until the process leaves the kernel for the first time, after fork.

    Sometimes, it may be useful to delay the activation of monitoring until a certain point in the execution. This is the case when the initialization must not be included in the counts. Pfmon provides two different ways to delay the point at which monitoring is turned on with the --trigger-start-address and --trigger-start-delay options.

    The --trigger-start-address option only applies to per-process sessions and is ignored for system-wide. It uses a code address to trigger monitoring. Once execution reaches the bundle address specified with the option, the monitoring will be turned on and will remain on until the program terminates. The address can be specified in hexadecimal or a code symbol name can be provided. It is not possible to specify a kernel address, pfmon will reject any such address. When an address is explicitely used, pfmon will not try to validate it except by checking it is not in the kernel. The delayed start mechanism will be used only the first time the address is reached.

    For instance, if the address of main() is 0x40000000000004a0, then we can delay monitoring until main() is reached using:

       % pfmon --trigger-start-address=0x40000000000004a0 -e loads_retired foo
          74 LOADS_RETIRED
    

    or using the symbol table:

       % pfmon --trigger-start-address=main -e loads_retired foo
          74 LOADS_RETIRED
    

    IMPORTANT: Note that pfmon can ONLY lookup symbols in the "main" program and NOT in any dynamically linked libraries. To allow complete coverage, the program MUST be linked statically.

    Whereas the same program executed without the trigger address, will get:

       % pfmon -e loads_retired foo
          1598 LOADS_RETIRED
    

    This example proves that the libc initialization used 1598-74=1524 loads all by itself.

    The --trigger-start-delay option uses time to delay monitoring. You simply specify a delay in seconds. When the delay expires, monitoring will be turned on. This options works for both per-process and system-wide monitoring. If the monitored process terminates before the delay expires, then nothing gets measured. This applies to both per-process and system wide sessions using a process to delimit session. Note that the session effectively starts when monitoring is turned on. Hence, the --session-timeout is only armed when monitoring in turned on.

    The following example will start monitoring 5 seconds in the execution of foo:

       % pfmon --trigger-start-delay=5 -e loads_retired foo
    

    The following example will start monitoring 5 seconds in the execution of foo and for 10 seconds after that point:

       % pfmon --trigger-start-delay=5 --session-timeout=10 -e loads_retired foo
    
    3.6 Getting timing information

    It is possible to get a tim breakdown of the execution of the monitored command for both per-process and system-wide mode using the --show-time option. The output is similar to the time(1) command. For instance:

       % pfmon --show-time -e nops_retired ls /dev/null
       /dev/null
       real 0h00m00.098s user 0h00m00.000s sys 0h00m00.095s
                         247913 NOPS_RETIRED
    
    3.7 Testing event combinations

    Sometimes it is handy to check if some events can be measured simultaneously without actually starting the monitoring session. the --check-events-only option of pfmon allows this mode of operation. it will check that the combination is valid and then exit. if the conbination is invalid, it will print out the reason and return with an exit value of 1, otherwise the exit value is 0. on Itanium 2, for instance, you can try:

       % pfmon --check-events-only -e loads_retired,stores_retired
       event loads_retired and stores_retired cannot be measured at the same time
       % echo $?
       1
    

    Note that in this mode, you do not need to specify a command to execute.


    4. System wide sessions

    When the --system-wide flag is used, pfmon operates in system wide mode. This means that it does not monitor a specific program anymore but instead all the processes that execute on a specific set of CPUs. In this mode, you do no need to specify a command. You do not need to be root to create a system wide session.

    A system wide session cannot co-exist with any per-process sessions. But a system wide session can run concurrently with other system wide sessions as long as they do not monitor the same set of CPUs. Of course multiple per-process sessions are possible.

    4.1 Selecting CPUs to monitor

    The --cpu-mask option can be used to restrict monitoring to a specific set of CPUs. When this option is not present, pfmon will automatically launch a system wide session on all available CPUs as reported by /proc/cpuinfo.

    So if the system has 2 available CPUS:

       % pfmon --system-wide -u -e cpu_cycles,ia64_inst_retired
       <Press ENTER to stop session>
       CPU0                248793 CPU_CYCLES
       CPU0                 60710 IA64_INST_RETIRED
       CPU1                 26690 CPU_CYCLES
       CPU1                  7706 IA64_INST_RETIRED
    

    A system wide session can monitor at any privilege level (kernel, user, or both).

    If you want to restrict to a specific CPU, you can use the --cpu-mask command:

       % pfmon --system-wide --cpu-mask=0x2 -u -e cpu_cycles,ia64_inst_retired
       <Press ENTER to stop session>
       CPU1                 17841 CPU_CYCLES
       CPU1                  7577 IA64_INST_RETIRED
    

    The CPU mask is a bitmask where each bit represents a CPU. CPU are numbered starting at 0. So bit 0 represents CPU0, bit 1, CPU2 and so on. Therefore the above command will only monitor events happening on CPU1. More than one bit can be set in the mask. For instance, with --cpu-mask=0x3, pfmon will monitor on CPU0 and CPU1 at the same time.

    4.2 Delimiting a system wide session

    There are three ways to delimit a system wide session. By default, the session will terminate when the user press the key. It is also possible to use a timeout expressed in seconds. Finally, the session can also be delimited by the execution of a command. It will start when the command starts and stops when it terminates. Here are some examples:

    Monitor cpu_cycles and instruction retired on the first two CPUs at both user and kernel levels and wait for a keypress to stop:

       % pfmon --cpu-mask=0x3 --system-wide -u -k -e cpu_cycles,ia64_inst_retired
       <Press ENTER to stop session>
       CPU0                   821818169 CPU_CYCLES
       CPU0                  1338893885 IA64_INST_RETIRED
       CPU1                   821813442 CPU_CYCLES
       CPU1                  1341176908 IA64_INST_RETIRED
    

    Monitor cpu_cycles and instruction retired on the first two CPUs at both user and kernel levels for 10 seconds:

       % pfmon --session-timeout=10 --cpu-mask=0x3 --system-wide -u -k \
         -e cpu_cycles,ia64_inst_retired
       <Session to end in 10 seconds>
       CPU0                  8003156088 CPU_CYCLES
       CPU0                 12800683300 IA64_INST_RETIRED
       CPU1                  8003106584 CPU_CYCLES
       CPU1                 12899764561 IA64_INST_RETIRED
    

    Monitor cpu_cycles and instruction retired on the first two CPUs at the user level only during the execution of the ls command (here obviously run on CPU0):

       % pfmon --cpu-mask=0x3 --system-wide -u \
         -e cpu_cycles,ia64_inst_retired -- ls -l /dev/null
       crw-rw-rw-    1 root     root       1,   3 Mar 24  2001 /dev/null
       CPU0                       46560 CPU_CYCLES
       CPU0                       26839 IA64_INST_RETIRED
       CPU1                        7514 CPU_CYCLES
       CPU1                        1184 IA64_INST_RETIRED
    
    4.3 Results aggregation

    It is possible to aggregate counts when monitoring more than one CPU:

       % pfmon --aggregate-results --system-wide -k -e cpu_cycles,ia64_inst_retired
       <Press ENTER to stop session>
                852331455 CPU_CYCLES
               1387206797 IA64_INST_RETIRED
    

    In which case, the per CPU results are summed. Pfmon does not allow different events to be monitored on different CPUs. For this you can run separate instances of pfmon with a different CPU mask, using a command line similar to:

      % pfmon --session-timeout=10 --cpu-mask=0x1 --system-wide -k -e cpu_cycles &
      % pfmon --session-timeout=10 --cpu-mask=0x2 --system-wide -k -e ia64_inst_retired &
    
    4.4 Excluding idle task

    Pfmon now allows the user to exclude the idle tasks from system wide monitoring session. This only works with a kernel that has perfmon 1.3 or higher. Pfmon checks the kernel version and may abort in case the wrong version is detected.

    Linux has one idle task per cpu. This task is run when nothing else can. The idle task is a kernel only task with a pid if 0. The pid 0 is use for ALL idle tasks. They do not show up in ps or top.

    When running a system wide session, it may be useful to stop monitoring when the idle task is running, this way we monitor only the USEFUL execution. Of course, monitoring the idle task or not implies that monitoring is active at the kernel privilege level, i.e., when using the -k or -0 option of pfmon. When monitoring only at the user level, excluding the idle task has no effect. Similarly, excluding the idle task for a per-process session has not effect.

    For instance, here is what we get without exclusion:

       % pfmon -k --session-timeout=10 --system-wide
                     8003084826 CPU_CYCLES
    

    This is run on a 800MHz Itanium CPU, so 10s is 8 billions cycles. But if we run with exclusion:

       % pfmon --exclude-idle -k --session-timeout=10 --system-wide
                         259663 CPU_CYCLES
    

    This is the useful cycles for the 10s period.

    5. Dealing with symbols

    Whenever an option takes an address (code or data) as argument, it is possible to directly use a symbol name rather than use its address. For instance, this is true for the --trigger-address option. The user has two ways to indicate where the find the symbol table. Pfmon can extract the symbol table using an ELF image directly. This is for instance what is done implicitely in per-process mode. Pfmon also understands the System.map format which is typically used to save the symbol table of the kernel.

    There are a couple of restrictions concerning the symbols. Pfmon cannot extract symbol information that is coming from dynamically linked libraries or modules. To avoid this problem, the program must be statically linked and should not explicitely use dl_open().

    If the symbol table has been stripped, pfmon will not find any symbol. In case the option requires a code address, pfmon will only look for matching code symbols. Conversly, if the option requires a data address, pfmon will only look for matching data symbols.

    By default, the symbols are automatically extracted from the command being run. This is true in per process mode but also in system wide mode when a command is specified. In case where symbols must be extract from an alternative ELF archive, then the user must use the --symbol-file option. The filename specified there must be a ELF/ia64 binary.

    Note that the Linux/ia64 kernel is also an ELF/ia64 archive, however for most distribution the kernel image found in /boot/efi is oftentimes compressed. The compression scheme used for Linux/ia64 is different from the one used on Linux/ia32. The compressed is image is simply the ELF/ia64 image compressed with gzip. So it is possible to decompress it to get the original ELF archive. The main caveat is that most of the time the compressed image is stripped. Therefore the user must rely on the corresponding System.map file usually placed in /boot/efi. In this case, the user must explicitely specify the location of the System.map file via the --sysmap-file option.

    Here are a few examples on Itanium:

    • Count the number of time main() is called in the noploop program:
          % file noploop
          noploop: ELF 64-bit LSB executable, IA-64, version 1, \
          statically linked, not stripped
          % pfmon --checkpoint-func=main -e ia64_inst_retired noploop 10000
      

      Here the symbol information for main() is directly extracted from noploop itself.

    • Count the number of times main() is called in the noploop-s program:
          % file noploop-s
          noploop-s: ELF 64-bit LSB executable, IA-64, version 1, \
          statically linked, stripped
          % pfmon --symbol-file=noploop --checkpoint-func=main \
            -e ia64_inst_retired noploop-s 1000
      

      Here noploop and noploop-s are the same program except that the latter does not have the symbol table anymore.

    • Count the number of times sys_getpid() is called during the execution of noploop:
          % pfmon -k --symbol-file=/boot/efi/vmlinux-nostrip --checkpoint-func=sys_getpid \
            -e ia64_inst_retired noploop 1000
      

      Here we assume that the kernel file vmlinux was not stripped. If the kernel has been stripped, then we can use the System.map instead: % pfmon -k --sysmap-file=/boot/efi/System.map --checkpoint-func=sys_getpid \ -e ia64_inst_retired noploop 1000

    6. Basic sampling

    Pfmon has support for sampling on any events or combination of events. Samples are collected into a buffer which can then be written to a file or simply on the screen.

    6.1 Principles

    Each sample is composed of two parts, a fixed size header which contains information about the sample and a variable body which consists of a set of 64-bit values each one representing a PMD register each representing the other events being monitored. All samples record the same set of PMDs, this set is determined by pfmon based on what is being measured.

    The sampling buffer is controlled by the kernel but its size is configurable. By default pfmon uses a buffer with 2048 entries. This can be changed using the --smpl-entries option.

    The sampling works as follows:

    1. the user specifies which events are to be recorded in each sample.
    2. the user specifies the sampling period (via an event) and optional randomization parameters.
    3. at the end of a period, a sample is recorded into the sampling buffer by the kernel.
    4. if the sampling buffer is not full, a new sampling period is reloaded and execution/monitoring resumes. we go back to step 3.
    5. if the sampling buffer becomes full, pfmon is notified.
    6. pfmon processes the buffer, i.e., prints and/or saves the buffer.
    7. pfmon then notifies the kernel that it is done.
    8. the kernel reload a new sampling period and execution/monitoring resumes. we go back to step 3.

    Pfmon (and the kernel) uses two sampling periods instead of just one. The first one is called short-smpl-period and the second is called long-smpl-period. The short-smpl-period is used in step 4, this is when the sampling buffer is not full after writing the sample. The long-smpl-period is used in step 8 when the reload occurs after the buffer became full.

    But why do we need 2 periods?

    As you might imagine there is some overhead is recording a sample. This overhead is increased even more when pfmon needs to get involved to drain the buffer. This operation can take some time and will inevitably introduce some noise in the measurements in the form of TLB and/or cache pollution. To try and hide this noise, it is sometimes beneficial to adjust the sampling period, i.e., make it larger to ensure that the next sample will not record an event that is the consequence of the overhead generated by the monitoring but rather a normal event occuring in the program/system being monitored. So it is expect that the long-smpl-period >= short-smpl-period. Of course if the two are equal, this is equivalent to having only one sampling period. Note that the long-smpl-period is only used to set the distance to the first sample recorded after the buffer is marked as empty again (step 7).

    6.2 Sampling output formats

    There are many ways in which the samples can be saved or printed on the screen. Pfmon has support for custom formats. Note that at this point, the kernel sampling buffer format is fixed. Here the customization happens in the tool. Pfmon comes with a set of output formats. Some of them can be used with any PMU models, others are specific to the Itanium or Itanium 2 PMUs. While all PMDs on all PMUs are 64 bits what they contains can vary from one PMU to the other.

    You can figure out which formats are available for the host PMU by typing:

       % pfmon -I
       supported PMU models: [itanium2] [itanium] [generic] 
       detected host PMU: itanium
       supported sampling outputs: [detailed-itanium] [raw] [compact] [btb] [example] 
    

    You can get a short description of what each format does by using the -S option:

       % pfmon -S detailed-itanium
       Name        : detailed-itanium
       Description : Details each event in clear text
       PMU models  : [itanium] 
    

    Some formats are supported on all PMU models, in which case they are listed as generic:

       % pfmon -S compact
       Name        : compact
       Description : Column-style raw values
       PMU models  : [generic]
    

    For instance, the compact format works on Itanium and Itanium 2:

       % pfmon --smpl-output-format=compact --long-smpl-periods=100000 ls
       0        14130    0  0x2000000000015771 0x0000582a9cf18e79 0x0010 100000 
       1        14130    0  0x2000000000015851 0x0000582a9cf34a40 0x0010 100000 
       2        14130    0  0x2000000000015941 0x0000582a9cf4e5e8 0x0010 100000 
       3        14130    0  0x2000000000023da0 0x0000582a9cf69db7 0x0010 100000 
       ....
    

    For more information about the various formats please refer to the source code :-<

    6.3 Sampling examples

    Suppose you want to record how many instructions are retired every 50000 cycles, i.e., you want to sample based on CPU_CYCLES and record the value of IA64_INST_RETIRED in each sample. This can be done as follows:

    	% pfmon --smpl-output-format=detailed-itanium \
    	  --short-smpl-period=50000 --long-smpl-period=50000 \
              -e cpu_cycles,ia64_inst_retired -- ls /dev/null
    

    The two periods are identical in this example because the number of instruction executed by the ls command is not influenced by the fact that we monitor. The syntax is such that the 50000 value of short-period applies to the first event specified in the event list. The same rule applies for long-period.

    With pfmon it is possible to use more than one event as the 'sampling event'. You can also specify a sampling period for IA64_INST_RETIRED, in which case we take a sample whenever the first OR second period expires:

    	% pfmon --smpl-output-format=detailed-itanium --short-smpl-period=50000,10000 \
    	  --long-smpl-period=50000,10000 -e cpu_cycles,ia64_inst_retired ls
    

    Here a sample will be recorded every 50000 cpu cycles OR each time 10000 instructions have been retired.

    You do not necessarily need to specify both periods. If you specify one, then pfmon will use the value to initialize the other one. In other words, as soon as you specify only one period, the unspecified one will get the same value.

    Let us look at the information in the sampling buffer for the detailed-itanium format. For the first example above, we get something like this printed on the screen:

    /dev/null
    Entry 0 PID:1490 CPU:3 STAMP:0x39e28c5cf782 IIP:0x2000000000004c70
    OVFL: 4 
    PMD5  : 0x0000000000004708
    Entry 1 PID:1490 CPU:3 STAMP:0x39e28c5f8e0a IIP:0x2000000000026ee0
    OVFL: 4  LAST_VAL: 5000
    PMD5  : 0x0000000000007310
    Entry 2 PID:1490 CPU:3 STAMP:0x39e28c6273d2 IIP:0x2000000000025e40
    OVFL: 4   LAST_VAL: 5000
    PMD5  : 0x000000000000b5e6
    Entry 3 PID:1490 CPU:3 STAMP:0x39e28c63ef1b IIP:0x2000000000018490
    OVFL: 4   LAST_VAL: 5000
    PMD5  : 0x000000000001137f
    Entry 4 PID:1490 CPU:3 STAMP:0x39e28c64c6f5 IIP:0x2000000000024f60
    OVFL: 4   LAST_VAL: 5000
    PMD5  : 0x0000000000018a73
    Entry 5 PID:1490 CPU:3 STAMP:0x39e28c6596cb IIP:0x2000000000018490
    OVFL: 4   LAST_VAL: 5000
    PMD5  : 0x00000000000222df
    .....
    

    The first line is the output from the ls command. Next you see the entries extracted from the sampling buffer. Entry 0 is the first entry recorded in this monitoring session. The first line of each sample (entry) shows the fixed header. The fields are as follows:

    • PID: the identity of the process that generated the event
    • CPU: the CPU number on which the event occurred
    • STAMP: a time stamp guaranteed to be unique in time per CPU.
    • IIP: the value of the IP when the event occurred (DANGER, see note below)
    • OVFL: the counter that triggered the recording of the sample (more than one possible).
    • LAST_VAL: the last value loaded into the first counter which overflowed

    VERY IMPORTANT NOTE: users are advised NOT TO TRUST the value reported in IIP. Samples get recorded by forcing a counter overflow and which then triggers an interrupt which will cause the kernel to record the information. Because of the parallel nature of the architecture and its implementations, it is very likely that by the time the PMU realizes that there was a counter overflow and generates the interrupt, the program execution has progressed way beyond the instruction that caused the event leading the a skewed IIP. At best IIP points to the next bundle given that interrupts can only be delivered at bundle boundaries.

    After the header, you get the value of PMD5. This register contains the number of instructions retired for our example. The second event specified by the user DOES NOT necessarily end up in PMD5. To figure out how the events were dispatched among the various PMDs, you can use the --with-header option (described earlier). The header contains detailed machine and session description. In our case it would like as follows:

    #
    # date: Wed Nov 20 17:00:43 2002
    #
    # hostname: hpljumbo.hpl.hp.com
    #
    # kernel version: Linux 2.4.18 #2 SMP Tue Aug 6 11:54:56 PDT 2002
    #
    # pfmon version: 2.0
    # kernel perfmon version: 1.0
    #
    #
    #
    # page size: 16384 bytes
    # CLK_TCK: 1024 ticks/second
    # CPU configured: 4
    # CPU online: 4
    # physical memory: 6827933696
    # physical memory available: 5598134272
    #
    # host CPUs:  4-way 800MHz Itanium (Merced, C0)
    #       PAL_A: 6.6.23
    #       PAL_B: 7.7.28
    #       Cache levels: 3 Unique caches: 4
    #       L1D:    16384 bytes, line  32 bytes, load_lat   2, store_lat   0
    #       L1I:    16384 bytes, line  32 bytes, load_lat   2, store_lat   0
    #       L2 :    98304 bytes, line  64 bytes, load_lat   6, store_lat   6
    #       L3 :  4194304 bytes, line  64 bytes, load_lat  21, store_lat  21
    #
    #
    # captured events:
    #       PMD4: CPU_CYCLES, user level(s)
    #       PMD5: IA64_INST_RETIRED, user level(s)
    #
    # monitoring mode: per-process
    #
    #
    # instruction sets:
    #       PMD4: CPU_CYCLES, ia32/ia64
    #       PMD5: IA64_INST_RETIRED, ia32/ia64
    #
    #
    # command: ./pfmon --with-header --smpl-output-format=detailed-itanium ...
    #
    #
    #
    #
    # kernel sampling format: 1.0
    # sampling entry size: 56
    #
    # recorded PMDs: PMD5 
    # sampling buffer entries: 2048
    #
    # short sampling rates (base/mask/seed):
    #       CPU_CYCLES 50000
    #       IA64_INST_RETIRED none
    #
    # long sampling rates (base/mask/seed):
    #       CPU_CYCLES 50000
    #       IA64_INST_RETIRED none
    #
    #
    

    Near the end of the header, you see in the "captured events" section: PMD5: IA64_INST_RETIRED.

    Pfmon will record the value of the PMD for which the event has no sampling period defined. For our first example, it means that it will record the value of the PMD counting the number of instructions retired. Let us look at a more complicated example using some of the Itanium specific events:

        % pfmon --with-header --short-smpl-periods=50000 --long-smpl-periods=50000 \
          -e cpu_cycles,ia64_inst_retired,l2_misses,cpu_cpl_changes -- ls /dev/null
    

    Here cpu_cycles is controlling the sampling period and each sample will include value of the PMDs counting the number of L2 misses (L2_MISSES) and the number of CPU privilege level changes (CPU_CPL_CHANGES):

    entry 0 PID:18723 CPU:3 STAMP:0x23b06dc011261 IIP:0x2000000000024d40
      PMD OVFL: 4 
      PMD5 : 0x00000000000017d7
      PMD6 : 0x00000000000001de
      PMD7 : 0x0000000000000008
    

    Where the assignments were:

    # captured events:
    #       PMD4: CPU_CYCLES, user level(s)
    #       PMD5: IA64_INST_RETIRED, user level(s)
    #       PMD6: L2_MISSES, user level(s)
    #       PMD7: CPU_CPL_CHANGES, user level(s)
    

    Using the compact format instead of the detailed one, you get results that are formatted such that they can be easily parsed by other tools. The header contains the description of every column:

    # column  1: entry number
    # column  2: process id
    # column  3: cpu number
    # column  4: instruction pointer
    # column  5: unique timestamp
    # column  6: bitmask of PMDs which overflowed
    # column  7: initial value of PMD which overflowed
    # column  8: PMD5
    # column  9: PMD6
    # column 10: PMD7
    

    and the data is formatted as follows:

    When sampling, the counts printed at the end of the session are not very useful, especially for the counters used as sampling periods. Those should be discarded and they are NOT saved in the sampling result file.

    6.4 Sampling in system wide sessions

    Sampling is possible in the same manner for system wide sessions. By default, the buffer is printed on the controlling tty. When sampling on more than one CPU at a time, samples for each CPU will be printed. When sampling results are redirected into a file, then you get one file per CPU. If the file is called 'myresults', then 'myresults.cpu0' contains the samples captured on CPU0, 'myresults.cpu1' the ones from CPU1, and so on.

    The --aggregate-results options also influences the way samples are saved to files. When this option is used, then samples are merged into a single file. In our example, they would go into 'myresults'. If you don't use the --smpl-no-entry-header every sample will have the CPU information.

    6.5 Randomization of sampling periods

    Pfmon supports randomization of both sampling periods. The user must supply a bitmask and a seed value using the --smpl-periods-random option. The same mask and seed applies to both the long and short period for each event. Each event can have a different mask and seed. Two separate invocations of pfmon using the same seed and mask arguments are guaranteed to generate to same "pseudo-random" series of numbers allowing reproducibility.

    The sampling buffer will report the random value used for the sampling period used to generate each sample in the LAST_VAL field in the detailed output format, otherwise it is in one of the columns in compact modes.

    In the following command, the long (and short) sampling period are initially set to 100000 and we activate randomization using a seed of 5. The mask indicates that we allow the value to vary between 100000 and 100255 (inclusive):

       % pfmon --smpl-periods-random=0xff:5 --long-smpl-period=100000 \
         -e cpu_cycles -- noploop 1000000000
    
       entry 0 PID:509 CPU:0 STAMP:0xa9b83faf28 IIP:0x4000000000000400
            OVFL: 4  LAST_VAL: 100000
       entry 1 PID:509 CPU:0 STAMP:0xa9b8413a4d IIP:0x4000000000000400
            OVFL: 4  LAST_VAL: 100005
       entry 2 PID:509 CPU:0 STAMP:0xa9b842c532 IIP:0x4000000000000400
            OVFL: 4  LAST_VAL: 100067
       entry 3 PID:509 CPU:0 STAMP:0xa9b8445077 IIP:0x4000000000000400
            OVFL: 4  LAST_VAL: 100181
       entry 4 PID:509 CPU:0 STAMP:0xa9b845db4e IIP:0x4000000000000400
            OVFL: 4  LAST_VAL: 100064
       entry 5 PID:509 CPU:0 STAMP:0xa9b84766b5 IIP:0x4000000000000400
            OVFL: 4  LAST_VAL: 100212
       entry 6 PID:509 CPU:0 STAMP:0xa9b848f1d5 IIP:0x4000000000000400
            OVFL: 4  LAST_VAL: 100140
    

    The randomization is shown in the LAST_VAL field which gives the value loaded into PMD4 (the PMD which overflowed) for each sample. Hence, 100181 is the number of cycles elapsed between entry 2 and entry 3.

    Randomization is important when sampling to avoid getting in lockstep with the execution and thereby collecting biased results.

    6.5 Blocking on overflow notifications

    Whenever the sampling buffer becomes full and pfmon is notified you have the option of either letting the monitored program continue or block it. In both cases, monitoring is off during the processing of the sampling buffer. By default, pfmon lets the program continue its execution. It is possible to block the program using the --overflow-block option. Blocking the program ensures pfmon sees the entire execution. Keeping the program running ensures that the caches and TLB are kept somewhat warm, i.e., with some state belonging to the running process, especially on SMP systems.



perfmon project links

» project home
» perfmon overview
» libpfm overview
» pfmon overview
» mailing list
» downloads
» bibliography
» presentations

pfmon links

» FAQ
» documentation
Printable version
Privacy statement Using this site means you accept its terms Feedback to HP Labs
© 2009 Hewlett-Packard Development Company, L.P.