Jump to content United States-English
HP.com Home Products and Services Support and Drivers Solutions How to Buy
» Contact HP

hp.com home


pfmon-3.0 common features

» 

HP Labs

» Research
» News and events
» Technical reports
» About HP Labs
» Careers @ HP Labs
» People
» Worldwide sites
» Downloads
Content starts here

Table of contents
  1. Command line options
  2. Getting event information
  3. Basic counting
    1. Simple examples
    2. Privilege levels
    3. Counter formats
    4. Saving counts
    5. Hiding program output
    6. Getting timing information
    7. Testing event combinations
  4. Monitoring running tasks
  5. System wide sessions
    1. Selecting CPUs to monitor
    2. Delimiting a session
    3. Results aggregation
    4. Delayed start
    5. Saving results in files
  6. Monitoring multiple processes/threads
    1. Basic monitoring
    2. Saving results in files
    3. Separating execution between fork()/vfork() and exec()
    4. Selectively monitoring processes
    5. Aggregating results from multiple processes/threads
  7. Triggering monitoring at specific locations
    1. Basic code triggers
    2. Data triggers
    3. Repeating triggers
    4. Inheritance of triggers
    5. Dynamic code stop triggers
  8. Dealing with symbols
    1. Symbol table sources
    2. Symbol lookups
    3. Symbols and the --follow-* options
    4. Symbols resolution and sampling
    5. Symbols resolution examples
  9. Sampling with pfmon
    1. Principles
    2. Sampling periods
    3. Sampling modules
    4. Basic sampling examples
    5. Writing samples to a file
    6. Resetting non-sampling counters
    7. Blocking on overflow notifications
    8. Sampling and the --follow-* options
    9. Sampling in system wide sessions
    10. Randomization of sampling periods
    11. Sampling and symbol resolution example
    12. Miscellaneous sampling options

    1. Pfmon command line options

    Pfmon supports the following list of options on any host system:

    -h, --helpdisplay the list of supported options
    -V, --versiondisplay pfmon version information
    -l[regex]show list of supported events by host PMU
    -i get information about a particular event
    -u, -3, --user-levelmonitor at the user level for all events
    -k, -0, --kernel-levelmonitor at the kernel level for all events
    -1monitor at the privilege level 1 for all events
    -2monitor at the privilege level 2 for all events
    -eev1,ev2,...select events to monitor
    -I, --infoshow pfmon information and detected host CPU
    --verboseprint more information during execution
    --debugenable debug prints
    --outfile=filenamesave counts in a file
    --appendopen outfile in append mode
    --overflow-blockblock when sample buffer is full
    --system-widecreate a system wide monitoring session
    --cpu-listk=x,y-z,...list cpu (number) to include in system wide session
    -S format, --smpl-module-info=formatinfo about a sampling module
    -t secs, --session-timeout=secsduration of the session in seconds
    --smpl-outfile=filenamewrite samples in a file
    --long-smpl-periods=val1,...event long sampling periods
    --short-smpl-periods=val1,...event short sampling periods
    --smpl-periods-random=mask1:seed1,...randomize sampling periods per event
    --smpl-print-countsprint counter results when sampling
    --reset-non-smpl-periodsreset counters not used as sampling periods for each sample
    --saturate-smpl-buffercollect samples only up to the limit of the buffer
    --with-headerput a description header with counts and samples
    --aggregate-resultsaggregate counts and sampling buffer outputs
    --priv-levels=lvl1,...set privilege level per event
    --show-timeshow real, user, system time for the executed command
    --us-counter-formatprint counters using commas (1,024)
    --eu-counter-formatprint counters using points (1.024)
    --hex-counter-formatprint counters in hexadecimal (0x400)
    --smpl-module=nameselect a sampling module
    --symbol-file=filenameuse the ELF archive filename to look for symbols
    --sysmap-file=filenameuse the System.map filename for kernel symbols
    --check-events-onlycheck event combination is valid and exit
    --resolve-addressesattempt to resolve addresses to symbols
    --no-cmd-outputredirect executed command output to /dev/null
    --attach-task=pidattach monitoring session to already running task identified by pid
    --trigger-code-start=addr,...start monitoring when execution reaches any of the addr
    --trigger-code-stop=addr,...stop monitoring when execution reaches any of the addr
    --trigger-data-start=addr,...start monitoring when any data location of the addr is accessed
    --trigger-data-stop=addr,...stop monitoring when any data location of the addr is accessed
    --trigger-code-repeatrepeat code trigger start/stop
    --trigger-data-repeatrepeat data trigger start/stop
    --trigger-code-followcode trigger continues across fork,vfork,pthread_create
    --trigger-data-followdata trigger continues across fork,vfork,pthread_create
    --trigger-data-rodata triggers activated on read-only access
    --trigger-data-wodata triggers activated on write-only access
    --follow-forkmonitoring continues across fork
    --follow-vforkmonitoring continues across vfork
    --follow-pthreadmonitoring continues across pthread_create
    --follow-allunion of --follow-fork,--follow-vfork,--follow-pthread, and --follow-exec
    --follow-exec[=regex]monitoring continues across exec, only is program cmdline matches regex when provided
    --follow-exec-exclude[=regex]similar to --follow-exec, except that monitoring stops on match
    --exec-split-resultsdifferentiate monitoring sessions before and after exec

    2. Getting event information

    The list of events supported by pfmon depends on the host PMU. You can get the list of supported events using the following pfmon option:

       % pfmon -l
       CPU_CYCLES
       IA64_INST_RETIRED
       IA64_TAGGED_INST_RETIRED_PMC8
       IA64_TAGGED_INST_RETIRED_PMC9
       INST_DISPERSED
       EXPL_STOPBITS
       ALL_STOPS_DISPERSED
       IA32_INST_RETIRED
       ISA_TRANSITIONS
       NOPS_RETIRED
       ....
    

    If you specify an argument to the -l option (no space between l and the argument), it is interpreted as a regular expression and all matching events will be listed:

       % pfmon -ll1d
       L1D_READ_FORCED_MISSES_RETIRED
       L1D_READ_MISSES_RETIRED
       L1D_READS_RETIRED
       PIPELINE_FLUSH_L1D_WAYMP_FLUSH
    

    You can get more detailed information about each event using the following option (shown for Itanium2):

       % pfmon -i nops_retired
       Name   : NOPS_RETIRED
       VCode  : 0x50
       Code   : 0x50
       PMD/PMC: [ 4 5 6 7 ]
       Umask  : 0000
       EAR    : No (N/A)
       BTB    : No
       MaxIncr: 6  (Threshold [0-5])
       Qual   : [Instruction Address Range] [OpCode Match] 
       Group  : None
       Set    : None
       Desc   : Retired NOP Instructions
    

    Pfmon is case insensitive for event names. Here you see some details about the event. The first 4 lines are generic and provided on all PMU models even though the codes may vary:

    • Code is the event code used by the PMU.
    • Vcode is a libpfm internal event code which encapsulates the event code and other information describing the type of the event. For simple events, the two codes are usually identical.
    • PMD/PMC: list the counting monitors on which this event can be programmed. Not all events can necessarily be programmed on all available counting monitors. This constraint is taken care of by the libpfm library.

    The remaining information is specific to the Itanium 2 PMU.

    Even with the -i option, you can use a regular expression for the event:

       % pfmon -i'writes$'
       Name   : L2_DATA_REFERENCES_L2_DATA_WRITES
       VCode  : 0x200b2
       Code   : 0xb2
       PMD/PMC: [ 4 5 6 7 ]
       Umask  : 0010
       EAR    : No (N/A)
       BTB    : No
       MaxIncr: 4  (Threshold [0-3])
       Qual   : [Instruction Address Range] [OpCode Match] [Data Address Range] 
       Group  : L2 Cache
       Set    : 1
       Desc   : Data Read/Write Access to L2 -- count only data write and semaphore operations
    

    On some PMU models (currently Itanium2), the events information contains a text description of the event.

    Events can be specified using their code:

       % pfmon -i 0x45
       Name   : L2_INST_PREFETCHES
       VCode  : 0x45
       Code   : 0x45
       PMD/PMC: [ 4 5 6 7 ]
       Umask  : 0000
       EAR    : No (N/A)
       BTB    : No
       MaxIncr: 1  (Threshold 0)
       Qual   : [Instruction Address Range] 
       Group  : None
       Set    : None
       Desc   : L2 Instruction Prefetch Requests
    

    Information about what each event measures can be found in the relevant CPU model specific micro-architecture documentation.

    The architecture requires that only two events be defined by all PMUs:

    • CPU_CYCLES : the number of elapsed CPU cycles.
    • IA64_INST_RETIRED : the number of instructions retired.

    Those two events are guaranteed to exist on all PMU but their codes may vary. The PMU specific event names may not be exactly the same, however, pfmon and especially the library it uses (libpfm) will always ensure that those two events can always be called by the two names listed above. As alluded to earlier, pfmon can support more than one PMU in a single binary. Pfmon also incorporates a generic PMU model which provides only the features defined by the architecture, this includes the two events. If pfmon does not have specific support for the host PMU it will default to the so called 'Generic' PMU support, if compiled in. You can find out what PMU support is compiled into pfmon as follows:

       % pfmon -I
       detected host CPUs:  2-way 1000MHz/3.0MB Itanium 2 (McKinley, B3)
       supported PMU models: [itanium2] [itanium] [generic IA-64] 
       detected host PMU: itanium2
       supported sampling modules: [detailed-itanium2] [raw-ia64] [compact-ia64] [btb] [example-ia64] 
       pfmlib version: 3.0
       kernel perfmon version: 2.0
    

    It is possible to force pfmon to operate in generic mode even though it has support for the host CPU using the pfmon_gen command:

       % pfmon_gen -I
       detected host CPUs:  2-way 1000MHz/3.0MB Itanium 2 (McKinley, B3)
       supported PMU models: [itanium2] [itanium] [generic IA-64] 
       detected host PMU: generic IA-64
       supported sampling modules: [raw-ia64] [compact-ia64] [example-ia64] 
       pfmlib version: 3.0
       kernel perfmon version: 2.0
    
       % pfmon_gen -i CPU_CYCLES
       Name   : CPU_CYCLES
       VCode  : 0x12
       Code   : 0x12
       PMD/PMC: [ 4 5 6 7 ]
    

    The pfmon_gen is not a separate command but just a symlink to pfmon. In fact, pfmon always checks the name it was invoked with. If this name is equal to 'pfmon_gen' and the generic support is compiled in, then pfmon will operate in generic mode. Such feature is useful when moving pfmon to a PMU for which neither pfmon itself nor libpfm have support yet.


    3. Basic counting with pfmon

    In generic mode, pfmon only supports the two architected events listed above. For comparison, the Itanium PMU supports about 230 events and the Itanium2 PMU about 500.

    No instrumentation of the program is required to monitor the system or a single process.

    3.1 Simple examples

    To collect counts on a specific command, you just need to launch it via pfmon, just like you would do with the time or strace command:

       % pfmon ls /dev/null
       /dev/null
           620458 CPU_CYCLES
    

    When invoked with no particular event, pfmon default to CPU_CYCLES. To monitor specific events, you can type:

       % pfmon -e cpu_cycles,IA64_inst_Retired ls /dev/null
       /dev/null
                        436368 CPU_CYCLES       
                        513437 IA64_INST_RETIRED
    

    As you can see, pfmon is not case sensitive with regards to event names. More than one event can be measured at a time using a comma separated list of events. You MUST not have space after the comma.

    If the command you want to run takes options, you can clearly distinguish the options of pfmon from the options of your command using the '--' separator:

       % pfmon -e ia64_inst_retired -- ls -ial /dev/null
       3679352 crw-rw-rw-    1 root     root       1,   3 Dec  6  2002 /dev/null
                        557392 IA64_INST_RETIRED
    

    Otherwise, pfmon will interpret your command's options as pfmon options.

    3.2 Privilege levels

    By default, pfmon monitors only what is going at the user level (application level). This is true for both per-process and system wide mode.

    It is possible to monitor at any of the 4 privilege levels provided by IA-64. It is also possible to monitor at several levels at the same time by specifying more than one level. The levels can be specified for all events or on a per-event basis.

    To affect all events, you can use any combinations of -k (-0), -1, -2, -u (or -3). To set the level for each event, the --priv-levels option must be used.

    By default, pfmon only measures at the user level:

       % pfmon -e nops_retired ls /dev/null
    

    counts the number of NOPS_RETIRED when ls is running at the user level only (equivalent to specifying -u or -3).

       % pfmon -k -e nops_retired ls /dev/null
    

    counts the number of NOPS_RETIRED when ls is running at the kernel level only.

       % pfmon -k -u -e nops_retired ls /dev/null
    

    counts the number of NOPS_RETIRED when ls is running at the kernel level or user level, i.e. all the time.

    It is possible to refine the settings on a per event basis using the --priv-levels option.

       % pfmon -e loads_retired,nops_retired ls /dev/null
    

    Both events are measured at the user level only.

       % pfmon --priv-level=u,k -e loads_retired,nops_retired ls /dev/null
    

    LOADS_RETIRED is measured at the user level only, NOPS_RETIRED at the kernel level only.

       % pfmon --priv-level=,uk -e loads_retired,nops_retired ls /dev/null
    

    LOADS_RETIRED is measured at the user level only, NOPS_RETIRED at the user and kernel levels.

       % pfmon -k --priv-level=uk -e loads_retired,nops_retired ls /dev/null
    

    LOADS_RETIRED is measured at the user and kernel levels, NOPS_RETIRED at the kernel level only.

    3.3 Counter formats

    Pfmon can display the final counts in various formats. There are 4 formats defined. The default one is shown in the example above. To make is easier to read large numbers or to feed the number to other programs.

    By default, you the raw number:

       % pfmon ls -l /dev/null
       crw-rw-rw-    1 root     root       1,   3 Dec  6  2002 /dev/null
                       724882 CPU_CYCLES
    

    --us-counter-format where the thousands, millions, billions are separated with commas (US and UK style):

       % pfmon --us-counter-format ls -l /dev/null
       crw-rw-rw-    1 root     root       1,   3 Dec  6  2002 /dev/null
                       724,882 CPU_CYCLES
    

    --eu-counter-format where the thousands, millions, billions are separated with points (European style):

       % pfmon --eu-counter-format ls -l /dev/null
       crw-rw-rw-    1 root     root       1,   3 Dec  6  2002 /dev/null
                       724.822 CPU_CYCLES
    

    --hex-counter-format where the counts are shown in hexadecimal format:

       % pfmon --hex-counter-format ls -l /dev/null
       crw-rw-rw-    1 root     root       1,   3 Dec  6  2002 /dev/null
                        0x00000000000b0f56 CPU_CYCLES
    
    3.4 Saving counts

    By default, the counts are printed on the controlling tty. However it is possible to save them in a file using the --outfile option:

       % pfmon --outfile=b --hex-counter-format ls -l /dev/null
       crw-rw-rw-    1 root     root       1,   3 Dec  6  2002 /dev/null
       % cat b
            0x00000000000b7cad CPU_CYCLES
    

    It is possible to include a header with the results using the --with-header option. It will be printed on the controlling tty or saved in the output file. The header contains detailed information about the configuration of the host machine and on the monitoring session:

       % pfmon --with-header --outfile=b --hex-counter-format ls -l /dev/null
       crw-rw-rw-    1 root     root       1,   3 Dec  6  2002 /dev/null
       % cat b
       #
       # date: Thu Oct 30 14:45:16 2003
       #
       # hostname: tuff.hpl.hp.com
       #
       # kernel version: Linux 2.6.0-test7 #200 SMP Fri Oct 24 14:11:06 PDT 2003
       #
       # pfmon version: 3.0
       # kernel perfmon version: 2.0
       #
       #
       # page size: 16384 bytes
       #
       # CLK_TCK: 1024 ticks/second
       #
       # CPU configured: 0
       # CPU online    : 2
       #
       # physical memory          : 4232495104 bytes (4036.4 MB)
       # physical memory available: 3303964672 bytes (3150.9 MB)
       #
       # host CPUs:  2-way 1000MHz/3.0MB Itanium 2 (McKinley, B3)
       #	PAL_A: 0.7.31
       #	PAL_B: 0.7.40
       #	Cache levels: 3 Unique caches: 4
       #	L1D:    16384 bytes, line  64 bytes, load_lat   1, store_lat   3
       #	L1I:    16384 bytes, line  64 bytes, load_lat   1, store_lat   0
       #	L2 :   262144 bytes, line 128 bytes, load_lat   5, store_lat   7
       #	L3 :  3145728 bytes, line 128 bytes, load_lat  12, store_lat   7
       #
       #
       # captured events:
       #	PMD4: CPU_CYCLES
       #
       # privilege levels:
       #	PMD4: CPU_CYCLES = user
       #
       # monitoring mode: per-process
       #
       # instruction sets:
       #	PMD4: CPU_CYCLES = ia32/ia64
       #
       #
       # command: pfmon --with-header --outfile=b --hex-counter-format ...
       #
       # command line     : ls -l /dev/null 
       #
       # process id       : 10672
       # thread  id       : 10672
       # parent process id: -1
       # parent thread  id: -1
       # exec count       : 1
       #
       #
       #
            0x00000000000b9176 CPU_CYCLES
    
    3.5 Hiding program output

    It is possible to redirect the output of the programs you execute under pfmon to /dev/null. This is desirable if you are interested in keeping a clean output and are not using an output file. By default, programs share the same terminal output as pfmon. You can use the --no-cmd-output option to eliminate the output.

    Normal output, we see the result of the "ls" command:

       % pfmon --us-counter-format ls -l /dev/null
       crw-rw-rw-    1 root     root       1,   3 Dec  6  2002 /dev/null
                       724,882 CPU_CYCLES
    

    Using the option, you get:

       % pfmon --no-cmd-output --us-counter-format ls -l /dev/null
                       724,882 CPU_CYCLES
    
    3.6 Getting timing information

    It is possible to get a time breakdown of the execution of the monitored command for both per-process and system-wide mode using the --show-time option. The output is similar to the time(1) command. For instance:

       % pfmon --show-time -e nops_retired ls /dev/null
       /dev/null
                        145158 NOPS_RETIRED
       real 0h00m00.001s user 0h00m00.000s sys 0h00m00.000s
    
    3.7 Testing event combinations

    Sometimes it is handy to check if some events can be measured simultaneously without actually starting the monitoring session. The --check-events-only option of pfmon allows this mode of operation. It checks that the combination is valid and then exits. If the combination is invalid, it prints out the reason and pfmon exits with a return code of 1, otherwise a value of 0 is returned. On Itanium2, for instance, you can try:

       % pfmon --check-events-only -e loads_retired,stores_retired
       event LOADS_RETIRED and STORES_RETIRED cannot be measured at the same time
       % echo $?
       1
    

    Note that in this mode, you do not need to specify a command to execute.


    4. Monitoring running tasks

    This version of pfmon can "attach" to an already running process, monitor for a while and eventually "detach" from the process. This is very useful when monitoring system daemons or workload that are very long to start. The process to attach to is specified using the --attach-task option with a process id as reported by the ps command. You must have permission to access the process for this option to work. It is not possible to attach to kernel threads, such as kswapd, migration, khubd, kjournald even if you are root.

    Let us look at some examples. Suppose we have this long running program called noploop and we want to monitor it once it is started:

       % noploop 1000000000000000 & 
       % ps ax | fgrep noploop
       22731 pts/7    R      1:33 nueh/noploop 1000000000000000
       % pfmon --attach-task=22731
       session interrupted by user
                  120106020603 CPU_CYCLES
    

    This pfmon command attaches to process 22731 and collects CPU_CYCLES in user level until the process terminates. It is possible to interrupt if you want to stop before the process terminates by simply pressing CTRL-C (control C) on your keyboard. This prints the counts so far and detaches from the program which will continue to run.

    It is possible to use a session timeout to delimit the duration of such measurement using the --session-timeout option:

       % pfmon --session-timeout=5 --attach-task=22731
       session timed out
                    4988897963 CPU_CYCLES
    

    It is not possible to use the --session-timeout option with any of the -follow-* option is used. But the --attach-task option can certainely be combined with any of the --follow-* options making it very easy to monitor daemons that fork off dedicated processes such as xinetd.


    5. System wide sessions

    When the --system-wide option is used, pfmon operates in system wide mode. This means that it does not monitor a specific program anymore but instead all the processes that execute on a specific set of CPUs. In this mode, you do no need to specify a command. You do NOT need to be root to create a system wide session.

    A system wide session cannot co-exist with any per-process sessions. But a system wide session can run concurrently with other system wide sessions as long as they do not monitor the same set of CPUs.

    5.1 Selecting CPUs to monitor

    This version of pfmon supports up to 1024 processors. By default, pfmon monitors on ALL online CPUs. It is possible to restrict the set of CPUS using the --cpu-list option.

    If the system has 2 available CPUS:

       % pfmon --system-wide -u -e cpu_cycles,ia64_inst_retired
       <Press ENTER to stop session>
       CPU0                       55797 CPU_CYCLES       
       CPU0                       61656 IA64_INST_RETIRED
       CPU1                       10577 CPU_CYCLES       
       CPU1                       10776 IA64_INST_RETIRED
    

    A system wide session can monitor at any privilege level (kernel, user, or both).

    If you want to restrict to a specific CPU, you can use the --cpu-list command:

       % pfmon --system-wide --cpu-list=1 -u -e cpu_cycles,ia64_inst_retired
       <Press ENTER to stop session>
       CPU1                        1869 CPU_CYCLES       
       CPU1                         512 IA64_INST_RETIRED
    

    The CPU are numbered starting at 0. You can specify multiple CPUs using a comma separated list without spaces:

       % pfmon --system-wide --cpu-list=0,1 -u -e cpu_cycles,ia64_inst_retired
       <Press ENTER to stop session>
       CPU0                       48569 CPU_CYCLES       
       CPU0                       61619 IA64_INST_RETIRED
       CPU1                       10444 CPU_CYCLES       
       CPU1                       10776 IA64_INST_RETIRED
    

    It is also possible to specify ranges of CPUs:

       % pfmon --system-wide --cpu-list=0-1 -u -e cpu_cycles,ia64_inst_retired
       <Press ENTER to stop session>
       CPU0                       48569 CPU_CYCLES       
       CPU0                       61619 IA64_INST_RETIRED
       CPU1                       10444 CPU_CYCLES       
       CPU1                       10776 IA64_INST_RETIRED
    

    Ranges and individual specification can be combined in a single comma-separated list:

       % pfmon --system-wide --cpu-list=0,2-3 -u -e cpu_cycles,ia64_inst_retired
    

    This above example monitors on CPU0, CPU2, and CPU3.

    5.2 Delimiting a system wide session

    There are three ways to delimit a system wide session. By default, the session will terminate when the user presses the <Enter> key. It is also possible to use a timeout expressed in seconds. Finally, the session can also be delimited by the execution of a command. It will start when the command starts and stops when it terminates. Here are some examples:

    Monitor cpu_cycles and instruction retired on the first two CPUs at both user and kernel levels and wait for a keypress to stop:

       % pfmon --cpu-list=0-1 --system-wide -u -k -e cpu_cycles,ia64_inst_retired
       <Press ENTER to stop session>
       CPU0                   325728352 CPU_CYCLES       
       CPU0                   853076156 IA64_INST_RETIRED
       CPU1                   325869036 CPU_CYCLES       
       CPU1                   853618095 IA64_INST_RETIRED
    

    Monitor cpu_cycles and instruction retired on the first two CPUs at both user and kernel levels for 10 seconds, use US counter format:

       % pfmon --us-c --session-timeout=10 --cpu-list=0-1 --system-wide -u -k \
         -ecpu_cycles,ia64_inst_retired
       <session to end in 10 seconds>
       CPU0              10,000,683,919 CPU_CYCLES       
       CPU0              26,189,007,084 IA64_INST_RETIRED
       CPU1              10,000,799,060 CPU_CYCLES       
       CPU1              26,204,014,060 IA64_INST_RETIRED
    

    Monitor cpu_cycles and instruction retired on the first two CPUs at the user level only during the execution of the ls command (here obviously run on CPU0):

       % pfmon --cpu-list=0-1 --system-wide -u -e cpu_cycles,ia64_inst_retired \
         -- ls -l /dev/null
       crw-rw-rw-    1 root     root       1,   3 Dec  6  2002 /dev/null
       CPU0                      802960 CPU_CYCLES       
       CPU0                      581288 IA64_INST_RETIRED
       CPU1                       44363 CPU_CYCLES       
       CPU1                       10666 IA64_INST_RETIRED
    
    5.3 Results aggregation

    It is possible to aggregate counts when monitoring more than one CPU:

       % pfmon --aggregate-results --system-wide -k -e cpu_cycles,ia64_inst_retired
       <Press ENTER to stop session>
                    4577585811 CPU_CYCLES       
                   11991877430 IA64_INST_RETIRED
    

    In which case, the per CPU results are summed. Pfmon does not allow different events to be monitored on different CPUs. For this you can run separate instances of pfmon with a different CPU list, using a command lines similar to:

       % pfmon --session-timeout=10 --cpu-list=0 --system-wide -k -e cpu_cycles &
       % pfmon --session-timeout=10 --cpu-list=1 --system-wide -k -e ia64_inst_retired &
    
    5.4 Delayed start

    Sometimes, it may be useful to delay the activation of monitoring until a certain point in the execution is reached. This is the case when the initialization must not be included in the counts.

    In system-wide mode only, pfmon provides the --trigger-delay-start option which can be used to specify a delay in second before monitoring is activated:

       % pfmon --system-wide --trigger-start-delay=5 
    

    Wait for 5 seconds before acitvating monitoring on all CPUs. You see the line:

       	<press ENTER to stop session>
    

    when it is actually started.

    The --trigger-start-delay option is not support in per-process mode.

    5.5 Saving results in files

    Results can be saved into file(s) for system-wide session using the same --outfile option as for per-process sessions. When results are not aggregated you get one file per cpu. The filename provided via the option is prepended to .cpuX where X is the CPU number:

       % pfmon --outfile=b --cpu-list=0-1 --system-wide -u \
         -e cpu_cycles,ia64_inst_retired -- ls -l /dev/null
       crw-rw-rw-    1 root     root       1,   3 Dec  6  2002 /dev/null
    
       % cat b.cpu0
                           774233 CPU_CYCLES       
                        557904 IA64_INST_RETIRED
       % cat b.cpu1
                            54799 CPU_CYCLES       
                         32728 IA64_INST_RETIRED
    

    You can add a header in each file by specifying the --with-header option.


    6. Monitoring multiple processes/threads

    This feature of pfmon-3.0 is only meaningful in per-process mode.

    6.1 Basic monitoring of multiple processes/threads

    When monitoring complicated workloads it is useful to be able to monitor more than just the first process or thread which typically only do the initialization the bulk of the work being carried out by worker processes/threads. To illustrate the various options offered by pfmon we use the example of the GNU C compiler gcc invoked to compile a trivial "hello.c" program. Several processes are created as part of the compilation:

       % gcc -v hello.c -o hello -O2 -static
       Reading specs from /usr/lib/gcc-lib/ia64-linux/2.96/specs
       gcc version 2.96 20000731 (Debian GNU/Linux IA64 experimental)
       ...
       /usr/lib/gcc-lib/ia64-linux/2.96/cpp0 -lang-c -v -D__GNUC__=2 -D__GNUC_MINOR__=96...
       ...
       /usr/lib/gcc-lib/ia64-linux/2.96/cc1 /tmp/ccM0VFYv.i -mb-step -quiet -dumpbase...
       ...
       as -x -o /tmp/ccMo7exq.o /tmp/ccas6Y5Q.s
       ...
       /usr/lib/gcc-lib/ia64-linux/2.96/collect2 -static -o hello /usr/lib/gcc-lib/...
    

    The edited output reveals that gcc uses a total of 5 processes to compile hello.c.

    Now, let's look at how pfmon can monitor all or some of the processes involved here. By default only the first process is monitored:

       % pfmon --us-c --no-cmd-out -eia64_inst_retired -- gcc hello.c -o hello -O2 -static
                     1,123,406 IA64_INST_RETIRED
    

    How do we know that only one process is monitored? You can enable the verbose mode of pfmon:

       % pfmon --verbose --us-c --no-cmd-out -eia64_inst_retired \
         -- gcc hello.c -o hello -O2 -static
       symbol file for gcc not found
       measuring at user privilege level ONLY
       ...
       [10950] started task: gcc hello.c -o hello -O2 -static 
       follow_exec=n follow_vfork=n follow_fork=n follow_pthread=n
       [10950] monitoring gcc hello.c -o hello -O2 -static ...
       [10950] activating monitoring
       measurements started at Thu Oct 30 17:10:58 2003
       
       [10950] task exited
                        1,123,406 IA64_INST_RETIRED
       [10950] detached
       created tasks        : 1
       maximum tasks        : 1
       maximum active tasks : 1
       measurements completed at Thu Oct 30 17:10:59 2003
    

    Now if we want to monitor all processes, we can use some of the --follow-* option that pfmon offers:

    • --follow-fork : pfmon monitors the child process created via a regular fork(). Pfmon creates a distinct "session" dedicated to the newly created process. By default pfmon does not monitor the child process.
    • --follow-vfork : pfmon monitors the child process created via a vfork(). Pfmon creates a distinct "session" dedicated to the newly created process. By default pfmon does not monitor the child process.
    • --follow-pthread: pfmon monitors the a new pthread is created via pthread_create(). This works for both the old LinuxThreads library and the New Posix Thread Library (NPTL). Pfmon creates a distinct "session" dedicated to the newly created process. By default pfmon does not monitor the new thread.
    • --follow-exec : pfmon monitors the process across the execve() system call. No distinct "session" is created. By default, pfmon does not monitoring across exec().
    • --follow-all : combines all of the above options.

    In case you are not sure which variation of the fork to use, it is best to use --follow-all. Now we run our gcc test case with the --follow-all option:

       % pfmon --follow-all --us-c --no-cmd-out -eia64_inst_retired \
         -- gcc hello.c -o hello -O2 -static
             619,470 IA64_INST_RETIRED /usr/lib/gcc-lib/ia64-linux/2.96/cpp0 (10973,10973,10971)
          20,312,608 IA64_INST_RETIRED /usr/lib/gcc-lib/ia64-linux/2.96/cc1 (10974,10974,10971)
           5,498,296 IA64_INST_RETIRED as (10975,10975,10971)
         228,937,523 IA64_INST_RETIRED /usr/bin/ld (10977,10977,10976)
             468,328 IA64_INST_RETIRED /usr/lib/gcc-lib/ia64-linux/2.96/collect2 (10976,10976,10971)
           1,123,415 IA64_INST_RETIRED gcc (10971,10971,-1)
    

    This time, pfmon reports 6 different results for 6 different processes. But we said there was 5 only? The collect2 process forks the linker ld and this is not reported by the -v option of gcc.

    Let's examine a line of the output:

                     5,498,296 IA64_INST_RETIRED as (10975,10975,10971)
    		 ^         ^                 ^   ^     ^     ^
                     |         |                 |   |     |     |
    		 |         |                 |   |     |     parent process id
    		 |         |                 |   |     thread id (tid)
    		 |         |                 |   process id (pid)
    		 |         |                 command name
    		 |         event name
    		 total count for the process
    

    A parent process id of -1 indicates the first process.

    It is important to keep in mind that for child processes, by default, the "total count" does include execution up to the initial fork, i.e., before the exec:

        m   | ->     vfork() ----> parent goes to wait()
        o   |          |
        n   |          |
        i   |          |
        t   |        execve()
        o   |          |
        r   |        execution of the command
        e   |          |
        d   |->       exit()
    

    The pid vs. tid distinction is only relevant for threads AND when using NPTL. Otherwise both numbers should be identical. The pid, tid, parent pid can be used to re-create the process tree involved with the execution of gcc. Results are reported in order of termination of the process. This is why "gcc" is reported last. The entire command line is not reported, only the first argument, which is the command name.

    You can trace what is going on with the --verbose option:

       % pfmon --verbose --follow-all --us-c --no-cmd-out -eia64_inst_retired \
         -- gcc hello.c -o hello -O2 -static
       symbol file for gcc not found
       measuring at user privilege level ONLY
       ...
       exec-pattern=*
       [11009] started task: gcc hello.c -o hello -O2 -static 
       follow_exec=y follow_vfork=y follow_fork=y follow_pthread=y
       [11009] monitoring gcc hello.c -o hello -O2 -static ...
       [11009] activating monitoring
       measurements started at Thu Oct 30 17:41:49 2003
       
       [11009] vfork [11011]
       [11011] monitoring gcc hello.c -o hello -O2 -static ...
       [11011] activating monitoring
       [11011] exec /usr/lib/gcc-lib/ia64-linux/2.96/cpp0 -lang-c -D__GNUC__=2 -D__G...
       [11011] monitoring /usr/lib/gcc-lib/ia64-linux/2.96/cpp0 -lang-c -D__GNUC__=2...
       [11011] task exited
       619,469 IA64_INST_RETIRED /usr/lib/gcc-lib/ia64-linux/2.96/cpp0 (11011,11011,11009)
       [11011] detached
       [11009] vfork [11012]
       [11012] monitoring gcc hello.c -o hello -O2 -static ...
       [11012] activating monitoring
       [11012] exec /usr/lib/gcc-lib/ia64-linux/2.96/cc1 /tmp/ccOmxEc2.i -mb-step -q...
       [11012] monitoring /usr/lib/gcc-lib/ia64-linux/2.96/cc1 /tmp/ccOmxEc2.i -mb-s...
       [11012] task exited
        20,312,608 IA64_INST_RETIRED /usr/lib/gcc-lib/ia64-linux/2.96/cc1 (11012,11012,11009)
       [11012] detached
       [11009] vfork [11013]
       [11013] monitoring gcc hello.c -o hello -O2 -static ...
       [11013] activating monitoring
       [11013] exec as -x -o /tmp/ccE7nXIP.o /tmp/cciqWksW.s ...
       [11013] monitoring as -x -o /tmp/ccE7nXIP.o /tmp/cciqWksW.s ...
       [11013] task exited
       5,498,296 IA64_INST_RETIRED as (11013,11013,11009)
       [11013] detached
       [11009] vfork [11014]
       [11014] monitoring gcc hello.c -o hello -O2 -static ...
       [11014] activating monitoring
       [11014] exec /usr/lib/gcc-lib/ia64-linux/2.96/collect2 -static -o hello /usr/...
       [11014] monitoring /usr/lib/gcc-lib/ia64-linux/2.96/collect2 -static -o hello...
       [11014] vfork [11015]
       [11015] monitoring /usr/lib/gcc-lib/ia64-linux/2.96/collect2 -static -o hello...
       [11015] activating monitoring
       [11015] exec /usr/bin/ld -static -o hello /usr/lib/gcc-lib/ia64-linux/2.96/.....
       [11015] monitoring /usr/bin/ld -static -o hello /usr/lib/gcc-lib/ia64-linux/2...
       [11015] task exited
       228,937,523 IA64_INST_RETIRED /usr/bin/ld (11015,11015,11014)
       [11015] detached
       [11014] task exited
       468,328 IA64_INST_RETIRED /usr/lib/gcc-lib/ia64-linux/2.96/collect2 (11014,11014,11009)
       [11014] detached
       [11009] task exited
       1,123,406 IA64_INST_RETIRED gcc (11009,11009,-1)
       [11009] detached
       created tasks        : 6
       maximum tasks        : 3
       maximum active tasks : 3
       measurements completed at Thu Oct 30 17:41:49 2003
    

    The trace reveals that vfork() is used to create the new processes, therefore we could have used --follow-vfork and --follow-exec to obtain the same results.

    6.2 Saving results in files

    It is possible to use the --outfile option when you are using any of the --follow-* options. There will be one file per process/thread, really per task in Linux terms unless you decide to aggregate the results using the --aggregate-results options.

    Using our gcc compilation example, we can now do:

       % pfmon --outfile=result --follow-all --us-c --no-cmd-out -eia64_inst_retired \
         -- gcc hello.c -o hello -O2 -static
       $ ls -l result.*
       -rw-r--r--    1 eranian  users          45 Nov  4 15:28 result.1931.1931
       -rw-r--r--    1 eranian  users          45 Nov  4 15:28 result.1933.1933
       -rw-r--r--    1 eranian  users          45 Nov  4 15:28 result.1934.1934
       -rw-r--r--    1 eranian  users          45 Nov  4 15:28 result.1935.1935
       -rw-r--r--    1 eranian  users          45 Nov  4 15:28 result.1936.1936
       -rw-r--r--    1 eranian  users          45 Nov  4 15:28 result.1937.1937
    

    There is one file per process created. The naming convention used by pfmon is as follows:

           result.1937.1937
           ^      ^    ^
           |      |    |
           |      |    thread id
           |      process id
           base filename
    

    Unless you are using threads (LinuxThreads or NPTL) both the process id and the thread id should be identical. You can use the --with-header option to get detailed header in each output file.

       % pfmon --with-header --outfile=result --follow-all --us-c --no-cmd-out \
         -eia64_inst_retired -- gcc hello.c -o hello -O2 -static
       % ls -l result.*
       -rw-r--r--    1 eranian  users        1416 Nov  4 15:32 result.2791.2791
       -rw-r--r--    1 eranian  users        1816 Nov  4 15:32 result.2793.2793
       -rw-r--r--    1 eranian  users        1497 Nov  4 15:32 result.2794.2794
       -rw-r--r--    1 eranian  users        1428 Nov  4 15:32 result.2795.2795
       -rw-r--r--    1 eranian  users        1790 Nov  4 15:32 result.2796.2796
       -rw-r--r--    1 eranian  users        1760 Nov  4 15:32 result.2797.2797
       % cat result.2797.2797
       ...
       #
       # command line     : /usr/bin/ld -static -o hello /usr/lib/...
       #
       # process id       : 2797
       # thread  id       : 2797
       # parent process id: 2796
       # parent thread  id: 2796
       # exec count       : 1
    

    The "command line" line shows the program for this task with represents the invocation of the linker, ld. The following set of lines shows the pid/tid and parent information.

    6.3 Separating execution between fork()/vfork() and exec()

    Pfmon can differentiate the execution that happens between the fork/vfork and exec using the --exec-split-result option.

    If we use our gcc example from the previous section, here is what happens without the option:

       % pfmon --follow-all --us-c --no-cmd-out -eia64_inst_retired -- gcc hello.c -o hello -O2 -static
             619,470 IA64_INST_RETIRED /usr/lib/gcc-lib/ia64-linux/2.96/cpp0 (10973,10973,10971)
          20,312,608 IA64_INST_RETIRED /usr/lib/gcc-lib/ia64-linux/2.96/cc1 (10974,10974,10971)
           5,498,296 IA64_INST_RETIRED as (10975,10975,10971)
         228,937,523 IA64_INST_RETIRED /usr/bin/ld (10977,10977,10976)
             468,328 IA64_INST_RETIRED /usr/lib/gcc-lib/ia64-linux/2.96/collect2 (10976,10976,10971)
           1,123,415 IA64_INST_RETIRED gcc (10971,10971,-1)
    

    With the option:

       % pfmon --exec-split-result --follow-all --us-c --no-cmd-out -eia64_inst_retired \
         -- gcc hello.c -o hello -O2 -static
               94 IA64_INST_RETIRED gcc (11041,11041,11039)
          619,376 IA64_INST_RETIRED /usr/lib/gcc-lib/ia64-linux/2.96/cpp0 (11041,11041,11039)
               94 IA64_INST_RETIRED gcc (11042,11042,11039)
       20,312,514 IA64_INST_RETIRED /usr/lib/gcc-lib/ia64-linux/2.96/cc1 (11042,11042,11039)
            2,203 IA64_INST_RETIRED gcc (11043,11043,11039)
        5,496,093 IA64_INST_RETIRED as (11043,11043,11039)
               94 IA64_INST_RETIRED gcc (11044,11044,11039)
              154 IA64_INST_RETIRED /usr/lib/gcc-lib/ia64-linux/2.96/collect2 (11045,11045,11044)
      228,937,369 IA64_INST_RETIRED /usr/bin/ld (11045,11045,11044)
          468,234 IA64_INST_RETIRED /usr/lib/gcc-lib/ia64-linux/2.96/collect2 (11044,11044,11039)
        1,123,406 IA64_INST_RETIRED gcc (11039,11039,-1)
    

    This time we have more counts reported, but that does not mean we had more processes created just because of our measurement. It simply shows a finer grain breakdown of the same measurement. It is fairly easy to rebuild the process tree:

                     gcc (11039,11039,-1)
           vfork ->        gcc (11041,11041,11039)
           exec  ->        /usr/lib/gcc-lib/ia64-linux/2.96/cpp0 (11041,11041,11039)
    
           vfork ->        gcc (11042,11042,11039)
           exec  ->        /usr/lib/gcc-lib/ia64-linux/2.96/cc1 (11042,11042,11039)
    
           vfork ->        gcc (11043,11043,11039)
           exec  ->        as (11043,11043,11039)
    
           vfork ->        gcc (11044,11044,11039)
           exec  ->        /usr/lib/gcc-lib/ia64-linux/2.96/collect2 (11044,11044,11039)
           vfork ->            /usr/lib/gcc-lib/ia64-linux/2.96/collect2 (11045,11045,11044)
           exec ->        	   /usr/bin/ld (11045,11045,11044)
    

    This measurement reveals that gcc does not do anything between vfork and exec for the pre-processor, and cc1 but does more before invoking as.

    When results are saved into a file, pfmon automatically adds an extra field to the file name when the --exec-split-results option is used. This is necessary to distinguish between the various parts of the execution of the task because the pid and the tid remain the same across exec:

       % pfmon --exec-split-result --outfile=qq --follow-all --us-c --no-cmd-out \
         -eia64_inst_retired -- gcc hello.c -o hello -O2 -static
       % ls -l qq.*
       -rw-r--r--    1 eranian  users          45 Nov  4 17:10 qq.20602.20602.1
       -rw-r--r--    1 eranian  users          45 Nov  4 17:10 qq.20604.20604.0
       -rw-r--r--    1 eranian  users          45 Nov  4 17:10 qq.20604.20604.1
       -rw-r--r--    1 eranian  users          45 Nov  4 17:10 qq.20605.20605.0
       -rw-r--r--    1 eranian  users          45 Nov  4 17:10 qq.20605.20605.1
       -rw-r--r--    1 eranian  users          45 Nov  4 17:10 qq.20606.20606.0
       -rw-r--r--    1 eranian  users          45 Nov  4 17:10 qq.20606.20606.1
       -rw-r--r--    1 eranian  users          45 Nov  4 17:10 qq.20607.20607.0
       -rw-r--r--    1 eranian  users          45 Nov  4 17:10 qq.20607.20607.1
       -rw-r--r--    1 eranian  users          45 Nov  4 17:10 qq.20608.20608.0
       -rw-r--r--    1 eranian  users          45 Nov  4 17:10 qq.20608.20608.1
    

    Notice, the third field in the filenames:

           qq.20604.20604.1
           ^  ^     ^     ^
           |  |     |     |
           |  |     |     exec number
           |  |     thread id
           |  process id
           base filename
    

    The process 20604 did exec once. The qq.20604.20604.0 contains the results for the initial portion of the execution of the process, i.e., before the first exec. The file qq.20604.20604.1 contains the results for the portion of execution after the first exec.

    As for the initial process, gcc in this case, it shows up as qq.20602.20602.1. The exec count is always 1 for this program because we count the initial exec from the pfmon tool (after the vfork).

    It is very helpful to use the --with-header option to record detailed information about the results. For instance, in the example above:

       % cat qq.20604.20604.0
       ...
       # command line     : gcc hello.c -o hello -O2 -static
       #
       # process id       : 20604
       # thread  id       : 20604
       # parent process id: 20602
       # parent thread  id: 20602
       # exec count       : 0
       #
       ...
    

    This represents the counts accumulated between the moment the parent gcc forked off the child and the moment the child called exec. Looking at the other file:

       % cat qq.20604.20604.1
       ...
       # command line     : /usr/lib/gcc-lib/ia64-linux/2.96/cpp0 -lang-c ...
       #
       # process id       : 20604
       # thread  id       : 20604
       # parent process id: 20602
       # parent thread  id: 20602
       # exec count       : 1
       ...
    

    This corresponds to the execution of the pre-processor, cpp0, with is exec'ed by the child of gcc. Notice the exec count is now 1.

    6.4 Selectively monitoring processes

    When using the --follow-exec option it is possible to specify a regular expression pattern to filter out some part of execution. This is best explain with an example. If we use our gcc example once again and suppose we only care about the results for the assembler, "as", then we can use the following:

       % pfmon --follow-all  --follow-exec="^as" --us-c --no-cmd-out \
         -eia64_inst_retired -- gcc hello.c -o hello -O2 -static
               5,496,093 IA64_INST_RETIRED as (11065,11065,11061)
    

    The "^" before as is to ensure that we match the pattern "as" only at the beginning of the command line.

    Pfmon only measures and reports execution of the "as process". Of course, here one has to be careful by what is meant by "as process". As we saw in the previous section, the "as process" starts its life as a "gcc process" that eventually execs the assembler, as. In this particular example, pfmon did not have monitoring enabled at the time the "gcc process" is created, therefore monitoring ONLY starts at the exec.

    If you want to include more than one command and cannot easily express this with a regular expression, you can use the opposite command to --follow-exec which is --follow-exec-exclude. Commands that match are not monitored. In our gcc example, suppose we want to measure everything except gcc:

     
       % pfmon --follow-all  --follow-exec-exclude="^gcc" --us-c --no-cmd-out -eia64_inst_retired \
         -- gcc hello.c -o hello -O2 -static
           619,376 IA64_INST_RETIRED /usr/lib/gcc-lib/ia64-linux/2.96/cpp0 (11081,11081,11079)
        20,312,513 IA64_INST_RETIRED /usr/lib/gcc-lib/ia64-linux/2.96/cc1 (11082,11082,11079)
         5,496,093 IA64_INST_RETIRED as (11083,11083,11079)
       228,939,036 IA64_INST_RETIRED /usr/bin/ld (11085,11085,11084)
           468,234 IA64_INST_RETIRED /usr/lib/gcc-lib/ia64-linux/2.96/collect2 (11084,11084,11079)
    

    These options may seem frivolous at first because a simple grep of the full output would yield the same results but they become particularly useful when pfmon is used for sampling. Also by not monitoring when not necessary we limit the overhead.

    6.5 Aggregating results from multiple processes/threads

    By default, pfmon produces output for every actively monitored task followed. You may want to only get the aggregate results. In this case, you simply need to use the --aggregate-results and pfmon will add up all the counts per event for all tasks.

    Let us take our compilation example and use the LOADS_RETIRED event to gain some more stability in the counts:

       % pfmon --follow-all -eloads_retired -- gcc hello.c -o hello -O2 -static
               80598 LOADS_RETIRED /usr/lib/gcc-lib/ia64-linux/2.96/cpp0 (17979,17979,17977)
             1818537 LOADS_RETIRED /usr/lib/gcc-lib/ia64-linux/2.96/cc1 (17980,17980,17977)
              535503 LOADS_RETIRED as (17981,17981,17977)
            25453153 LOADS_RETIRED /usr/bin/ld (17983,17983,17982)
               61333 LOADS_RETIRED /usr/lib/gcc-lib/ia64-linux/2.96/collect2 (17982,17982,17977)
              115971 LOADS_RETIRED gcc (17977,17977,-1)
    

    Now with aggregation:

       % pfmon --aggregate-results --follow-all -eloads_retired -- gcc hello.c -o hello -O2 -static
                      28065993 LOADS_RETIRED
    

    7. Triggering monitoring at specific locations

    Pfmon allows you to control where monitoring is started and stopped using code or data trigger addresses. By default pfmon starts monitoring at the first instruction of a program and stop when the programs terminates, i.e., the process exits. This feature exists on all platform and does not use the PMU hardware but the debug registers. As such it introduces more overhead and should be use carefully, i.e. the section of the execution where monitoring is active must be long enough compared to the overhead.

    This ability to control where monitoring starts/stops is only available in per-process mode.

    7.1 Basic code triggers

    You can start or stop monitoring when execution reaches a certain instruction in your program. The instruction is identified by its address. For code, the address must be aligned on a bundle boundary (16-byte aligned). There is no particular checks made on the address, pfmon does not try to validate the address inside the process user level address space. However pfmon rejects all addresses referring to kernel code. This feature cannot be used to trigger monitoring for kernel level execution. This limitation is due to the fact that triggers are implemented using the debug registers and it is illegal to set a breakpoint in the kernel.

    You can set up to four code trigger start and stop. For instance you can have two start triggers and two stop triggers, but they do not need to be evenly balanced, i.e., one start and three stops is perfectly legal. You specify the start triggers using --trigger-code-start followed by a comma separated list of addresses or symbol names. Do not use space between the element of the list. Similarly stops are expressed using --trigger-code-stop as a comma-separated list of addresses or symbols.

    Supposing we set the start trigger at the beginning of a function, and if this function calls others, monitoring will stay active until we come across a stop trigger or the process terminates.

    By default start and stop triggers are "executed" only once.

    Let us look at a simple example with a stupid noploop program:

       % nm noploop | fgrep "T main"
       40000000000004c0 T main
    

    Count the number of load retired instruction for the entire execution:

       % pfmon --no-cmd-out -eloads_retired noploop 100000
                          1340 LOADS_RETIRED
    

    Count the number of load retired instruction starting at main():

       % pfmon --trigger-code-start=0x40000000000004c0 --no-cmd-out \
         -eloads_retired noploop 100000
                           552 LOADS_RETIRED
    

    Monitoring start at the first instruction in main() and stop when noploop executes its last user level instruction (we only monitor at the user level here). From the two runs, we can figure out the number of loads executed by the startup code: 1340-552 = 788. We can verify this by simply stopping when we reach main():

       % pfmon --trigger-code-stop=0x40000000000004c0 --no-cmd-out -eloads_retired noploop 100000
                           788 LOADS_RETIRED
    

    As you can see here, you do not need to match a start with a stop.

    Instead of using addresses, you can use symbols names given that the symbol table is still available in the binary. So our last example can be expressed as follows:

       % pfmon --trigger-code-stop=main --no-cmd-out -eloads_retired noploop 100000
                           788 LOADS_RETIRED
    
    7.2 Data triggers

    Data triggers work very much like ode triggers but they apply only to data accesses, i.e., load, store, synchronization instructions. By default the trigger operates on read or write access. The start triggers are specified with the --trigger-data-start option and the stop triggers with the --trigger-data-stop option. Both are comma-separated lists of data addresses or symbols. The trigger is activated when the byte pointed to by the address is accessed as part of any size of access, it could be a ld1, ld2, ld4, ld8 for instance.

    When using symbols, the names refer to global variables in the main program when linked dynamically or for the entirety of the program when linked statically, i.e., no dynamic symbol lookup is performed.

    It is possible to refine the type of access that we active the triggers. By default they are activated on read OR write. You can restrict this with the following options:

    • --trigger-data-ro : activate on read access only
    • --trigger-data-wo : activate on write access only
    7.3 Repeating triggers

    In certain cases it is useful to repeat the start/stop trigger sequence each time the program comes across the trigger points. In the case of a code trigger, that is how you could accumulate a measurement specific to a function, for instance. By default pfmon only uses each trigger once. You can repeat the code triggers using --trigger-code-repeat and ALL code start/stop trigger will be repeated. Similarly using --trigger-data-repeat will keep the data trigger installed throughout the execution.

    7.4 Inheritance of triggers

    By default, triggers are not inherited across fork, vfork, pthread_create. Triggers are NEVER inherited across exec(). By inheritance, we mean are not reinstalled into the newly created task. Trigger inheritance is possible using the --trigger-code-follow and --trigger-data-follow options.

    7.5 Dynamic code stop triggers

    Many times, code trigger are used to measure only one function and the one it calls. However many function can be called from difference places and it is often hard to locate the return point where the stop trigger should be placed. Pfmon can dynamically figure out the return point of a function. When the code start and stop trigger are identical, pfmon interpret the options as a request to monitor when execution comes into a function until it leaves the function, i.e., return from the function. Pfmon does this using a best-effort approach that is not 100% guaranteed. Pfmon does not check that you place the start trigger at the beginning of a function. To determine the location where to place the stop trigger, pfmon uses the value store in the return pointer (rp) at the time execution reaches the start trigger. Hence, if the start trigger is placed at the first instruction of the function, then the rp contains the place when the function will eventually return to. Note that this location is not necessarily one bundle after where it was called, especially for leave procedures. This feature is not 100% guaranteed and as such is reserved to the experts.

    Here is a simple example using our noploop program:

       % pfmon --trigger-code-start=noploop --trigger-code-stop=noploop \
         -e loads_retired noploop 100000000
                                0 LOADS_RETIRED
    

    The assembly code for the noploop function inside our noploop program is as follows:

       4000000000000390 :
       4000000000000390:       [MII]       nop.m 0x0
       4000000000000396:                   mov.i r2=ar.lc
       400000000000039c:                   mov.i ar.lc=r32;;
       40000000000003a0:       [MIB]       adds r32=-1,r32
       40000000000003a6:                   nop.i 0x0
       40000000000003ac:                   br.cloop.sptk.few 40000000000003a0 ;;
       40000000000003b0:       [MIB]       mov r8=r32
       40000000000003b6:                   mov.i ar.lc=r2
       40000000000003bc:                   br.ret.sptk.many b0;;
    

    There is no load instruction in this function and notice that there is no alloc instruction either. noploop() is called from main(). To figure out what is going on, you can turn on verbose mode:

       % pfmon --verb --no-cmd-out --trigger-code-start=noploop --trigger-code-stop=noploop \
         -e loads_retired noploop 100000000
       loaded 1353 symbols from ELF file noploop
       measuring at user privilege level ONLY
       start code trigger @0x4000000000000390
       stop  code trigger @0x4000000000000390
       ...
       exec-pattern=*
       [12760] started task: noploop 100000000 
       follow_exec=n follow_vfork=n follow_fork=n follow_pthread=n
       [12760] monitoring noploop 100000000 ...
       [12760] installed start code breakpoint at 0x4000000000000390
       measurements started at Fri Oct 31 17:18:52 2003
    
       [12760] reached start code breakpoint @0x4000000000000390
       [12760] install br.ret breakpoint @0x40000000000004e0
       [12760] activating monitoring at trigger start
       [12760] clearing code breakpoint @0x4000000000000390
       [12760] reached stop  code breakpoint @0x40000000000004e0
       [12760] stopping monitoring at trigger stop
       [12760] clearing code breakpoint @0x40000000000004e0
       [12760] task exited
                             0 LOADS_RETIRED
       [12760] detached
       created tasks        : 1
       maximum tasks        : 1
       maximum active tasks : 1
       measurements completed at Fri Oct 31 17:18:52 2003
    

    The key line in the trace is the one that shows the dynamically extracted stop trigger:

          [12760] install br.ret breakpoint @0x40000000000004e0
    

    When the start trigger is reached at 0x4000000000000390, i.e., the beginning of noploop(), pfmon queries the kernel to get the return pointer (rp), and it gets 0x40000000000004e0 which is the address in main() where noploop() will return to:

       40000000000004cc:                   mov b6=r15
       40000000000004d0:       [MIB]       ld8 r1=[r14]
       40000000000004d6:                   nop.i 0x0
       40000000000004dc:                   br.call.sptk.many b0=b6;;
       40000000000004e0:       [MII]       mov r1=r38
    

    In this program noploop() is called via an indirect function call.

    Of course, this feature relies upon having a valid return pointer at the time we reach the start trigger. Pfmon does not verify this at all, so be careful when using this feature!


    8. Dealing with symbols

    Whenever an option takes an address (code or data) as argument, it is possible to directly use a symbol name rather than an address. For instance, this is true for the --trigger-code-*, --trigger-data-* options. Other examples include the Itanium2 options, --drange, --irange options.

    8.1 Symbol table sources

    Pfmon can extract the symbol table from an ELF image directly. Pfmon also understands the System.map format which is typically used to save the symbol table of the kernel. In addition pfmon now understands the format of the kernel symbols exported via /proc/kallsyms which is standard in the 2.6 kernel series. Note however, that this file does not list the kernel data symbols but it has the advantage of always being available and it is always matching the host kernel.

    8.2 Symbol lookups

    In per-process mode, the symbols are automatically extracted from the command being run. When the command does not have the symbol table it is possible to specify an alternate ELF file to look for the symbol, using the --symbol-file command. This other ELF file must be the non-stripped version of the command being run, of course!

    There are a few restrictions concerning symbols in per-process mode. Pfmon cannot extract symbol information coming from dynamically linked libraries or modules. To work around this problem, the program must be statically linked and should not explicitly use dl_open().

    If the symbol table has been stripped, pfmon will not find any symbol and you should revert to using a numerical address.

    In case the option requires a code address, pfmon will only look for matching code symbols. Conversely, if the option requires a data address, pfmon will only look for matching data symbols. The kernel symbols exported via /proc/kallsyms do no include data symbols, use System.map for this instead.

    For the kernel, pfmon automatically load the content of /proc/kallsyms, unless the user specified the --sysmap-file option which supersedes it. The option must point to a valid System.map file that matches the host kernel.

    8.3 Symbols and the --follow-* options

    In per-process mode, symbol lookups are done before the command is actually launched therefore it pertains only to the initial program. Hence, it is not possible to refer to the name of a symbol from a program that will eventually execute. For instance, it is not possible to name a symbol from the assembler "as" when monitoring a compilation with GNU c (gcc).

    8.4 Symbol resolution when sampling

    It is possible to resolve all addresses present in samples when using the --resolve-addresses option AND when the sampling module honors the option. The [detailed-itanium] and [detailed-itanium2] built-in sampling modules do honor the option but the [compact-ia64] does not.

    With this option, pfmon will try to resolve addresses it finds in each sample. This includes the IIP but possibly other addresses contained in model-specific PMDs. For instance, pfmon resolves the address contained in PMD17 for D-EARS on Itanium and Itanium2. Resolution occurs for kernel level symbols and the user application if a symbol table was found. When no matching symbol exists, the raw address is printed instead. The location of the symbol, either kernel or application (app) is prefixed to the symbol name.

    In system-wide, only kernel symbols are resolved.

    8.5 Symbol resolution examples

    Here are a few examples on Itanium or Itanium2:

    Count the number of time main() is called in the noploop program:

        % file noploop
        noploop: ELF 64-bit LSB executable, IA-64, version 1, statically linked, not stripped
        % pfmon --no-cmd-out --checkpoint-func=main -e ia64_inst_retired noploop 10000
                             3 IA64_INST_RETIRED
    

    Here the symbol information for main() is directly extracted from noploop itself. You must divide by 3 (because of 3 were executed in the first bundle) to get the actual result.

    Count the number of time main() is called in the noploop-stripped program:

        % file noploop-stripped
        noploop-stripped: ELF 64-bit LSB executable, IA-64, version 1, statically linked, stripped
    
        % pfmon --no-cmd-out --checkpoint-func=main \
          -eia64_inst_retired noploop-=stripped 10000
        symbol not found: main
    

    Let us provide the non-stripped version for the symbol table:

        % pfmon --no-cmd-out --symbol-file=noploop --checkpoint-func=main \
          -e ia64_inst_retired noploop-stripped 1000
                                 3 IA64_INST_RETIRED
    

    Here noploop and noploop-stripped are the same programs except that the latter does not have the symbol table anymore.

    Count the number of times sys_getpid() is called during the execution of the noploop program:

        % pfmon -k --checkpoint-func=sys_getpid -e ia64_inst_retired noploop 1000
                                 0 IA64_INST_RETIRED
    

    The -k is required to make sure we monitor at the kernel level where sys_getpid() is located. The results shows that noploop never calls getpid(). We can easily verify that this is correct using the strace program:

        % strace noploop 1000
        execve("./noploop", ["noploop", "100000"], [/* 24 vars */]) = 0
        uname({sys="Linux", node="tuff.hpl.hp.com", ...}) = 0
        brk(0)                                  = 0x6000000000010000
        brk(0x6000000000034000)                 = 0x6000000000034000
        fstat(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 1), ...}) = 0
        mmap(NULL, 65536, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x2000000000000000
        write(1, "143663 cycles, 1.44 cycles/loop\n", 32143663 cycles, 1.44 cycles/loop) = 32
        munmap(0x2000000000000000, 65536)       = 0
        exit(0)                                 = ?
    

    But if we use a program that does getpid() in a loop:

        % pfmon --no-cmd-out -k --checkpoint-func=sys_getpid -e ia64_inst_retired getpid 1000
                          3003 IA64_INST_RETIRED
    

    This one does 1000+1 (3003/3) calls, obviously a 1000 from the loop plus an extra one somewhere else.


    9. Sampling with pfmon

    Pfmon has support for sampling on any events or combination of events. Samples are collected into a buffer which can then be written to a file or simply on the screen. Sampling is supported for both system-wide mode and per-process mode including when following fork, vfork and pthread_create.

    Pfmon can sample is many different ways depending on the underlying CPU or what is being measured. For a given sampling measurement, pfmon can present the information differently. For instance it can be presented in clear text with decoding of certain timings, or it can be dump as a table with the values uninterpreted. What control the output is called a pfmon sampling module and we detail what they do in the next sections, but first we need to introduce what sampling with pfmon and Linux/ia64 perfmon-2 means.

    9.1 Principles

    Sampling is a technique to record certain information at some interval. The information is recorded into a sample which is stored into a sampling buffer. The interval can be time-based or more likely event-based and the length can be fixed or can vary. A typical example is the case where you want to record where you are (IP) each time you have executed one million instructions.

    Each sample is composed of two parts, a fixed size header which contains general information about the sample and a variable size body which consists of a set of 64-bit values each one corresponding to a PMD register representing the other events being monitored. All samples record the same set of PMDs, this set is determined by pfmon based on what is requested by the user.

    The sampling buffer is controlled by the kernel but its size is configurable. Depending on the sampling module used in pfmon, the default size of the buffer can vary. The common module use a size of 2048 entries. Each sampling module may offer options to change that size.

    Sampling works as follows:

    1. the user specifies what needs to be recorded.
    2. the user specifies the sampling period and optional randomization parameters.
    3. at the end of a period, a sample is recorded into the buffer by the kernel.
    4. a new sampling period is reloaded and execution/monitoring resumes. we go back to step 3.
    5. if the sampling buffer becomes full, pfmon is notified.
    6. pfmon processes the buffer, i.e., prints and/or saves the buffer depending on the sampling module.
    7. pfmon then notifies the kernel that it is done.
    8. the kernel reload new sampling periods and execution/monitoring resumes. we go back to 3.

    Pfmon and the kernel perfmon-2.x subsystem allow monitoring sessions with multiple sampling periods. In case where more than one sampling periods expire at the same time, there is one sample per expired period recorded, thereby multiple samples recorded, subject to the space left in the buffer.

    9.2 Sampling periods

    Pfmon (and the perfmon-2.x ) uses two sampling periods instead of just one. The first one is called short-smpl-period and the second is called long-smpl-period. The short-smpl-period is used in step 4, this is when the sampling buffer is not full after writing the sample. The long-smpl-period is used in step 8 when the reload occurs after the buffer became full.

    But why do we need 2 periods?

    As you might imagine there is some overhead when recording a sample. This overhead is increased even more when pfmon needs to get involved to drain the buffer. This operation can take some time and will inevitably introduce some "noise" in the measurements in the form of TLB and/or cache pollution. To try and hide this "noise", it is sometimes beneficial to adjust the sampling period, i.e., make it bigger to ensure that the next sample will not record an event that is the consequence of the overhead generated by the monitoring but rather a normal event for the program/system being monitored. It is expect that the long-smpl-period >= short-smpl-period. Of course if the two are equal, this is equivalent to having only one sampling period. Note that the long-smpl-period is only used for the first sample recorded after the buffer is marked as empty again (step 7). Pfmon also use the long period first the initial period, i.e., for the first sample of the measurement.

    The long sampling period is specified via the --long-smpl-periods options and similarly --short-smpl-period is used for the short sampling period.

       % pfmon --long-smpl-periods=5000 --short-smpl-period=2000 -e l3_misses ....
    

    In the command line above, we take a sample every 2000 L3 misses at the user level. When the buffer becomes full, pfmon processes the data and restart monitoring. The next sample will come after 5000 L3 misses, subsequent samples will be 2000 misses apart.

    Pfmon supports multiple sampling periods. Each time one period elapse a sample is recorded. Hence with multiple sampling periods you can, in effect, create an OR condition: after N occurrences of event A OR M occurrences of event B record a sample. The number of sampling period is limited by the number of counting monitors supported by the underlying PMU. On Itanium and Itanium2 the limit is 4 counters, thus 4 sampling periods.

    When multiple sampling periods are used, they must be specify as a comma-separated list (no space). They apply in order tto the events specified with the -e option:

       % pfmon --long-smpl-periods=4000,3000 --short-smpl-periods=2000,1000 -el2_misses,l3_misses ...
    

    Here, the long and short periods and l2_misses are respectively 4000 and 2000. For l3_misses they are 300 and 1000 respectively.

    At a minimum, You must specify either one of the two types of sampling periods. Pfmon automatically adjusts the other one to be equal in that case. Hence the following:

       % pfmon --long-smpl-periods=4000 -el2_misses ...
    

    implies that the short period for l2_misses is 4000. You can see this with the --verbose option:

       % pfmon --verbose --long-smpl-periods=4000 -el2_misses ...
       loaded 10806 symbols from /proc/kallsyms
       loaded 1353 symbols from ELF file noploop
       measuring at user privilege level ONLY
       using detailed-itanium2 sampling module
       long  sampling periods(val/mask/seed): 2000/0x0/0
       short sampling periods(val/mask/seed): 2000/0x0/0
       ...
    
    9.3 Sampling modules

    There are many ways in which the samples can be saved or printed on the screen. Pfmon offers a choice of sampling module depending on the underlying CPU and what is being measured. There a few built-in modules but you are free to add you own should you have specific needs. Pfmon includes a template module call example-ia64.

    To find out the list of built-in modules, simply use the -I option:

       % pfmon -I
       detected host CPUs:  2-way 1000MHz/3.0MB Itanium 2 (McKinley, B3)
       supported PMU models: [itanium2] [itanium] [generic IA-64] 
       detected host PMU: itanium2
       supported sampling modules: [detailed-itanium2] [raw-ia64] [compact-ia64] [btb] [example-ia64] 
       pfmlib version: 3.0
       kernel perfmon version: 2.0
    

    Because some modules are dependent on the underlying CPU, i.e., PMU model, the list of choices may vary from one processor to another. The above example is for Itanium2 but on Itanium you would get:

       % pfmon -I
       detected host CPUs:  2-way 733MHz/2.0MB Itanium (Merced, B3)
       supported PMU models: [itanium2] [itanium] [generic IA-64] 
       detected host PMU: itanium
       supported sampling modules: [detailed-itanium] [raw-ia64] [compact-ia64] [btb] [example-ia64] 
       pfmlib version: 3.0
       kernel perfmon version: 2.0
    

    Notice that "detailed-itanium2" has been replaced by "detailed-itanium" which is another module specific to Itanium (1).

    A sampling module is called when the sampling buffer becomes full and needs to be drained (dumped). Each module is free to process the data in any way it wants. Some of them can be specialized in ONLY a certain type of measurements.

    But in general the built-in formats accept any kind of measurements. They simply differ in the way they present the information:

    • [detailed-itanium2] : prints detailed information about each sample. The header and body are decoded and are directly meaningful to user. This module is Itanium 2 specific.
    • [detailed-itanium] : prints detailed information about each sample. The header and body are decoded and are directly meaningful to user. This module is Itanium specific.
    • [compact-ia64] : prints the header and body with one line per sample and just raw values. No interpretation is attempted. This module works for all CPU models.
    • [raw-ia64] : the buffer is dumped as is into a file. This module works for all CPU models.
    • [btb] : prints the Branch Trace Buffer (BTB) with one branch per line. Value are arranged in column-style with no interpretation. You must use BRANCH_EVENT as the sampling period. This module works for Itanium and Itanium2 only.
    • [example-ia64] : a simple template module for people who want to write their own sampling module.

    By default, pfmon auto-detect the host CPU and picks the first matching sampling module. On Itanium2, the default is [detailed-itanium2]. You can force a specific module by using the --smpl-module option:

       % pfmon no-cmd-output --smpl-module=compact-ia64 --long-smpl-periods=1000000 \
         -ecpu_cycles noploop 10000000
       0        18222    0  0x40000000000003a0 0x000312854d7e7f27   4 1000000 
       1        18222    0  0x40000000000003a0 0x000312854d8dff32   4 1000000 
       2        18222    0  0x40000000000003a0 0x000312854d9d5351   4 1000000 
       3        18222    0  0x40000000000003a0 0x000312854daca5bf   4 1000000 
       4        18222    0  0x40000000000003a0 0x000312854dbbf839   4 1000000 
       5        18222    0  0x40000000000003a0 0x000312854dcb4915   4 1000000 
       6        18222    0  0x40000000000003a0 0x000312854dda9e7a   4 1000000 
       7        18222    0  0x40000000000003a0 0x000312854de9ef8b   4 1000000 
       8        18222    0  0x40000000000003a0 0x000312854df94035   4 1000000 
       9        18222    0  0x40000000000003a0 0x000312854e089090   4 1000000 
       10       18222    0  0x40000000000003a0 0x000312854e17e0b8   4 1000000 
       11       18222    0  0x40000000000003a0 0x000312854e273102   4 1000000 
       12       18222    0  0x40000000000003a0 0x000312854e368116   4 1000000 
    

    You can find out more about each module by using the -S or --smpl-module-info option:

       % pfmon -S btb
       name        : btb
       description : IA-64 column-style BTB raw values. Must use BRANCH_EVENT event only.
       PMU models  : [itanium] [itanium2] 
       options     :
               --smpl-entries=val              set number of entries for sampling buffer
    

    The name of the module along with a short description is printed. Then the list of supported PMU models. Finally, optional module-specific options are listed.

    Each module can have module-specific options. Those options become known to pfmon ONLY once the module is known. That means that if you do not choose the default sampling module, then you MUST specify the module name, via --smpl-module, BEFORE you can use the module-specific options.

       % pfmon --no-cmd-out --smpl-module=btb --smpl-entries=1024 --long-smpl-periods=8192 \
         -ebranch_event noploop 10000
    

    Here --smpl-entries appears AFTER the --smpl-module selection.

    You can list all the modules with all their options simply by typing:

       % pfmon --help
       ....
       options for "detailed-itanium2" sampling format:
            --smpl-entries=val              set number of entries for sampling buffer (default 2048)
       options for "raw-ia64" sampling format:
            --smpl-entries=val              set number of entries for sampling buffer (default 2048)
       options for "compact-ia64" sampling format:
            --smpl-entries=val              set number of entries for sampling buffer (default 2048)
       options for "btb" sampling format:
            --smpl-entries=val              set number of entries for sampling buffer (default 2048)
       options for "example-ia64" sampling format:
            --smpl-entries=val              set number of entries for sampling buffer (default 2048)
    

    Notice that the default buffer size is printed.

    9.4 Basic sampling examples

    Suppose you want to record how many instructions are retired every 50000 cycles, i.e., you want to sample based on CPU_CYCLES and record the value of IA64_INST_RETIRED in each sample. This can be done as follows:

       % pfmon --no-cmd-out --short-smpl-period=50000 --long-smpl-period=50000 \
         -e cpu_cycles,ia64_inst_retired -- ls /dev/null
    

    The two periods are identical in this example because the number of instructions executed by the ls command is not influenced by the fact that we monitor.

    Next, we take a sample every 50000 cpu cycles OR each time 10000 instructions have been retired:

       % pfmon --no-cmd-out --short-smpl-period=50000,10000 --long-smpl-period=50000,10000 \
         -e cpu_cycles,ia64_inst_retired ls
    

    Let us look at the information in the sampling buffer for the detailed-itanium2 sampling module. For the first example above, we get something like this printed on the screen:

     
    	entry 0 PID:18291 CPU:0 STAMP:0x31444f4ed7297 IIP:0x2000000000023ec0
            OVFL: 4 LAST_VAL: 50000
            PMD5 : 0x0000000000004c96
            entry 1 PID:18291 CPU:0 STAMP:0x31444f4f0aebc IIP:0x2000000000021370
            OVFL: 4 LAST_VAL: 50000
            PMD5 : 0x000000000000f4f0
            entry 2 PID:18291 CPU:0 STAMP:0x31444f4f1a9b2 IIP:0x2000000000014ab0
            OVFL: 4 LAST_VAL: 50000
            PMD5 : 0x0000000000015bc7
            entry 3 PID:18291 CPU:0 STAMP:0x31444f4f27e8e IIP:0x2000000000014a81
            OVFL: 4 LAST_VAL: 50000
            PMD5 : 0x000000000001b799
            entry 4 PID:18291 CPU:0 STAMP:0x31444f4f34f65 IIP:0x2000000000014c11
            OVFL: 4 LAST_VAL: 50000
            PMD5 : 0x0000000000024d63
            entry 5 PID:18291 CPU:0 STAMP:0x31444f4f4201c IIP:0x2000000000021601
            OVFL: 4 LAST_VAL: 50000
            PMD5 : 0x0000000000033113
            .....
    

    You see the entries extracted from the sampling buffer. Entry 0 is the first entry recorded in this monitoring session. The first line of each sample (entry) shows the fixed header. The fields are as follows:

    • PID : the identity of the process that generated the event
    • CPU : the CPU number on which the event occurred
    • STAMP : a time stamp guaranteed to be unique in time per CPU.
    • IIP : the value of the IP when the PMU interrupt occurred (DANGER, see note below)
    • OVFL : the counter that triggered the recording of the sample.
    • LAST_VAL: the last value loaded into the counter which overflowed

    VERY IMPORTANT NOTE: Users are advised NOT TO TRUST the value reported in IIP. Samples get recorded by forcing a counter overflow which then triggers an interrupt which will cause the kernel to record the information. Because of the parallel nature of the architecture and its implementations, it is very likely that by the time the PMU realizes that there was a counter overflow and generates the interrupt, the program execution has progressed beyond the instruction that caused the event. This very often leads to the a skewed IIP. At best IIP points to the next bundle given that interrupts can only be delivered at bundle boundaries.

    After the header, you get the value of PMD5. This register contains the number of instructions retired for our example. The second event specified by the user DOES NOT necessarily end up in PMD5. To figure out how the events were dispatched among the various PMDs, you can use the --with-header option (described earlier). The header contains detailed machine and session descriptions. In our case it would like as follows:

       #
       # date: Mon Nov  3 14:21:18 2003
       #
       # hostname: tuff.hpl.hp.com
       #
       # kernel version: Linux 2.6.0-test7 #200 SMP Fri Oct 24 14:11:06 PDT 2003
       #
       # pfmon version: 3.0
       # kernel perfmon version: 2.0
       #
       #
       # page size: 16384 bytes
       #
       # CLK_TCK: 1024 ticks/second
       #
       # CPU configured: 0
       # CPU online    : 2
       #
       # physical memory          : 4232495104 bytes (4036.4 MB)
       # physical memory available: 3248488448 bytes (3098.0 MB)
       #
       # host CPUs:  2-way 1000MHz/3.0MB Itanium 2 (McKinley, B3)
       #       PAL_A: 0.7.31
       #       PAL_B: 0.7.40
       #       Cache levels: 3 Unique caches: 4
       #       L1D:    16384 bytes, line  64 bytes, load_lat   1, store_lat   3
       #       L1I:    16384 bytes, line  64 bytes, load_lat   1, store_lat   0
       #       L2 :   262144 bytes, line 128 bytes, load_lat   5, store_lat   7
       #       L3 :  3145728 bytes, line 128 bytes, load_lat  12, store_lat   7
       #
       #
       # captured events:
       #       PMD4: CPU_CYCLES
       #       PMD5: IA64_INST_RETIRED
       #
       # privilege levels:
       #       PMD4: CPU_CYCLES        = user
       #       PMD5: IA64_INST_RETIRED = user
       #
       # monitoring mode: per-process
       #
       # instruction sets:
       #       PMD4: CPU_CYCLES        = ia32/ia64
       #       PMD5: IA64_INST_RETIRED = ia32/ia64
       #
       #
       # command: pfmon --no-cmd-out --with-header --short-smpl-period=50000 ...
       #
       # command line     : ls /dev/null 
       #
       # process id       : 18315
       # thread  id       : 18315
       # parent process id: -1
       # parent thread  id: -1
       # exec count       : 1
       #
       #
       #
       # recorded PMDs when CPU_CYCLES overflows: PMD5 
       #
       # short sampling rates (base/mask/seed):
       #       CPU_CYCLES 50000
       #       IA64_INST_RETIRED none
       #
       # long sampling rates (base/mask/seed):
       #       CPU_CYCLES 50000
       #       IA64_INST_RETIRED none
       #
       #
       # min sampling buffer entries: 2048
       #
       #
    

    Near the middle of the header, you see in the "captured events" section: PMD5: IA64_INST_RETIRED. Later on you see the line "recorded PMDs when..." showing which PMD gets recorded in the body of each sample.

    Pfmon records the value of the PMD for which the event has no sampling period defined. For our first example, it means that it will record the value of the PMD counting the number of instructions retired. Let us look at a more complicated example using some of the Itanium2 specific events:

       % pfmon --with-header --short-smpl-periods=50000 --long-smpl-periods=50000 \
          -e cpu_cycles,ia64_inst_retired,l2_misses,cpu_cpl_changes -- ls /dev/null
    

    Here cpu_cycles is controlling the sampling period and each sample will include value of the PMDs counting the number of L2 misses (L2_MISSES) and the number of CPU privilege level changes (CPU_CPL_CHANGES):

       ...
       entry 1 PID:18326 CPU:0 STAMP:0x3153ce9a31d98 IIP:0x2000000000014e01
            OVFL: 4 LAST_VAL: 50000
            PMD5 : 0x000000000000f0c5
            PMD6 : 0x00000000000004da
            PMD7 : 0x0000000000000044
       ...
    

    Where the assignments were:

       # captured events:
       #       PMD4: CPU_CYCLES
       #       PMD5: IA64_INST_RETIRED
       #       PMD6: L2_MISSES
       #       PMD7: CPU_CPL_CHANGES
       ...
       # recorded PMDs when CPU_CYCLES overflows: PMD5 PMD6 PMD7 
    

    Using the compact-ia64 module instead of the detailed-itanium2, you get results that are formatted such that they can be easily parsed by other tools. The header contains the description of every column:

       % pfmon --smpl-module=compact-ia64 --no-cmd-out --with-header --short-smpl-periods=50000 \
        --long-smpl-periods=50000 -e cpu_cycles,ia64_inst_retired,l2_misses,cpu_cpl_changes \
        -- ls /dev/null
    
       0 18336 0  0x2000000000022bc0 0x00031564d69054aa 4 50000 0x000000000000551d 0x27b 0x26 
       1 18336 0  0x2000000000013710 0x00031564d6931eb3 4 50000 0x0000000000010e2f 0x536 0x47 
       2 18336 0  0x2000000000014a90 0x00031564d6941a54 4 50000 0x00000000000180ab 0x6b7 0x4e 
    

    The header contains the information to figure the content of each column:

       # description of columns:
       #       column  1: entry number
       #       column  2: process id
       #       column  3: cpu number
       #       column  4: instruction pointer
       #       column  5: unique timestamp
       #       column  6: overflowed PMD index
       #       column  7: initial value of first overflowed PMD
       #
       #       when PMD4(CPU_CYCLES) overflows: 
       #               column  8: PMD5
       #               column  9: PMD6
       #               column 10: PMD7
    
    9.5 Writing samples to a file

    By default, all sampling module output to the controlling terminal. This may be overwhelming at times. Therefore it is strongly recommended that samples be redirected to a file using the --smpl-outfile option. This option is different from the --outfile option. The latter is where the final counts are saved, whereas the former is where the samples are saved. By default, no counts are printed in sampling mode.

    The filename specified for the --smpl-outfile is in fact just the basename. It is appended by a suffix which depends on the monitoring sessions type. For system-wide session, you get one file per cpu and the .cpuXX suffix is added with XX represent the CPU number, starting at zero. If aggregation is used then nothing is appended.

    In per-process mode, the filename is also the base in case some of the follow-* options are used and you get one file per task monitored. See section g/ below for more details.

    9.6 Resetting non-sampling counters

    Oftentimes you want to know how many occurrences of an event during specific period. We saw that sampling gives you this kind of information. For instance, in our first example we want to know how many instruction retired for every 50000 cycles. The instructions retired count is recorded in every sample but this the accrued count not the count since last sample. It is fairly easy to correct this by substracting the count for the previous sample but this requires a pass over the entire buffer. Pfmon and the kernel perfmon-2.x subsystem support an option to force a reset of the non-sampling event each time a sample is recorded. For pfmon this is the --reset-non-smpl-periods options.

    Let us examine the effect of the --reset-non-smpl-periods options. goping back to our first example:

       % pfmon --no-cmd-out --short-smpl-period=50000 -e cpu_cycles,ia64_inst_retired ls
    
    	entry 0 PID:18291 CPU:0 STAMP:0x31444f4ed7297 IIP:0x2000000000023ec0
               OVFL: 4 LAST_VAL: 50000
               PMD5 : 0x0000000000004c96
            entry 1 PID:18291 CPU:0 STAMP:0x31444f4f0aebc IIP:0x2000000000021370
               OVFL: 4 LAST_VAL: 50000
               PMD5 : 0x000000000000f4f0
            entry 2 PID:18291 CPU:0 STAMP:0x31444f4f1a9b2 IIP:0x2000000000014ab0
               OVFL: 4 LAST_VAL: 50000
               PMD5 : 0x0000000000015bc7
            entry 3 PID:18291 CPU:0 STAMP:0x31444f4f27e8e IIP:0x2000000000014a81
               OVFL: 4 LAST_VAL: 50000
               PMD5 : 0x000000000001b799
            entry 4 PID:18291 CPU:0 STAMP:0x31444f4f34f65 IIP:0x2000000000014c11
               OVFL: 4 LAST_VAL: 50000
               PMD5 : 0x0000000000024d63
            entry 5 PID:18291 CPU:0 STAMP:0x31444f4f4201c IIP:0x2000000000021601
               OVFL: 4 LAST_VAL: 50000
               PMD5 : 0x0000000000033113
    

    Here the value of PMD5 is the accrued value since the beginning of the session. Therefore to figure out how many instructions were retired between entry 2 and entry 3, one must do 0x000000000001b799 - 0x0000000000015bc7 = 0x5bd2 = 23506

    Now with the option:

       % pfmon --no-cmd-out --short-smpl-period=50000 --reset-non-smpl-periods \
         -e cpu_cycles,ia64_inst_retired ls
    
       entry 0 PID:18353 CPU:0 STAMP:0x315e7c8493a68 IIP:0x200000000000fd90
            OVFL: 4 LAST_VAL: 50000
            PMD5 : 0x0000000000005b74
       entry 1 PID:18353 CPU:0 STAMP:0x315e7c84c0902 IIP:0x20000000000136e0
            OVFL: 4 LAST_VAL: 50000
            PMD5 : 0x000000000000b3c2
       entry 2 PID:18353 CPU:0 STAMP:0x315e7c84cf6ea IIP:0x2000000000014a72
            OVFL: 4 LAST_VAL: 50000
            PMD5 : 0x0000000000007768
       entry 3 PID:18353 CPU:0 STAMP:0x315e7c84dcbcf IIP:0x2000000000014ba0
            OVFL: 4 LAST_VAL: 50000
            PMD5 : 0x0000000000007b74
       entry 4 PID:18353 CPU:0 STAMP:0x315e7c84e9b57 IIP:0x20000000000137f0
            OVFL: 4 LAST_VAL: 50000
            PMD5 : 0x0000000000010095
       entry 5 PID:18353 CPU:0 STAMP:0x315e7c84f65a9 IIP:0x20000000000231b0
            OVFL: 4 LAST_VAL: 50000
            PMD5 : 0x000000000000b345
    

    You easily notice that the count for PMD5 is reset between two samples. No further calculation is needed to answer our initial question.

    9.7 Blocking on overflow notifications

    Whenever the sampling buffer becomes full and pfmon is notified you have the option of either letting the monitored program continue or block it. In both cases, monitoring is off during the processing of the sampling buffer. By default, pfmon lets the program continue its execution. It is possible to block the program using the --overflow-block option. Blocking the program ensures pfmon sees the entire execution. Keeping the program running ensures that the caches and TLB are kept somewhat warm, i.e., with some state belonging to the running process, especially on SMP systems. The option is support in all types of per-process sessions including the ones which are following execution through fork(), vfork(), and pthread_create().

    This options is not supported in system-wide mode.

    9.8 Sampling and the --follow-* options

    Sampling is supported across fork(),vfork(), and pthread_create(). It is also supported when sampling across exec(). The samples are collected per-process, i.e., really per-task in Linux term. In a multithreaded application, you get samples for each thread.

    By default all sampling modules write to the controlling terminal, hence it may be difficult to distinguish which samples come from which task and output may be interleaved. However, pfmon does serialize terminal output. This means that samples output in batched per task,i.e., no interleaving possible during a dump of a buffer.

    The best way to handle multi process/thread sampling sessions is to redirect the output to a file using the --smpl-outfile option. The filename passed to the option serves as a basename very much like what is done for the --outfile option that we described here.

    Let us look at our compilation example:

       % pfmon --follow-all --no-cmd-out --smpl-outfile=gcc-smpl --long-smpl-periods=1000000 \
         -- gcc hello.c -o hello -O2 -static
       % ls -l gcc-smpl.*
       2174981 -rw-r--r--    1 eranian  users           0 Nov  4 17:31 gcc-smpl.24410.24410
       2175304 -rw-r--r--    1 eranian  users           0 Nov  4 17:31 gcc-smpl.24411.24411
       2175313 -rw-r--r--    1 eranian  users        1046 Nov  4 17:31 gcc-smpl.24412.24412
       2175323 -rw-r--r--    1 eranian  users         380 Nov  4 17:31 gcc-smpl.24413.24413
       2175327 -rw-r--r--    1 eranian  users           0 Nov  4 17:31 gcc-smpl.24414.24414
       2175337 -rw-r--r--    1 eranian  users       14634 Nov  4 17:31 gcc-smpl.24415.24415
    

    Just like for --outfile, the process id and the thread id are appended to the provide filename. The thread id differs from the process id only when following threads. For processes, the two numbers are identical:

     
           gcc-smpl.24410.24410
           ^        ^     ^
           |        |     |
           |        |     thread id
           |        process id
           base filename
    

    You can get detailed information included in each file using the --with-header option, just like for --outfile.

    The --exec-split-results and --follow-exec-* options have the same effects as --outfile on the generation of the sampling output files that they do for the counts. When --exec-split-results is used, the exec number if appended to the sampling output file:

       % pfmon -exec-split --follow-all --no-cmd-out --smpl-outfile=gcc-smpl \
         --long-smpl-periods=1000000 -- gcc hello.c -o hello -O2 -static
       % ls -l gcc-smpl.*
       2174981 -rw-r--r--    1 eranian  users        1683 Nov  4 17:38 gcc-smpl.25727.25727.1
       2175304 -rw-r--r--    1 eranian  users        1594 Nov  4 17:38 gcc-smpl.25728.25728.0
       2175312 -rw-r--r--    1 eranian  users        1990 Nov  4 17:38 gcc-smpl.25728.25728.1
       2175317 -rw-r--r--    1 eranian  users        1594 Nov  4 17:38 gcc-smpl.25729.25729.0
       2175323 -rw-r--r--    1 eranian  users        2717 Nov  4 17:38 gcc-smpl.25729.25729.1
       2175332 -rw-r--r--    1 eranian  users        1594 Nov  4 17:38 gcc-smpl.25730.25730.0
       2175333 -rw-r--r--    1 eranian  users        1982 Nov  4 17:38 gcc-smpl.25730.25730.1
       2175336 -rw-r--r--    1 eranian  users        1594 Nov  4 17:38 gcc-smpl.25731.25731.0
       2175337 -rw-r--r--    1 eranian  users        1964 Nov  4 17:38 gcc-smpl.25731.25731.1
       2175341 -rw-r--r--    1 eranian  users        1964 Nov  4 17:38 gcc-smpl.25732.25732.0
       2175342 -rw-r--r--    1 eranian  users       16568 Nov  4 17:38 gcc-smpl.25732.25732.1
    

    If we look at the header of the largest file, we find out that this is the linker, which is exec'ed from the collect2 process:

       # command line     : /usr/bin/ld -static -o hello /usr/lib/gcc-lib/...
       #
       # process id       : 25732
       # thread  id       : 25732
       # parent process id: 25731
       # parent thread  id: 25731
       # exec count       : 1
    

    Some of the files can be empty (in case --with-header is not specified) or just with a header which means that no samples was collected for this portion of the execution.

    9.9 Sampling in system wide mode

    Sampling is possible in the same manner for system wide sessions. By default, the buffer is printed on the controlling tty. When sampling on more than one CPU at a time, samples for each CPU will be printed. When sampling results are redirected into a file with the --smpl-outfile option, you get one file per CPU. If the file is called 'myresults', then 'myresults.cpu0' contains the samples captured on CPU0, 'myresults.cpu1' the ones from CPU1, and so on:

       % pfmon --system-wide --smpl-outfile=sysout --long-smpl-periods=1000000 \
         --session-timeout=3
       <session to end in 3 seconds>
       % ls -l sysout*
       sysout.cpu0  sysout.cpu1
       % ls -l sysout.cpu*
       -rw-r--r--    1 eranian  users           0 Nov  4 12:22 sysout.cpu0
       -rw-r--r--    1 eranian  users           0 Nov  4 12:22 sysout.cpu1
    

    The --aggregate-results options influences the way samples are saved to files. When this option is used, the samples are merged into a single file. In our example, they would go into 'myresults'.

    9.10 Randomization of sampling periods

    Pfmon supports randomization of both sampling periods. The user must supply a bitmask and a seed value using the --smpl-periods-random option. The same mask and seed applies to both the long and short period for each event. Each event (sampling period) can have a different mask and seed. Two separate invocations of pfmon using the same seed and mask arguments are guaranteed to generate to same "pseudo-random" series of numbers allowing reproducibility.

    Most built-in sampling modules, report the last sampling period value, i.e., the value that was loaded into the sampling period to record a sample.

    In the following example, we use the [detailed-itanium2] module. The long (and short) sampling period are initially set to 100000 and we activate randomization using a seed of 5. The mask indicates that we allow the value to vary between 100000 and 100255 (inclusive):

       % pfmon --smpl-periods-random=0xff:5 --long-smpl-period=100000 \
         -e cpu_cycles -- noploop 1000000000
    
       entry 0 PID:509 CPU:0 STAMP:0xa9b83faf28 IIP:0x4000000000000400
            OVFL: 4  LAST_VAL: 100000
       entry 1 PID:509 CPU:0 STAMP:0xa9b8413a4d IIP:0x4000000000000400
            OVFL: 4  LAST_VAL: 100005
       entry 2 PID:509 CPU:0 STAMP:0xa9b842c532 IIP:0x4000000000000400
            OVFL: 4  LAST_VAL: 100067
       entry 3 PID:509 CPU:0 STAMP:0xa9b8445077 IIP:0x4000000000000400
            OVFL: 4  LAST_VAL: 100181
       entry 4 PID:509 CPU:0 STAMP:0xa9b845db4e IIP:0x4000000000000400
            OVFL: 4  LAST_VAL: 100064
       entry 5 PID:509 CPU:0 STAMP:0xa9b84766b5 IIP:0x4000000000000400
            OVFL: 4  LAST_VAL: 100212
       entry 6 PID:509 CPU:0 STAMP:0xa9b848f1d5 IIP:0x4000000000000400
            OVFL: 4  LAST_VAL: 100140
    

    The randomization is shown in the LAST_VAL field. Here is corresponds to the value loaded into PMD4 (the PMD which overflowed) for each sample. Hence, 100181 is the number of elapsed cycles between entry 2 and entry 3.

    Randomization is important when sampling to avoid getting in lockstep with the execution and thereby collecting biased results.

    9.11 Sampling and symbol resolution example

    Let us use an example which stresses the application and the kernel to show how both the application and the kernel addresses are resolved. We use a stupid getpid() loop program:

     
       % pfmon -u -k --no-cmd-out --resolve-addresses --long-smpl-period=1000000 -- getpid 100000
    

    We monitor at both the kernel and user level to record samples at both levels. The sample we are getting look as follows:

       entry 0 PID:16661 CPU:0 STAMP:0xdc1e3b599f02 IIP:kernel:break_fault+0x120
            OVFL: 4 LAST_VAL: 1000000
       entry 1 PID:16661 CPU:0 STAMP:0xdc1e3b68eff9 IIP:kernel:break_fault+0x120
            OVFL: 4 LAST_VAL: 1000000
       entry 2 PID:16661 CPU:0 STAMP:0xdc1e3b7836ba IIP:app:getpid+0x2
            OVFL: 4 LAST_VAL: 1000000
       entry 3 PID:16661 CPU:0 STAMP:0xdc1e3b877d3b IIP:app:getpid+0x2
            OVFL: 4 LAST_VAL: 1000000
       entry 4 PID:16661 CPU:0 STAMP:0xdc1e3b96c324 IIP:app:getpid+0x2
            OVFL: 4 LAST_VAL: 1000000
       entry 5 PID:16661 CPU:0 STAMP:0xdc1e3ba608df IIP:kernel:break_fault+0x120
            OVFL: 4 LAST_VAL: 1000000
       entry 6 PID:16661 CPU:0 STAMP:0xdc1e3bb54ebd IIP:kernel:break_fault+0x120
            OVFL: 4 LAST_VAL: 1000000
       entry 7 PID:16661 CPU:0 STAMP:0xdc1e3bc4944c IIP:kernel:break_fault+0x120
            OVFL: 4 LAST_VAL: 1000000
    

    You see that the interrupted IP (IIP), i.e., where we were when the PMU interrupted was generated is resolved to a symbol name. Sometimes we are in the kernel such as at kernel:break_fault+0x12. The +0x12 is the offset from the base address of the symbol. If we look at the code for the symbol app:getpid+0x2, we get:

       400000000000df00 <__getpid>:
       400000000000df00:       [MII]       mov r15=1041
       400000000000df06:                   break.i 0x100000;;
       400000000000df0c:                   nop.i 0x0
       400000000000df10:       [MFB]       nop.m 0x0
       400000000000df16:                   nop.f 0x0
       400000000000df1c:                   br.ret.sptk.few b0;;
    

    The +0x2 offset is a byte offset, meaning we are in the first bundle, most likely at the break.i instruction.

    9.12 Miscellaneous sampling options

    When sampling, the counts printed at the end of the session are not very useful, especially for the counters used as sampling periods. By default, pfmon does not print them at the end of a sampling session. You can change this by using the --smpl-print-counts option.

    For certain measurements, it is interesting to put the sampling buffer in "saturation" mode. This means that monitoring stops as soon as the buffer fill up. Pfmon is not notified at all. In the end, we collect the first N samples, where N is determine by the size of the sampling buffer. The processes and session continues until normal termination, but no monitoring is active. You can turn on this mode using the --saturate-smpl-buffer.



perfmon project links

» project home
» perfmon overview
» libpfm overview
» pfmon overview
» mailing list
» downloads
» bibliography
» presentations

pfmon links

» FAQ
» documentation
Printable version
Privacy statement Using this site means you accept its terms Feedback to HP Labs
© 2009 Hewlett-Packard Development Company, L.P.