Performance InspectorThe Performance Inspector (PI) is a suite of tools that identifies performance problems and characteristics. PI is distributed with a kernel patch and the source to build the device driver and tools. Procedures are also included to automate the install process and build for installing the tools. Support is provided for Intel 32- and 64-bit platforms, Power PC 32- and 64-bit systems, S390 32- and 64-bit systems, and the AMD Hammer processor.The tools included in all versions of Performance Inspector include the following:Above Idle. Shows how idle, process, and interrupt times are distributed over the processor(s) on the running system. Above Idle is a phase 1 tool that identifies hot spots with respect to processor time and interrupt time spent in the system. This tool can also be used on SMP systems to see how well the multiple active processes are being spread over multiple CPUs.Per-Thread Time. Hooks into the process dispatch and interrupt code to maintain information on the total amount of time spent within a process, the time spent handling interrupts, and the amount of idle time. Summary information is provided on a per-CPU basis and for the system as a whole. APIs are provided to allow a developer to include calls within this application to measure functions and code snippets.System Trace Data Collection. Trace hooks are added to the kernel to collect data on what is happening within the system. Hooks are provided for process dispatches, process timeslices, interrupt entry and exits, process forks and clones, process startup execution, system timer interrupts, and memory mapping of code segments. It is also possible for an application to write its own trace records.Command files, libraries, and post processors are included to allow automatic tracing functions and reports. The most significant of these functions allow the measurement and reporting of jitted methods within Java applications. This function is very useful in identifying what methods are running and how much time they are using.JPROF. Performs Java execution profiling. This tool includes the capability to obtain detailed information on jitted methods. To use this tool, use a version of the Java SDK that includes JVMPI support.Java Lock Monitor (JLM). Reports hold-time accounting and contention statistics with Java applications. The IBM Java 1.4 SDK is required to use this tool.In addition to these common measurement functions, an additional set of functionality is provided on Intel 32-bit systems. This functionality includes the following:Performance counter support. Supports starting and stopping the counter, reporting counter contents, and displaying counter settings. The Trace Data Collection and Per-Thread Time functions base their measurements on the performance counters instead of the system clock. Instead of seeing how much time is spent within a process, you see how many instructions are executed, or how many branches or jumps are performed.Instruction tracing. Instruction tracing records all the branches taken while the measurement is active. Branches include calls, jumps, and any other execution path change. A post-processor tool is provided to report an instruction trace, including the number of instructions executed and where in the code the execution occurs. This reporting can also be done for Java jitted methods.Dynamic kernel patch. For certain releases of Red Hat and SUSE distributions, a version of the tools is provided that does not require a patch to the Linux kernel. Instead, the device driver dynamically patches the required hooks into the running system and runs the Performance Inspector tool suite.The Performance Inspector installation requires you to apply a kernel patch and rebuild the kernel. After that step is complete, the various tools are built. Because some of the tools are sensitive to the version of Java that is being used, the tool build process must be redone when a new version of the Java support is used. However, this requirement does not affect the kernel. The kernel needs to be patched and built only once.An exception to the installation procedure is provided for the Intel 32-bit support. A special version of the Performance Inspector is provided that does not require the kernel to be patched as long as the kernel being used is the default kernel shipped with supported versions of Red Hat and SUSE. The kernel source needs to be installed and the tools built as before, but the kernel does not have to be rebuilt. The PI driver dynamically patches the kernel at the appropriate places. The kernel patches are removed when the driver is uninstalled.When none of the PI performance probes in the kernel are active, there is minimal impact on system performance. At most, a compare and short branch instruction are added to the code path where the probes are located. For the dynamic version of the IA32 PI, no extra overhead is added when the driver is not loaded.When performing a trace, there is usually a 2% to 3% overhead on the system. Instruction tracing obviously adds more overhead, but it is for use in a debug environment. N tracing is done in a working production environment without any major impacts to performance or throughput.Most of the PI functions can be controlled from APIs issued from the user's program as well as from the PI's own command files. When using the API interface, either from C code or Java, you can fine-tune when, where, and what information is collected. All of the source is provided and can be used as a coding sample.Refer to http://www-124.ibm.com/developerworks/oss/pi/indexl for more information and to obtain a copy of the Performance Inspector.The remainder of this chapter examines in detail the various features of the Performance Inspector. Above IdleAbove Idle works by hooking into the process timeslice logic and interrupt handler within the kernel. Above Idle keeps track of the amount of time that is spent while a processor is busy, idle, or handling interrupts. It is useful for identifying processor overload, high interrupt activity, and poor distribution of work within a multiprocessor system. When active, Above Idle gathers information over a user-specified measurement period. The default is 1 second. When using the default, Above Idle determines the amount of time the system spent while idle, active, and processing interrupts. The percentage of processor idle, active, and interrupt processing time is then calculated and displayed by the processor when running on a multiprocessor system. Parameters are used to define how many measurements to make and the time interval between measurements.To start the Above Idle measurement, enter swtrace ai, which uses the defaults to report system usage every second until the measurement is manually stopped.The following example shows a sample Above Idle output. The example was run on an eight-way SMP machine and shows data collected over 3 seconds. Per-Thread TimeWhen active, per-thread time (PTT) accumulates the amount of time spent by each process within the system, as well as idle time and interrupt time. If you are processing on an Intel IA 32-bit system, you can use a performance event counter instead of using the real-time clock. For example, a measurement can be made that identifies the number of instructions executed by each process, the idle process, and the interrupts. The totals for process time, IRQ time, and idle time are provided for each processor in the system, as well as a system total of all processors.PTT is activated via an API call, PTTInit, or an external program called ptt. In both cases, you specify the measurement medium as system timer or performance counter. After it is activated, you can run a program called pttstats to display the current values for idle, process, and interrupt times. You can run the pttstats program any number of times. A terminate command stops the measurement. The last counts before the terminate command is issued remain stored in the buffers until another init command is received. The following shows a sample PTT report: APIs are provided to allow applications to start this measurement and to obtain the total amount of time spent in the process being measured. APIs allow an application to be instrumented to measure the amount of time spent between specific operations or positions within the application. On Intel 32-bit applications, measurements such as instructions retired can be used as a performance counter measurement. Trace ProfilingTrace profiling works by hooking the kernel timer interrupt. When the timer interrupt is processed, the profiling code creates a trace record that contains the address where the processor was executing when the timer interrupt occurred. This trace record is then written into a trace buffer. When the profiling is complete, these trace records can be dumped out to a file where a post-processing program produces various reports.You can run a trace profile to identify hot spots within the system by identifying the applications that consume the most time and which functions within those applications are causing them to use all that time. This includes jitted methods with Java applications.Use the run.tprof command to perform the trace and produce the default reports.By looking at the run.tprof command file, you can see the specific steps you need to follow to take a profile trace and produce the reports.The following example shows a sample trace report for a small C application. When profiling applications, the level of optimization used when the application is compiling can have an effect on the trace output. With optimization on, functions that execute the same base code are all rolled up into the application instead of reported on separately by symbol name. Therefore, it is a good idea to make the first few profiling runs on applications that have not been compiled with optimization turned on. Instruction TracingA special version of trace profiling that sets up the system to use the hardware's instruction trace capabilities is available on Intel IA32 systems. Instruction tracing starts from the branch instruction signal and then creates trace records every time a code branch, jump, or call is performed. Instruction profiling provides a detailed trace of the execution path in the system without the need to have a debugger installed or code compiled with the debug option enabled.Use instruction tracing only for very short, controlled measurement periods, because it can produce a significant number of tracing records. Instruction tracing is primarily useful for determining the exact execution path through the entire system for the function being executed.A small command file, run.itrace, is provided to simplify the taking of an instruction trace profile.The following example shows a small output of run.itrace. This report shows the assembler instructions that were executed between the entry and the branch instruction. The report provides a detailed view of what is going on in the system. To relate this back to the actual C code, you would have to compile the C code with the assembler option and calculate the offsets to the assembler instructions to find them in the listing.If you add the -c option to the post-processing report generator, the output would look similar to the following.
Java ProfilerJPROF is a Java profiling agent that dynamically responds to JVMPI events based on options passed at Java invocation. The profiler is generally referred to as JPROF but uses an executable library called libjprof.so. JPROF provides JIT address-to-name resolution to support tprof and ITtrace data reduction. Other functions include Java Lock Monitor and Java HeapDump. The profiler has been implemented for IBM JDK 1.2.2 and later and is based on the support in the JDK for the JVMPI interface.The following example shows a report generated by the trace profiler on a Java application.
Java Lock MonitorJava Lock Monitor (JLM) support is provided with version 1.4.0 of the IBM JDK. JLM provides monitor hold time accounting and contention statistics on monitors used in Java applications and the JVM itself. JLM provides support for the following counters.Counters associated with contended locks:Total number of successful acquires.Recursive acquires. Number of times the monitor was requested and was already owned by the requesting thread.Number of times the requesting thread was blocked waiting on the monitor because the monitor was already owned by another thread.Cumulative time the monitor was held.The following statistics are also collected on platforms that support 3-Tier Spin Locking (x86 SMP):Number of times the requesting thread went through the inner (spin) loop while attempting to acquire the monitor.Number of times the requesting thread went through the outer (thread yield) loop while attempting acquire the monitor.Garbage collection (GC) time is removed from hold times for all monitors held across a GC cycle.A monitor can be acquired either recursively, when the requesting thread already owns it, or nonrecursively, when the requesting thread does not already own it. Non-recursive acquires can be further divided into fast and slow. Fast is when the requested monitor is not already owned and the requesting thread gains ownership immediately. On platforms that implement 3-Tier Spin Locking, any monitor acquired while spinning is considered a fast acquire, regardless of the number of iterations in each tier. Slow is when the requested monitor is already owned by another thread and the requesting thread is blocked.[View full width]Java Lock Monitor Report Version_4.26 (05.01.2002) Built : ( Wed May 1 14:44:28 CDT 2002 ) JLM_Interval_Time 34021158156 System (Registered) Monitors %MISS GETS NONREC SLOW REC TIER2 TIER3 %UTIL AVER-HTM MON-NAME 87 5273 5273 4572 0 710708 18487 1 95408 ITC Global_Compile lock 9 6870 6869 631 1 113420 2976 0 11807 Heap lock 5 1123 1123 51 0 11098 286 1 248385 Binclass lock 0 1153 1147 5 6 1307 33 0 47974 Monitor Cache lock 0 46149 45877 134 272 36961 877 1 6558 ITC CHA lock 0 33734 23483 19 10251 6544 150 1 17083 Thread queue lock 0 5 5 0 0 0 0 0 9309689 JNI Global Reference lock 0 5 5 0 0 0 0 0 9283000 JNI Pinning lock 0 5 5 0 0 0 0 0 9442968 Sleep lock 0 1 1 0 0 0 0 0 0 Monitor Registry lock 0 0 0 0 0 0 0 0 0 Evacuation Region lock 0 0 0 0 0 0 0 0 0 Method trace lock 0 0 0 0 0 0 0 0 0 Classloader lock 0 0 0 0 0 0 0 0 0 Heap Promotion lock Java (Inflated) Monitors %MISS GETS NONREC SLOW REC TIER2 TIER3 %UTIL AVER-HTM MON-NAME 15 68 68 10 0 2204 56 2 11936405 test.lock.testlock1@A09410 ![]() 2 42 42 1 0 186 5 0 300478 test.lock.testlock2@D31358 ![]() 0 70 70 0 0 41 1 0 7617 Java.lang.ref ![]() LEGEND: %MISS : 100 * SLOW / NONREC GETS : Lock Entries NONREC : Non Recursive Gets SLOW : Non Recursives that Wait REC : Recursive Gets TIER2 : SMP Wait Hierarchy TIER3 : SMP Wait Hierarchy %UTIL : 100 * Hold-Time / Total-Time AVER-HT : Hold-Time / NONREC Descriptions of the report's fields are as follows:JLM_Interval_Time. Time interval between the start and end of measurement. Time is expressed in the units appropriate for the hardware platform: cycles for x86, IA64, and S390, and time-based ticks for PPC.%MISS. Percentage of the total GETS (acquires) where the requesting thread was blocked waiting on the monitor. GETS. Total number of successful acquires. NONREC. Total number of nonrecursive acquires. This number includes SLOW gets.SLOW. Total number of nonrecursive acquires which caused the requesting thread to block waiting for the monitor to no longer be owned. This number is included in NONREC.To calculate the number of nonrecursive acquires in which the requesting thread obtained ownership immediately (FAST), subtract SLOW from NONREC. On platforms that support 3-Tier Spin Locking, monitors acquired while spinning are considered FAST acquires.REC. Total number of recursive acquires. A recursive acquire is one where the requesting thread already owns the monitor.TIER2. Total number of Tier 2 (inner spin loop) iterations on platforms that support 3-Tier Spin Locking.TIER3. Total number of Tier 3 (outer thread yield loop) iterations on platforms that support 3-Tier Spin Locking.%UTIL. Monitor hold time divided by JLM_Interval_Time. Hold time accounting must be turned on.AVER-HTM. Average amount of time the monitor was held. Recursive acquires are not included because the monitor is already owned when acquired recursively. MON-NAME. Monitor name or NULL (blank) if the name is not known. Performance Inspector Executable ToolsThe following tools, which are shipped with the Performance Inspector, provide support for the PI functions. swtraceswtrace is a software tracing mechanism that runs on Linux. swtrace is normally run from a command prompt by issuing the swtrace command with the appropriate arguments.swtrace uses software trace hooks to collect data. Trace hooks are identified by both a major code and a minor code. Trace data is collected to a trace buffer that is allocated when swtrace is initialized or turned on. The size of the trace buffer can be set when swtrace is initialized. The swtrace command allows the user to select which major codes are traced, when tracing starts, when tracing stops, when data is transferred from the trace buffer to disk, and formatting of the trace data.The major parameters supported by swtrace are as follows:init. Tells the trace profiler to allocate the trace buffers and initialize the system for tracing. With init, the size of the trace buffer to be allocated and the performance counter to use for taking the trace can be optionally specified.enable. Enables the trace hooks within the Linux kernel. This controls what information is placed in the trace buffer.disable. Keeps the specified trace hooks from being measured.on. Starts the trace. The trace information is gathered until the swtrace off command is given.get. Dumps the contents of the trace buffers to a file for processing by the report generator program.it_install. Initializes the instruction trace facility.it_remove. Resets the instruction trace functionality.Other parameters display information about the Performance Inspector and control the rate of profiling. The command file run.tprof is generated when PI is installed in the system. This command file contains all the steps necessary to take a profiling trace and produce a report. postpost produces various reports based on the trace profiling data. When a trace profiling report is produced, it is written to a file called tprof.out. When you install the Performance Inspector, you identify the directory where this file will be saved. If you want to keep the current tprof.out file, you must rename it something else before running another run.tprof command.One option supported by the post command is -show. -show creates a file called post.show, which is a dump in a readable format of all the trace records. When all the trace hooks are enabled, post.show gives a detailed look at the sequence of events that occurred in the system, from execs to dispatches to interrupts. pipcntrpipcntr controls and displays the performance counters when PI is running on Intel IA32 platforms. This program can start and stop counters and display the contents of the counter registers and the counter control register settings. You can use this utility to start a performance counter and then run the per-thread time utility using this counter. The same holds true for the trace profiling function. pttptt starts and stops the per-thread time measurement. When starting, you can also specify what metric to use to perform the measurement. pttstatspttstats displays the per-thread time of every process in the system. On Intel IA32 systems, the measurement metric can be either the real-time clock or a performance monitor counter. |