Operating System Performance Monitoring

1.   Three distinct activities are involved when engaging in performance improvement activities: performance monitoring, performance profiling, and performance tuning.

   a)   Performance monitoring is an act of non-intrusively collecting or observing performance data from an operating or running application. Monitoring is usually a preventative or proactive type of action. Monitoring is also usually the first step in a reactive situation where an application stakeholder has reported a performance issue but has not provided sufficient information or clues as to a potential root cause. In this situation, performance profiling likely follows performance monitoring.

 

    b)   Performance profiling in contrast to performance monitoring is an act of collecting performance data from an operating or running application that may be intrusive on application responsiveness or throughput. Performance profiling tends to be a reactive type of activity, or an activity in response to a stakeholder reporting a performance issue, and usually has a more narrow focus than performance monitoring. Profiling is rarely done in production environments. It is typically done in qualification, testing, or development environments and is often an act that follows a monitoring activity that indicates some kind of performance issue.

 

    c)   Performance tuning, in contrast to performance monitoring and performance profiling, is an act of changing tune-ables, source code, or configuration attribute(s) for the purposes of improving application responsiveness or throughput. Performance tuning often follows performance monitoring or performance profiling activities.

 

2.   CPU utilization on most operating systems is reported in both user CPU utilization and kernel or system (sys) CPU utilization. User CPU utilization is the percent of time the application spends in application code. In contrast, kernel or system CPU utilization is the percent of time the application spends executing operating system kernel code on behalf of the application. High kernel or system CPU utilization can be an indication of shared resource contention or a large number of interactions between I/O devices. The ideal situation for maximum application performance and scalability is to have 0% kernel or system CPU utilization since CPU cycles spent executing in operating system kernel code are CPU cycles that could be utilized by application code.

 

3.   On compute-intensive systems, further monitoring of the number of CPU instructions per CPU clock cycle (also known as IPC, instructions per clock) or the number of CPU clock cycles per CPU instruction (also known as CPI, cycles per instruction) may be required.

 

4.   The operating system tools report a CPU as being utilized even though the CPU may be waiting for data to be fetched from memory. This scenario is commonly referred to as a stall . Stalls occur any time the CPU executes an instruction and the data being operated on by the instruction is not readily available in a CPU register or cache. When this occurs, the CPU wastes clock cycles because it must wait for the data to be loaded from memory into a CPU register before the CPU instruction can execute on it. Thus the strategy for increasing the performance of a compute intensive application is to reduce the number of stalls or improve the CPU’s cache utilization so fewer CPU clock cycles are wasted waiting for data to be fetched from memory.

 

5.   The commonly used CPU utilization monitoring tool on Windows is Task Manager and Performance Monitor. A running history of CPU utilization for each processor is displayed in the CPU Usage History panel on the upper right of “Performance” tab of Task Manager. The upper line, a green colored line, indicates the combined user and system or kernel CPU utilization. The lower line, a red colored line, indicates the percentage of system or kernel CPU usage. Note that to view system or kernel CPU utilization in Window’s Task Manager, the Show Kernel Utilization option must be enabled in the View > Show Kernel Utilization menu.


 

6.   The Windows Performance Monitor ( permon ) uses a concept of performance objects. Performance objects are categorized into areas such as network, memory, processor, thread, process, network interface, logical disk, and many others. Within each of these categories are specific performance attributes, or counters, that can be selected as performance statistics to monitor. User and kernel or system CPU utilization can be monitored by selecting the Processor performance object, and then selecting both % User Time and % Privileged Time counters and clicking the Add button. Windows uses the term “Privileged Time ” to represent kernel or system CPU utilization.


 

 

7. Windows typedef   is a command line tool that can be used to collect operating system performance statistics. You specify the performance statistics you want to collect using the Microsoft performance counter names. The Microsoft performance counter names are the same as those used in the Performance Monitor. For example, to collect user and kernel or system CPU utilization you can run:

typeperf "\Processor(_Total)\% Privileged Time" "\Processor(_Total)\% User Time"
 

You can also assemble a list of performance counters in a file and pass the name of the file to the typeperf   command. For example, you can enter the following performance counters in a file named cpu-util.txt :

\Processor(_Total)\% Privileged Time

\Processor(_Total)\% User Time
 

 

Then, invoke the typeperf command with the option –cf:

typeperf -cf cpu-util.txt
 

Additional details on the typeperf   command and its options can be found at http://www.microsoft.com/resources/documentation/windows/xp/all/proddocs/en-us/nt_command_typeperf.mspx?mfr=true .

 

8.   On Linux, CPU utilization can be monitored graphically with the GNOME System Monitor tool, which is launched with the gnome-system-monitor command. In the GNOME System Monitor, there is a CPU history area where a line for each virtual processor’s CPU utilization is drawn illustrating its CPU utilization over a period of time. The number of virtual processors matches the number returned by the Java API Runtime.availableProcessors() . Another popular graphical tool to monitor CPU utilization on Linux is xosview .

 

9.   As the amount of free physical memory reduces, the system attempts to free up memory by locating pages that have not been used in a long time. It then pages these out to disk. This page scanning activity is reported as scan rate. A high scan rate is an indicator of low physical memory. Monitoring the page scan rate is essential to identifying when a system is swapping.

 

10.   Linux and Solaris have vmstat , which shows combined CPU utilization across all virtual processors. If no reporting interval is given to vmstat , the reported output is a summary of all CPU utilization data collected since the system has last been booted. When a reporting interval is specified, the first row of statistics is a summary of all data collected since the system was last booted.

 

11.   Solaris and Linux also offer a tabular view of CPU utilization for each virtual processor using the command line tool mpstat .( Most Linux distributions require an installation of the sysstat package to use mpstat .)

 

12.   We should identify whether an application has threads that tend to consume larger percentages of CPU cycles than other threads or whether application threads tend to utilize the same percentage of CPU cycles. The latter observed behavior usually suggests an application that may scale better.

 

13.   Linux top reports not only CPU utilization but also process statistics and memory utilization. Press ‘Shift + h ‘ and wait few secs. You can see ‘Show threads on ‘ message in the top console. Now, you can see thread level details like CPU/Memory utilization.

 

14.   If you convert the thread id value to hexadecimal and use the JDK’s jstack   command you can find the Java thread that corresponds to OS by searching for a “ nid ” label. The following output from the JDK’s jstack   command is trimmed but shows that a Java thread with a 0x2 is the “main” Java thread and is executing a Java NIO Selector.select() method.

"main" prio=3 tid=0x0806f800 nid=0x2
 runnable [0xfe45b000..0xfe45bd38]
java.lang.Thread.State: RUNNABLE
at sun.nio.ch.DevPollArrayWrapper.poll0(Native Method)
at sun.nio.ch.DevPollArrayWrapper.poll(DevPollArrayWrapper.java:164)
at sun.nio.ch.DevPollSelectorImpl.doSelect(DevPollSelectorImpl.java:68)
at sun.nio.ch.SelectorImpl.lockAndDoSelect(SelectorImpl.java:69)
- locked <0xee809778> (a sun.nio.ch.Util$1)
- locked <0xee809768> (a java.util.Collections$UnmodifiableSet)
- locked <0xee802440> (a sun.nio.ch.DevPollSelectorImpl)
at sun.nio.ch.SelectorImpl.select(SelectorImpl.java:80)
at com.sun.grizzly.SelectorThread.doSelect(SelectorThread.java:1276)
 

 

15.   CPU scheduler’s run queue is where lightweight processes are held that are ready to run but are waiting for a CPU where it can execute. A high run queue depth can be an indication a system is saturated with work. A system operating at a run queue depth equal to the number of virtual processors may not experience much user visible performance degradation. The number of virtual processors is the number of hardware threads on the system. It is also the value returned by the Java API, Runtime.availableProcessors() .In the event the run queue depth reaches four times the number of virtual processors or greater, the system will have observable sluggish responsiveness.

 

16   There are generally two alternative resolutions to observing high run queue depth. One is to acquire additional CPUs and spread the load across those additional CPUs, or reduce the amount of load put on the processors available.

 

17.   The run queue depth on Windows is monitored using the \System\Processor Queue Length performance counter. If the scale factor is 10. This means a run queue depth of 1 is displayed on the chart as 10, 2 as 20, 3 as 30, and so on.

 

18.   The following typeperf command monitors run queue depth at a 5 second interval:

typeperf -si 5 "\System\Processor Queue Length"
 

19.   On Linux a system’s run queue depth can be monitored using the vmstat command. The first column in vmstat reports the run queue depth. The number reported is the actual number of lightweight processes in the run queue.

 

20.   A Java application or JVM that is swapping or utilizing virtual memory experiences pronounced performance issues. Swapping occurs when there is more memory being consumed by applications running on the system than there is physical memory available. When a portion of an application is accessed that has been swapped out, that portion of the application must be paged in from the swap space on disk to memory. Swapping in from disk to memory can have a significant impact on an application’s responsiveness and throughput.

 

21.   A JVM’s garbage collector performs poorly on systems that are swapping because a large portion of memory is traversed by the garbage collector to reclaim space from objects that are unreachable. If part of the Java heap has been swapped out it must be paged into memory so its contents can be scanned for live objects by the garbage collector. The time it takes to page in any portion of the Java heap into memory can dramatically increase the duration of a garbage collection.

 

22.   On Windows systems that include the Performance Monitor, monitoring memory pages per second (\Memory\Pages / second ) and available memory bytes (\Memory\Available MBytes ), can identify whether the system is swapping.

 

23.   Monitoring for swapping activity using Linux vmstat is shown here. The columns in Linux vmstat to monitor are the “ si ” and “ so   ” columns, which represent the amount of memory paged-in and the amount of memory paged-out. In addition, the “ free   ” column reports the amount of available free memory. There are other ways to monitor for swap activity on Linux such as using the top command or observing the contents of the file /proc/meminfo .

 

24.   An application experiencing heavy lock contention exhibits a high number of voluntary context switches. The cost of a voluntary context switch at a processor clock cycle level is an expensive operation.

 

25.   A general rule to follow is that any Java application experiencing 5% or more of its available clock cycles in voluntary context switches is likely to be suffering from lock contention. Even a 3% to 5% level is worthy of further investigation. An estimate of the number of clock cycles spent in voluntary context switching can be calculated by taking the number of voluntary thread context switches in an interval and multiplying that number by 80,000 (an estimate of a context switch in number clock cycles), and dividing it by the total number of clock cycles available in the that interval.

 

26.   It is possible to monitor lock contention by observing thread context switches in Linux with the pidstat command from the sysstat package. However, for pidstat to report context switching activity, a Linux kernel version of 2.6.23 or later is required. The use of pidstat -w reports voluntary context switches in a “cswch/s” column. It reports the number of voluntary context switches for all virtual processors per second, not a sum of all context switches.

 

27.   On Windows, the Performance Monitor and typeperf have the capability to monitor context switches. But the capability to distinguish between voluntary and involuntary context switching is not available via a performance counter. To monitor Java lock contention on Windows, tools outside the operating system are often used, such as Intel VTune or AMD CodeAnalyst.

 

28.   Involuntary thread context switches occur when a thread is taken off the CPU as a result of an expiring time quantum or has been preempted by a higher priority thread. High involuntary context switches are an indication there are more threads ready to run than there are virtual processors available to run them. It is common to observe a high run queue depth, high CPU utilization, and a high number of migrations in conjunction with a large number of involuntary context switches.

 

29.   Strategies to reduce involuntary context switches include creating processor sets for systems running multiple applications and assigning applications to specific processor sets, or reducing the number of application threads being run on the system.

 

30.   On Linux, creation of processor sets and assigning applications to those processor sets can be accomplished using the Linux taskset command.

 

31.   On Windows systems, applications can be assigned to a processor or set of processors by using Task Manager’s Process tab. Select a target process, right-click, and select Set Affinity . Then choose the processors the selected process should execute on. An application can be launched from the command line with start /affinity <affinity mask> , where <affinity mask> is the processor affinity mask in hexadecimal.

 

32.   Most operating systems’ CPU schedulers attempt to keep a ready-to-run thread on the same virtual processor it last executed. If that same virtual processor is busy, the scheduler may migrate that ready-to-run thread to some other available virtual processor. Migration of threads can impact an application’s performance since data, or state information, used by a ready-to-run thread may not be readily available in a virtual processor’s cache. A strategy to reduce thread migrations is creating processor sets and assigning Java applications to those processor sets.

 

33.   Neither the Linux nor Solaris implementation of netstat reports network utilization. Both provide statistics such as packets sent and packets received per second along with errors and collisions.

 

34.   A port of the Solaris nicstat monitoring tool for Linux is available. The source code can be downloaded from http://blogs.sun.com/roller/resources/timc/nicstat/nicstat-1.22.tar.gz .

 

35.   On Windows, the number of bytes transmitted across a network interface can be obtained using the “\Network Interface(*)\Bytes Total/sec ” performance counter. The “* ” wildcard reports the bandwidth for all network interfaces on the system. You can replace the wildcard “* ” with the network interface you are interested in monitoring. The bandwidth of the network interface can be obtained using the “\Network Interface(*)\Current Bandwidth ” performance counter. It reports bandwidth in bits per second. Network utilization can also be monitored in Windows using Task Manager and clicking on the Networking tab.

 

36.   A strategy to reduce system or kernel CPU spent on network read/write is to reduce the number network read or write system calls. Additionally, the use of nonblocking Java NIO instead of blocking java.net.Socket may also improve an application’s performance by reducing the number of threads required to process incoming requests or send outbound replies. A strategy to follow when reading from a nonblocking socket is to design and implement your application to read as much data as there is available per read call. Also, when writing data to a socket, write as much data as possible per write call. There are Java NIO frameworks that incorporate such practices, such as Project Grizzly (https://grizzly.dev.java.net). Java NIO frameworks also tend to simplify the programming of client-server type applications.

 

37.   Disk I/O utilization along with system or kernel CPU utilization can be monitored using iostat on Linux and Solaris. To use iostat on Linux, the optional sysstat package must be installed. To monitor disk utilization on Windows Server systems, the Performance Monitor has several performance counters available under its Logical Disk performance object.

 

38.   To monitor disk I/O utilization and system or kernel CPU utilization on Linux you can use iostat -xm .

 

39.   Patterns to look for I/O potential problem is repeated accesses to the same file, same disk block, by the same command, process id, and user id. It may be that the same information is being accessed multiple times. Rather than re-reading the data from disk each time, the application may be able to keep the data in memory, reuse it, and avoid re-reading and experiencing an expensive disk read. If the same data is not being accessed, it may be possible to read a larger block of data and reduce the number of disk accesses.

 

40.   At the hardware and operating system level any of the following may improve disk I/O utilization:

a) A faster storage device

b) Spreading file systems across multiple disks

c) Tuning the operating system to cache larger amounts of file system data structures

 

41.   At the application level any strategy to minimize disk activity will help such as reducing the number of read and write operations using buffered input and output streams or integrating a caching data structure into the application to reduce or eliminate disk interaction. The use of buffered streams reduces the number of system calls to the operating system and consequently reduces system or kernel CPU utilization.

 

42.   Some systems are configured and installed with the disk cache disabled. An enabled disk cache improves an application’s performance that heavily relies on disk I/O. However, enabling the disk cache may result in corrupted data in the event of an unexpected power failure.

 

43.   Many performance engineers and system administrators of Solaris or Linux systems use sar to collect performance statistics. With sar , you can select which data to collect such as user CPU utilization, system or kernel CPU utilization, number of system calls, memory paging, and disk I/O statistics. Data collected from sar is usually looked at after-the-fact, as opposed to while it is being collected.

 

猜你喜欢

转载自seanzhou.iteye.com/blog/1546931