Java application performance tuning (Reprinted two)

Suppose now that we have the application performance problems (eg. CPU water level is relatively high), ready to be optimized to work, in the process, there will be a potential pain points, what does? Here are some of the more common ones:

  1. It is not very clear on the performance of optimized processes. After an initial period of a suspected bottleneck point, they happily Hangchihangchi began to dry, the final settlement of the problem is only a superficial performance bottleneck, the real root of the problem did not touch up;
  2. Analysis of performance bottlenecks point of thinking is not very clear. So much CPU, network, memory ...... performance indicators, what concerns me in the end that should start together where to start?
  3. Lack of understanding of performance optimization tools. After experiencing problems, it is not clear which of the tools used, do not know what metrics tool by representatives get.


2. Performance optimization of process

performance optimization in this area, does not have a strict process definition, but for the vast majority of optimization scenarios, we can be abstracted as the following four-step process.

  1. Preparation phase: the main work is through performance testing, get an overview of the application, about the direction of bottlenecks, optimize clear objectives;
  2. Analysis phase: through a variety of tools or means to locate performance bottlenecks initial point;
  3. Tuning Stage: The location of the point of bottlenecks, for tuning applications;
  4. Test phase: Let tune with application performance testing, compared with the indicators preparation phase, observing whether it meets expectations, if not eliminate a bottleneck or point performance does not meet expectations, then repeat steps 2 and 3.


The figure is a schematic of the four stages of the process.

 

 2.1 General Detailed process
in four steps above common processes among steps 2 and 3 we will introduce in the next two sections focus. First we look at, in the preparation phase and the testing phase, we need to do something.
★ 2.1.1 preparation phase
preparation phase is very critical step, it can not be omitted.
First, we need to tune a detailed understanding of the subject, the so-called know ourselves, know yourself.

  1. Rough assessment of the performance issues, filtration performance issues as a result of lower business logic. For example, online application log level is unreasonable, may cause CPU and disk load soar, this adjustment can log level at high flow;
  2. Understand the application of the overall architecture, external interfaces such as application-dependent and the core of which, which components used and the frame, which interfaces, high utilization of the module, upstream and downstream data link is how the like;
  3. Learn information corresponding to the application server , such as the cluster information server is located, CPU / memory information server, Linux installed version information, the server is a container or a virtual machine, where the host if there is interference on the current application and the like after the mixing portion;


Secondly, we need to get baseline data, then combine baseline data and current business indicators, to determine the ultimate objective of the performance optimization.

  1. Use benchmarking tool to obtain fine-grained system indicators. You can use a number of Linux benchmarking tool (eg. Jmeter, ab, loadrunnerwrk, wrk, etc.) to obtain the file system, disk I / O, network and other performance reports. In addition, similar to the GC, Web server, LAN traffic and other information, if necessary, also need to understand the record;
  2. By measuring pressure or pressure measuring tool platform (if any) on the application of stress testing, obtain current macroeconomic indicators of business applications, such as: response time, throughput, TPS, QPS, consumption rate (MQ application to have the) Wait. Stress tests may be omitted, can be combined with actual monitoring data current and past business, to count the current number of core business metrics, such as service TPS afternoon peak.


★ 2.1.2 beta
into this stage, that we have tentatively identified where application performance bottlenecks, and has a preliminary tuning. We tuned to detect whether an effective manner, that is, under the conditions of the simulation, stress testing of applications. Note: Because Java has JIT (just-in-time compilation ) process, and therefore may need to be pre-warm-up stress testing.
If the stress test results in line with expectations tuning target, or compared to baseline data, it has greatly improved, then we can continue to locate the next choke point through the tool, otherwise, you will need to temporarily exclude this bottleneck point, continue to look for The next variable.

2.2 Precautions
when optimizing for performance, understand the following these precautions can we take some detours.

  1. 2/8 point performance bottlenecks usually present distribution, that is, 80% of the performance problems usually caused by 20% of the performance bottleneck point lead, 2/8 principles also means that not all performance issues are worth optimization;
  2. Performance optimization is a progressive, iterative process, need to gradually, dynamically. After recording the baseline, every time you change a variable, the introduction of multiple variables will give our observations, the optimization process causing interference;
  3. Do not over-pursue stand-alone application performance , if the stand-alone perform well, you should to think from the perspective of system architecture; do not over-pursuit of the ultimate optimization on a single dimension, such as excessive pursuit of CPU performance bottlenecks of memory while ignoring aspects;
  4. Select the appropriate performance optimization tools, performance optimization that can achieve a multiplier effect;
  5. Optimization of the entire application, should be isolated from the online system , the new line code program should be degraded.


3. choke point analysis toolbox
performance optimization is actually to find the performance bottleneck point applications, and then try to go through some means of tuning ease. Positioning performance bottleneck point is more difficult, quickly, directly positioned to the choke point, we require the following two conditions:

  1. Just the right tool;
  2. Some performance optimization experience.


We must first of its profits, how do we choose the right tool? Under different optimization scenarios, what should choose those tools?
Choice, we look at the famous "Performance Tools (Linux Performance Tools-full) map", presumably many engineers know that it comes from performance experts Brendan Gregg. The figure from the various subsystems of the Linux kernel, lists the tools that we have at the time of each subsystem performance analysis can be used to cover the monitoring, analysis, performance optimization classy tune every aspect. In addition to this panorama, Brendan Gregg also available individually benchmarking tool (Linux Performance Benchmark Tools) chart, performance monitoring tools (Linux Performance Observability Tools) diagrams, more details please refer to the website of Brendan Gregg explained.

 

 Source: http: //www.brendangregg.com/linuxperf.html spm = ata.13261165.0.0.34646b44KX9rGc?

The above image is very classic, we do performance optimization is a very good reference, but in fact, we in the practical application of the time, you will find that it may not be the most appropriate, the following two main reasons:
1) to analysis of experience demanding. The above image is actually from the perspective of Linux system resources to the observed performance of this function in Linux requires us each subsystem, the principle to understand. Example: experience performance issues, we do not take each subsystem tools at all to try again, most cases are: We suspect that a sub-system problems, according to the tool and then listed on this chart, go to observation or verification our guess, this is undoubtedly overstating the requirements for performance optimization experience;
2) suitability and completeness is not very good. Our analysis of performance issues, from the ground up to analyze the system from the bottom is less efficient, most of the time, from the application level to analyze will be more effective. Performance Tools (Linux Performance Tools-full) chart gives only a tool set from the perspective of system layer, if you start from the application layer analysis, what tools can we use? What is the point we first need to focus on?
Given the number of pain points above, are given below a more practical "performance optimization tools Atlas", the figures respectively from the system layer, application layer (layer containing components) point of view, we first need cited in the analysis of performance problems the indicators of interest (which? is the most marked concern), these are the most likely point where performance bottlenecks occur. It should be noted that some low-frequency indicators or tool, and is not listed in the figure, such as CPU interrupt, the index nodes, I / O event tracking, investigation of these low-point of thinking is more complex, the opportunity is also generally encountered small, where we focus most common ones on it.
The above comparison of performance tools (Linux Performance Tools-full) FIG advantage lies in the following figure: the specific tool combines with the performance, at the same time from different levels of performance bottleneck to describe the distribution point, and operable usefulness Some stronger.Layer is divided tool system CPU, memory, disk (including a file system), consistent four parts of the network, the performance of the same tool set tool (Linux Performance Tools-full) in the tool of FIG. Component layer and the application layer is configured tools: tools provided by the JDK + Trace Analysis Tools Tools + dump + Profiling tools.
Here is not specific description of the specific use of these tools, we can use man command to get detailed instructions for use tool, in addition, there is another query command manual method: info. info can be understood as a detailed version of the man, if man is not a good understanding of output, can refer to a document info, command too much, remember there is no need to remember.

 

 The above image how to use?
First, although the point to describe the bottleneck from the distribution system components, using two three angles, but in actual operation, the three often complementary, mutual influence. The system provides for the application run-time environment, the nature of the performance issue is the use of system resources reached the limit, and is reflected in the application layer is the application of the indicators / components began to decline; and the irrational use and design applications / components, It will accelerate the depletion of system resources. Therefore, when analyzing the bottleneck point, we need to combine the results of analysis from different angles, out of common, to get the final conclusion.
Secondly, the proposed start to start the application layer, high-frequency indicators analysis noted in the figures, grasping the most important, the most suspicious and most likely lead to point performance, the preliminary conclusion after, go to system level validation. The advantage of this is: a lot of the performance bottleneck point reflected at the system level, will be presented multivariate, for example, garbage collection (GC) application layer indicators appeared abnormal, easily observed by JDK comes with a tool, but reflect on the system level, the system will find the current CPU utilization, memory indicators are not normal, which gives us the analysis of ideas brought trouble.
Finally, if the choke point at the application layer and system layer were shown multivariate distributions, it is recommended to use this time ZProfiler, JProfiler and other tools for applications Profiling, obtain comprehensive application performance information (Note: Profiling refers to the application is running, by event (event-based), statistical sampling (sampling statistical) or implanting additional instructions (Byte-Code instrumentation) and other methods to collect information on the application runtime analysis method to study the dynamic behavior). For example, the statistical sample may be on the CPU, in conjunction with various symbol table information, a period of time to obtain the code hotspot within the application.
Here are the different levels of analysis, we need to focus on core performance indicators, but also how these preliminary indicators, system or application to determine whether there is a performance bottleneck point, confirmation as to the causes of the bottleneck point, the bottleneck point, tuning means, will be launched in the next section.
3.1 CPU && thread

And related indicators of the following main CPU. Common tools include top, ps, uptime, vmstat, pidstat and so on.

    1. CPU utilization (CPU Utilization)
    2. The average CPU load (Load Average)
    3. Context switches (Context Switch)
top - 12:20:57 up 25 days, 20:49,  2 users,  load average: 0.93, 0.97, 0.79
Tasks:  51 total,   1 running,  50 sleeping,   0 stopped,   0 zombie
%Cpu(s):  1.6 us,  1.8 sy,  0.0 ni, 89.1 id,  0.1 wa,  0.0 hi,  0.1 si,  7.3 st
KiB Mem :  8388608 total,   476436 free,  5903224 used,  2008948 buff/cache
KiB Swap:        0 total,        0 free,        0 used.        0 avail Mem

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
119680 admin     20   0  600908  72332   5768 S   2.3  0.9  52:32.61 obproxy
 65877 root      20   0   93528   4936   2328 S   1.3  0.1 449:03.61 alisentry_cli

Content of the first line of the display: the current time, system uptime, and the number of users are logging. After three digital load average, sequentially showing the past 1 minute, 5 minutes, and the average load (Load Average) 15 minutes. The average load is the unit of time, the system is in run state (is using CPU or waiting for CPU process, R state) and the average number of processes that can not be interrupted state (D state), which is the average number of active processes, CPU load average and CPU utilization are not directly related .
Third line represents the CPU utilization, the meaning of each column can use man to see. CPU usage reflects the statistical unit of time usage of CPU, as a percentage of the way the show. Calculated as: CPU utilization may = 1 - Total time (CPU idle time) / CPU. It should be noted that by analyzing the performance of the resulting tool is actually a CPU CPU utilization averaged over a sampling time. Note: CPU utilization may display the top tool is to add up values of all CPU cores, i.e. maximum utilization of the CPU core 8 can reach 800% (htop the like may be used instead of the top update some tools).
Vmstat command is used, you can view the "context switches" the index in the following table, a set of data every second Output:

$ vmstat 1
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
 0  0      0 504804      0 1967508    0    0   644 33377    0    1  2  2 88  0  9

On the table cs (context switch) is the number of context switches per second, according to different scenarios, CPU context switches can also be divided into interrupt context switching thread context switching and process context switch three kinds, but no matter what kind, too context switching, the CPU time will be consumed in saving and restoring registers, the kernel stack and virtual memory and other data, thus shortening the time course actually runs, leading to a substantial decline in the overall performance of the system. vmstat output in us, sy CPU utilization are user mode and kernel mode, these two values are also very reference value.
vmstat lose only gives the overall system context switches circumstances, in order to view the details of each process context switches (such as voluntary and involuntary switch) is required pidstat, this command can also view a user mode and kernel mode processes CPU utilization.

What analysis CPU-related abnormalities idea is?
1) CPU utilization: If we observe a certain period of time a system or application process CPU utilization has been high (over 80% single core), then it deserves our vigilance. We can repeatedly use the command dump jstack application thread stack view hot codes, non-Java applications can directly use a CPU perf collected samples, analyzed off-line sampling data obtained after the CPU executes a hotspot (Java application stack required symbol table information mapping, can not be used directly perf get results).
2) CPU load average: The average load is higher than the number of CPU 70%, there is a bottleneck point means that the system, there are many reasons for the increased load, where not carried out. It should be noted that the monitoring system by monitoring the trend of the average load, easier to locate the problem, sometimes load large files, etc., can lead to increased average load transient. If the three values of 1 minute / 5 minutes / 15 minutes or less, it means the system load is very stable, is not concerned if these three values decreased, indicating that the load gradually increased, need to focus on the overall performance;
3) CPU context switching: context switch this indicator, and no experience can be recommended (tens to tens of thousands are possible), this indicator value depends on the CPU performance of the system itself, as well as the current situation of the application work. However, the number of context switches occur if the system or application of the order of magnitude of growth, there is a great probability indicate a performance issues, such as involuntary up and down switches increased significantly, indicating that there are too many threads competing CPU.
The above three indicators are closely related, such as frequent context switching CPU may result in an average load increases. How to tune the application according to the relationship between these three, in the next section.
Some of the transaction on the CPU, usually can be observed, but it should be noted that from the thread, and the thread is not entirely CPU-related issues. Thread-related indicators, the main are the following (all are directly or indirectly through JDK comes jstack tool):

  1. The total number of threads in the application;
  2. Each thread state distribution of applications;
  3. Using a thread-locking, such as deadlocks, lock distribution;


About thread, there are exceptions may be concerned about:
whether the total number 1) thread too much. Too many threads, is reflected in the CPU cause frequent context switches, while the thread will consume too much memory, the size of the total number of threads and the machine configuration and the application itself relevant;
if state 2) thread exception. Observe whether WAITING / BLOCKED thread too much (or too violent to set the number of threads lock contention), the binding of applications using internal lock comprehensive analysis;
3) combined CPU utilization, see if there consume a lot of CPU threads.

3.2 && heap memory
and memory-related indicators mainly in the following commonly used analysis tools: top, free, vmstat, pidstat and JDK comes with some tools.

  1. System memory usage, including the remaining memory, used memory, available memory, the cache / buffers;
  2. Process (including Java process) of virtual memory, permanent memory, shared memory;
  3. The missing page number of abnormal process, including the main missing page abnormalities and minor anomalies missing pages;
  4. Swap swapped in and out of memory size, Swap parameters;
  5. JVM heap allocation, JVM startup parameters;
  6. JVM heap recycling, GC situation.


You can view the use of free usage and usage Swap partition of system memory, top tools can be specific to each process, as we can see the size of the Java process memory resident (RES) by using the top tool, these two tools combine , covering most of the available memory metrics. Here is the free command output:

$free -h
              total        used        free      shared  buff/cache   available
Mem:           125G        6.8G         54G        2.5M         64G        118G
Swap:          2.0G        305M        1.7G

Specific meanings of each column in the output is not repeated here, it is relatively easy to understand. Under focuses on the swap and buff / cache these two indicators.
Swap is to effect a local file or a disk space is used as a memory, and a transducer includes a transducer into two processes. Swap required disk read and write, so performance is not very high, in fact, including ElasticSearch, Hadoop Java applications, including most of the recommendations Swap turned off, because the cost of memory has been reduced, which also JVM and garbage collection procedures for: JVM in GC time will traverse all use of heap memory, this memory is if Swap out, there will be traversed when disk I / O is generated. Increased Swap partition and disk usage is generally strongly correlated, when specific analysis, requires a combination of cache usage, swappiness threshold, and active case file anonymous pages and pages of comprehensive analysis.
buff / cache is the size of the cache and buffer. Cache (cache): is read from disk files or temporarily store data when writing files to disk, file-oriented. Use cachestat can view the entire system cache hit read and write, and to read and write can be observed using cachetop hits per process cache. Buffer (buffer) is written to the disk or read data from the disk that temporarily stores data, block-oriented devices. free command output, the two indicators are added together, can be used to distinguish between vmstat command cache and buffer can also be seen Swap partition swapped in and out of memory size.
Aware of the common memory indicators, common memory problems and what? Summarized as follows:

  1. The remaining system memory / lack of available (a process taking up too much, the system itself out of memory), memory overflow;
  2. Memory recall exception: memory leaks (process memory usage over time continued to rise), GC frequency anomaly;
  3. Cache uses oversized (large file read or write), the cache hit rate is not high;
  4. Abnormal excessive page fault (frequent I / O read);
  5. Swap partition uses abnormal (excessive use);


After the memory-related abnormalities, analysis idea is kind of how?

  1. Use free / top global view memory usage, such as the use of system memory, Swap partition memory usage, cache / buffer occupancy, etc., to determine the initial direction of the memory problems: process memory, cache / buffers, Swap partition;
  2. Observation of trends over time using a memory. Vmstat such as by observing whether growth in the memory has been used; jmap by the timing distribution statistics object memory, determines whether there is a memory leak, cachetop command by positioning the root causes of elevated buffers;
  3. Depending on the type of memory problems, combined with the application itself, for detailed analysis.


Example: Use free find little cache / buffer occupancy, excluding the impact cache / buffer memory after -> Use vmstat or sar look at each process memory usage trends -> found that a process of memory when used continued to rise - > If a Java application can be used allocate jmap / VisualVM / heap dump analysis tools such as observation target memory, or by a change in the application memory jstat observed GC -> binding business scenarios, memory leaks positioned / GC parameters is unreasonable / business code abnormalities.

3.3 && disk file
in the analysis of problems and disk-related, usually its file system and taking into account the following no longer distinguish. And disk / file system-related indicators mainly in the following, common tools for observation and iostat pidstat, the former applies to the entire system, which can observe the specific process I / O.

  1. Disk I / O utilization: refers to the percentage of the disk I / O processing time;
  2. Disk Throughput: refers to the second I / O request size, in units of KB;
  3. I / O response time, means I / O request issued from the interval to receive a response, including the actual processing time and the waiting time in the queue;
  4. IOPS (Input / Output Per Second): I per second / O requests;
  5. I / O queue size refers to the average I / O queue length, the queue length as short as possible;


Use iostat output interface is as follows:

$iostat -dx
Linux 3.10.0-327.ali2010.alios7.x86_64 (loginhost2.alipay.em14)     10/20/2019     _x86_64_    (32 CPU)

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sda               0.01    15.49    0.05    8.21     3.10   240.49    58.92     0.04    4.38    2.39    4.39   0.09   0.07

Figure above% util, that is, the disk I / O utilization, with the same CPU utilization, this value may also be more than 100% (the presence of the parallel I / O); rkB / s and wkB / s respectively read from the disk per second and write data amount, i.e. the throughput, in units of KB; index disk I / O processing time is r_await and w_await denote the read / write request processing completion response time, the svctm represents an average I / O processing required time, the index has been abandoned, no practical significance. r / s + w / s is IOPS index, respectively sent to a second number of disk read requests and write requests; aqu-sz denotes the length of the waiting queue.

and most of iostat output pidstat similar, except that it can review each process I / O real-time.
How to determine the index disk appeared abnormal?

  1. When the disk I / O utilization for a long time more than 80%, or the response time is too large (for the SSD, from milliseconds to 0.0x 1.x ms range, mechanical disk typically 5ms ~ 10ms), usually means that the disk I / O there is a performance bottleneck;
  2. If large util%, while rkB / s and wkB / s is small, because there is generally more disk random access, the random access is best optimized for sequential read and write, (strace or blktrace observed by I / if O is continuously determined whether the behavior of sequential write, random access indicator should be concerned about IOPS, sequential read may be concerned about a certain index);
  3. If avgqu-sz relatively large, indicating there are many I / O request waiting in the queue. Generally, if the queue length of a single continuous over two disks, is generally believed that the presence of I / O performance problem with this disk.

3.4 Network
range covers a broad network concept in the application layer, transport layer, network layer, network interface layer has different indicators to measure. Here "network" we are discussing, especially network application layer, indicators are usually used as follows:

  1. Network bandwidth: represents the maximum transmission rate of the link;
  2. Network Throughput: indicates the amount of data successfully transmitted per unit time;
  3. Network Delay: indicates a request issued from the network until it receives a response to the distal end, the time required;
  4. Network connection and errors;


In general, the application layer network bottlenecks the following categories:

  1. Network bandwidth cluster or machine room where the saturation effects improve application QPS / TPS of;
  2. Abnormal network throughput occurs, such as a large number of data transfer interfaces, resulting in excessive bandwidth;
  3. Network connection exception or error;
  4. Network partition occurs.


Bandwidth and network throughput of these two indicators, we typically focus on the entire application can be directly obtained through the monitoring system, if there is a clear index rose over time, indicating that there is a network performance bottleneck. For stand-alone, network throughput can be obtained using the sar network interface, process.
Use ping or hping3 whether the network can get zoning, specific network delay occurs. For applications, we are more concerned about the delay of the entire link, you can get all aspects of the link delay information through the trace log after middleware Buried output.
Use netstat, ss and sar can get the number of network connections or network error. Excessive overhead caused by network links is great, one will take a file descriptor, and second, will take up the cache, so the system can support a number of network links is limited.
3.5 Tools Summary
can be seen that, when analyzing CPU, memory, disk and other performance indicators, there are several tools are high frequencies, such as top, vmstat, pidstat, slightly summarize here:

  1. CPU:top、vmstat、pidstat、sar、perf、jstack、jstat;
  2. 内存:top、free、vmstat、cachetop、cachestat、sar、jmap;
  3. Disk: top, iostat, vmstat, pidstat, du / df;
  4. Network: netstat, sar, dstat, tcpdump;
  5. Application: profiler, dump analysis.

 

Many of the above tools, most of which are used to view system level metrics, the application layer, in addition to a range of tools provided by the JDK, some commercial products such as gceasy.io (GC log analysis), fastthread.io (analysis thread dump log) is good.

Online troubleshooting Java application exception or bottleneck analysis application code, you can use open source Ali Arthas, this tool is very powerful, the following brief introduction.

 

Arthas main line of application-oriented real-time diagnostics, the solution is similar to the "anomaly of online applications, the need for analysis and positioning line" issue, of course, some methods Arthas provide call tracking tools in our investigation, such as "slow query", etc. issue, but also very helpful. The main function of Arthas provided are:

 

  1. Get Thread statistics, such as thread holds the lock statistics, CPU utilization statistics;
  2. Loading class information, dynamic class loading, method of loading information;
  3. Call stack tracking, time-consuming call statistics;
  4. Method call parameters, the detected result;
  5. System configuration, application configuration information;
  6. Decompile load the class;
  7. ....


It should be noted that the only means of performance tools to solve performance problems, we can understand the general use of common tools, do not put too much energy on learning tool.

Guess you like

Origin www.cnblogs.com/029zz010buct/p/12608303.html