Server performance evaluation (cpu, memory, disk IO)

1. Factors affecting the performance of Linux servers

1. Operating system level

    CPU
    memory
    disk I/O bandwidth
    network I/O bandwidth

        2. Program application level

        2. System performance evaluation criteria

         affecting performance factors
    affecting performance factors Judging criteria
    good or bad bad
    CPU user% + sys%< 70% user% + sys%= 85% user% + sys% >=90%
    Memory Swap In(si)=0

     Swap Out(so)=0
        Per CPU with 10 page/s More Swap In & Swap Out
    disk iowait% <20% iowait% =35% iowait% >= 50%

        where:

         %user: indicates the percentage of time the CPU is in user mode.

         %sys: Represents the percentage of time the CPU is in system mode.

         %iowait: Represents the percentage of time the CPU waits for input and output to complete.

         swap in: si, which means page import of virtual memory, that is, swap from SWAP DISK to RAM

         Swap out: So, it means page export of virtual memory, that is, swap from RAM to SWAP DISK.

        Three, system performance analysis tools

        1. Common system commands

         Vmstat, sar, iostat, netstat, free, ps, top, etc.

        2. Commonly used combination methods

         o Use vmstat, sar, iostat to detect whether the CPU bottleneck is

         o Use free, vmstat to detect whether it is Memory bottleneck

         o Use iostat to detect whether it is a disk I/O bottleneck

         o Use netstat to detect whether it is a network bandwidth bottleneck

        4. Linux performance evaluation and optimization

        1. Overall system performance evaluation (uptime command)

        [root@server ~]# uptime

         16:38 :00 up 118 days, 3:01, 5 users, load average: 1.22, 1.02, 0.91

         Here is the output value of load average, the size of these three values ​​generally cannot be greater than the number of system CPUs, for example, In this output, the system has 8 CPUs. If the three values ​​of load average are greater than 8 for a long time, it means that the CPU is very busy and the load is high, which may affect system performance. However, occasionally if it is greater than 8, don’t worry, it will generally not affect System performance. On the contrary, if the output value of the load average is less than the number of CPUs, it means that the CPU has free time slices. For example, in the output in this example, the CPU is very idle.

        2. CPU performance evaluation

        (1) Use the vmstat command to monitor the system CPU.

         This command can display brief information about the performance of various resources in the system. Here we mainly use it to see the CPU load.

         The following is the output of the vmstat command on a certain system:

        [root@node1 ~]# vmstat 2 3

         procs ———-memory———- —swap- —-io—- -system- —-cpu——

         rb swpd free buff cache si so bi bo in cs us sy id wa st

         0 0 0 162240 8304 67032 0 0 13 21 1007 23 0 1 98 0 0

         0 0 0 162240 8304 67032 0 0 1 0 1010 20 0 1 100 0 0

         0 0 0 162240 8304 67032 0 0 1 1 1009 18 0 1 99 0 0 The
    Procs

        r column represents the number of processes running and waiting for the cpu time slice. If this value is longer than the number of system CPUs, it means that the CPU is insufficient and the CPU needs to be increased.

         The b column indicates the number of processes waiting for resources, such as waiting for I/O, or memory swapping.
    Cpu

        The us column shows the percentage of CPU time consumed by user processes. When the value of us is relatively high, it means that the user process consumes more cpu time, but if it is greater than 50% for a long time, you need to consider optimization programs or algorithms.

         The sy column shows the percentage of CPU time consumed by the kernel process. When the value of Sy is high, it indicates that the kernel consumes a lot of CPU resources.

         According to experience, the reference value of us+sy is 80%. If us+sy is greater than 80%, it means that there may be insufficient CPU resources.

        (2)Using the sar command to monitor the CPU of the system

        sar is very powerful and can perform individual statistics on each aspect of the system, but using the sar command will increase the system overhead, but these overheads can be evaluated, and the statistical results of the system will not have a great impact.

         The following is the CPU statistics output of the sar command for a certain system:

         [root@webserver ~]# sar -u 3 5

         Linux 2.6.9-42.ELsmp (webserver) 11/28/2008 _i686_ (8 CPU)

         11:41: 24 AM CPU %user %nice %system %iowait %steal %idle

         11:41:27 AM all 0.88 0.00 0.29 0.00 0.00 98.83

         11:41:30 AM all 0.13 0.00 0.17 0.21 0.00 99.50

         11:41:33 AM all 0.04 0.00 0.04 0.00 0.00 99.92

         11:41:36 AM all 90.08 0.00 0.13 0.16 0.00 9.63

         11:41:39 AM all 0.38 0.00 0.17 0.04 0.00 99.41

         Average: all 0.34 0.00 0.16 0.05 0.00 99.45

        The output of each item above is explained as follows:
    %user column shows the user The percentage of CPU time consumed by the process.
    The %nice column shows the percentage of CPU time consumed by running normal processes.
    The %system column shows the percentage of CPU time consumed by system processes.
    The %iowait column shows the percentage of CPU time occupied by IO waiting. The
    %steal column shows the steal operation that pagein forces to perform on different pages in a relatively tight memory environment.
    The %idle column shows the percentage of time that the CPU is idle.

        Question

         1. Have you ever encountered the phenomenon that the overall system CPU utilization is not high and the application is slow?

         In a multi-CPU system, if the program uses a single thread, there will be such a phenomenon. The overall CPU utilization rate is not high, but the system application responds slowly. This may be due to the single thread being used by the program. Using one CPU causes the CPU occupancy rate to be 100%, unable to process other requests, while other CPUs are idle, which leads to low overall CPU usage and slow applications.

        3. Memory performance evaluation

         (1) Use free instruction to monitor memory

         free is the most commonly used command to monitor the memory usage of Linux. Look at the following output:

         [root@webserver ~]# free -m

         total used free shared buffers cached

         Mem: 8111 7185 926 0 243 6299

         -/+ buffers/cache: 643 7468

         Swap: 8189 0 8189

         generally has such an empirical formula: when application available memory/system physical memory>70%, it means that the system memory resources are very sufficient and does not affect system performance. When application available memory/system physical memory<20% , Indicates that the system memory resources are in short supply, and the system memory needs to be increased. When 20%<application available memory/system physical memory<70%, it indicates that the system memory resources can basically meet the application requirements and will not affect system performance temporarily.

        (2) Use vmstat command to monitor memory

        [root@node1 ~]# vmstat 2 3

         procs ———-memory———- —swap- —-io—- -system- —-cpu——

         rb swpd free buff cache si so bi bo in cs us sy id wa st

         0 0 0 162240 8304 67032 0 0 13 21 1007 23 0 1 98 0 0

         0 0 0 162240 8304 67032 0 0 1 0 1010 20 0 1 100 0 0

         0 0 0 162240 8304 67032 0 0 1 1 1 1009 18 0 1 99 0 0 The
    memory

        swpd column represents the amount of memory switched to the memory swap area (using k as unit). If the value of swpd is not 0 or is relatively large, as long as the values ​​of si and so are 0 for a long time, there is generally no need to worry in this case, and the system performance will not be affected.

         The free column indicates the amount of free physical memory (in k as the unit). The

         buff column indicates the amount of memory in the buffers cache. Generally, buffering is required for reading and writing to block devices.

         The cache column indicates the amount of memory of page cached, which is generally used as a file system cached. Frequently accessed files will be cached. If the cache value is large, it means that the number of files in cached is large. If the bi in IO is relatively small at this time, it means the efficiency of the file system. Better.
    The swap

        si column indicates the amount of memory transferred from the disk to the memory swap area.

         The so column indicates the number of transfers from memory to disk, that is, the number of memory swap areas that enter the memory.

         Generally, the values ​​of si and so are both 0. If the values ​​of si and so are not 0 for a long time, it means that the system memory is insufficient. Need to increase system memory.

        4. Disk I/O performance evaluation

         (1) Disk storage basics
    Familiar with RAID storage methods, you can choose different RAID methods according to different applications.
    Try to replace direct disk I/O with memory read and write as much as possible, so that frequently accessed files or data are put into memory for operation and processing, because memory read and write operations are thousands of times more efficient than direct disk read and write operations.
    Separate the frequently read and write files from the long-term unchanged files and place them on different disk devices.
    For data that is frequently written, you can consider using raw devices instead of file systems.

        The advantages of using raw devices are:
    data can be read and written directly without operating system-level cache, which saves memory resources and avoids contention for memory resources.
    Avoid file system-level maintenance overhead, such as file system needs to maintain super block, I-node, etc.
    It avoids the cache pre-reading function of the operating system and reduces I/O requests.

        The disadvantage of using bare equipment is:
    data management and space management are not flexible and require very professional people to operate.

        (2)Using iostat to evaluate disk performance

         [root@webserver ~]# iostat -d 2 3

         Linux 2.6.9-42.ELsmp (webserver) 12/01/2008 _i686_ (8 CPU)

        Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn

         sda 1.87 2.58 114.12 6479462 286537372

        Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn

         sda 0.00 0.00 0.00 0 0

        Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn

         sda 1.00 0.00 12.00 0 24

         The output of each item above is explained as follows:
    Blk_read/s means the number of data blocks read per second.
    Blk_wrtn/s represents the number of data blocks written per second.
    Blk_read represents the number of all blocks read.
    Blk_wrtn represents the number of all blocks written.
    You can get a basic understanding of the read and write performance of the disk through the values ​​of Blk_read/s and Blk_wrtn/s. If the value of Blk_wrtn/s is large, it means that the disk write operations are very frequent. You can consider optimizing the disk or optimizing the program. If Blk_read/ The value of s is very large, indicating that there are many direct read operations from the disk, and the read data can be put into the memory for operation.
    There is no fixed size for the value of these two options. Depending on the system application, there will be different values, but there is one rule that can be followed: long-term, large data reading and writing is definitely abnormal. This situation will definitely affect system performance.

        (3)Using sar to evaluate disk performance.

         Through the "sar -d" combination, you can make a basic statistics on the disk IO of the system. Please see the following output:
    [root@webserver ~]# sar -d 2 3
    Linux 2.6. 9-42.ELsmp (webserver) 11/30/2008 _i686_ (8 CPU)

    11:09:33 PM DEV tps rd_sec/s wr_sec/s avgrq-sz avgqu-sz await svctm %util
    11:09:35 PM dev8-0 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00

    11:09:35 PM DEV tps rd_sec /s wr_sec/s avgrq-sz avgqu-sz await svctm %util
    11:09:37 PM dev8-0 1.00 0.00 12.00 12.00 0.00 0.00 0.00 0.00

    11:09:37 PM DEV tps rd_sec/s wr_sec/s avgrq-sz avgqu -sz await svctm %util
    11:09:39 PM dev8-0 1.99 0.00 47.76 24.00 0.00 0.50 0.25 0.05

    Average: DEV tps rd_sec/s wr_sec/s avgrq-sz avgqu-sz await svctm %util
    Average: dev8-0 1.00 0.00 19.97 20.00 0.00 0.33 0.17 0.02 The
    meaning of several parameters that need attention:
        await represents the average waiting time (in milliseconds) of each device I/O operation.
        svctm represents the average service time (in milliseconds) of each device I/O operation.
        %util represents what percentage of a second is used for I/O operations.

    For disk IO performance, there are generally the following criteria:
    Under normal circumstances, svctm should be less than the await value, and the size of svctm is related to disk performance. The load of CPU and memory will also affect the svctm value, and excessive requests will also Will indirectly lead to an increase in svctm value.
    The size of the await value generally depends on the value of svctm and the length of the I/O queue and the I/O request mode. If the value of svctm is very close to await, it means that there is almost no I/O waiting and the disk performance is very good. If the value of await is far Higher than the value of svctm, it means that the I/O queue waits too long, and the applications running on the system will slow down. At this time, the problem can be solved by replacing the faster hard disk.
    The value of %util is also an important indicator to measure disk I/O. If %util is close to 100%, it means that the disk generates too many I/O requests, and the I/O system is already working at full capacity, and the disk may have a bottleneck. . In the long run, it will inevitably affect the performance of the system. This problem can be solved by optimizing the program or by replacing a higher and faster disk.

    5. Network performance evaluation

    (1) Detect the connectivity of the network through the ping command
    (2) Detect the network interface status through the netstat -i combination
    (3) Detect the routing table information of the system through the netstat -r combination
    (4) Through the sar -n combination Display the network operating status of the system

     

    =========================================== ======

    Remarks: Although some standards are written above, our judgment method is slightly different. For example, the CPU utilization is generally less than 75%, load is based on the number of CPU cores, and the memory utilization is less than 80%. For web applications, there are also several dimensions of judgment. Both the error rate and the timeout rate must be less than one in ten thousand.

Guess you like

Origin blog.csdn.net/qq_32907195/article/details/112831475