Linux view resource usage status

top - 16:38:04 up 53 days, 21:04,  0 users,  load average: 72.35, 70.02, 71.65
Tasks: 886 total,   2 running, 883 sleeping,   0 stopped,   1 zombie
%Cpu(s): 68.9 us, 15.2 sy,  0.0 ni, 15.8 id,  0.0 wa,  0.0 hi,  0.1 si,  0.0 st
KiB Mem : 98833648 total, 12209316 free, 30220856 used, 56403472 buff/cache
KiB Swap:  4194300 total,  4090868 free,   103432 used. 60035444 avail Mem 

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                               
24900 root      20   0   27.6g  95844  21544 S 105.1  0.1   0:03.43 java                                                                                                  
25028 root      20   0   27.4g  88684  21376 S 101.3  0.1   0:03.15 java                                                                                                  
26004 root      20   0 2929492  23272   3540 S  52.7  0.0  24245:31 hcs_shvfs                                                                                             
30444 root      20   0 7699112  39772  18820 S  52.4  0.0  56161:17 mps                                                                                                   
 3222 root      20   0   11.7g  68088   7452 S  44.7  0.1  24466:39 hcs_vs                                                                                                
17739 root      20   0   17.8g   4.6g  45176 S  44.7  4.9  78:40.45 java                                                                                                  
 4963 root      20   0   16.2g 183272  13812 S  21.5  0.2   9474:29 das_media                                                                                             
23430 root      20   0 7379000 191428  22620 S  19.9  0.2   5187:15 ncg-gb35114                                                                                           
29940 root      20   0 4100840 200800  11304 S  17.0  0.2  16963:03 hik.opsmgr.moni                                                                                       
 5029 root      20   0   11.3g 148652   9228 S  12.2  0.2   9141:19 RegService                                                                                            
 1913 root      20   0 3479096 570748  76120 S  11.9  0.6  44:23.91 Web Content                                                                                           
30846 root      20   0 7515652  70476  30020 S   9.3  0.1   8151:32 Probe 

Top half of the explanation:
(1) The first line: system time + system running time + several users + 1/5/15 minutes system average load
(2) the second line: total number of processes (total) + number of running processes (running) + the number of sleeping processes (sleeping) + the number of stopped processes (stopped) + the number of zombie processes (zombie)
(3) The third line (key):
        us-the percentage of CPU occupied by user space.
        sy-Percentage of CPU occupied by kernel space.
        ni — The percentage of CPU occupied by processes that have changed priority
        id — Idle CPU percentage
        wa — IO waiting percentage of CPU
        hi — Hard interrupt (Hardware IRQ) percentage of CPU
        si — Soft interrupt (Software Interrupts) percentage of CPU
 ( 4) The fourth line: the display content is "total physical memory", "used physical memory", "free physical memory", "kernel cache memory" in order.
 (5) The fifth line: The display content is "total amount of exchange area", "total amount of interactive area used", "total amount of free exchange area", and "total amount of buffered exchange area".
 
Notes in the second half of top:
PID: process ID
USER: user name
PR: priority
NI: negative value means high priority, positive value means low priority.
VIRT: virtual memory
RES: real memory
SHR: shared memory
S: Process state D=uninterruptible sleep state; R=run; S=sleep; T=track/stop; Z=zombie process
"percentage of CPU time occupied from last update to present", "percentage of physical memory used by the process" ", "The total CPU time used by the process", "Command name, command line".

Parameters:
top -d 2: Display the resource occupancy of all processes every 2 seconds
top -c: Display the resource occupancy of the process every 5 seconds, and display the command line parameters of the process (only the process name by default)
top- p 12345 -p 6789: displays the resource usage of the two processes whose pid is 12345 and pid is 6789 every 5 seconds
top -d 2 -c -p 123456: displays the resource usage of the process whose pid is 12345 every 2 seconds , And explicitly the command line parameters of the process startup

Locate the zombie process and the parent process of the zombie process
Use the command: ps -A -ostat,ppid,pid,cmd |grep -e'^[Zz]'

Use Kill -HUP zombie process ID to kill the zombie process. In this case, the zombie process cannot be killed. In this case, you need to kill the parent process of the zombie process.
kill -HUP zombie process parent ID

Parameter interpretation

ps -A -ostat,ppid,pid,cmd |grep -e '^[Zz]'

-A parameter lists all processes
-o custom output fields stat (status), ppid (process parent id), pid (process id), cmd (command)
because the process with status z or Z is a zombie process, so we use grep grabs stat status as zZ process


#vmstat is the abbreviation of Virtual Meomory Statistics (virtual memory statistics), which can monitor the virtual memory, process, and CPU activities of the operating system. It is to make statistics on the overall situation of the system. The disadvantage is that it is impossible to conduct in-depth analysis of a certain process. #Statistics
10 times in 10 seconds
[root@HikvisionOS ~]# vmstat 10 10
procs -----------memory---------- ---swap-- ---- -io---- -system-- ------cpu-----
 r b swpd free buff cache si so bi bo in cs us sy id wa st
35 0 103432 12347228 756072 55675072 0 0 1 125 0 0 16 12 72 0 0
45 0 103432 12357168 756072 55671712 0 0 0 4336 80632 234562 64 16 20 0 0
80 0 103432 12308676 756072 55708188 0 0 0 4288 80557 233907 64 16 19 0 0
76  0    103432 12307848 756072 55704420    0    0            0  4212         80367    233457         64 17 19  0  0
41  0    103432 12311544 756072 55701380    0    0          108  4694         81747    236129         63 16 21  0  0
30  0    103432 12267112 756072 55739932    0    0            0  4440         79537    232021         65 16 19  0  0
34  0    103432 12310188 756072 55695764    0    0            0  4038         80051    231772         64 17 19  0  0
38  0    103432 12344380 756072 55650796    0    0            0  3833         80565    233747         64 16 20  0  0
98  0    103432 12234400 756072 55688820    0    0            0  4235         81235    235143         64 16 20  0  0
44  0    103432 12307036 756072 55684912    0    0            0  4797         80051    230640         64 17 18  0  0

Field explanation: The     number of processes in the run queue in the
procs
r column. If it is greater than 1 for a long time, it means that the cpu is insufficient and the cpu needs to be increased.
    b The number of processes waiting for IO.

-----cpu-----
    us User process execution time. When the value of us is relatively high, it means that the user process consumes more cpu time, but if it is longer than 50% for a long time, you need to consider optimizing the user's program.
    sy system process execution time. Here, the reference value of us + sy is 80%. If us+sy is greater than 80%, it means that there may be insufficient CPU.
    The wa column shows the percentage of CPU time occupied by IO waiting. The reference value of wa here is 30%. If wa exceeds 30%, it indicates that the IO wait is serious. This may be caused by a large number of random access to the disk, or the bandwidth bottleneck of the disk or disk access controller (mainly block operation).
    The id column shows the percentage of time that the cpu is idle

-system--Displays the number of interrupts that occurred during the acquisition interval in the number of interrupts
    per second, including clock interrupts.
    cs The number of context switches per second, such as when cs is much higher than the disk I/O and network packet rate, further investigation should be conducted.

----memory-----
    swpd Switch to the amount of memory in the memory swap area (k means). If the value of swpd is not 0, or is relatively large, such as over 100m, as long as the values ​​of si and so are 0 for a long time, the system performance is still normal
    free. The amount of memory in the current free page list (k means)
    buff as the memory of the buffer cache Quantity, buffering is generally required for reading and writing to block devices.
    Cache: As the amount of memory of the page cache, it is generally used as the cache of the file system. If the cache is large, it means that more files are used in the cache. If the bi in IO is relatively small at this time, it means that the efficiency of the file system is better.

---swap-- The size of
    si written from the swap area to the memory per second.
    so The memory size written to the swap area per second.

---io---
    bi The number of blocks read per second (read disk) (kb per second).
    bo The number of blocks written per second (write to disk) (kb per second)
#Here we set the bi+bo reference value to 1000, if it exceeds 1000, and the wa value is larger, you should consider balancing the disk load, which can be combined with iostat output analysis.

Summary:
If r is often greater than 4 and id is often less than 40, it means that the load on the cpu is heavy.
If bi and bo are not equal to 0 for a long time, it means that the memory is insufficient.
If disk is often not equal to 0, and the queue in b is greater than 3, it means that io performance is not good.

View disk load iostat
shows the load of all devices
[root@HikvisionOS tmp]# iostat 
Linux 3.10.0-957.5.1.el7.x86_64 (HikvisionOS) 07/30/2020 _x86_64_ (24 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          16.53    0.00   11.64    0.15    0.00   71.68

Device:            tps    kB_read/s    kB_wrtn/s    kB_read    kB_wrtn
vda               1.75         0.73        15.86    3416632   73884217
vdb               0.12         5.98         0.00   27834078         16
vdc               0.11         8.47        14.04   39448987   65401980
vdd              35.14         2.24      2845.00   10424269 13250311520
dm-0              1.72         0.38        15.85    1758836   73824399
dm-1              0.00         0.05         0.00     210741       3525

Description of cpu attribute values:
%user: The percentage of time that the CPU is in user mode.
%nice: The percentage of time that the CPU is in user mode with NICE value.
%system: The percentage of time that the CPU is in system mode.
%iowait: The percentage of time the CPU waits for input and output to complete.
%steal: The percentage of unconscious wait time of the virtual CPU when the hypervisor maintains another virtual processor.
%idle: CPU idle time percentage.

Note:
If the value of %iowait is too high, it means that the hard disk has an I/O bottleneck.
If the value of %idle is high, it means that the CPU is idle.
If the value of %idle is high but the system responds slowly, the CPU may be waiting for memory allocation, and the memory capacity should be increased .
If the %idle value is continuously lower than 10, it indicates that the CPU processing capacity is relatively low, and the most important resource in the system is the CPU.

Obtained multiple comparison arguments:
# [-d displays disk usage, -x displays detailed information, -k displays read and write information in mB units]
[root@HikvisionOS tmp]# iostat -x 1 10 #Execute
once per second, Perform 10 times
Linux 3.10.0-957.5.1.el7.x86_64 (HikvisionOS) 07/30/2020 _x86_64_ (24 CPU)

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
          16.52    0.00   11.64    0.15    0.00   71.69

Device:         rrqm/s   wrqm/s     r/s     w/s    rkB/s    wkB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
vda               0.00     0.22    0.08    1.67     0.73    15.86    19.02     0.00    3.21    0.67    3.33   0.42   0.07
vdb               0.00     0.00    0.12    0.00     5.98     0.00    98.58     0.00    0.15    0.15    0.02   0.12   0.00
vdc               0.00     1.13    0.05    0.06     8.47    14.04   406.96     0.00   19.06    0.42   37.14   0.23   0.00
vdd               0.00   644.72    0.03   35.12     2.24  2844.91   162.03     0.52   17.51   27.11   17.51   0.63   2.23
dm-0              0.00     0.00    0.01    1.71     0.38    15.85    18.83     0.01    7.25    1.16    7.30   0.45   0.08
dm-1              0.00     0.00    0.00    0.00     0.05     0.00    70.41     0.00    1.44    1.02    4.94   0.91   0.00

......

Field explanation:
   rrqm/s: The number of read operations for merge per second. That is, delta(rmerge)/s
   wrqm/s: The number of merge write operations per second. That is, delta(wmerge)/s
   r/s: The number of read I/O devices completed per second. That is, delta(rio)/s
   w/s: The number of write I/O devices completed per second. That is, delta(wio)/s
   rsec/s: The number of sectors read per second. That is, delta(rsect)/s
   wsec/s: The number of sectors written per second. That is, delta(
   wsect )/s rkB/s: The number of K bytes read per second. It is half of rsect/s, because the size of each sector is 512 bytes. (Need calculation)
   wkB/s: The number of K bytes written per second. It is half of wsect/s. (Calculation required)
   avgrq-sz: Average data size (sector) of each device I/O operation. delta(rsect+wsect)/delta(rio+wio)
   avgqu-sz: average I/O queue length. That is, delta(aveq)/s/1000 (because the unit of aveq is milliseconds).
   await: The average waiting time (milliseconds) for each device I/O operation. That is, delta(ruse+wuse)/delta(rio+wio)
   svctm: average service time (milliseconds) of each device I/O operation. Namely delta(use)/delta(rio+wio)
   %util: What percentage of a second is used for I/O operations, or how much time in a second the I/O queue is not empty. That is, delta(use)/s/1000 (because the unit of use is milliseconds)
  
If %util is close to 100%, it means that too many I/O requests are generated, the I/O system is fully loaded, and the disk may have a bottleneck.
Idle is less than 70%. The IO pressure is greater. Generally, the read speed has more wait.
#At the same time, you can view the b parameter (the number of processes waiting for resources) and the wa parameter (the percentage of CPU time occupied by IO waiting). , IO pressure is high when it is higher than 30%)

The load is simply the length of the process queue.
The average system load is defined as the average number of processes in the run queue (how many processes are running on the CPU or waiting to run) in a specific time interval. If a process meets the following conditions, it will be in the run queue:

It is not waiting for the result of the I/O operation.
It has not actively entered the waiting state (that is, it has not called'wait') and
has not been stopped (for example: waiting for termination)

When a process is runnable, it is in a run queue and competes with other runnable processes for CPU time. The load of the system refers to the total number of processes that are running and ready to run. For example, now the system has 2 running processes and 3 runnable processes, then the load of the system is 5. Load average is the number of loads within a certain period of time.

What factors constitute the size of the cpu load

The index to measure the load of the CPU system is load. Load is a measure of how much load the computer system can bear. Simply put, it is the length of the process queue. The request is larger than the current processing capacity, there will be waiting, causing the load to increase


How to judge whether the system has been Over Load?
For a general system, judge based on the number of CPUs. If the average load is always below 1.2, and you have a machine with 2 cups. Then there will be basically no shortage of cpu. That is, the load average is less than the number of CPUs.
Load and capacity planning (Capacity Planning)
       are generally based on the average load of 15 minutes.

Load misunderstanding:
high system load must be a performance problem.
    Truth: The high load may be due to CPU-intensive computing. The
high load of the system must be a CPU power problem or insufficient quantity.
    Truth: High Load just means that the queues that need to be run have accumulated too much. But the tasks in the queue may actually consume CPU, or it may consume i/0 and other factors.
The system has a high load for a long time. First, increase the CPU. The
    truth: Load is just the appearance, not the substance. In some cases, increasing the CPU will temporarily see the load drop, but the symptoms can not be solved.

2: How to identify system bottlenecks when the load average is high.
   Is it due to insufficient CPU, or insufficient io, or insufficient memory?


How to evaluate a reasonable Load Average in performance requirements?

Generally speaking, Load Average is related to the number of machine cores. Taking a single-core machine as an example, load=0.5 means that half of the CPU resources can handle other thread requests, load=1 means that all resources of the CPU are processing requests, and no remaining resources can be used, and load= 2 means that the CPU is overloaded, and there are twice as many threads waiting to be processed. Therefore, for a single-core machine, under ideal conditions, Load Average should be less than 1. Similarly, for dual-core processors, Load Average is less than 2. The conclusion is: In a multi-core processor, your Load Average should not be higher than the total number of processor cores.

How to convert the load value between different core processors?
You may encounter such a problem in performance testing. Your online machine has 8 cores, but the offline performance test machine only has 4 cores. Then the load value I tested with a 4-core machine is 4, which is converted to an 8-core machine. How much should it be?
Should be 4*4/8=2

View the top five processes occupying memory

Command: ps auxw | head -1; ps auxw|sort -rn -k4|head -5 
The unit of memory is kb, VSZ is the occupation of virtual memory, and RSS is the occupation of real memory.
       Command decomposition:
       ps auxw displays system resource usage;
       head -1 means to display the first column, which is the title column;
       sort -r means reverse sorting, -n means sorting by number, -k4 means the fourth character of the column

View the top three processes occupying the CPU

Command: ps auxw|head -1;ps auxw|sort -rn -k3|head -3
The third column of the selected resource occupation (ie cpu), represented by "-k3"

Guess you like

Origin blog.csdn.net/Doudou_Mylove/article/details/107757389