[Linux] Still using the top command? You can try the atop tool, the information is clear at a glance, a new choice for operation and maintenance engineers

atop use

With its stability, Linux is increasingly being used as the operating system of servers (of course, some people will seriously say: Linux is just the kernel of the operating system:). But using Linux as the underlying operating system, can we guarantee that our services will be stable 7*24? No, we must know that business functions are realized by programs running on the system. To achieve the stability of business functions, choosing Linux is only the first step. We work more to prevent business programs from becoming short-term stability. plate.

When there is a problem with our server, the external performance is that the business function cannot be provided normally. The internal reason, from the perspective of the program, may be a problem with the business program (bug in the program itself), or it may be artificial on the server. Misoperation (improper execution of scripts or commands); from the perspective of system resources, it may be CPU preemption, memory leak, abnormal disk IO read and write, network abnormality, etc. After a problem occurs, in the face of various possible reasons, how should we proceed to analyze it? Do we have any tools for problem localization?

Introduction to atop

The atop to be introduced in this article is a tool for monitoring Linux system resources and processes. It records the operating status of the system at a certain frequency. The collected data includes the usage and process of system resources (CPU, memory, disk and network) The running status can be saved in the disk in the form of a log file. After a problem occurs on the server, we can obtain the corresponding atop log file for analysis. atop is an open source software, we can get its source code and rpm installation package from here.

how to use atop

After installing atop, we can type the "atop" command on the command line to see the current running status of the system:

atop default view

The meaning of the system resource monitoring field

There are many fields and values ​​listed in the figure above. What is the meaning of each field? What should we think? The meaning of each of the above fields is relative to the sampling period. Let's first focus on the upper part of the above figure.

ATOP row: This column shows the host name, the date and time of information sampling and the frequency of information collection
PRC row: This column shows the overall running status of the process
  1. sys: the sum of the running time of all processes in the kernel mode in the past 10s
  2. usr: the sum of the running time of all processes in user mode in the past 10s
  3. #proc: the number of processes converted in the past 10s
  4. #zombie: the number of zombie processes in the past 10s
  5. #exit: the number of processes that atop exits during the 10s sampling period
CPU row: The cpu column shows the overall status information of the server's CPU, including the proportion of cores and users, the proportion of processing interrupts, and the proportion of CPUs that are idle (here is 100%*cpu core number, CPU has Sometimes there will be waiting idle due to disk performance problems)
  1. sys, usr: When the CPU is used to process the process, the proportion of CPU time that the process occupies in the kernel mode and user mode
  2. irq: The actual proportion of the CPU processing the interrupt request of the process
  3. idle: The proportion of time when the CPU is idle (in addition to being idle, it will also be idle when waiting for disk io)
  4. wait: The proportion of time the CPU is in the state of "the process waits for disk IO and the CPU is idle"

The result of adding the values ​​in the CPU column is N00%, where N is the number of CPU cores.

cpu row: the status information of each core, the same as the total CPU information, the sum of each column is the total CPU status information.
CPL line: cpl also reflects the overall performance of the server. The displayed information includes the number of process waiting queues, which are sampled from the past 1 minute, 5 minutes, and 15 minutes.
  1. avg1, avg5, and avg15 fields: the average number of processes waiting in the run queue in the past 1 minute, 5 minutes, and 15 minutes
  2. csw: number of context switches
  3. intr: number of interrupt occurrences
  4. numcpu: the number of cores of the cpu
MEM row: This column indicates the usage of memory
  1. tot: the total amount of physical memory
  2. free: the size of free memory (you can't judge the lack of memory from this field alone, you also need to refer to -/+ buffers/cache:free in free -m because the content of this block can be used at any time, and you can also check whether there is Use Swap to determine whether there is insufficient memory)
  3. cache: memory size used for page cache
  4. dirty: the size of dirty pages in memory
  5. buff: memory size used for file caching
  6. slab: the memory size occupied by the system kernel
SWP row: This column indicates the usage of swap space
  1. tot: the total amount of swap area
  2. free: the total amount of remaining space in the swap space
PAG line: This column indicates the paging of virtual memory

swin: number of memory pages swapped in

swout: number of memory pages swapped out

LVM/DSK line: This column indicates the disk usage, each disk device corresponds to a column, if there is an sdb device, then add a column of DSK information
  1. sda: disk device identifier
  2. busy: the proportion of time when the disk is busy
  3. read: read, KiB/r, MBr/s: the number of requests read per second and the number of kb and mb requested
  4. write: write, KiB/w, MBr/w: the number of requests written per second and the number of kb and mb requested
  5. avq: disk average queue length (according to actual monitoring, this column seems to be the average number of disk requests avgrq)
  6. avio: the average io time of the disk
NET row: Multiple columns of NET display the network transmission information of the transport layer (TCP/UDP), network layer (ip), and network interface.
  1. transport: The display of data input and output of the transport layer (TCP/UDP). For example, the data transmission between the internal processes of the server is displayed at the transport layer, thinking that it does not need to be transmitted through the network.
  2. network: display of data input and output at the network layer (ip);
  3. eth0: The display of the data input and output of the default network interface, that is, the display of data transmission through the ip of etho,
  4. sp: the bandwidth of the network card (1000M)
  5. pcki: size of the incoming packet
  6. pcko: the size of the outgoing packet
  7. si: incoming data size per second
  8. so: The size of data transmitted per second
  9. coll (collisions): the number of collisions per second
  10. mlti (MULTICAST): number of multicasts per second
  11. erri/erro: the number of errors in input and output per second
  12. drpi/drpo: the number of input and output packets lost per second
  13. lo: The data display of the data transmission through the 127.0.0.1 network interface, the parameters are the same as the above eth0

Process column: The process column shows the data of each process in the past 10S

m mode: memory state mode
  1. SYSCPU: CPU time occupied by the process in kernel mode in the past 10s

  2. USRCPU: the CPU time occupied by the process in user mode in the past 10S

  3. VSIZE: The size of the virtual space occupied by the process in the past 10S

  4. RSIZE: the memory space occupied by the process in the past 10S

  5. PSIZE: the page size occupied by the process in the past 10S

  6. VGROW: the virtual space size of the past 10S process growth

  7. RGROW: the memory size of the past 10S process growth

  8. SWAPSZ: The size of the swap space used by the process in the past 10S.

  9. MEM: Percentage of memory occupied by processes in the past 10S

Memory view (Memory consumption)

The memory view shows the memory usage of the process, press the m key to enter the memory view.

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-43Aaiup9-1685945076992)(http://images.cnblogs.com/cnblogs_com/bangerlee/320012/r_atop%E5 %86%85%E5%AD%98%E8%A7%86%E5%9B%BE.png)]

The lower part of the figure above shows the size of virtual memory space (VSIZE), memory space (RSIZE) occupied by each process, and the growth size of virtual memory and physical memory (VGROW, RGROW) in the previous sampling cycle. The MEM column indicates the process The size of physical memory occupied.

From the information in the PAG column in the above figure, we can know that the system memory load is high at this time, and page swapping occurs. From the VGROW and RGROW columns in the process view, it can be seen that the amount of memory occupied by the VirtualBox process has increased significantly, and the memory occupied by some processes Decrease (negative VGROW or RGROW fields), to make room for the VirtualBox process.

d-mode: disk status mode
  1. WRDSK: the amount of data written to the disk by the process in the past 10S
  2. DSK: The percentage of the disk occupied by the past 10S process
  3. CMD: process name
p mode: process status mode, the process with the same name is displayed in one column, grouped and displayed according to the process name
  1. NPROCS: Number of processes with the same name

Other parameters have been listed above

v mode: thread state mode
u mode: user mode

Group display by user

g mode: standard mode

Default view (Generic information)

Enter the atop information interface, what we see is the default view of the process information (the lower part of the above picture), press the g key to jump from other views to the default view.

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-qiUTh24x-1685945076993)(http://images.cnblogs.com/cnblogs_com/bangerlee/320012/r_%E9 %BB%98%E8%AE%A4%E8%A7%86%E5%9B%BE.png)]

From the figure above, we can see that the find process with PID 3061 took up 3.43 seconds of CPU time in kernel mode and 0.96 seconds in user mode before exiting. The total CPU time used was 4.39 seconds, compared to 10 Minute sampling period, CPU time occupation ratio is 1%, ST column indicates the process status, N indicates that the process is a newly generated process in the previous sampling period, E indicates that the process has exited, and the EXC column indicates the exit code of the process. From the process name in the "<>" symbol, we can also know that the process has exited.

s: the current status of the process, including: s(sleeping), R(running), etc.
c: command mode

Command view (Command line)

By pressing the c key we can enter the command view, which shows the commands corresponding to each process.

[External link picture transfer failed, the source site may have an anti-theft link mechanism, it is recommended to save the picture and upload it directly (img-5cN8rsx8-1685945076993) (http://images.cnblogs.com/cnblogs_com/bangerlee/320012/r_atop%E5 %91%BD%E4%BB%A4%E6%A8%A1%E5%BC%8F.png)]

Sometimes one of our "Ma Daha" colleagues executes a certain script or command, which makes the system resource usage extremely high. At this time, we can easily find the command that caused the exception through the command view of atop.

atop related files

/etc/atop: The directory saves the configuration file of atop
/etc/rc.d/init.d/atop: the startup file of
atop /etc/cron.d/atop: the scheduled task file of atop, the default is 0:00 every day Start
/var/log/atop: atop log file, the default is that a log file of the day will be generated at 0:00 every day, and then you can view the information through atop -r file, but you can’t find the function of autoplay, you can only enter b To display the information of a specified time, you can write a loop to realize
/usr/bin/atop: atop command directory

atop -r atop_20160510 -b 13:00 -e 17:00
1. The log file information generated by atop is recorded in a sampling period of 10 minutes, which can be modified by modifying the /etc/atop/atop.daily file.

The combination of sampling pages at each time point forms an atop log file, and we can use the "atop -r XXX" command to view the log file. So in what form to save the atop log file?

For the storage method of atop log files, we can do this:

  1. Save an atop log file every day, which records the information of the day
  2. Log files are named in the form of "atop_YYYYMMDD"
  3. Set the log expiration date and automatically delete the log files from a period of time ago

In fact, atop developers have provided the above log saving methods, and the corresponding atop.daily script can be found in the source code directory. In the atop.daily script, we can change the sampling period of atop information by modifying the INTERVAL variable (the default is 10 minutes); change the number of days the log is saved by modifying the value in the following command (the default is 28 days):

(sleep 3; find $LOGPATH -name 'atop_*' -mtime +28 -exec rm {} \; )&

Finally, we modify the cron file to execute the atop.daily script every morning:

0 0 * * * root /etc/cron.daily/atop.daily
Export the records of atop to text:

top -r /val/log/top/top/atop_slot10_suse10sp2_20120622 -b 04:00 -e 16:10 >> atop_log.txt

The above command can be used with one command to redirect atop information into a file, for example:

top -v -b 01:00 -e 01:05 atop_linux_20160119>me.log

atop to see the CPU idle rate:

top – PCPU – r atop_linux_20160119 | grep – v SEP | grep – v RESET | awk – F "[print,, (9=11=13]] 100%"]"

Other parameters of atop:

Copy after login
Usage: atop [-flags] [interval [samples]]
or
Usage: atop -w file [-S] [-a] [interval [samples]]
atop -r [file] [-b hh:mm] [-e hh:mm] [-flags]

generic flags:
  -a  show or log all processes (i.s.o. active processes only)
  -R  calculate proportional set size (PSS) per process
  -P  generate parseable output for specified label(s)
  -L  alternate line length (default 80) in case of non-screen output
  -f  show fixed number of lines with system statistics
  -F  suppress sorting of system resources
  -G  suppress exited processes in output
  -l  show limited number of lines for certain resources
  -y  show individual threads
  -1  show average-per-second i.s.o. total values

  -x  no colors in case of high occupation
  -g  show general process-info (default)
  -m  show memory-related process-info
  -d  show disk-related process-info
  -n  show network-related process-info
  -s  show scheduling-related process-info
  -v  show various process-info (ppid, user/group, date/time)
  -c  show command line per process
  -o  show own defined process-info
  -u  show cumulated process-info per user
  -p  show cumulated process-info per program (i.e. same name)

  -C  sort processes in order of cpu-consumption (default)
  -M  sort processes in order of memory-consumption
  -D  sort processes in order of disk-activity
  -N  sort processes in order of network-activity
  -A  sort processes in order of most active resource (auto mode)

specific flags for raw logfiles:
  -w  write raw data to   file (compressed)
  -r  read  raw data from file (compressed)
      special file: y[y...] for yesterday (repeated)
  -S  finish atop automatically before midnight (i.s.o. #samples)
  -b  begin showing data from specified time
  -e  finish showing data after specified time

Reference:
https://blog.51cto.com/u_15715098/5707324
http://www.taodudu.cc/news/show-3691508.html?action=onClick

Guess you like

Origin blog.csdn.net/imliuqun123/article/details/131046878