Linux system performance monitoring collection items

1. Basic collection items for Linux operation and maintenance

When doing operation and maintenance, you are not afraid of problems. What you are afraid of is that you will not be able to catch the scene, and your eyes will be dark. Therefore, it is of great significance to rely on a strong monitoring system to collect as many indicators as possible. But which indicators are meaningful? Based on the idea from practice, the experience summed up by engineers in the long-term struggle is the most valuable.

In the long-term work practice of operation and maintenance engineers, we have summarized some indicators that are often referenced in the process of system operation and maintenance, mainly including the following categories:

  • CPU
  • Load
  • RAM
  • disk
  • I
  • Network related
  • Kernel parameters
  • ss statistics output
  • Port collection
  • Process survival information collection of core services
  • Resource consumption of critical business processes
  • NTP offset collection
  • DNS resolution collection

For each category, the specific detailed indicators are as follows. These indicators are directly supported by the agent component of open-falcon. falcon-agent will collect relevant indicators at regular intervals (currently 60 seconds) and report them to the server.

2. CPU-related collection items

Calculation method: Obtained by collecting /proc/stat, you can refer to the statistical output of the sar command to understand.

  • cpu.idle:Percentage of time that the CPU or CPUs were idle and the system did not have an outstanding disk I/O request.
  • cpu.busy: Relative to cpu.idle, its value is equal to 100 minus cpu.idle.
  • cpu.guest:Percentage of time spent by the CPU or CPUs to run a virtual processor.
  • cpu.iowait:Percentage of time that the CPU or CPUs were idle during which the system had an outstanding disk I/O request.
  • cpu.irq:Percentage of time spent by the CPU or CPUs to service hardware interrupts.
  • cpu.softirq:Percentage of time spent by the CPU or CPUs to service software interrupts.
  • cpu.nice:Percentage of CPU utilization that occurred while executing at the user level with nice priority.
  • cpu.steal:Percentage of time spent in involuntary wait by the virtual CPU or CPUs while the hypervisor was servicing another virtual processor.
  • cpu.system:Percentage of CPU utilization that occurred while executing at the system level (kernel).
  • cpu.user:Percentage of CPU utilization that occurred while executing at the user level (application).
  • cpu.cnt: The number of cpu cores.
  • cpu.switches: Number of cpu context switches, counter type.

3. Disk related collection items

Calculation method: first read /proc/mounts to get all mount points, and then get the usage of blocks and inodes through syscall.Statfs_t. Each metric will be appended with a set of tag descriptions, similar to mount=$mount, fstype=$fstype, where $mount is the mount point, such as /home, and $fstype is the file system, such as ext4.

  • df.bytes.free: disk free amount, int64
  • df.bytes.free.percent: The percentage of disk free space in the total, float64, such as 32.1
  • df.bytes.total: total disk size, int64
  • df.bytes.used: disk used size, int64
  • df.bytes.used.percent: The percentage of the used size of the disk to the total, float64
  • df.inodes.total: total number of inodes, int64
  • df.inodes.free: the number of free inodes, int64
  • df.inodes.free.percent: percentage of available inodes, float64
  • df.inodes.used: used inode data, int64
  • df.inodes.used.percent: percentage of used inodes, float64

4. megacli tool output

Use the megacli tool to read RAID-related information. Each metric will attach a set of tag descriptions to indicate the PD or VD to which it belongs. The PD format is PD=Enclosure_ID:SLOT_ID. For example, PD=32:0 indicates the first disk, VD= 0 indicates the first logical disk.

  • sys.disk.lsiraid.pd.Media_Error_Count: This and the following three indicators are currently only collected as data, which does not necessarily mean that the disk is damaged (it just means that the probability of damage is increasing)
  • sys.disk.lsiraid.pd.Other_Error_Count
  • sys.disk.lsiraid.pd.Predictive_Failure_Count
  • sys.disk.lsiraid.pd.Drive_Temperature
  • sys.disk.lsiraid.pd.Firmware_state: If the value is not 0, there is a problem with this physical disk
  • sys.disk.lsiraid.vd.cache_policy: If the value is not 0, it means that the logical disk cache policy does not match the settings
  • sys.disk.lsiraid.vd.state: If the value is not 0, there is a problem with this logical disk

5. SMART tool output

Use the smartctl tool to read the SMART information of the disk. At present, all indicators are only collected as data, which does not necessarily mean that the disk is damaged (it just means that the probability increases). Each metric will have a set of tag descriptions, indicating the drive letter, such as device=/dev/sda .

  • sys.disk.smart.Reallocated_Sector_Ct
  • sys.disk.smart.Spin_Retry_Count
  • sys.disk.smart.Reallocated_Event_Count
  • sys.disk.smart.Current_Pending_Sector
  • sys.disk.smart.Offline_Uncorrectable
  • sys.disk.smart.Temperature_Celsius

6. Partition read and write monitoring

Test whether all mounted partitions are readable and writable, each metric will have a set of tag descriptions, indicating the mount point, such as mount=/home

  • sys.disk.rw: If the value is not 0, it indicates that there is a problem with reading and writing this partition

7. IO related collection items

Calculation method: Collect /proc/diskstats every second and calculate the difference, which are all counter types. Each metric will have a set of tag descriptions, such as device=$device, which are used to indicate specific devices, such as sda1 and sdb. Users can refer to the help documentation of iostat to understand the specific metric meaning.

  • disk.io.ios_in_progress:Number of actual I/O requests currently in flight.
  • disk.io.msec_read:Total number of ms spent by all reads.
  • disk.io.msec_total:Amount of time during which ios_in_progress >= 1.
  • disk.io.msec_weighted_total:Measure of recent I/O completion time and backlog.
  • disk.io.msec_write:Total number of ms spent by all writes.
  • disk.io.read_merged:Adjacent read requests merged in a single req.
  • disk.io.read_requests:Total number of reads completed successfully.
  • disk.io.read_sectors:Total number of sectors read successfully.
  • disk.io.write_merged:Adjacent write requests merged in a single req.
  • disk.io.write_requests:total number of writes completed successfully.
  • disk.io.write_sectors:total number of sectors written successfully.
  • disk.io.read_bytes: a number in bytes
  • disk.io.write_bytes: The unit is a byte number
  • disk.io.avgrq_sz: The following values ​​are the values ​​seen by iostat -x 1
  • disk.io.avgqu-sz
  • disk.io.await
  • disk.io.svctm
  • disk.io.util: is a percentage, such as 56.43, which means 56.43%

8. Collection items related to machine load

Calculation method: read /proc/loadavg, all of which are primitive value types:

  • load.1min
  • load.5min
  • load.15min

9. Memory related collection items

Calculation method: read the content in /proc/meminfo, where mem.memfree is free+buffers+cached, mem.memused=mem.memtotal-mem.memfree. Users can refer to the output of the free command and the help documentation to understand the meaning of each metric.

  • mem.memtotal: total memory size
  • mem.memused: how much memory is used
  • mem.memused.percent: percentage of memory used
  • mem.memfree
  • mem.memfree.percent
  • mem.swaptotal: total size of swap
  • mem.swapused: how much swap is used
  • mem.swapused.percent: The percentage of swap used
  • mem.swapfree
  • mem.swapfree.percent

10. Network related collection items

Calculation method: read the content of /proc/net/dev, each metric has a set of tags attached, such as iface=$iface, indicating the specific interface, such as eth0. Metrics with in indicate inflow, out indicates outflow, and total is the total amount in+out. The supported metrics are as follows:

  • net.if.in.bytes
  • net.if.in.compressed
  • net.if.in.dropped
  • net.if.in.errors
  • net.if.in.fifo.errs
  • net.if.in.frame.errs
  • net.if.in.multicast
  • net.if.in.packets
  • net.if.out.bytes
  • net.if.out.carrier.errs
  • net.if.out.collisions
  • net.if.out.compressed
  • net.if.out.dropped
  • net.if.out.errors
  • net.if.out.fifo.errs
  • net.if.out.packets
  • net.if.total.bytes
  • net.if.total.dropped
  • net.if.total.errors
  • net.if.total.packets

11. Port collection items

The calculation method is to use ss -ln to determine whether the specified port is in the listen state. Primitive value type, the value is either 1, which means listening, or 0, which means not listening. Each metric is attached with a set of tags, such as port=$port, where $port is the specific port.

  • net.port.listen

12. Machine Kernel Configuration

  • kernel.maxfiles: /proc/sys/fs/file-max read
  • kernel.files.allocated: read the first Field of /proc/sys/fs/file-nr
  • kernel.files.left:值=kernel.maxfiles-kernel.files.allocated
  • kernel.maxproc: read /proc/sys/kernel/pid_max

13. ntp collection item

Use ntpq -pn to get the offset of the local time relative to the ntp server.

  • sys.ntp.offset: local offset time, in ms

14. Process monitoring

  • proc.num: To judge the number of a certain process, there are two scenarios here, one is to judge according to the name of the process, such as name=sshd; the other is to judge according to cmdline, for example, the application process name of Java may be both It is java. It cannot be distinguished according to the first case. At this time, cmdline can be configured, such as cmdline=./falcon_agent-c./cfg.ini

15. Process resource monitoring

  • process.cpu.all: The cpu of sys+user used by the process and its subprocesses, the unit is jiffies
  • process.cpu.sys: sys cpu used by the process and its child processes, in jiffies
  • process.cpu.user: user cpu used by the process and its subprocesses, in jiffies
  • process.swap: the swap used by the process and its child processes, the unit is page
  • process.fd: the number of file descriptors used by the process
  • process.mem: memory occupied by the process, in bytes

16. ss command output

  • ss.orphaned
  • ss.closed
  • ss.timewait
  • ss.slabinfo.timewait
  • ss.synrecv
  • ss.estab

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326015516&siteId=291194637