Summary of O&M Monitoring Indicators

Summarize the monitoring content in the operation and maintenance work.

monitoring target

Understand the importance of monitoring and the business goals to be achieved using monitoring

Usually include the following three points:

  • Real-time monitoring of the target system

  • Monitoring can provide real-time feedback on the current state of the target system. Whether the hardware, software, and business of the target system are normal and what state they are currently in

  • Ensure the reliability of the target system, and the business can continue to run stably

monitoring method

  • Understand the monitoring objects such as: how does the CPU work?

  • Performance benchmark indicators such as: CPU usage, load, user mode, kernel mode, context switching

  • For example, the definition of alarm threshold: the definition of high CPU load, how high is the kernel state and user state

  • How to deal with faults more efficiently

monitoring core

  • problem found

  • positioning problem

  • Solve the problem

  • Summarize the problem, summarize the cause of the failure and the prevention of the problem, so as to avoid recurrence in the future

monitoring tool

  • Veteran monitoring

    • Cacti

    • Nagios

    • smokeping

  • popular monitoring

    • Zabbix

    • OpenFalcon

    • Prometheus + Grafana

    • Didi Open Source Nightingale

    • smartping (dedicated to network monitoring)

    • LEPUS Usagi (dedicated to monitoring database)

    • Self-study

  • third party monitoring

    • Monitor treasure

    • Listen to the cloud

    • newrelic

monitoring process

  • collection

Collect data from the system through SNMP, Agent, ICMP, SSH, IPMI, etc.

  • storage

Various database services, MySQL, PostgreSQL

  • analyze

Provide graphics and timeline information to facilitate us to locate the fault

  • exhibit

Indicator information, indicator trend display

  • Call the police

Phone, email, WeChat, SMS, alarm upgrade mechanism

  • deal with

Determine the fault level and find responders for quick processing

Monitoring indicators

hardware monitoring

  • Machine hardware: CPU temperature, physical disk, virtual disk, motherboard temperature, disk array
    IPMI tool cannot obtain the status of the hardware, you can use the MegaCli tool to detect the status of the Raid disk queue
    https://www.ibm.com/developerworks/cn/linux /l-ipmi/

System monitoring

  • host survival

  • CPU, memory, hard disk, usage

  • inode

  • load

  • Network card access bandwidth

  • Number of TCP connections

  • Disk read and write, read only

application monitoring

MySQL

  • service availability

  • memory usage

  • disk usage

  • Master-slave asynchrony and delay

  • backup situation

  • Connections

Redis, Redis Cluster

  • load

  • memory usage

  • number of connections

  • SWC

Nginx

  • status code

  • connection status information

  • RabbitMQ

  • PHP-FPM

  • OpenLDAP

    • Access IP

    • Number of calls

  • Zimbra

  • OpenVPN

    • Version information, currently online

    • User, assign IP, client connection IP, obtain address location through IP, receive and send traffic connection time duration connection ID

  • ELK

  • Graylog

  • GitLab

  • Jenkins

  • MongoDB

  • HAproxy

Network Monitoring

  • network quality

  • Public network egress

  • Dedicated line bandwidth

  • Internet equipment

Traffic Analysis

log monitoring

Security Monitoring

  • URL, API monitoring

  • Self-study

  • Alibaba Cloud Solution

Performance Monitoring (APM) java|php|go|nodejs|distributed link tracking

  • PinPoint

  • Zipkin

  • SkyWalking

  • CAT、Jaeger

business monitoring

E-commerce business as an example:

  • How many orders are generated per minute

  • How many users are registered per minute

  • active users per minute

  • How many promotions per day

  • How many users were brought in by the campaign

  • How much traffic is brought in by the promotion

  • How much profit is brought in by promotional activities

other

  • SSL certificate monitoring

  • Whether the surviving process is still there, port monitoring, log scrolling

  • Health indicator MQ message accumulation volume

  • Interface monitoring API success rate, delay, QPS, etc.

monitoring alarm

  • mail

  • Short message

  • DingTalk, WeChat, Enterprise WeChat and other instant messaging software

  • Telephone

Alarm handling

Fault self-healing: Automatically start when the server is down. Use the software mechanism supervisor, systemd or custom scripts to implement

Comprehensive monitoring

hardware monitoring

The router switch is monitored through SNMP, and other content is implemented using IPMI. If they are all public clouds, you can ignore this part. Case: Open-Falcon monitoring H3C-ER3260G2 router

System monitoring

service monitoring

  • Service comes with

    • Nginx comes with status module

    • PHP corresponding status module

    • MySQL uses percona official tools for monitoring

  • Get data by custom method

    • MySQL show global status xxx;

    • Redis info command information

  • Network Monitoring (Hybrid Cloud Architecture)

    • smokeping

    • smartping

  • Security Monitoring

    • Cloud services can directly use cloud security groups, or supplement native iptables

    • hardware firewall

    • Web services use Nginx+Lua to implement a web-level firewall, or Openresty

  • Log monitoring
    ELK and Graylog realize exception log and error log keyword monitoring

  • Business monitoring
    Determine the monitoring indicators and monitor them, different businesses are different

  • Traffic analysis
    It is recommended to use Baidu statistics, google statistics, business, R & D embedded code implementation.

    or use piwik

  • visual
    dashboard

  • Automated monitoring
    Through API, batch operation

Monitoring summary

A complete monitoring system requires a detailed understanding of the business, and software is just a means.

Guess you like

Origin blog.csdn.net/LinkSLA/article/details/130213736