Summarize the monitoring content in the operation and maintenance work.
monitoring target
Understand the importance of monitoring and the business goals to be achieved using monitoring
Usually include the following three points:
-
Real-time monitoring of the target system
-
Monitoring can provide real-time feedback on the current state of the target system. Whether the hardware, software, and business of the target system are normal and what state they are currently in
-
Ensure the reliability of the target system, and the business can continue to run stably
monitoring method
-
Understand the monitoring objects such as: how does the CPU work?
-
Performance benchmark indicators such as: CPU usage, load, user mode, kernel mode, context switching
-
For example, the definition of alarm threshold: the definition of high CPU load, how high is the kernel state and user state
-
How to deal with faults more efficiently
monitoring core
-
problem found
-
positioning problem
-
Solve the problem
-
Summarize the problem, summarize the cause of the failure and the prevention of the problem, so as to avoid recurrence in the future
monitoring tool
-
Veteran monitoring
-
Cacti
-
Nagios
-
smokeping
-
-
popular monitoring
-
Zabbix
-
OpenFalcon
-
Prometheus + Grafana
-
Didi Open Source Nightingale
-
smartping (dedicated to network monitoring)
-
LEPUS Usagi (dedicated to monitoring database)
-
Self-study
-
-
third party monitoring
-
Monitor treasure
-
Listen to the cloud
-
newrelic
-
monitoring process
-
collection
Collect data from the system through SNMP, Agent, ICMP, SSH, IPMI, etc.
-
storage
Various database services, MySQL, PostgreSQL
-
analyze
Provide graphics and timeline information to facilitate us to locate the fault
-
exhibit
Indicator information, indicator trend display
-
Call the police
Phone, email, WeChat, SMS, alarm upgrade mechanism
-
deal with
Determine the fault level and find responders for quick processing
Monitoring indicators
hardware monitoring
-
Machine hardware: CPU temperature, physical disk, virtual disk, motherboard temperature, disk array
IPMI tool cannot obtain the status of the hardware, you can use the MegaCli tool to detect the status of the Raid disk queue
https://www.ibm.com/developerworks/cn/linux /l-ipmi/
System monitoring
-
host survival
-
CPU, memory, hard disk, usage
-
inode
-
load
-
Network card access bandwidth
-
Number of TCP connections
-
Disk read and write, read only
application monitoring
MySQL
-
service availability
-
memory usage
-
disk usage
-
Master-slave asynchrony and delay
-
backup situation
-
Connections
Redis, Redis Cluster
-
load
-
memory usage
-
number of connections
-
SWC
Nginx
-
status code
-
connection status information
-
RabbitMQ
-
PHP-FPM
-
OpenLDAP
-
Access IP
-
Number of calls
-
-
Zimbra
-
OpenVPN
-
Version information, currently online
-
User, assign IP, client connection IP, obtain address location through IP, receive and send traffic connection time duration connection ID
-
-
ELK
-
Graylog
-
GitLab
-
Jenkins
-
MongoDB
-
HAproxy
Network Monitoring
-
network quality
-
Public network egress
-
Dedicated line bandwidth
-
Internet equipment
Traffic Analysis
log monitoring
Security Monitoring
-
URL, API monitoring
-
Self-study
-
Alibaba Cloud Solution
Performance Monitoring (APM) java|php|go|nodejs|distributed link tracking
-
PinPoint
-
Zipkin
-
SkyWalking
-
CAT、Jaeger
business monitoring
E-commerce business as an example:
-
How many orders are generated per minute
-
How many users are registered per minute
-
active users per minute
-
How many promotions per day
-
How many users were brought in by the campaign
-
How much traffic is brought in by the promotion
-
How much profit is brought in by promotional activities
other
-
SSL certificate monitoring
-
Whether the surviving process is still there, port monitoring, log scrolling
-
Health indicator MQ message accumulation volume
-
Interface monitoring API success rate, delay, QPS, etc.
monitoring alarm
-
mail
-
Short message
-
DingTalk, WeChat, Enterprise WeChat and other instant messaging software
-
Telephone
Alarm handling
Fault self-healing: Automatically start when the server is down. Use the software mechanism supervisor, systemd or custom scripts to implement
Comprehensive monitoring
hardware monitoring
The router switch is monitored through SNMP, and other content is implemented using IPMI. If they are all public clouds, you can ignore this part. Case: Open-Falcon monitoring H3C-ER3260G2 router
System monitoring
service monitoring
-
Service comes with
-
Nginx comes with status module
-
PHP corresponding status module
-
MySQL uses percona official tools for monitoring
-
-
Get data by custom method
-
MySQL show global status xxx;
-
Redis info command information
-
-
Network Monitoring (Hybrid Cloud Architecture)
-
smokeping
-
smartping
-
-
Security Monitoring
-
Cloud services can directly use cloud security groups, or supplement native iptables
-
hardware firewall
-
Web services use Nginx+Lua to implement a web-level firewall, or Openresty
-
-
Log monitoring
ELK and Graylog realize exception log and error log keyword monitoring -
Business monitoring
Determine the monitoring indicators and monitor them, different businesses are different -
Traffic analysis
It is recommended to use Baidu statistics, google statistics, business, R & D embedded code implementation.or use piwik
-
visual
dashboard -
Automated monitoring
Through API, batch operation
Monitoring summary
A complete monitoring system requires a detailed understanding of the business, and software is just a means.