[Monitoring system] Introduction to monitoring system and comparison of mainstream monitoring frameworks

Internet applications are inseparable from monitoring systems, so why is there a monitoring system?

Products of Internet companies usually provide services through software, websites, apps or other digital means, and such products may face a series of risks and challenges during use.

For example, network failures or stability issues , due to network failures, hardware failures or configuration errors, etc., may lead to unstable access or downtime, thereby affecting user experience. There are also performance bottlenecks and delays . When the number of user visits increases, there may be pressure on the server to exceed the load or reach the upper limit of the bandwidth, resulting in slower response speed of web pages or other platforms, affecting user experience and satisfaction. There is also the issue of data security . Criminals may obtain data or perform malicious operations by attacking firewalls, viruses, malware, etc., destroying data security to a certain extent. Therefore, Internet companies need to monitor products at different stages such as development, testing, release, and operation and maintenance, so as to detect problems in time and take corresponding measures.

What can the monitoring system help us do?

  • Monitor system performance and service quality in real time to ensure system stability and detect and resolve faults in a timely manner.

  • Monitor user behavior and data traffic to further optimize service quality and improve user experience.

  • Discover and predict end-to-end IT operational risks in advance to avoid data security issues and financial losses.

  • Monitor the architecture, code and services of the system, improve the quality of system construction and deployment, and optimize resources and operation and maintenance costs

What are System Availability Indicators?

SLA measures how high the availability of a system is, the target system provides 7 x 24 hours of uninterrupted service, and how many 9s are there when cloud vendors advertise their product SLA.

  • Time dimension: the ratio of the normal use time of the system to the total time (the whole year is an example) 1 year = 365 days = 8760 hours

    • 99.9 = 8760 * 0.1% = 8760 * 0.001 = 8.76 hours
    • 99.99 = 8760 * 0.0001 = 0.876 hours = 0.876 * 60 = 52.6 minutes
    • 99.999 = 8760 * 0.00001 = 0.0876 hours = 0.0876 * 60 = 5.26 minutes
  • Request times dimension: the total number of requests and the proportion of failures (1000 requests as an example, relatively simple)

    • System availability of 99%: means that 1000 * (1- 99%) = 10 requests are allowed to go wrong in 1000 requests
    • System availability 99.9%: It means that 1000 * (1- 99.9%) = 1 request error is allowed in 1000 requests.
  • The more 9, the longer the service availability throughout the year, the more reliable the service, and the shorter the downtime, but there are often network/computer room problems, and the application update version causes the service to be unavailable

  • For most businesses of large factories, 4 9s are rigid needs, 5 9s are goals, and 6 9s are ideals

  • How many 9s are used by the industry to measure system availability

    • Basically usable: 2 9s, the approximate downtime in a year is less than 88 hours

    • High availability: 3 nines, less than 9 hours of downtime in a year

    • High availability: 4 nines or 99.99%, the approximate downtime in a year is less than 53 minutes

    • Very high availability: 5 nines, less than 5 minutes of downtime per year

Common monitoring layers of Internet companies

There are many monitoring and indicators for Internet companies, and they are classified according to the order. These indicators are not absolute and should be selected according to the business and system of the enterprise. For different monitoring priorities, different levels of indicators should also be included according to requirements. It mainly focuses on the status and events of the underlying infrastructure to ensure the normal operation of servers, networks, CPUs, and storage devices, and to provide stable resource support.

(1) System and network layer

It mainly focuses on the status and events of the underlying infrastructure to ensure the normal operation of servers, networks, CPUs, and storage devices, and to provide stable resource support.

  • Worry about common indicators of the system
CPU利用率:服务器上CPU主要的核心使用率情况。
实时负载:系统上所有进程的数目,计算干活的进程数,以及等待队列的进程数,也就是当前的机器的实时压力情况。
内存使用率:服务器内存使用情况,包括已使用、空闲等情况。
网络带宽利用率:服务器网络使用度,包括网卡、负载均衡、网络连接等的带宽使用情况。
硬盘I/O读写速度:磁盘读写速率。
硬盘容量:服务器硬盘容量使用情况,包括已使用、空闲等情况。
进程使用率:检查系统进程情况,包括进程执行状态、占用集群资源等情况。
端口连接状态:检查系统端口连接的状态。
错误日志记录:记录系统产生的错误日志,包括错误类型、时间、处理结果等情况。
响应时间:服务器响应请求的时间
  • Common network indicators
带宽利用率:网络带宽利用率评估,包括上传和下载比率。
包丢失率:测量包在传输中丢失的数量和百分比。
延迟时间:从发送请求到得到响应且完成处理信息所需的时间。
网络流量:流经网络的实时数据量和数据流量。
网络错误率:网络传输中发生的错误数量和百分比。
连接数:网络连接总数。
网络响应时间:网络请求响应时间。
网络拥塞状态:系统可用资源数量和使用率。
传输速度:网络传输速率,包括平均速率和实时速率。

(2) Application layer (business applications, middleware applications, etc.)

Pay attention to the status and operation quality of the overall service, be able to predict system operation bottlenecks in time, and ensure product efficiency and user experience.

请求响应时间:从请求到获得响应的整个时间。
错误率:应用程序产生错误的请求占总数的百分比。
CPU使用率:应用程序当前使用的处理器资源百分比。
线程实例数:当前在应用程序中运行的线程实例数量。
平均程序执行时间:应用程序各模块的平均执行时间。
堆内存使用率:应用程序中Java虚拟机(JVM)分配的内存占用的百分比。
平均延迟时间:从请求到响应开始的时间差。
垃圾回收时间:在JVM中收集不再使用的内存对象所需的时间。
响应代码:HTTP请求成功或失败代码。

(3) Business layer

Focus on the analysis and results of business operations, and obtain more business data by monitoring platform operations and various configurations. Discover industry development trends in a timely manner, guide business direction, and realize all-round monitoring, forecasting and intervention.

GMV销售额:项目特定时间内总的销售额。
日活、月活:日活跃用户数、月活跃用户数
客单价:平均每个订单的金额
支付成功率:成功支付订单和总订单的比率
订单量:每次交易中的订单数量。
购物车转化率:加入购物车商品与实际付款之间的比率。
转化率:网站访客转化为潜在客户(注册、订阅、购买等)的百分比。
用户留存率:每月活跃用户数量与上月活跃用户数量的比率。
退款率:以退货所得的金额与总交易金额之间的比率。
每次交易的平均时间:从访问网站到交易结束的总时间。
每个访问的平均时间:用户在网站上花费的总时间除以有效付款数量。
启动错误率:应用程序在第一次启动时无法正确启动的次数。
平均执行时间:任务执行所需的平均时间。
异常数量:系统产生的错误、崩溃、死锁等数量。
页面加载时间和速度:网页从请求到加载的时间和速度。
点击率:网站广告点击率。

OK, we have learned some general concepts of monitoring. Let's introduce the mainstream framework of monitoring in the industry.

The architecture mode of the monitoring system has two modes: push and pull.

Pull mode : Data can be obtained regularly according to needs, avoiding redundant transmission of data, and saving network bandwidth and storage space. It is also possible to selectively obtain part of the data by means of a request as needed. It is suitable for scenarios that do not require high stability of the monitored object. The core consumption of the Pull mode is on the monitoring system side, and the cost on the application side is relatively low. However, due to the need to regularly send requests to obtain data, for services that require real-time response, the data transmission speed of the Pull mode is difficult to achieve. Pull mode requires an application capable of handling push events, and therefore requires high cost and complexity.

Push mode : In this mode, the monitored object can actively push data, and the monitoring client does not need to send requests, thus avoiding network load. For services that require real-time response data, the Push mode has higher transmission efficiency and better real-time performance. Push mode is suitable for a stable network environment and can realize fast transmission of real-time data. However, the core consumption of Push mode is on the push and Push Agent side, and the consumption on the monitoring system side is smaller than that of Pull. The monitored object needs to actively send data, so it is not suitable for scenarios with high requirements on the monitored object, such as the need to strictly control the data flow of the monitored object.

For the company's internal monitoring system, it is more appropriate to have both Pull and Push capabilities.

insert image description here

Let's take a look at the comparison of mainstream monitoring and alarm frameworks

insert image description here

Zabbix (supports pull/push two modes) : Developed based on C language, it is an enterprise-level monitoring software developed based on server-client architecture. It allows monitoring of various network services, server resources and hardware.

The advantage is that it is user-friendly, easy to deploy and set up, and has a plug-in framework that can flexibly meet different monitoring needs. It provides rich graphical monitoring data with high scalability, and users can easily add custom functions.

The disadvantage is that it is powerful and cumbersome to use, and the installation and deployment of Zabbix requires more time and resources. For complex IT architectures, configuration of Zabbix is ​​difficult, and the response time is not real-time, which may not be ideal for application scenarios with high real-time requirements.

insert image description here

Open-Falcon (push mode) : It is a distributed, high-performance monitoring solution, which is used in large-scale high-availability systems and extensive cloud computing environments.

The advantage is that it is lightweight, can be deployed on physical machines and virtual machines, and the deployment and configuration are simple enough. Supports query optimization, supports multiple data source input, and has a simple and easy-to-use graphical interface.

The disadvantage is that the domestic usage is small and the community is not active enough. For some complex IT architectures, Open-Falcon may have difficulties in preparation and installation, and experienced developers need code adjustments for different requirements.

insert image description here

Prometheus (Pull mode) : Developed in go language, it is a scalable open source monitoring system for collecting and storing multi-dimensional time series data. It supports the PromQL query language and provides graphical display tools. The functions of visualization and alarming are implemented by third-party products such as Grafana and Alertmanager, which is highly scalable. Simple in function, as a lightweight rising star, it has obvious advantages in performance and display, and it supports container monitoring very well.

The advantage is that it is highly scalable, supports horizontal expansion, and meets the needs of large-scale monitoring. Supports flexible query and query language, and provides rich query functions and operators. Data can be collected from different data sources and multi-dimensional data aggregation is supported. Alertmanager can be easily integrated for alerting.

The disadvantage is that the storage mechanism is not flexible enough, the storage and processing of large-scale data is under pressure, and the technical level of operation and maintenance engineers is high.

OK, this article mainly briefly introduces some business needs of the monitoring system and the introduction of the mainstream monitoring framework, then we will mainly learn about Prometheus.

Okay, then we will see you next time, remember to follow us three times in a row!

insert image description here

Guess you like

Origin blog.csdn.net/weixin_47533244/article/details/131628981