Hierarchical architecture design for enterprise-level IT application operation and maintenance monitoring

Enterprises basically have their own IT systems, and each IT system has its own monitoring system.

An enterprise-level IT application monitoring framework is a comprehensive solution involving many levels and corresponding tools. As the scale and complexity of enterprise IT systems continue to increase, monitoring and management systems are also facing increasing challenges.

Sometimes people don't know where to start when establishing monitoring; sometimes after establishing a monitoring system, they find that many blind spots cannot be monitored.

This article will share the basic principles of IT application operation and maintenance monitoring, general monitoring system and application scenarios, monitoring platform design, intelligent monitoring implementation methods, etc., in order to provide some help to the monitoring and management of enterprise IT systems.

1. Monitoring principle

The basic principle of enterprise-level IT application operation and maintenance monitoring architecture is to conduct comprehensive monitoring and management of enterprise IT systems by collecting, storing, analyzing and displaying various monitoring data. Among them, monitoring data includes system, network, application and other indicator data, event data and log data, etc., which can be collected through various data collectors.

The collected data can be stored in storage systems such as distributed databases, NoSQL databases, or data warehouses, and transformed into visual monitoring indicators through data analysis and processing, and displayed through dashboards, charts, reports, etc.

At the same time, the monitoring data can be monitored and alarmed in real time through the alarm system, and the IT system can be automatically managed and optimized through automatic operation and maintenance.

2. monitoring level

Generally speaking, there must be monitoring wherever there are IT systems, and the distribution of IT systems in different enterprises is different. Some enterprises have a large number of edge systems, such as: computers, industrial computers, etc.; some enterprises have their own IDC computer room, and their own IT systems are built in the IDC computer room; some enterprises build their own IT systems on the public cloud ; Some enterprises have established a hybrid cloud architecture, which includes both IDC computer rooms and public clouds.

The IT monitoring system is attached to it. For the edge system, there is an Internet of Things monitoring system similar to IOT; the IDC computer room has a monitoring system for network equipment (this is generally provided by the network provider); the system on the public cloud is provided by the cloud provider. Provide a complete monitoring system; if there is a hybrid cloud architecture, then the monitoring system construction team needs to integrate the monitoring systems on and off the cloud to provide unified monitoring.

The above monitoring is classified from the perspective of the system, and what is done is system monitoring. This article discusses how to divide the layers from the perspective of application operation and maintenance.

2.1 APIs monitoring

APIs (Application Programming Interface) monitoring, also known as front-end monitoring, refers to the process of real-time monitoring and management of APIs usage, performance, security, etc. Usually includes:

A. Usage monitoring: monitor APIs call status, usage frequency, error rate, etc., so as to understand the usage and traffic status of APIs.

B. Performance monitoring: monitor the performance indicators of APIs such as response time, delay, and throughput, so as to discover performance problems and bottlenecks of APIs in time.

C. Security monitoring: Monitor the security of APIs, including authentication, authorization, access control, etc., to protect APIs from security threats.

D. Error monitoring: Monitor APIs error conditions, including error type, error code, error frequency, etc., so as to discover and solve APIs error problems in a timely manner.

2.2 Application Layer Monitoring

Application layer monitoring refers to the process of real-time monitoring and management of application performance, availability, security, etc. Usually includes:

A. Application performance monitoring: Monitor the performance indicators of the application, including the gold four indicators such as request response time, throughput, and error rate saturation rate, so as to discover application performance problems and bottlenecks in time.

B. Availability monitoring: monitor the availability of the application, including the running status of the application, the number of visits, the error rate, etc., to ensure the normal operation and availability of the application.

C. Security monitoring: monitor the security of the application, including application firewall, intrusion detection, security events, etc., to protect the application from security threats. Generally, this is the responsibility of the security team, and the operation and maintenance personnel are less involved.

D. Log management: collect, analyze, and visualize application log information to help users quickly discover and resolve application problems and exceptions.

In order to realize application layer monitoring, corresponding tools and platforms:

A. Application performance monitoring tools: By monitoring application performance indicators, it helps users quickly discover application performance problems and bottlenecks.

B. Availability monitoring tool: By monitoring the running status and access times of the application to ensure the normal operation and availability of the application.

C. Security monitoring tools: Similar to the monitoring of APIs, it is mainly composed of vulnerability scanning tools, intrusion detection systems and other tools. For example, a third-party tool is used in the newly launched code of the application. If this tool has a backdoor vulnerability, it will be monitored. arrive.

D. Log management tool: By collecting, analyzing and visualizing application log information, it helps users quickly discover and solve application problems and abnormal situations.

2.3 Resource layer monitoring

Resource layer monitoring refers to the process of real-time monitoring and management of various resources (such as CPU, memory, disk, network, etc.) of the computer system, which includes not only servers, but also containers. Resource scheduling, so it also includes the monitoring of the number of containers and their status.

2.4 Link Layer Monitoring

Link layer monitoring refers to the process of real-time monitoring and management of the interaction process between components and modules in a distributed system. Link layer monitoring can help users quickly discover and solve application problems and bottlenecks, and improve application reliability and performance.

2.5 Backend Monitoring

Backend monitoring refers to the process of real-time monitoring and management of application backends (such as databases, caches, message queues, etc.). Database monitoring is an important part of back-end monitoring, mainly to monitor and manage the performance, availability and security of the database to ensure the normal operation and stability of the application.

Back-end monitoring also includes performance monitoring, availability monitoring, security monitoring, and log monitoring, which are similar to application layer monitoring.

Today, when the public cloud is popular, more and more enterprises migrate the backend (database, redis, etc.) to the public cloud. These indicators will be provided by the public cloud. What we have to do is to introduce these indicators from the public cloud to the local display. .

2.6 Business Monitoring

Business monitoring refers to the process of real-time monitoring and management of the business functions of the application, focusing on the business process and business indicators of the application to ensure the normal operation of the business functions of the application and the realization of business value.

2.7 Monitoring of operation and maintenance capabilities

SLA (Service Level Agreement), SLO (Service Level Objective), and SLI (Service Level Indicator) are important indicators for measuring operation and maintenance capabilities. SLA is an agreement to measure the quality of customer service, and SLO and SLO are indicators to measure whether the reliability of the system operated and maintained is up to standard.

3. Monitoring the market and synthetic monitoring indicators

The solution to this problem depends on monitoring the market

Linkosla Monitoring Platform

There are too many monitoring indicators, and we have to synthesize them, so the synthetic monitoring indicators were born.

Synthetic monitoring indicators refer to comprehensive indicators obtained by combining and calculating multiple monitoring indicators, and are used to judge the overall health status and performance status of the application. Synthetic monitoring metrics are usually combined from multiple individual metrics and can reflect the overall performance and health of an application.

The calculation method of the synthetic monitoring index can be determined according to the specific situation, and the common calculation methods include the following:

A. Average: Calculate the average of multiple indicators, such as the average of request response time, average of server load, etc.

B. Weighted average: Calculate the weighted average of multiple indicators, and assign weights to different indicators according to their importance, such as the weighted average of request response time and server load.

C. Percentile: Calculate the percentile of multiple indicators to reflect the distribution and extreme value of the indicator, such as the 95th percentile of request response time.

D. Comprehensive index: By weighting and summing multiple indicators, a comprehensive index is generated to measure the overall performance and health of the application, such as application health score, performance index, etc.

Synthetic monitoring metrics can help users gain a more comprehensive understanding of application performance and health, and identify and resolve issues more quickly.

4. Intelligent monitoring and alarm

Intelligent monitoring and alarming refers to the process of intelligently monitoring and managing applications using artificial intelligence and machine learning technologies. Intelligent monitoring can help users find and solve problems more quickly and accurately, and improve the stability and reliability of applications.

A. Automatically identify anomalies: do not rely solely on thresholds for alarms, use machine learning and statistical analysis technologies to analyze and model monitoring indicators of applications, automatically identify anomalies and generate alarms or automatically trigger preset response actions .

B. Automatically adjust configuration: Automatically optimize application parameters, using technologies such as machine learning and optimization algorithms, to automatically adjust application configuration parameters to optimize application performance and stability.

C. Predictive analysis: Predict possible failure risks in advance, use techniques such as machine learning and time series analysis to analyze and model historical data of applications, predict future trends and possible problems, and take preventive measures in advance and problem solving.

D. Reduce duplicate alarms: alarm merging, use machine learning to judge the alarms that have been generated, reduce the level of unimportant alarms or merge similar alarms, and do not trigger or trigger less in the alarm.

E. Reduce alarm jitter: Alarm convergence, when abnormal jitter occurs in monitoring data, whether the monitoring system should alarm has always been a problem. Using machine learning to analyze multi-dimensional monitoring data and clustering algorithms to determine the relevance of events, and Reduce the possibility of multiple or false alarms.

F. Alarm scenario generation: Alarms that are close to user business scenarios, extreme value analysis and noise reduction processing are performed on monitoring data, and alarms related to user business are associated with CI items to form a scenario.

Intelligent monitoring and alarming does not overthrow the original monitoring and alarming system, it is an extension of the original monitoring and alarming, providing enterprises with better monitoring and alarming experience.

Linksla Intelligent Operation and Maintenance Manager

Guess you like

Origin blog.csdn.net/LinkSLA/article/details/132018837