Monitoring and Alarming of Billion-Level High Concurrency System

What is system monitoring?
For software systems with simple functions and a small number of users, most companies do not need additional monitoring systems to ensure the normal operation of the company's business. When the company develops to a certain level, the system becomes more and more diversified, the single system becomes more and more complex, and the number of users facing it increases. In order to ensure the normality and stability of the system and the real-time monitoring of external business in real time, most Internet companies will design and develop a monitoring system according to their own system architecture and business level, such as Alibaba's "Eagle Eye" system.

Ge Patrol - Ge Push System Monitoring
With the continuous expansion of Ge Push business and the increasing number of users, Ge Push urgently needs a complete monitoring system to ensure the normal operation of the system and business in real time. At the system level, Getui must ensure the stability and normality of the system when hundreds of millions of users are accessing at the same time. At the business level, Getui needs to reflect the daily business growth and decline through real-time data, and Gepaun was born at this time. .

System difficulties and design

Diversified data is
based on push business, and each push expands many independently running systems, and the monitoring data of each system is different. In order to ensure the stability and scalability of the system, we divide all data sources into two categories: one is JMX-based configurable data, and the other is independently packaged access data. Based on the characteristics of the two types of data, JMX data is designed to be actively collected, and independently packaged data is designed to be passively received.

A large number of nodes are distributed
in the face of a large number of users. Getui needs to arrange many nodes in different regions to ensure the real-time performance of services. In the face of a large number of nodes, concurrent data collection and reception design is the only solution, and we also need to encapsulate different types of threads and thread pools based on different data sources, but another difficulty brought by a large number of multi-threaded concurrency is, Design and allocation of shared resources, guarantee and rollback of atomic operations, and accuracy of data collection. Based on this difficulty, the Producer-Consumer mode is adopted in the code structure, as well as the design ideas of processes and threads.

Another function of the complex business logic
monitoring system is that it can reflect the development trend of the company's business in real time and give an alarm in time. Lots of system access and different kinds of requested data. Based on these data, many analysis strategies and alarm strategies need to be written into the program, so the business logic is extremely complex, and different strategies are dynamically loaded. The Strategy design pattern becomes the best choice.

A major feature of the real-time demand
monitoring system is that it can alarm abnormal data in time, and collect, classify, analyze and display a large amount of data in seconds. Therefore, the in-memory database (couchbase) and the data search engine (elasticsearch) have become the key intermediate keys to ensure the real-time performance of the system.

Monitoring and Alarming of Billion-Level High Concurrency System
At the system level, it integrates a series of external tools including Database, couchbase, elasticsearch, flume, kafka, etc.
Monitoring and Alarming of Billion-Level High Concurrency System
At the code level, different design patterns are tried to help the whole system be better compatible with different data, so as to ensure the stable operation of the system and the accurate capture and display of data.

Features of a tour

Abnormal log alarm
When the system has abnormal log, it will be synchronized to the ES of the patrol in real time. Once a patrol detects an abnormal log, it will immediately send an alarm message to the corresponding personnel. In this way, we will receive system exceptions in real time, which provides necessary conditions for timely handling of online problems.

Periodic comparison
For some monitoring points, there should be a fixed trend every day, as shown in the figure below. We update this trend through the data of the previous 7 days. When the online data does not match this trend, we will send an alarm message.
Monitoring and Alarming of Billion-Level High Concurrency System

Self-monitoring
A patrol is used to monitor the online system, and a patrol is also a part of the online system, so how can a patrol monitor itself? We use the method of automatically modifying the threshold to achieve self-monitoring. When the threshold is modified, the patrol will send an alert email, and then 10 minutes later, the threshold will be changed to the original one, and then we will receive a return to normal email, and the whole process is automatic. So when we can't receive the self-alert email, the patrol itself has a problem.

Development summary
I believe that many projects will encounter the four problems mentioned above. In real-time, it is difficult for many systems to review and summarize some problems or experiences from a global perspective during the intense development process. Here we only provide one perspective to analyze a huge system: when the data sources are diversified, developers must ensure that The unity of all data before entering the system business logic, that is, the common data encapsulation, can ensure the stability of the core modules of the system under the changing demand environment; the main problem brought by the huge data nodes is the data flow Therefore, it is extremely important to add a layer (that is, the Producer-Consumer of this system) between the incoming and the receiving of the data flow to ensure the stability and controllability of the data flow. Complex business logic is the most common problem in software development, and many classic books are devoted to it. However, in actual development, especially when the development cycle is tight, it is difficult to have a set of specific and common solutions. In the development of a tour, we can only formulate the Strategy code framework according to the requirements and business logic. We are often impressed by the increase in the amount of data. In the development of a tour, the principle we adopted is to store the data separately, and then use different databases according to different data applications.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324896066&siteId=291194637