Architecture design and implementation of a hornet's nest big business traffic monitoring alarm system

Sector line of business more and more, a line running any application, may be problematic for a variety of reasons: for example, the operational level, reduced orders than last week, a sudden drop in the flow; technical issues the system appears ERROR, the interface response slower. Canadian transportation business, one obvious feature is dependent on many suppliers of services, so we also need to focus on call provider interface is abnormal, and so on.

To make each business line under the big traffic can be found by the police as soon as possible and solve problems, and thus enhance the quality of service business system, we decided to build a unified monitoring alarm system. On the one hand the system has been found to occur in the first time abnormal timely solutions; on the other hand find some potential problems as early as possible, does not affect the normal operation of the business logic of such a system now, but some have been more time-consuming operation and long such problems if not promptly treated, it is likely to affect the future development of the business.

This paper describes positioning hornet's nest big business traffic monitoring alarm system, the overall architecture design, as well as some of our experience in the landing pit step on the practice course.

 

Architecture Design and Implementation

We want to monitor the alarm system mainly has the following three capabilities:

1. Common components automatic alarm : create a default alarm rules for each frame assemblies commonly used business systems (such as RPC, HTTP, etc.), to facilitate unified control framework level.

2. a custom alarm service : service by a service indicator Buried develop custom fields to record and operational status of each specific operating system module.

3. quickly locate the problem : the problem is not the purpose of discovery, is the key to solve. We hope that after the completion of an alarm message is sent, allowing developers to glance to find problems where, in order to quickly resolve.

In this context, the overall architecture of FIG alarm center and key processes as shown below:

 

Longitudinal view, is left Kafka alarm center, right is the service system.

Alarm-centric architecture is divided into three layers, the top layer is WEB admin page, mainly to complete the inquiry and maintenance alarm log alarm rules; the middle layer is the core of the alarm center; bottom layer is the data layer. Called a service system through a mes-client-starter jar package to perform an access alarm center.

We can work into alarm center into five modules:

 

1. Data collection

We use the index to find a way to report the acquisition system problem that some of the indicators the system is running our attention to record and upload. The mode can be uploaded log, UDP, and so on.

First data collection module we do not repeat create the wheel, but based directly on the MES ( inside the hornet's nest big data analysis tools ) to achieve, the main consideration several reasons for this: one to the data analysis and alarm data sources are similar; two to save a lot of development costs; but also easy access to the alarm.

What are the specific indicators that should capture it? To request a single large transport business scenarios users, for example, the link may include the entire HTTP request, Dubbo call, SQL operations center may also include validation, conversion, assignment and other sectors. Call set down, it will involve a lot of classes and methods, it is impossible for each class, each method call to do the collection, time-consuming but also does not make sense.

In order to minimize the cost to discover as much as possible, we selected some of the system components are automatically RBI common framework, such as HTTP, SQL, RPC framework Dubbo we use to achieve a unified control framework level.

For business, the system indicators for each business focus is different. Different indicators for different business developers need to focus on, such as the number of successful payment and other orders, API developers can offer through the system manually Buried, their definition of different business modules and systems need to focus on indicators.

2. Data storage

For dynamic index up data collection, we chose to use Elasticsearch to store, mainly based on two reasons:

One is a dynamic field stores. Indicators for each business system concerns may be different, the focus of each middleware is different, so what fields buried type for each field are unpredictable, which requires a database to dynamically add fields to store Buried . Elasticsearch no need to define the type of field and buried insertion of the dot data can be automatically added.

Second, it can stand the test of vast amounts of data . Each user request into the monitor through each component will have a plurality of buried point, the amount of data is very large. Elasticsearch can support a large amount of data storage, it has a good level of scalability.

In addition, Elasticsearch also supports aggregate calculations for quick execution count, sum, avg other tasks.

 3. Alarm Rule

With Buried data, the next step you need to define a set of alarm rules, we are concerned about the problem of data to quantify the specific checks to verify that exceeds a preset threshold. This is the most complex of the entire alarm central issue, and most core.

Prior to the overall architecture diagram, the core part is the "rule execution engine" that drives the operation of the system by performing regular tasks. First, the execution engine to query all the rules in force, then filtered and the polymerization is calculated according to the description of the rule to Elasticsearch, and finally the step of polymerizing the threshold value calculation results obtained with the rules previously set comparison, if the condition is sent alarm messages.

This process involves several key technical points:

1) Timing task

To ensure the availability of the system, to avoid single points of failure causes the entire alarm monitoring system failure, we "minutes" for the cycle, performed once every minute set alarm rules. Here it is the Elastic Job for distributed task scheduling, easy manipulation start and stop tasks.

2). "Three-stage" alarm rules

We will achieve defined alarm rule is "filtering, aggregation, comparison," these three stages. For example, it is assumed that a service A Buried ERROR logging:

app_name=B   is_error=false  warn_msg=aa   datetime=2019-04-01 11:12:00
app_name=A   is_error=false                datetime=2019-04-02 12:12:00
app_name=A   is_error=true   error_msg=bb  datetime=2019-04-02 15:12:00
app_name=A   is_error=true   error_msg=bb  datetime=2019-04-02 16:12:09

Alarm rule is defined as follows:

  • Filter : trying to identify a data set by several conditions. For the above problem, the filter conditions may be: app_name = A, is_error = true , datetime between '2019-14-02 16:12:00' and '2019-14-02 16:13:00'.

  • Polymerization : step calculates the data sets of a predefined type polymerization count, avg, sum, max etc., to give a unique value. For the above problem, count the number of times we choose to calculate ERROR appears.

  • Comparison : the results obtained in the previous step is compared with the set threshold.

For some complex alarm conditions , such as failure rates and flow fluctuations of our top mentioned, it should be how to achieve it?

Suppose there is a problem: if A calls the service failure rate exceeds 80%, and the total request amount is larger than 100, alarm notification is sent.

We know that the number is actually divided by the total failure rate of failure, while the number of failures and the total number may be referred to by the previous "filter + polymerization" get in the way, so in fact this problem can be describe by the following formula:

failedCount/totalCount>0.8&&totalCount>100

We then use the engine fast-el expression of the above expression is evaluated, the results obtained with the threshold value comparison can be set.

3) automatically creates default alarm rules

For commonly used Dubbo, HTTP, etc. As more classes and methods involved, developers can maintain through the admin interface alarm rules, alarm rules are stored in a MySQL database, while caching in the Redis.

To Dubbo for example, first to get all of the provider and the consumer through Dubbo in ApplicationModel, information and rule templates combination of these classes and methods (rule templates can be understood as a specific class rules and methods to weed out information), create for a class under the rule of a method.

For example: dubbo Interface / order / getOrderById A service provided externally per minute average response time of over one second, the alarm; B Dubbo service call interfaces / train / grabTicket / min range of more than 10 the number of false state, the alarm and the like.

4. Alarm behavior

Currently the rules after an alarm is triggered to occur mainly in two ways alarm behavior:

  • E-mail alarm : for each type of alarm through the development of different people in charge, the first time the relevant personnel informed of system anomalies.

  • Micro-channel alarm : alarm messages as a supplement.

Then we will continue to improve the behavior of the police tactics, such as using different alarm mode for the issue of different levels, so that developers can quickly identify both the problem of the police, without too much effort involved in the research and development of new features.

 The positioning aid

To help developers quickly locate tools, we designed a hit sampling function:

First of all, I hit the tracer_id rules extracted, providing a link to jump directly to the kibana view logs, to achieve reduction of the link.

Second, developers can also set up their own field he wants attention, and then I'll value of this field is also extracted, the problem is you can glance to see where it is.

On technology, define a hit sampling field, which allows the user to enter inside a dollar or more braces. For example, we might focus on the interface operation of a particular vendor, field sampling is likely to hit the top half of the figure is lower. When the alarm message needs to be transmitted, inside the extracted fields in the query to a value corresponding to ES, freemarker accomplished by replacing the final message sent to the developers is shown below, system developers to quickly know where the problem .

 

Stepped pit experience and evolution direction

Build a large transportation service monitoring alarm system is a process from 0-1, and over the entire development process, we encountered a lot of problems, such as: memory instantly played, ES more slowly, frequently Full GC, the following specific talk about our experience optimized for the above points.

Stepped pit

1. Memory instantly played

Any system, it has to withstand the limit, so it needs a dam during the flood can be intercepted.

 

The same alarm center, alarm center outside facing the biggest bottleneck point in the reception Kafka pass over the MES Buried log. Early last line appeared once due to the large number of business system abnormalities lead to instant Buried log hit the alarm center, resulting in problems of system memory played.

The solution is to evaluate the maximum capacity of each node, good system protection. To address this issue, we take a way limiting, since Kafka consuming messages using a pull model, it is only necessary to control the rate of good to pull, such as the use of RateLimiter Guava:

messageHandler = (message) -> {
  RateLimiter messageRateLimiter = RateLimiter.create(20000);
  final double acquireTime = messageRateLimiter.acquire();
  /**
   save..
  */
}

2. ES more slowly

Since MES log larger than, have hot and cold points, in order to ensure the performance at the same time facilitate data migration, we created a granularity index ES + month of application, as follows:

 

3. Frequent Full GC

我们使用 Logback 作为日志框架,为了能够搜集到 ERROR 和 WARN 日志,自定义了一个 Appender。如果想搜集 Spring 容器启动之前(此时  TalarmLogbackAppender 还未初始化)的日志, Logback 的一个扩展 jar 包中的 DelegatingLogbackAppender 提供了一种缓存的方式,内存泄漏就出在这个缓存的地方。

正常情况系统启动起来之后,ApplicationContextHolder 中的 Spring 上下文不为空,会自动从缓存里面把日志取出来。但是如果因为种种原因没有初始化这个类 ApplicationContextHolder,日志会在缓存中越积越多,最终导致频繁的 Full GC。

解决办法:

1. 保证 ApplicationContextHolder 的初始化

2. DelegatingLogbackAppender 有三种模式:OFF SOFT ON ,如果需要打开,尽量使用 SOFT模式,这时候缓存被存储在一个由 SoftReference 包装的列表中,在系统内存不足的时候,可以被垃圾回收器回收掉。

 

近期规划

目前这个系统还有一些不完善的地方,也是未来的一些规划:

  • 更易用:提供更多的使用帮助提示,帮助开发人员快速熟悉系统。

  • 更多报警维度:目前支持 HTTP,SQL, Dubbo 组件的自动报警,后续会陆续支持 MQ,Redis 定时任务等等。

  • 图形化展示:将埋点数据通过图形的方式展示出来,可以更直观地展示出系统的运行情况,同时也有助于开发人员对于系统阈值的设置。

 

小结 

总结起来,大交通业务监控报警系统架构有以下几个特点:

  •  支持灵活的报警规则配置,丰富的筛选逻辑

  • 自动添加常用组件的报警,Dubbo、HTTP 自动接入报警

  • 接入简单,接入 MES 的系统都可以快速接入使用

Operation and maintenance of production lines mainly do three things: identify problems, locate and solve problems. Identify problems, that is, when abnormality occurs in the system notifies the person in charge of the system as soon as possible. Locate and solve problems, is to provide the necessary information quickly repair system for developers, the more precise the better.

Positioning alarm system should be the first step in the chain and the inlet means to solve the problem online. Which leads through the core data with data back system (tracer links, etc.), deployment of distribution system and other organic tandem, can greatly improve the efficiency of online problem-solving, and better for the production escort.

Whatever you do, our ultimate goal is only one, is to improve the quality of services.

 

Author: Song Chun test, a hornet's nest big traffic platform senior research engineer.

(Title figure Source: Network)

Watch hornet's nest technology, you want to find out more

Guess you like

Origin www.cnblogs.com/mfwtech/p/10955872.html