Breaking through the traditional monitoring mode: a new idea of business status monitoring HM | JD Cloud technical team

1. The blind spots of the traditional monitoring system, how to build business status monitoring.

A very important part of the system architecture design is data monitoring and data final consistency. Regarding the compensation for consistency, it has been summarized by the boss of the algorithm department and will not be repeated. Here is mainly about how to compensate? What are the compensation plans? This leads to the data monitoring system. Some friends may ask, why can the business status monitoring system provide compensation? Don't worry, look down.

There are two types of traditional monitoring systems, system monitoring and business monitoring. System monitoring includes concurrency monitoring, exception monitoring, call chain monitoring, port monitoring, zabbix monitoring, http monitoring, etc. Business monitoring refers to monitoring whether business data is normal, and users need to carry out business tracking for data collection. The bottom layer of business monitoring routinely relies on the log reporting system. Before accessing business monitoring, apply for access to the log reporting system. Figure 1

(figure 1)

It can be seen from the business monitoring sequence diagram that it is generally divided into five steps:

1. Data burying point, the log reported by the business end after burying the point, can also be mysql. The log files are finally reported through flume or bin log.
2. Data collection, usually through kafka for data collection.
3. Data cleaning is generally performed on the ods layer with spark-streaming for shunting and cleaning.
4. Data storage, the data will be stored in the dw layer after splitting, and finally fall into various libraries.
5. For data display, there are many open source ones, but grafana is the most used, and there are large data screens, etc.

Do you feel a little confused when you see this ? Do you feel confused from link tracking? The difference between business monitoring and link tracking is intrusive buried point reporting and non-intrusive agent grabbing and reporting. This seemed to have lost its soul, so I asked AI, and the answer given by AI was "Business monitoring is a technology used to monitor business indicators and key business processes, the purpose of which is to realize real-time understanding of business operations and Quick response ".

2. The birth of new business monitoring, hunter-monitor.

Standing on the shoulders of giants and looking down at the overall situation, discover the real needs:

1. Alarm capability , around business and operation scenarios. Set thresholds for various alerts. Respond promptly when the threshold is reached.
2. Data calculation and data statistics capabilities , calculate the abnormal data of each node on the entire link according to the buried point. Help with statistics and output.
3. Reachability , internal chat tools, emails, phone calls, text messages, and WeChat when necessary.
4. Data archiving capability , data archiving is for the bottom line and final consistency. It is for data comparison when there is an exception.
5. Data self-care ability . In the AI ​​era, it is necessary to have the ability to automatically digest and process.
6. The ability of alarm rules , the application of "tree", the ability to connect the entire system link in series.



We are the research and development department of JD Insurance Platform, which undertakes the flow of end-end extended warranty orders in the mall. Traffic is all transaction data. Transaction data is not allowed to be lost. Therefore, we gave birth to our own business monitoring system "hunter-monitor" (hm for short). hm has achieved the above six capabilities. When a problem occurs, the business and product will be notified as soon as possible. It also provides abnormal data statistics, node data calculation, backtracking, compensation and other capabilities. When business or production R&D needs, data comparison can be done on the platform. It also has the ability to extend, such as the ability to connect to the jsf interface. To achieve automatic compensation capability.

The core capabilities of hm business status monitoring are: data series and data calculation. It is possible to bury the entire link of the business in the system, which has been linearly connected in series. And display the abnormal state data of each node. Finally digest the abnormal data.

Three, three consecutive questions: Who is suitable for access? how to use? Is there an instance of access?

1. Who is suitable for access

All systems connected to the insurance SaaS workbench can be connected to business status monitoring. What about those who didn't intervene? You only need to create a tenant in the insurance SaaS workbench to use hm business status monitoring.

2. How to use

2.1 Monitoring access

It only takes three simple steps to access hm , create rules, create alarm rules, and access business points. The creation method is the same as that of a conventional business monitoring system.

2.2 Data processing

Abnormal data eventually needs to be disposed of. Abnormal data can be processed with one click in the monitoring list

2.3 Customization

We support customization of touch content, customization of abnormal data processing methods, and customization of abnormal data statistics. It can call the jsf interface of the business system to complete automatic processing, and can also generate abnormal data reports according to requirements, and can further help the business side customize exception handling in the system link. hm has been applied to the full-link system of extended warranty transactions, contract performance platform, business-finance integration platform and insurance abTest and other systems. Let's take a look at several scenarios for accessing extended warranty services.

3. Actual combat! Extended Warranty Service Access Scenario

3.1 Large screen display:

Every week, the problems that occurred in the extended warranty business in the previous week will be published, and sent to the person in charge of the business side through internal communication tools and emails, and the download of abnormal insurance application forms will be supported. After the business receives the email, it will follow the strategy in the email to complete the correct insurance. So far, it has helped the business side complete the re-insurance of 400,000+ abnormal insurance policies. Helping the business reduce the rate of customer complaints, and also helping the insurance company get premiums. (figure 2)

(figure 2)

3.2 Automatic order replenishment:

Most of the upstream business of the extended warranty comes from the mall, and the business will process the order in the system and distribute it downstream. Due to the large volume and high operating threshold, there will always be abnormal situations, such as missing a certain parameter, resulting in transaction failure or users not being able to perform normally. . In the past, problems could only be discovered when the customer fulfilled the contract or when the downstream transaction failed to initiate settlement. After monitoring is configured in hm, if an abnormal situation is found, the jsf interface of order replenishment will be called to trigger automatic order replenishment. In the past, it took days at most to solve problems, but now they can be solved in minutes. It has the effect of reducing cost and increasing efficiency.

3.3 Data archiving:

hm provides data permanent archiving capabilities for the upstream and downstream transactions of the extended warranty. If various abnormal situations are found, the data can be exported from the hm system for data comparison. If it is an amount, it can also be automatically connected to the reconciliation system . View reconciliation results online and export reconciliation difference data (Figure 3). At the same time, an abnormal data email will be sent to notify the corresponding products and services (Figure 4).

(image 3)

(Figure 4)

4. The core, technical architecture and implementation of HM

What if there is really no way to access and can only do research by ourselves? It doesn't matter, I will list the technical methods. Provide ideas for solutions.

1. Technical Architecture

The hm architecture simplifies the complexity, straight to the point. Start with the core business data, bury points in the business application, and connect the entire link through the tree node nodeId. The buried point data is unified into the data warehouse after cleaning. It is triggered by the scheduling center to do data calculation and statistics at regular intervals, and displays them to the front end. Let's first look at an architecture diagram. Figure 5

(Figure 5)

2. Core technology

2.1 Rule Engine

The rule engine refers to the embedded rules. The rule engine refers to the Jaeger source code to generate our rule code nodeId. (Fig. 6) A rule tree built into hm. The final cache is displayed on the work desk (Figure 7).

(Figure 6)

(Figure 7)

2.2 Alarm engine

The alarm engine refers to a series of rules for configuring alarms, data calculation rules, and access methods. After creating the rules, you need to configure detailed alarms for each rule, including the type of alarm triggered, alarm rules, operation thresholds, processing methods, etc. (Figure 8) The alarm type refers to the contact method, which inherits the capabilities of the insurance SaaS-msg and supports email, internal chat tool, WeChat, telephone and other contact methods. The task system uses Easy-Job to dynamically manage tasks. The processing method can be connected to the business side Jsf to complete the closed loop, or it can be set as an archive for subsequent export or logarithmic requirements.

(Figure 8)

2.3 Data buried points

After the burying rules and alarm rules are configured on the insurance workbench, you can bury points on the business side, which is different from link tracking or traditional agent-based systems in that they are non-intrusive burying systems. hm is a highly intrusive tracking system. Here we have customized a set of tracking specifications, " asynchronous threads must be enabled to send MQ or call API interfaces ". Buried points support two methods, one is send msg to topic, mq supports jmq2/jmq4. The other is to initialize the entity class of hunter-expoxt by calling the API. Messages are sent by hm.

2.4 Data cleaning

The main responsibility of hm is the induction and sorting of business data. In addition to buried point access, it also supports access to data sources such as mq and databases. All data is unified by the DTS of the DataBus system of the group's DP (DataPilot) platform, and unified into the FDM/BDM layer of the data warehouse. Then, the group's scheduling center Buffalo (EMR) configures spark tasks to perform data sorting. The final data is stored in doris/hive/es.

2.5 Data Calculation

hm only records abnormal data, and focuses on the statistics and calculation of abnormal data. After configuring the rule nodes and system buried points, hm will calculate the abnormal data of each node . Process according to the alarm rules, or notify the business and production research, or call the jsf interface of the business system to automatically process abnormal data, or process the data by itself according to the rules.

2.6 Statistics

hm will publish data statistical reports every week and send them to business and production research. The report will reflect the abnormal data of all systems under the business line he is responsible for, including processed abnormal data and unprocessed abnormal data, abnormal comparison data of A business line and B business line, abnormal comparison data of business system and business system, etc. . Reports can be customized according to business needs. Help business and production and research to better grasp the latest status of the system.

2.7 Task Center

The task center refers to the xxljob task, which refers to the dispatch center, which is strongly bound to the alarm rules. Scheduling tasks are divided into two categories, one is business tasks, which are dynamically created tasks, and are executed according to the set corn. The other category is platform missions. It is used to maintain business tasks, such as regularly deleting tasks without exceptions. (Figure 9)

(Figure 9)

2.8 Touch display

The contact method supports the insurance workbench, internal chat tool, email, corporate WeChat, telephone voice, etc. Choose according to the needs of the business side.

2.9 Processing method

If there is no abnormal processing data after 3 touches, it will be automatically upgraded, and will be copied to the upper level of the department when it is touched next time. Abnormal data needs to be changed in the data state on the hm list page.

2.10 Open source capability: jaeger

The bottom layer of hm refers to jaeger-core, and rewrites the jaegerSpan and jaegerTracer classes. And repackage jaeger-core and opentracing-api - form your own jar (hunter-api)

V. Summary

The above are the full technical details of hm. The soul of hm is data calculation, governance, and data statistics. hm is basically the leader of the integration of hundreds of schools. It is a business-oriented, abnormal monitoring and processing solution developed by our platform R&D department, which brings together the wisdom of every partner in the team.

 

Author: Jingdong Insurance manages Shunli

Source: JD Cloud Developer Community

It is infinitely faster than Protocol Buffers. After ten years of open source, Cap'n Proto 1.0 was finally released. The postdoctoral fellow of Huazhong University of Science and Technology reproduced the LK-99 magnetic levitation phenomenon. Loongson Zhongke successfully developed a new generation of processor Loongson 3A6000 miniblink version 108. The world's smallest Chromium core ChromeOS splits the browser and operating system into an independent 1TB solid-state drive on the Tesla China Mall, priced at 2,720 yuan Huawei officially released the security upgrade version of HarmonyOS 4, causing all Electron-based applications to freeze AWS will begin to support IPv4 public network addresses next year Official release of Nim v2.0, an imperative programming language
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4090830/blog/10092638