Real-time data warehouse chaos drill practice

1. Background introduction

At present, the priority level of real-time delivery indicators provided by the real-time data warehouse is becoming more and more important. It is no longer a separate report display and other functions, especially the relevant data provided to the downstream rule engine, which has a direct impact on the advertising delivery of the delivery operation, and the data is delayed. Or abnormality may cause direct or indirect asset losses.

From the perspective of the link panorama of the delivery management platform, the real-time data warehouse is an indispensable part, which can quickly process massive data and quickly analyze effective information. It also supports manual control of the delivery management platform. A real-time node accident may cause the entire delivery link to fail to operate normally. In addition, the delivery rule engine is an automated operation and the service needs to run 24 hours a day. Therefore, timely and effective data quality monitoring and warnings need to be configured to quickly identify abnormal fluctuations or non-compliance. Business data, and thus plans to introduce chaos engineering, hoping that by actively injecting faults, risks can be perceived in advance as much as possible, potential problems can be discovered, and targeted prevention and reinforcement can be carried out to avoid serious consequences when faults occur and improve The overall anti-risk capability of real-time data warehouse.

2. Exercise scope

In order to reflect the chaos drill situation in more detail, the real-time data warehouse chaos is divided into two parts according to the different contents of the drill: the technical side and the business side .

Chaos on the technical side : Inject common exceptions based on middleware, databases, JVM, basic resources, networks, services, etc., and conduct chaos drills based on core application scenarios sorted out in actual business to test the vulnerability and emergency response capabilities of the system, thereby improving the team The stability guarantees the processing capability.

Chaos on the business side : For companies with intensive e-commerce activities, various arrival rates, exposure rates, and more macro-level GMV, number of new users, number of user calls, etc., can all reflect the health of the business. In practice, In life, in order to describe a stable state, we need a set of indicators to form a model, rather than a single indicator. Regardless of whether chaos engineering is used, it is crucial to identify the health status of such indicators, so a complete set of data collection, monitoring, and early warning mechanisms must be established around them. When business indicators fluctuate greatly, we can Quickly sense, locate, repair and stop bleeding.

In the past, data warehouse chaos projects were all on the technical side. This time, on the premise that the delivery link has been built and the primary and backup links have been completed, we hope that through multiple rounds of business-side chaos, we can improve the overall system's data change sensing capabilities.

3. Exercise plan

If you want to do your job well, you must first sharpen your tools. Before executing a chaos drill, you need to prepare the preparatory work, formulate reasonable drill SOPs, plans, plans, and analyze the drill environment, scripts, data, tools, scenarios, explosion radius, etc. Conduct a possibility assessment, and when the feasibility is confirmed, make an appointment with related parties before carrying out practical operations.

This article mainly shares with you the real-time data warehouse chaos drill process based on the business side:

1.Write drill SOP

SOP is a standard operating procedure, which is to refine, quantify and optimize the operating steps and requirements of a certain event to form a standard operating process. Regarding chaos on the business side, especially drills related to real-time warehouse data , this is our first time to do it, and we have not found relevant drill guidance reference in the industry. We are in the exploratory stage. In order to facilitate the smooth progress of the project and make the subsequent drill operations more standardized and efficient, after everyone communicated and discussed in the early stage of the drill, The SOP drill template compiled in the early stage of the project is as follows:

2. Research on drill plans

First collect the core indicator range of the real-time data warehouse delivery link. On this basis, pull the historical data over a period of time for analysis, find the healthy fluctuation threshold corresponding to each indicator, and then configure the corresponding DQC rule monitoring. For fluctuations For abnormal indicators that are not within the health threshold, timely alarms will be issued within minutes (expected to be 15 minutes), and rapid investigation and response will be provided. To this end, in the early stage of the exercise, we went through a series of program research and exploration, as follows:

"In the solutions provided below, the indicator data are analyzed based on the number of device activations as an example."

  • Option 1: According to the day dimension, collect the number of device activations at each hour on the same day in the recent period, as a proportion of the day's market , and calculate the minimum and maximum values ​​as the healthy fluctuation threshold of this indicator;

  • Option 2: According to the day dimension, collect the fluctuation data of adjacent hourly points on the same day within a period of time to find patterns, such as the fluctuation data from 9 am to 10 am every day, and then conduct data statistics through a series of mathematical distribution methods, so as to Hope to find a relatively stable fluctuation range;

  • Option 3: According to the day dimension, collect the index fluctuation data on adjacent days over a period of time to find patterns, such as the fluctuation data from 9 a.m. yesterday to 9 a.m. the day before yesterday, and then conduct data statistics through a series of mathematical distribution methods. Therefore, we hope to find a relatively stable fluctuation range;

  • Option 4: Based on the previous three options, the fluctuations of indicators on weekdays and weekends may be different, so based on the daily dimension statistics, we also investigated the year-on-year fluctuation distribution of the weekly dimension, such as every Monday morning at 9 Click on the fluctuation data to 10 am, and then conduct data statistics through a series of mathematical distribution methods, hoping to find a relatively stable fluctuation range;

  • Option 5: In the same way, we also investigated the week-to-month fluctuation distribution in the weekly dimension, such as the fluctuation data from 9 a.m. this Monday to 9 a.m. last Monday, and then conducted data statistics through a series of mathematical distribution methods, so as to hope that Find a relatively stable fluctuation range;

  • Option 6: Based on the main and backup links, when the source is the same, the indicators calculated by the real-time data warehouse, and the result data of the sink of the two links at the same period of time should be consistent or have small fluctuations. , for example, for a 10-minute delay between the primary and backup links, the fluctuation does not exceed 10%, and the average difference reaches a consistency of more than 90%.

Plans 1 to 5 have been tried. The scenario data of each plan is analyzed through the statistical data of maximum value, minimum value, average value, percentile distribution, variance, standard deviation, etc. It is difficult to find a fairly stable fluctuation. It is also impossible to define the specific threshold range of the indicator. During the actual drill process, if the fluctuation alarm threshold is set too large, when the business data fluctuates abnormally in real production, alarms cannot be detected in time; if the setting is too small, it will cause frequent alarms, which is harmful to There may be doubts about its accuracy and effectiveness. Moreover, there are dozens of core indicators for real-time delivery. Each indicator corresponds to a different health threshold. The cost of collection and analysis is very high. From the perspective of the effect of the exercise, it is also not so obvious.

Based on the overall evaluation, the exercise mainly adopted Plan 6: a total of 29 real-time delivery core indicators were collected. Within a period of time (15 minutes), the fluctuation difference between the main and backup link indicators did not exceed 10%.

3.Drill method

In the red-blue confrontation drill, the team is divided into two groups: red (defense) and blue (offense).

The testers form the Blue Army: responsible for formulating the chaos drill plan, performing fault injection into the target system, and recording the drill process in detail;

The real-time data warehouse is developed as a red army: responsible for fault detection, emergency response, and troubleshooting, while also verifying the system's fault tolerance, monitoring capabilities, personnel response capabilities, recovery capabilities and other reliability capabilities under different fault scenarios.

4. Exercise process

The overall exercise process is roughly divided into three stages: preparation stage, offensive and defensive stage, and review stage.

1.Preparation stage

  • After the plan is prepared and reviewed, confirm the link plan;

  • The Blue Army prepared the corresponding test data and scripts in advance according to the attack plan formulated in advance; ‍

  • The Red Army ensures that the environment is available in advance and carries out monitoring, defense and emergency response measures before the drill based on the pre-planned attack plan as planned.

2. Attack and defense stage

  • The blue team simulates real attack behavior based on the pre-planned attack plan, attacks the drill link (standby link) at the agreed time, performs fault injection, and records the corresponding operation steps to facilitate subsequent reporting;

  • After the Blue Army attacked, the red team paid real-time attention to the operation of the monitoring system through Feishu/email alerts and other notification methods. If there were abnormal alerts, they needed to troubleshoot and locate the problem as soon as possible, and then evaluate the repair plan;

  • During the offensive and defensive confrontation, the Blue Army can adjust and improve attack strategies based on the Red Army's defensive measures, trying its best to break through the system's defenses and achieve the set goals. At the same time, the Red Army can also analyze the Blue Army's attack techniques and behavior models to continuously improve defensive measures. to strengthen defense.

3. Review and improvement stage

  • After the chaos drill, summarize and evaluate, analyze the performance of the red team and blue team, and evaluate the security and anti-attack capabilities of the system;

  • Summarize experience and lessons, summarize successful defense measures and failed attack methods in order to improve the system's security strategy; ‍

  • Based on the evaluation results and summed up experience, we formulate improvement plans to repair loopholes and weaknesses in the system and improve the system's ability to resist risks.

5. Actual offensive and defensive combat

There are a total of 29 index fluctuation cases in this exercise, and the overall exercise operations are similar.

Take case 17 "The recalled product collection UV fluctuates abnormally on the hour in a certain channel" as an example. The specific drill operation process is as follows.

1. Data preparation

  • Through the background database, pull out the production primary (backup) link, and at a certain channel (for example media_id= '2') at a certain hour (for example hour= 10), recall the overall statistical value N corresponding to the product collection uv.
--渠道小时整点维度下,商品收藏uv汇总数据
select
  `指标名称`,
  `日期`,
  '2' as `指标ID`,
  `小时段`,
  sum(`指标值`)
from table_a
where
  date = date_format(now(), '%Y%m%d')
  and `指标名称` in ( '商品收藏uv' )
  and `小时段` = 10
  AND `指标id` = '2'
GROUP BY
  `指标名称`,
  `日期`,
  `小时段`
order by
  指标名称;
  • Pull out the backup link, and under a certain channel (such as media_id= '2'), at a certain hour (such as hour= 10), a specific piece of detailed data, record the corresponding value of the product collection uv as n, change n to n+ 0.1N is subsequently injected into the backup link, resulting in a 10% fluctuation difference between the active and backup links.
-- 明细数据
select
  t.指标名称,t.账户id,t.计划ID,t.设备类型,t.指标值
from
  (
    select
      `账户id`,
      `计划id`,
      `指标名称`,
      `指标值`,
      `设备类型` ,
      row_number() over (partition by 指标名称 order by 指标值 desc ) as rn
    from  table_a
    where
      date = date_format(now(), '%Y%m%d')
      and `指标名称` in ('商品收藏uv')
      and `设备类型` = '召回'
      and `小时段` = 10
      AND `指标id` = '2'
  ) t
where
  t.rn = 1
ORDER BY 指标名称;
  • After sorting, the data that needs to be injected is obtained, see the yellow part.

2. Fault injection odps

  • Import the data that needs to be injected into odps.

Before importing, you need to create a new test table du_qa_dw_dev.hundun_case in the datawork space for importing drill data.

-- drop table if  EXISTS du_qa_dw_dev.hundun_case;
CREATE TABLE IF NOT EXISTS hundun_case
(
    message  STRING COMMENT '消息内容'
)
COMMENT '混沌演练'
;
  • Fill the du_qa_dw_dev.hundun_case table with numbers.

  • Verify that the data import was successful.

3.odps synchronization to kafka

Execute the flink synchronization script to synchronize the odsp du_qa_dw_dev.hundun_case table data to the corresponding kafka topic.

flink task script:

--SQL
--********************************************************************--
--odps同步到kakfa脚本,用于实时数仓混沌演练异常注入使用
--********************************************************************--
-- 基本函数
CREATE FUNCTION JsonParseField AS 'com.alibaba.blink.udx.log.JsonParseField';
CREATE FUNCTION jsonStringUdf AS 'com.alibaba.blink.udx.udf.JsonStringUdfV2';
---同步账号表
CREATE TABLE `source` (
message                        VARCHAR  
) WITH (
   'connector' = 'du-odps',
  'endPoint' = '***',
  'project' = '***',
  'tableName' = 'hundun_case_01',
  'accessId' = '*******',
  'accessKey' = '*******'

);

CREATE TABLE `kafka_sink` (
  `messageKey`  VARBINARY,
  `message`  VARBINARY,
  PRIMARY KEY (`messageKey`) NOT ENFORCED
) WITH (
  'connector' = 'du-kafka',
  'topic' = '********',
   'properties.bootstrap.servers' = '*******',
  'properties.compression.type' = 'gzip',
  'properties.batch.size' = '40960',
  'properties.linger.ms' = '1000',
  'key.format' = 'raw',
  'value.format' = 'raw',
  'value.fields-include' = 'EXCEPT_KEY'
);

INSERT INTO kafka_sink
SELECT
cast(MD5(message) as VARBINARY),
cast(message as VARBINARY)
FROM source
;

4.kafka platform query data

After executing the flink synchronization task, you can query through the background whether the corresponding data has been successfully synchronized.

5. Exception injection notification

After the exception injection is completed, the Red Army can be notified through the Feishu group notification. If an alarm is received, the group must be informed as soon as possible.

Blue Army: The Blue Army has completed data preparation. Please ensure that the environment is OK and the rule configuration has been completed before the exercise. In addition, the exercise time plan must be notified to downstream related parties in a timely manner;

Blues: Injection completed.

6. Alarm trigger notification

  • Before the Red Army exercises, they can configure defense rules in advance through the monitoring platform.
  • After the abnormal injection, if it meets expectations and abnormal indicator fluctuations are found within 15 minutes, the Red Army must synchronize to the drill group in time.

Medium risk** dual-link active and standby consistent monitoring

Service name: **** Environment: ****** Alarm time: ****** Trigger condition: **Dual-link comparison fluctuates abnormally, lasting for 10 minutes Alarm details: Indicator: prd_collect_uv master comparison decreases :[-10%] Main:1066 Backup:956

Business domain: real-time data warehouse

Application person in charge: ***

  • If it does not meet expectations and no abnormal indicator fluctuations are found within 15 minutes, the Red Army needs to promptly locate and follow up the problem, and after repair, communicate with subsequent drills to verify the repair results.

Red Army: No alarm received within 15 minutes, positioning in progress

Red Army: The reason has been found. Due to the attack, the alarm data was not sent out in time. It is being repaired.

Red Army: Repaired, please Red Army re-attack

7. Record of drill process

Collect, summarize and record every operation during the exercise, including time points, executors, operations, etc., as follows:

6. Exercise summary

7. Future Outlook

Chaos drills on the real-time data warehouse business side, from 0 to 1, after a series of exploration and practice, through the main and backup link comparison method, abnormal fluctuation indicators can be quickly identified and sensed during the drill. From the drill results, Good results have been achieved, but there are also certain limitations, such as:

  • During the drill, abnormal data injected manually may affect the use of backup links if it cannot be cleared quickly.

  • For real-time indicator fluctuations without backup links, a more refined feasible plan needs to be formulated to find a healthy indicator fluctuation range.

These require the team to further explore and solve. At the same time, during the exercise process, we will continue to accumulate, enrich exercise cases, and improve the exercise library. Follow-up plans include introducing tools (platforms), establishing exercise assistance mechanisms, and regular scheduled exercises. Make chaos drills more automated, standardized, and normalized, and improve the overall data stability of the real-time data warehouse. *Text/Yuan Xiao

This article is original to Dewu Technology. For more exciting articles, please see: Dewu Technology official website

Reprinting without the permission of Dewu Technology is strictly prohibited, otherwise legal liability will be pursued according to law!

Fined 200 yuan and more than 1 million yuan confiscated You Yuxi: The importance of high-quality Chinese documents Musk's hard-core migration server Solon for JDK 21, virtual threads are incredible! ! ! TCP congestion control saves the Internet Flutter for OpenHarmony is here The Linux kernel LTS period will be restored from 6 years to 2 years Go 1.22 will fix the for loop variable error Svelte built a "new wheel" - runes Google celebrates its 25th anniversary
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/5783135/blog/10112796