From data to large model application, on November 25th, Hangzhou Yuanchuang Conference was held to share development tips.

1. Background

Common preparation tasks for current major promotions: special stress testing (full-link stress testing, internal stress testing), disaster recovery drills, downgrade drills, current limiting, inspections (monitoring, application health), chaos drills (red and blue confrontation) ,As shown below. As the platform business becomes more and more complex, the role of red-blue confrontation becomes more and more obvious. The following will introduce in detail how the big data platform carried out red-blue confrontation in preparation for this Double Eleven promotion.

Figure 1. Common work instructions for preparing for big promotions

First, let’s first understand what red-blue confrontation is and what are its benefits?

2. Introduction to red-blue confrontation

Red-blue confrontation is a common adversarial drill method in the field of network security. It refers to integrating platform security threat monitoring capabilities, Emergency response capabilities and protection capabilities, conduct red and blue confrontation drills with real troops in a real network environment, and improve and improve security protection technology and management systems.

The blue side represents the attacking side and the red side represents the defending side. Red-blue confrontation simulates real network attacks and defense processes in a controlled environment. The blue side simulates various threats and attack methods to attack the red side to test its defense capabilities and system high availability. The red team is responsible for defense and response, finding and fixing problems in the system, and gathering information about attackers.

Figure 2. Red-blue confrontation

3. The benefits of red-blue confrontation

1. Ensure the effectiveness of monitoring alarms

Red-blue confrontation can help industry and research institutions verify the effectiveness of monitoring alarm configurations, the timeliness of notifications, and the accuracy of information.

2. Enhance system reliability

Red-blue confrontation helps improve system reliability by identifying potential problems that can cause errors in the system.

3. Reduce risk

Red and Blue Countermeasures help reduce the risk associated with online issues by identifying potential weaknesses that could be exploited by malicious attackers.

4. Cost-effective testing

Red-blue confrontation simulates the scenario of the production environment, but does not cause risks to the production environment, ensuring the quality of the system from a testing perspective.

Figure 3. Benefits of red-blue confrontation

4. Red-blue confrontation practice

The practice of red-blue confrontation drills mainly includes six parts: drill announcement, personnel designation and task assignment, pre-drill scenario review, red-blue confrontation process, drill results collection, and drill review.

Figure 4. Red-blue confrontation practice mainly includes six parts

4.1 Exercise Announcement

It mainly includes two parts:

First, the person in charge of this red-blue confrontation organizes the confrontation drill kick-off meeting, determines the time range of the confrontation drill, and designates the real-time/offline drill interface person.

Second, real-time/offline products notify business users via email/internal office app in advance that a red-blue confrontation drill will be conducted.

Figure 5. Announcement of Red-Blue Confrontation Drill

4.2 Personnel designation and task allocation

First, designate the person in charge of this red-blue confrontation. Responsible for the overall coordination of the entire red-blue confrontation drill, including plan formulation, implementation of drill confrontation documents, scenario collection notification and review, organizing the attacker's initiation and defender's defense process, and drill review work.

Secondly, designate real-time and offline preparation interface personnel respectively. Acting as the blue attacker mainly specifies drill attack scenarios and launches system attacks.

Again, specify real-time and offline backup personnel respectively. Generally, they are core R&D personnel. Since the specific time of launching the attack is uncertain, in order to avoid that after the blue team launches the attack, the red team cannot handle the fault in time due to various special reasons and affect the normal online business, the backup personnel can quickly restore the system. .

Finally, assign live and offline side exercise monitors separately. Generally, they are testers, mainly responsible for recording alarm information (mdc, ump) issued during the drill and reviewing drill record documents.

Figure 6. Designation and task allocation of red and blue confrontation personnel

4.3 Scenario collection before the exercise

This part is the most important link before the drill, which mainly includes determining the scope of drill application and determining the attacker’s drill scenario.

4.3.1 Determine the application scope of the drill

For drill applications, it is recommended to give priority to applications with application levels L0 and L1, which can be selected based on business needs. In addition, you can quickly query the corresponding application in the following two ways on JD.com:

http://XXX.jd.com/dashboard/4/node/XXX

http://XXX.jd.com/health

The detailed drill application list is provided by the real-time/offline interface person (reviewed and approved by the C3 leader). Output: Collection of batch injection scenarios by the attacker.

Figure 7. Drill application scope

4.3.2 Collect drill failure scenarios

Jdos applications mainly use the [Chaos Engineering] platform for fault injection, using the following drill scenarios:

High CPU usage, high memory usage, high disk usage, network delay, network packet loss, process termination, mysql request delay exception, jimdb request delay exception, etc.

In the underlying cluster, operation and maintenance personnel mainly perform fault injection through scripts, commands, etc. It mainly includes the following drill scenarios:

Database instance CPU is high, hdfs queue is full, computing tasks are pending, RSS cluster is busy, zk node is down abnormally, etc.

4.4 Red-blue confrontation process

With the drill scenario in place and the product sending a drill notification email, you can start the red-blue confrontation. There are a few points to note here:

① The specific attack time cannot be "disclosed" to the blue team;

② It is recommended to choose production environment applications or clusters for attacks to simulate online problems as realistically as possible.

4.4.1 [Principal person in charge] Notification before the drill

The person in charge sends a message to the group in advance before the blue attacker’s official drill. The template is as follows:

@全体成员  
【重要通知】
今天17:30～21:30大数据平台（实时+离线）进行红蓝对抗演练，不定时进行故障突袭。请各位同学将跟进处理过程在本群进行同步。分三个阶段：问题发现、原因分析诊断、故障处理。
每个环节（问题发现、故障诊断、故障处理）确定后立马发消息，不要最后发总结！
每个环节（问题发现、故障诊断、故障处理）确定后立马发消息，不要最后发总结！




1、问题发现
【问题发现】
产品-服务名称：
（1）收到电话/咚咚告警，告警内容xxx  
或
（2）雷达大屏飘红，截图xx  开始排查处理


2、原因分析
【故障诊断】
产品-服务名称：xx问题原因已查到，原因概要描述。


3、故障处理
【故障处理】
产品-服务名称：：xx问题已处理，已恢复，并给出告警恢复/监控截图。

4.4.2 [Blue Team] Create & Execute Exercise Tasks

On the chaos engineering platform, Blue Team creates drill tasks based on previously collected drill scenarios or creates drill tasks in batches. As shown below:

Figure 8. Blue team creates tasks

Explain the following points:

① Attacks on the underlying cluster are mainly implemented through commands and scripts, which will not be described in detail here.

② Network delay and packet loss may cause the drill to fail. The reason is: network fault drills are restricted (the host kernel version has known bugs and cannot be drilled) "4.18.0-80.11.2.el8_0.x86_64".

③ In the scenario of 100% memory utilization, because Linux memory will trigger oom kill when it is full, it is recommended to set it to 90%.

④ It is recommended that the drill duration be longer than 5 minutes. Reason: The mdc alarm cycle range configured by some applications is within 5 minutes. If the drill duration is less than 5 minutes, the alarm may not be received.

4.4.3 [Red Team] Defense Repair Fault

After the blue team launches an attack, the red team will receive an alarm from the internal office app and repair the fault according to the established plan. Some screenshots are as follows:

Figure 9 and 10. Internal office app alarm indication

4.4.4 [Red Team] System Recovery

Some drill scenarios (process termination) will not automatically recover, and the red team needs to manually restart the system application services to ensure that the production application services are normal.

4.4.5 [Red Team + Blue Team] End of drill

After the red-blue confrontation drill, both red and blue sides fill in the "Red-Blue Confrontation Drill Scenario" document, and the blue side fills in: chaos task link, chaos drill scenario, drill status, chaos drill execution start time, and chaos drill execution end time. Fill in the red box: troubleshooter, alarm information, root cause, time to troubleshoot the cause, description of the troubleshooting process (including troubleshooting process, use of tools, auxiliary decision-making and judgment), planned solutions & emergency plans, and estimated impact processing time. As shown below:

Figure 11. Document filling in after the exercise.

4.5 The person in charge of collecting drill results will review the drill results and sort out the separation drill issues so that the red and blue parties can improve them as soon as possible. The main problems are as follows

1) Failure to handle the problem in a timely manner: After receiving the alarm, the red team failed to handle the fault in a timely manner due to various reasons (meeting, absence from work, etc.).

2) Incomplete processing: After the red team processed the ns failure problem, it did not notify the user to process the failed task.

3) No alarm received:

① No alarm rules are configured. For example, the mdc or ump platform does not configure alarms.

② The alarm threshold is not triggered. For example, when the blue team attacks, the CPU utilization is 90% but the mdc alarm rule is configured to be 95%.

③ The mdc platform disables alarms. For example, mdc temporarily disables MDC monitoring and alarming in the template center.

Figure 12. Problems with the drill

4.6 Exercise review

The person in charge organizes a red-blue confrontation review meeting and provides drill results and a list of issues. Both real-time and offline architects participate and evaluate or make suggestions for this drill from the perspectives of the drill process, drill effects, etc.

① The alarm level needs self-examination and correction. At present, some alarm levels are configured to be low. When the CPU utilization is greater than 90%, [Warning] will be reported. It is recommended to change it to [Urgent].

② Extend the attack time. Find a few applications with an attack time of 30+ minutes to verify whether the defenders are actually harvesting traffic.

③ Normalize chaos drills. It can be carried out through the chaos engineering platform-normal drills, and combined with the duty schedule to increase the frequency of drills to train troops through war.

④ Practice step-by-step [Warning] and [Emergency] scenarios. In the first step, attack the scene that triggers [Warning] in 10 minutes, and in the second step, attack the scene that triggers [Emergency] in 10 minutes.

⑤ Java method exception and delay scenarios have not been practiced. In the future, testers are expected to support the inflow of traffic through forcebot stress testing.

Expect Chaos Platform support:

① The chaos engineering platform supports batch selection of multiple applications to create, start and stop chaos drill tasks. It can improve the efficiency of creating tasks. The current batch creation drill task function can only add applications one by one to create them.

② The chaos engineering platform provides a normalized chaos drill API. It is convenient for users to customize and create normalized drill tasks.

③ The chaos engineering platform supports viewing mdc and ump alarms within the platform. Reduce users switching back and forth between multiple platform systems.

5. Summary

Through this red-blue confrontation drill, it not only effectively enhanced the anti-risk capabilities of big data platform system applications, reduced the probability of production environment system failures, but also greatly improved the ability of R&D personnel to solve problems and faults, and also accumulated a lot of experience. A quick and efficient drill plan.

Finally, thank you to the Chaos Engineering Platform for its strong support!

Author: JD Retail Yin Wei

Source: JD Cloud Developer Community Please indicate the source when reprinting

Big data platform red and blue confrontation-sharpen the sword and temper the soldiers! | JD Cloud Technical Team