Big data real-time link preparation - data dual-stream high-fidelity pressure test | JD Cloud technical team

1. Big data dual-stream construction

1.1 Data dual stream

In the era of big data, more and more businesses rely on real-time data for decision-making, such as promotion adjustments, click-through rate estimates, and advertising commissions. In order to ensure the smooth development of business and to ensure the high availability of the overall big data link, more and more level 0 systems are building dual streams to ensure the stability of data streams during daily and big promotions. Build core data link dual-computer rooms, dual-stream active-active. At the same time, the construction of Shuangliu requires the deployment of dual computer rooms for all links on the entire link , which takes up double the physical resources ; the entire construction process requires coordination of upstream and downstream links (data producers, data warehouse processors, intermediate processing nodes, and business consumers), which will also consume a lot of communication and construction costs. In order to achieve a balance between resource consumption and business stability, Shuangliu construction standards and implementation procedures are specially formulated to guide business parties to reasonably evaluate Shuangliu needs and carry out Shuangliu construction and implementation smoothly.

1.2 Evaluation Dimensions and Standards for the Construction of Data Dual Streams

serial number dimension Evaluation Criteria Standard Definition & Remarks
1 system level Level 0 system The level 0 system is the company's core business service system. Once unavailable, it will directly affect the gold transaction process or affect the company's reputation, brand, group strategy, marketing plan, etc., and may cause P0-P2 level accidents. The definition of the 0-level system is based on the definitions in 4.1-4.2 of the online accident grading, liability determination and point deduction standards of the retail subgroup. A level 0 system will be inclined in terms of server resources and human resources, but to ensure high availability of a level 0 system, it will be closely related to the accident level.
2 task level L0 real-time task For details on the specific task levels of the 0-level system on the service line, please refer to the reference instructions for the protocol level setting of the real-time data platform operation level management specification.
3 physical resources The business party applies for and undertakes the physical resource consumption required for Shuangliu construction, and the cost of physical resource consumption is reasonable. The backup flow is built according to the carrying capacity of 80% of the mainstream (80% of resources) The business side needs to provide specific physical resource information: a. Physical resource costs include storage resources, computing resources, bandwidth, queue resources, etc. b. Covering all links such as upstream production system storage, data warehouse processors, intermediate processing nodes, and business consumers. c. Assess both traffic and transactions
4 Data timeliness Data timeliness requirement <= 20 minutes for big promotion 0 (or corresponding business peak point) or <= 40 minutes for normal data timeliness
5 data peak Big promotion peak forecast (transaction, traffic) Normal peak forecast (transaction, traffic) The peak value of data is used as the main reference, but in the absence of data reference, the new system will make appropriate adjustments according to whether the business is a group strategy. If other conditions are met, but the data peak value is very small, dual-streaming is generally not recommended, and special cases will be discussed separately.
6 production source The production source must be deployed in two computer rooms
7 Business scene Missing data can cause XX class accidents The business side provides a complete business scenario and the impact of failures to help evaluate

2. High-fidelity pressure measurement of big data double-stream dam

2.1 Shuangliu dam pressure measurement

Starting from preparations for the 21-year big promotion, the core data link on the big data side has shifted from single-module, single-task stress testing to full-link dam stress testing. The flood gates have been moved up, and the pressure testing range has been expanded. Orders and transactions have been flooded at the same time, and high-fidelity big promotion network peaks, resource competition scenarios, and data products (Golden Eye, Business Intelligence, and large screens in the combat command room) will simultaneously perform read and query stress testing during flood discharge, simulating the real big promotion scene of parallel reading and writing peaks.

2.2 Establishment of pressure measurement targets for Shuangliu Dam

(1) Stress test target setting, generally referring to historical peaks and market estimates, gives core transactions and traffic-themed link peak estimates, for example, 1.2 times that of Double 11 in 22 years. The key data flow topic will give an estimated peak consumption for downstream reference, as shown in the table below (data related to confidentiality will not be shown in detail)

2.3 Pressure measurement scheme for Shuangliu dam

(1) The method of holding back the transaction , by stopping the synchronization task to hold the order, the transaction dual-flow architecture diagram is as follows:

(2) Flow suppression mode , flow non-destructive suppression pressure test is to stop the collection service and write JDQ write cluster to hold the flow. Business parties who do not participate in the pressure test can switch to **"JDQ4 Lancang River_Click Stream New Stream" (new JDQ write cluster is created during the pressure test) to ensure that the downstream business can normally consume the real-time data of the traffic during the pressure test period without loss .

2.4 Shuangliu dam pressure measurement specification

(1) The specific hold-up order, flow-hold start time, and flood discharge time of the full-link stress test will be notified 24 to 48 hours in advance (email + work group) before each stress test. After the notification is issued, the flood discharge time will not be adjusted

(2) The whole link pressure test will be reported by the group to avoid important promotional activities, and the pressure test should avoid its own disaster recovery drills such as storage (hbase, jimdb, ES), JDQ, JRC, etc., so as to avoid invalid pressure test

2.5 High-fidelity stress testing in distorted scenes

The proportion of pre-sale orders in ordinary orders is too low. The peak value of pre-sale orders on weekdays/the peak value of big promotions = 0.05% to 5.9%. Therefore, pre-sale orders cannot achieve high-fidelity in the Shuangliu dam stress test, so the overall transformation of the big data pre-sale link has been done, and the joint online military exercise stress test (pressure test of the business production system) has realized the high-fidelity stress test supplement of the pre-sale link.

Landing plan: Military exercises are responsible for providing pre-sale order data and payment of deposits and final payment scenarios, and the big data link is transformed to be compatible with stress testing for data stress testing without polluting online data

As shown in the figure below: The yellow part provides the corresponding storage of data for online military exercises—the shadow library table. The green part is newly added for stress testing, the top layer is the stress testing data source (JMQ/JDQ), and the lower part is the transparent stress testing environment and write shadow storage built for the stress testing. The corresponding tasks of GoldenEye pre-sale general source and Shangzhi pre-sale transaction general source have been changed to double-input and double-out, which can handle online data sources and stress measurement data sources at the same time, and online data can be written into online output topics and online storage. After the data of the pressure measurement data source is processed, the topic of the pressure measurement data is output and written into the shadow storage. In this way, the online topology does not need to be changed with each pressure test. At the same time, downstream business parties can also flexibly choose whether to participate in the pressure test.

3. Migration plan for the business side during the big data pressure test

3.1 The impact of Shuangliu dam pressure testing on business parties

During the flow-holding and order-holding period of the big data dual-stream stress test, the computer room (Huitian/Langfang) corresponding to the flow-holding and order-holding has no real-time data distribution, and it resumes after the flood discharge. Business parties that do not participate in stress testing need to switch accordingly.

3.2 Migration plan for business parties that do not participate in stress testing

(1) Switch clusters:

A. The transaction is not involved. The source topic of the transaction is dual-stream and active-active, and the business can be switched to the topic corresponding to the non-stress testing computer room.

B. The traffic directly consumes the topic that the click stream spits out, and needs to be switched to the lossless pressure measurement cluster** "JDQ4 Lancang River_Click Stream New Stream" . This switch cluster supports one-click migration without restarting tasks. To use this function, you need to upgrade jdq-sdk. The jdq sdk version is jdq4-clients: 1.3.0-SNAPSHOT flink: 1.10/1.12/1.14-1.0.9-SNAPSHOT**. If you can’t see the cluster “ JDQ4 Lancangjiang_click stream to create a new stream” during the migration process , you can contact Ping O&M for support

(2) Switch topic authentication

A. The transaction is dual-stream. Both Langfang and Huitian have corresponding topics. Business parties who do not participate in stress testing can apply, and consume the topics corresponding to non-stress testing computer rooms.

B. The traffic is not the topic directly spit out by the consumption collection service, but the real-time traffic data warehouse and the topic of the next link are consumed. It is also dual-stream and active-active. Switch consumption to the topic corresponding to the non-stress test room.

Author: JD Retail Jing Minglan

Source: JD Cloud Developer Community

The 8 most in-demand programming languages ​​in 2023: PHP is strong, C/C++ demand is slowing Musk announced that Twitter will be renamed X, and the logo will be changed for five years, Cython 3.0 is officially released GPT-4 is getting more and more stupid? The accuracy rate dropped from 97.6% to 2.4%. MySQL 8.1 and MySQL 8.0.34 were officially released. The father of C# and TypeScript announced the latest open source project: TypeChat Meta Enlargement move: released an open source large language model Llama 2, which is free for commercial use . React core developer Dan Abramov announced his resignation from Meta. ChatGPT for Android will be launched next week. Pre-registration starts now . needs? Maybe this 5k star GitHub open source project can help - MetaGPT
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4090830/blog/10090554