[A work of accumulation] Performance testing and optimization practice and exploration (storage model optimization + call link analysis) | JD Logistics Technical Team

I. Introduction

Performance testing for software systems is a key measure to ensure its business carrying capacity and stability. Taking the capability building of software systems as the main line, system capability design work and performance testing work have both a sequence and mutual influence. The above, in the main aspects of performance testing such as scenario decision-making, architecture analysis, traffic analysis, stress test implementation, and dissection and optimization, triggered many thoughts on the consolidation of system capabilities and chassis and the improvement of test strategies.

During the performance testing phase, analyze system capability implementation and tuning solutions, and explore room for better solutions and performance testing strategies.

2. Actual combat and thoughts on stress testing of hot data storage models

Through performance testing, we can speculate on SKU inventory pre-occupation scenarios, performance bottlenecks and risks in different storage modes.

After the data architecture upgrade, SKU inventory reservation efficiency (TPS) increased by 2300%↑.

Test-driven, combined with system implementation, demonstrates the necessity of cache preheating, and uses big data analysis to explore scientific cache preheating and insulation strategies.

Combined with new business models, think about more scientific test data construction ideas and testing process efficiency improvement plans.

1. Stress test scenario

Inventory reservation refers to providing a short-term reservation of SKU inventory for documents during the order receiving process. In the logistics warehouse order receiving process, inventory pre-emption behavior in the SKU dimension will be initiated.

The inventory center provides SKU inventory pre-emption standard capabilities to the outside world through the pre-emption interface in the "Inventory Pre-emption Main Application". Mainly through three key applications: "inventory deduction logic control and database layer interaction", "cache layer interaction", and "task scheduling", it carries inventory logic calculation and storage layer interaction capabilities.

From the data model perspective, there are two types of pre-emption capability implementations:

▪Business department dimension inventory pre-emption is mainly carried through the Redis cache layer.

▪Batch inventory pre-occupation is directly carried by the database.

When the volume of warehouse allocations during a major sales promotion enters an explosive period, hot SKU pre-emption requests increase rapidly, and inventory pre-emption requests go directly to the database, the system TP99 will jump or even continue to rise, causing order receiving timeout in serious cases.

Above, we plan to construct targeted stress testing scenarios and data models to confirm the peak carrying capacity of the system and the effectiveness of the tuning strategy.

2. First pressure and analysis

◦Stress test target: "Pre-emption interface" under "Inventory pre-emption main application", in the database carrying hotspot SKU pre-emption request mode, explore the peak traffic that the target TP99 (≤3000ms) can carry, and verify the optimized Peak carrying capacity (target TP99≤500ms).

◦Stress test plan : A single hotspot SKU is continuously pressured and pre-occupied. The pressure starts at QPS=10 and increases by QPS+10 to explore the upper limit of the performance that can handle requests.

◦Pressure test process and conclusion

▪When QPS=50, the system can stably support inventory pre-emption business (TP99≈100ms).

▪"Inventory Pre-occupation" main application: CPU usage ≤15%, memory usage ≤35%

▪"Inventory deduction logic control and database layer interaction" application: CPU usage ≤18%, memory usage ≤65%

▪Database: CPU usage ≤7.8% (no slow SQL)

▪Based on the current system performance, it has the conditions for continuous pressurization.

▪When the pressure is increased to 60 in QPS+10 increments, TP99 rapidly increases to 7000ms in about 2 minutes. The main application of "inventory preemption" is TPS ≤ 60. It is predicted that the system capacity has reached the bottleneck and the pressure is stopped.

"Inventory pre-emption" main application TP99+TPS trend

"Inventory pre-emption" main application hardware resource trend

Database key indicators (CPU)

Database key indicators (slow SQL)

Database key indicators (memory)

Bottleneck prediction : Inventory pre-emption in the document dimension is carried out by first checking (available inventory) and then writing (pre-empting inventory). During the process of placing high-frequency orders for hot SKUs, the database will record this line for a long time. For continuous reading and writing, the database level will ensure the atomicity of a single transaction through the row lock mechanism. The lock competition caused by row-level locks will most likely cause the system processing capacity to reach a bottleneck and restrict the execution efficiency of the system. At the same time, there is no hardware resource bottleneck from the application layer to the storage layer, eliminating the impact of insufficient hardware resources.

3. Tuning and re-pressure

◦Storage layer transformation ( see Inventory Center - System Architecture Diagram of Inventory Pre-emption Scenario ): After the first round of stress testing and analysis, in order to solve the known performance bottlenecks, from the data architecture level, batch inventory pre-emption is directly carried by the database. Pressure, upgrade to the Redis cache mainly carrying the request pressure. The high-performance throughput capability of Redis is used to solve the problem of data reading and writing efficiency in concurrent scenarios, and Redis is used to carry the main traffic of hot products in front.

◦Consistency guarantee ( see Inventory Center - simplified diagram of inventory change monitoring mechanism )

▪In order to ensure data consistency between the cache layer and the database layer, in the event of a cache hit, the database is written back asynchronously by establishing a scheduling task or MQ method.

▪In the event of cache breakdown, the results are preempted by reading (database) first, writing (Redis) and then feeding back (API), and then write back to the database asynchronously to ensure data consistency.

Inventory Center-Inventory pre-occupation scenario system architecture diagram

Inventory Center-Simplified Diagram of Inventory Change Monitoring Mechanism

◦Re -pressure conclusion

▪After completing the data architecture upgrade and hotspot SKU cache warm-up, the initial QPS = 1100 and increases by 100. When the TPS reaches 1200, TP99 ≈ 130ms, and the system can stably support the batch inventory pre-occupation business.

▪When TPS reaches 1300, TP99 fluctuates significantly (glitch ≈ 420ms), and the CPU usage of the "cache layer interaction" application soars to 90%+. The stability of the core link deteriorates and pressure is stopped.

▪Compared with the database bearing mode, after the cache upgrade, TP99 meets expectations (≤500ms), and the TPS bearing capacity is greatly increased by 2300%=(1200-50)/50.

"Inventory pre-emption" main application TP99+TPS trend

"Inventory pre-emption" main application hardware resource trend

Database key indicators (CPU)

Database key indicators (slow SQL)

Database key indicators (memory)

Redis cluster key indicators

4. Thinking about system robustness

◦**Disadvantages of full caching: **Different industries in the supply chain model have large differences in the life cycle of SKU categories (such as the apparel industry ≈ 3 months). The full caching mode will lead to a large number of invalid categories in Redis and resource consumption. The expansion is uncontrollable and increases resource costs. It is necessary to design a more effective caching solution.

◦**Necessity of cache preheating and heat preservation:**Cache hit rate is closely related to the preheating mechanism and heat preservation strategy.

▪Necessity: The regular big promotion rhythm and the starting sales period will trigger the first cache initialization. The overlap between promotional categories and daily sales categories determines the probability of first cache breakdown. The current Key validity period = 7 days, and the sales period → a good start → the peak period interval is more than 7 days. The lack of necessary insulation strategy will increase the possibility of cache failure before the next promotion node.

Great start to 11.11 Cache hit rate trend

The system as a whole can carry traffic smoothly and cache the hit rate curve at the same time, so there is room for improvement.

▪Pre-heating ideas: How to maintain the cache effectiveness as much as possible during specific periods such as big sales, and improve the cache hit rate (reduce the probability of breakdown). This can be done through pre-processed multi-dimensional analysis and research, including but not limited to big data-based analysis. Pre-promotion centralized purchasing category distribution analysis, previous major promotions and key node promotion category density and distribution analysis, and key customer promotion plan research, etc., combined with technical means, pre-judgment, preheating and insulation.

◦** Cache preheating practice: ** By analyzing the centralized purchasing period before a customer’s big promotion and the SKU category overlap of the big promotion nodes, the following rules were found

▪Perspective of centralized purchasing: the SKU category overlap during the big promotion period is ≈69% relative to the opener category, and ≈75% relative to the 11.11 category.

▪Sales perspective: the SKU category during the initial sales period has a coincidence degree of ≈94% compared to the opener, and the overlap of the opener relative to the 11.11 category is ≈75%.

▪The above data proves that by preheating the cache of SKU available inventory data of the centralized procurement period and the previous promotion period before key promotion nodes such as the good start and the 11.11 promotion, it helps to improve the cache hit rate of pre-emption requests.

Analysis of SKU category overlap in major promotion links

◦**Abnormal scenario identification:**Inventory scenarios have high requirements on the three properties of data (accuracy, timeliness, and completeness). During the two-way synchronization process between the database and cache, it is necessary to avoid business exceptions caused by consistency issues.

▪Oversold exception identification: During the peak sales period, in order to protect the security of the main database, cache synchronization and current limiting are used to reduce the pressure on the main database, causing synchronization delays between the cache and the database. The same SKU is not deducted in time at the database layer. In this case, the cache is superimposed. In case of Key expiration, the interface directly returns MySQL data, which may cause oversold business exceptions.

▪System optimization ideas

▪Static plan: During the peak period of single volume, the key validity period is extended to cover the key intervals of major promotions.

▪Dynamic solution: Add a hotspot SKU cache validity period delay strategy. Key expiration T-1 days, SKUs with an average daily pre-emption request volume greater than 1 will automatically extend the Key validity period.

5. Thoughts on improving testing strategies

◦Scene expansion

▪The trend of mainstreaming the live broadcast e-commerce model is strong (in the first three quarters of 2023, national live broadcast e-commerce sales reached 1.98 trillion yuan, an increase of 60.6%, accounting for 18.3% of online retail sales, and live broadcast e-commerce drove the growth rate of online retail by 7.7 percentage points ), compared with traditional e-commerce, its limited-time promotion model is superimposed with social communication and diffusion attributes, resulting in large instantaneous traffic for single products, lower category overlap between different promotion sessions, and high promotion frequency, which puts forward different requirements for system performance.

▪Backward performance testing strategy, from a platform perspective, needs to increase the diversity of SKUs as much as possible, while reducing the category overlap of SKUs in a single request for stress testing, and identifying potential performance risks in real complex scenarios.

◦Efficiency improvement : Warehouse distribution order performance testing in complex scenarios requires a large reserve of basic data (commodity, inventory), as well as high-complexity interface request data preparation. How to ensure that basic data such as products and inventory are ready quickly? At the same time, can the message body of the order request be automatically constructed and assembled quickly according to the SKU density and complexity requirements? It is necessary to develop extended functions based on the existing stress testing framework to support one-click rapid initialization construction capabilities from basic data to complex documents, reduce the difficulty of constructing complex scenarios, and improve testing work efficiency.

3. Analysis, identification and optimization of invalid calls

During the traffic analysis phase of performance testing, combined with business scenario research, suspected performance bottlenecks are identified in advance.

After promoting the investigation and adjusting the core link calling logic, during the calibrated business window period, the total number of core interface calls was reduced by 60%↓.

Deeply segment business scenarios and deduce potential tuning space.

1. Background

After the order is shipped out of the warehouse, the logistics system uses the order details query application to provide external query capabilities for the order and its associated package details. It is mainly called by external systems (Top 2 level callers: access return 67%, fulfillment return 11%). After the document is shipped out of the warehouse, basic order information such as the quantity and package details of the outbound goods are output.

Key (Top2) caller topology

2. Scenario investigation and doubtful point identification

◦Scenario research and risk prediction (production flow analysis)

▪Conducted call volume trend analysis on the "Order and Package Details Query Interface", sampling 10.12 06:30~23:00 in 23 years (traffic analysis period), compared with the same period of the latest promotion (peak period of the latest big promotion request), Top2 calls The total number of peak calls surged by 305%.

▪Based on preliminary research and from the perspective of call volume, under normal circumstances the average warehouse outbound capacity is ≈ 400,000 orders/minute. The peak period of warehouse outbound is 08:00~18:00 every day. The number of warehouse outbounds: "Order and Package Detail Inquiry "Interface" peak call volume ≈ 1:10 is the "regular ratio".

▪Through the observation of online data on October 12, the number of warehouse exits: the peak call value of the "Order and Package Details Inquiry Interface" (400000/6532200) ≈ 1:16, which is a larger deviation than the "regular ratio".

▪In the above, through the production traffic analysis work, it was identified that there were doubts in the number of calls to the "Order and Package Details Inquiry Interface" during the peak period of warehouse outbound shipments, and further in-depth analysis was conducted.

Key application call volume during the latest promotion period

Volume of key application calls on October 12, 2023

◦Call chain coarse screening

▪Warehouse allocation out-of-warehouse document dimensions, fulfillment callback application, when pushing out-of-warehouse details to the order system, the warehouse details query interface will be called.

▪Connect to the return application. When order information is returned, the warehouse details query interface will be called.

▪Performance status callback peak value/Access callback call peak value ≈ 1:9. The access callback call peak value is obviously larger, and the suspect system (access callback application) is gradually locked.

◦In- depth analysis of doubtful points

▪After in-depth investigation, it was first confirmed that the early judgments on abnormal traffic and suspicious systems were basically accurate.

▪At the technical architecture level, the access callback application calls the target interface without judging the order status. As a result, a large number of invalid calls occurred when the documents were not shipped out of the warehouse and there were no shipping details.

▪At the same time, it was discovered that the alias configuration of the AB test environment was incorrect, causing the production traffic to be mistakenly superimposed.

3. Tuning strategy

◦Call logic adjustment

▪ "I" In the order return stage of the business scenario, if the document status is before shipment, the "Order Package Details Query Interface" call will not be initiated and invalid queries will be eliminated.

▪Based on the final returned content (whether detailed information is required), determine the necessity of the call and eliminate unnecessary queries.

◦Adjust the AB test environment alias configuration to avoid test traffic from causing unnecessary pressure on the production environment.

Access postback application logic before optimization

Optimized access to backhaul application logic

4. Tuning effect

◦Compared with before tuning (10.12), the total number of calls to "Access backhaul application" is reduced by 60%↓ (before: 2397252500, after: 925890100), and the peak number of calls is reduced by 64%↓ (before: 5921500, after: 2121800).

The following figure shows the distribution of call volume before and after adjustment for comparison.

5. Preliminary identification of performance risks

◦The stress testing implementation phase is not the only stage to discover performance risks. If you have the ability to identify performance risks and promote demonstration during the traffic analysis phase, the earlier the problem is discovered, the smaller the risk control cost (resources) and the lower the quality risk.

6. Normalize OpsReview

◦Traffic abnormality observation: Traffic analysis and performance risk identification need to combine the actual production and operation characteristics and the key call chain of the interface to define the general rules of system call volume. It is necessary for the called party to continuously identify the sources and regular magnitudes of calls, take inventory of external calling strategies, and investigate risks when there is a change in the volume of calls.

◦Coding standards: It is necessary to abstract the interface calling logic into a standard method to avoid coding differences that vary from person to person during team collaborative development and reduce the probability of invalid queries.

◦Customized logic troubleshooting: There are many customized logics for non-standard services in the system, and it is necessary to troubleshoot invalid query risks based on special logic.

7. Potential tuning space deduction

◦Based on test experience and after sorting out business scenarios, it was found that there are subdivided non-standard customized processes under the "I scenario", as well as the "P scenario" standard process in parallel with the "I scenario".

◦Linked R&D conducted an in-depth analysis of the non-standard customized process in the “I scenario” and the standard process in the “P scenario”. It was confirmed that there is room for further optimization and the optimization plan was clarified (as shown below).

4. Summary

Performance testing is a key measure to consolidate and upgrade system capabilities. Through the presentation and thinking of typical cases, the space for improvement of system capabilities and performance testing strategies is explored. Ensure that core system links carry peak business traffic stably and efficiently while calmly handling extreme scenarios.

Author: JD Logistics Liu Rui et al.

Source: JD Cloud Developer Community Ziyuanqishuo Tech Please indicate the source when reprinting

Broadcom announced the termination of the existing VMware partner program . Site B crashed twice, Tencent's "3.29" level one incident... Taking stock of the top ten downtime incidents in 2023, Vue 3.4 "Slam Dunk" released, Yakult confirmed 95G data Leaked MySQL 5.7, Moqu, Li Tiaotiao... Taking stock of the (open source) projects and websites that will be "stopped" in 2023 "2023 China Open Source Developer Report" is officially released Looking back at the IDE 30 years ago: only TUI, bright background color …… Julia 1.10 officially released Rust 1.75.0 released NVIDIA launched GeForce RTX 4090 D specially for sale in China
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4090830/blog/10555567