"Digital Twin + Automation" - Tencent's Advanced Road to Basic Network Testing

In response to the fast-growing Tencent Cloud business' pursuit of the ultimate in network quality, Tencent has established a basic network testing laboratory, which has significantly reduced network failures caused by solution defects. However, as the scale of the basic network continues to grow, the complexity of the production network architecture increases rapidly, and the problem of insufficient testing caused by requirements delivery time and test environment constraints becomes more and more prominent. How to optimize the test capability and achieve the ultimate pursuit of basic network quality, this article will give Tencent's answer .

2. The continuous "evolution" of the basic network test system

In order to provide high-quality network services with low latency and high throughput for global businesses, Tencent Cloud has deployed more than 1 million servers in 26 physical regions and 70 availability zones around the world. There are hundreds of models from dozens of suppliers in the network. There are hundreds of thousands of network devices and multiple versions of network architectures. Network failures usually have a large-scale and serious impact on business. However, due to the continuous improvement of global business requirements for interconnection quality and capacity, the basic network needs to continue to undergo network expansion, optimization and upgrade iterations. At present, the average annual number of architecture versions of Tencent's basic network is 32, and the annual change volume reaches 30,000, with an average of more than 90 changes per day. Among all serious and above-level faults, the number of faults directly or indirectly related to the change and the duration of the impact accounted for more than 50%. In a large-scale and highly complex network, how to complete the secure operation and continuous iteration of the network while ensuring the normal operation of the business is an important problem faced by the basic network.

2.1 Traditional testing stage

The best strategy for improving network quality is to prevent problems before they happen. Tencent formally established a basic network testing laboratory in 2018, requiring that all operations of the production network be tested and verified in advance to achieve 100% test coverage of changes. The basic network testing laboratory has effectively improved the network quality in the past few years, but there are two core limitations in the testing process:

●  Cost constraints . Although the laboratory includes most of the network equipment models and software versions used on a large scale in the live network, the number of equipment is limited. In the actual test, the production network can only be abstracted into a typical model, and the corresponding policy configuration and traffic simulation can be completed;

●  Efficiency limitations . During the testing process, it is necessary to manually adjust the topology and configuration, and conduct a large number of functional and performance tests, and the test execution takes a long time. In order to reduce project delays, testers usually combine scenarios and processes of the same type to reduce the number of test items.

These limitations force testers to abstract and combine test environments and scenarios, which inevitably brings the risk of missing tests.

2.2 Era of Intelligent and Efficient Testing

Tencent's basic network has introduced a virtualized simulation environment and an automated test system into the test system to form a virtual-real automated test system to meet the challenges of cost and efficiency. The system architecture is shown in the figure:

picture

Figure 1 Basic network test system architecture

 Device layer : the bottom layer is composed of formal simulation, virtual network element equipment, and physical equipment. Formal simulation can provide logical configuration analysis functions through mathematical calculations. The physical environment combined with virtual network elements can build a test environment close to the current network 1:1;

●  Channel layer: users or systems directly interact with devices through the channel layer to control the topology, configuration and status of devices. The device opens a variety of channels, such as CLI for direct use by users, and upward open interfaces such as tRPC for various system calls;

●  System layer: Tencent internally provides various systems for testing process, including equipment management, process management and resource management. The application issues control commands to the device layer by calling the system API or operating system;

●  Application layer: The basic network includes various network tests, such as architecture evolution, traffic scheduling, and policy optimization. Different applications use different testing methods and environments according to their characteristics.

3. Two "weapons" of basic network testing system

3.1 "Spiritual Realm": Low-cost digital twin network simulation

Tencent's "Lingjing" network verification platform provides virtualization and formalization capabilities, which is a strong supplement to the existing test environment. The three test methods have their own strengths in terms of simulation capabilities, scale, and cost.

picture

Figure 2 Comparative analysis of the advantages and disadvantages of various test methods and applicable scenarios

●  The physical simulation test uses the same equipment and software as the production network, which can be used for all types of tests including function, performance, and reliability. But the cost is extremely high, so the simulation scale is small, and only typical scenarios can be constructed;

●  Virtualization simulation test provides virtual network elements including equipment from major commercial manufacturers, Tencent self-developed network equipment, and testers. Its software functions and production networks have a high degree of simulation, and the simulation cost is extremely low (per The equivalent cost is 10 yuan), which can build a super-large-scale topology and complete the functional verification of various solutions and equipment. Only the virtual network element depends on the update of the manufacturer, and cannot simulate the performance of the device;

●  Formal simulation test utilizes the self-developed software system based on open source to convert the device configuration into a mathematical model to assist in the analysis of configuration vulnerabilities such as routing black holes and accessibility. This test method is not based on the hardware and software implementation of the device, so it is limited to logic testing. Combined with the simulation test, the double test verification of logic and implementation can be realized.

The corresponding relationship between common typical test scenarios and test methods can refer to the following categories:

picture

Figure 3 Classification of test methods applicable to various tests

3.1.1 Combining virtual and real, learning from each other

By opening up and combining different test environments, an integrated environment with complementary advantages can be formed, and then combined with the existing operating system, higher quality and more efficient testing can be achieved.

picture

Figure 4 The connection between the virtualized environment and the real environment

● The virtualization environment is connected with the physical environment . The physical test environment can actually only simulate typical scenarios, and the tester is used to simulate traffic and other device routing, but the function implementation of the tester cannot completely simulate the manufacturer's implementation behavior. By connecting the network between the virtual environment and the physical environment, virtual and real devices can actually establish BGP neighbors and publish routes, which can realize a test environment closer to the production network. Even due to the low cost of virtual network elements, it is possible to realize a 1:1 copy of the virtual and real test network with the production network to improve the quality of network testing.

●  Virtualized environment and operating system are connected . The traditional physical environment is quite different from the production network, and the construction of the test environment and test operations need to be abstracted and then manually operated. Tencent's basic network operation platform integrates multiple functions such as device management, topology monitoring, traffic monitoring, configuration audit, automatic configuration generation and delivery, and automatic change. After introducing a virtual environment to generate a 1:1 environment, the functions of the operation platform can be reused in large quantities for the construction of the test environment and the testing process: the configuration backup of the production network can be directly used for configuration analysis; the connection relationship, equipment model, and configuration of the production network It can be directly used to generate a virtual environment; various traffic monitoring and change procedures of the operation platform can be directly run in the test environment to achieve fast and accurate verification.

picture

Figure 5 Automatic mapping and one-click generation from the production network to the test environment

3.2 Automation: efficient systematic testing

With the construction of Tencent's research efficiency platform in recent years, the basic network has realized platformization of test management and automation of test execution, which has greatly improved test efficiency and effectively guaranteed test quality.

3.2.1 Systematization of test management

Test management has challenges such as unclear sources of requirements, difficulty in reusing and iterating test cases, and inconsistent test result recording specifications. Relying on Tencent's corporate-level research and effectiveness platforms (TAPD, Zhiyan, Workerbee, etc.), the basic network connects the entire process of testing from demand triggering to result analysis online, realizing systematic and automated management.

picture

Figure 6 Test management system

3.2.2 Test Execution Automation

In the manual testing phase, testers need to manually modify the network topology and network configuration, and then manually analyze the test results. This process is not only inefficient, but also prone to test omissions or deviations due to human error. After joining the automation system, the manually constructed standard test topology and configuration and the production network together form a topology configuration library for automatic environment construction; for repetitive test processes, the corresponding protocol models, traffic models, and device actions will be tested. The code is arranged to realize automatic execution; after the test is completed, by constructing the analysis model, the system can automatically reason and give the preliminary conclusion of the test according to the behavior monitoring results of the tester and the device under test.

picture

Figure 7 Test automation process

4. Best Practices

4.1 Enhancement of Backbone Network Overseas Traffic Engineering Capabilities

Overseas bandwidth resources are relatively scarce and costly, leading to complex backbone network topologies and severe cost challenges in capacity resource planning; at the same time, network transmission distances are long, and detours in fault scenarios may cause more than 20% delay increase and lead to Services are damaged, and the impact of network topology design on quality is amplified.

The key difficulty in planning and design is that it is difficult to effectively predict the changes and impacts of the entire network under complex topologies and various faults. In complex topologies, in addition to connection relationships, line resource distribution, device models, configurations, and protocol implementations, the actual traffic matrix and changes are also important factors. Tencent's basic network test system uses a virtual environment to realize 1:1 simulation and traffic simulation of the production network:

 Network mirroring : Lingjing automatically generates a network mirroring composed of virtual network elements by pulling the actual data of the production network operation system, and the simulation degree is greater than 95%. For example, the connection with 256 devices in the unified switching matrix CUF in the data center will be automatically simplified according to the number of planes and link bundles, so as to solve the problem that the virtual OS provided by some commercial manufacturers cannot simulate all ports of large-scale frame devices );

●  Traffic playback: Pull the traffic data collected and stored based on XFLOW and optical splitting technology in the network acquisition system of the production network, generate simulated traffic in the test topology, and dynamically generate traffic simulation at a specified time and specified network node range.

The goal of capacity planning is to provide services with lower latency and higher quality at a lower cost. In particular, it is necessary to ensure that the network can converge quickly and the quality cracking degree is acceptable in typical fault scenarios. The verification system can directly simulate various network failure scenarios, expansion and contraction, and migration schemes, and measure the cost, delay, and availability rate changes under different failure degrees to help designers decide on the optimal scheme.

picture

Figure 8 Capacity planning design model

Due to the relative scarcity of overseas cable resources, especially submarine cable maintenance is difficult and the cycle is long (basically within 1 to 6 months), it is common to experience cumulative multi-point faults during the maintenance process. Therefore, in this example, the maximum number of simultaneous fault points k ≤ 3. The peak bandwidth utilization rate is ≤80%, and the delay and cost changes of each alternative are compared. While keeping the current cost unchanged, by optimizing connection and capacity design, the average delay in failure scenarios can be reduced by up to 18%; when the budget is increased by 10%, this value can be increased to 35%; we cut redundant lines and open up Bottleneck points and other means continue to optimize resource deployment, and simulate and analyze various optimization schemes to match resource characteristics and network requirements in different regions.

The planning and design model constructs the real topology and traffic information of the live network, and can pull the latest data in real time, traverse all feasible solutions, and automate rolling tests. Compared with the traditional manual evaluation method that requires about 20 man-days of manpower, the model can automatically complete all test items within 24 hours; at the same time, it can be flexibly adjusted according to changes in the environment and goals. During the construction that lasted for several months, changes in business deployment, hot spot traffic bursts, new demands of important businesses/customers, and topology adjustments occurred from time to time. The model can automatically collect traffic in real time and sense topology changes to complete the solution update.

4.2 Network change quality improvement

Taking advantage of the combination of automation and virtual reality, Tencent combined the virtual environment with the physical environment and the live network operation system, designed and developed a complete automated pre-change system, and embedded testing into the automated change process. The overall process of a typical change is as follows:

picture

Figure 9 Schematic diagram of the whole process from pre-change verification to implementation

Take the bandwidth expansion and change of the dedicated network as an example:

1. The operator first arranges the entire automated process of the change on the Tencent NetChange automated change platform. This process covers network status inspection before the change, elegant device isolation, configuration deployment, grayscale switchback, post-change inspection, and corresponding procedures for each link. EOP failure emergency rollback operation;

2. The pre-change system will automatically pull the corresponding topology, and the operator will set the devices that are strongly related to the change as physical devices, and other devices will use the automatically generated virtualized devices;

3. Automatically inject historical traffic corresponding to the time window and run the automated change process. During the pre-change process, each operation system continuously monitors network quality, routing changes, equipment performance, etc., and the formal simulation system continuously performs logic analysis on each submitted configuration and automatically generates test reports;

4. After the pre-change test is passed, the relevant process will be transferred to the production network for actual implementation after confirmation by the operator.

The system has effectively improved the test efficiency and quality in network change scenarios. The time to build the environment has been shortened from ≥1 day to ≤2 hours. The test coverage rate has increased to 100%. The test environment satisfaction rate is close to 99%. The perception rate has dropped by 80%, and the network quality has been greatly improved.

►►► 5. Summary and Outlook

The quality of the test ensures the quality of the basic network. From architecture design to implementation, the closer the test environment is to the live network environment and the more scenarios covered by the test, the higher the test quality will be. By introducing a simulation environment and an automated process, the efficiency and quality of testing can be effectively improved. However, due to the complex types of equipment in the network, the virtual network elements cannot be fully matched; there are many types of tests, and the development of automated test cases requires continuous iterative optimization, so it is not yet possible to achieve a complete one-to-one replica of the existing network environment and complete automation. way of testing. With the continuous promotion and iteration of self-developed switches, the production network will eventually uniformly use self-developed switches, and the test will no longer depend on the network elements of commercial equipment. Coupled with the continuous improvement of automation capabilities, the basic network test system will surely undergo a qualitative improvement. , providing extremely high-quality and efficient testing capabilities for basic networks.

Finally: The complete software testing video tutorial below has been sorted out and uploaded, and friends who need it can get it by themselves [Guaranteed 100% free]

Software Testing Interview Documentation

We must study to find a high-paying job. The following interview questions are the latest interview materials from first-tier Internet companies such as Ali, Tencent, and Byte, and some Byte bosses have given authoritative answers. Finish this set The interview materials believe that everyone can find a satisfactory job.

Guess you like

Origin blog.csdn.net/wx17343624830/article/details/132473104