Exploration on the Construction of Global Safety Production & Quality Assurance System

Authors: Xiao Gangyi, Zhang Jun, Li Jinglei (Globalization Business Platform Team)

There are certain differences between the business, technology and structure of globalized e-commerce and domestic technology. From the perspective of safety production assurance and quality assurance, these differences have brought more challenges. This article will share with you related information about safety production and quality assurance. experience.

I. Introduction

As a team with rich domestic e-commerce technology accumulation, in serving the global e-commerce business, on the one hand, it will naturally inherit the domestic e-commerce technology system to solve the common problems of e-commerce business. On the other hand, in the face of differences in business, technology, organization, culture, policy and other dimensions, we are prompted to evolve new or more appropriate technical systems. For example, the stage characteristics of international e-commerce business development, the characteristics of overseas infrastructure, the characteristics of international business different from domestic business, user and cultural characteristics, and more uncontrollable policies and compliance. In the technical system that we have been exploring for many years, we shared and explained the experience related to development technology and architecture in the previous article. In this article, we want to share another important piece in the technical field, experience related to safety production and quality assurance .

Before sharing technical experience, we first disassemble the global e-commerce we serve in some key dimensions, which is also the understanding of some key features of global e-commerce through years of experience, and these understandings are crucial to our technical construction. is crucial.

1.1 Business Differences

In terms of business, globalized e-commerce mainly has two basic types: book-to-book and cross-border.

Lazada, a Southeast Asian e-commerce brand, is a typical overseas book-to-book business. It mainly realizes the rapid construction of an e-commerce system that supports local buyers and sellers in a foreign country or region. Such a model determines that most of the merchants are located in overseas countries, the currency will be relatively single, and the operating capabilities need to be more localized.

The cross-border e-commerce brand AliExpress is a typical cross-border business model. It mainly realizes that Chinese merchants can sell goods to cross-border buyers through our system under the advantage of Chinese goods. The problems involved here will be quite different. In terms of language, currency, time difference, category, activity, exchange rate difference, logistics, supply chain, etc., we need different technologies to solve the problem of differences. E-commerce business is mainly composed of e-commerce system, payment system, and logistics system. There are certain differences between the international and domestic technical implementations of these three systems.

1.2 Differences in technical architecture

In terms of technology and architecture, there are also significant differences:

  • Differences in infrastructure: Most overseas e-commerce service targets are cross-country, and overseas hardware facilities are different from domestic ones. Therefore, under the premise of considering stability, our system should realize data synchronization, system reusability, compliance, etc. at a lower cost. Our technical keywords here are multi-unit deployment, multi-computer room deployment, and cloud native. These points are critical to the difference in the quality system we build.

  • Differences in application architecture: In the field of e-commerce technology, we often build a large technical system network through different applications or microservices, and each application is split according to its respective field. The difference between domestic e-commerce is that our applications or services need to consider sites, that is, national sites, and even one server serves multiple national sites through multiple domain names at the same time. In addition, in terms of application architecture, in terms of how the agile middle platform can support good business, we designed the application architecture of the Church & Bazaar, so that the flexibility of the application and the relationship between R&D and production have been qualitatively improved. At the same time, it also raises the requirements on how to ensure the quality and stability of the system.

  • Differences in the R&D process: China and Taiwan need to abstract and merge e-commerce logic as much as possible, but the nature of national differentiation will inevitably lead to a large number of country-specific customization needs, and thus form a more complex organizational structure, whether it is a business-to-business or a Localized operations in cross-level business. Therefore, our system will be developed together with more engineers who are closer to e-commerce users. We need to achieve more flexible application orchestration, code parallel writing, isolated release, traffic scheduling and other capabilities. In a more flexible and larger R&D process, the difficulty of quality control naturally rises to a higher level.

  • Differences in the running state: Considering the QPS characteristics, etc., in the running state, from the perspective of cost and performance, we have achieved multi-tenant parallelism and architectural isolation. This also determines that the work of our system in terms of complexity, testability, troubleshooting, test environment, automation, and data construction will be more complicated.

  • Differences in data synchronization: Even though our system may be reused by tenants in the running state, in terms of data, considering privacy, security, compliance, performance, scalability, etc., we are physically or logically isolated by site. In cross-border scenarios, in order to achieve fast access to Chinese products globally, it is necessary to achieve second-level data synchronization capabilities. This brings difficulty to the stability of the system and also increases the complexity of test cases.

Some of these differences are naturally determined by business attributes. Some are new problems that we bring when we technically solve business or architectural problems. From the perspective of production safety and quality assurance, these differences have brought us more challenges and opportunities for technological innovation, which is the core of the entire global technology system.

2. Safety production system

2.1 Overview of Global High Availability Architecture

System availability (Availability) was first proposed in Patrick O'Connor's Practical Reliability Engineering book.

Among them, MTTR is Mean Time To Repair, that is, the average repair time after a system failure, indicating the maintainability of the system. MTBF is Mean Time Between Failure, that is, the average non-failure time, indicating the reliability of the system.

Therefore, there are two main directions to ensure the high availability of the daily state of the system: improving MTBF, increasing the system non-failure time, and improving the overall system reliability & fault tolerance by maintaining stability mode and system redundancy. And, reduce MTTR, shorten the time required for fault recovery, facilitate accidents, changes, monitoring operation and maintenance, etc., and increase system maintainability.

In addition to the above two daily high-availability construction directions, for large-scale e-commerce systems, how to maintain high service stability under the impact of flood peaks during the big promotion, and how to find out the system bottlenecks in the relatively short preparation stage before the big promotion, and grasp the overall core Links are also a link that cannot be ignored in the construction of a high-availability system.

Therefore, this article will discuss the global high-availability architecture in three areas:

  • High availability system construction

  • Great promotion of stability guarantee construction

  • Apply high-availability architecture

2.2 High availability system construction

At present, the common standard in the industry to measure high availability is 1-5-10. 1 minute for discovery, 5 minutes for response/location, and 10 minutes for recovery. In the past practice, the achievement of 1-5-10 is very dependent on the SRE team, the team's business familiarity, troubleshooting methods, and practical operation proficiency. However, as the complexity of the system increases and the link dependencies become more and more complex, 1-5-10 has higher and higher requirements for the SRE team.

There is a certain difference between high-availability faults and business faults, and high-availability faults are universal. Around the fault, we can do targeted arming. Next, we will introduce a high-availability fault system built around fault definitions in the internationalization section.

Next, we will analyze the processing process of the entire high-availability system from a failure in which the success rate of transaction orders drops. First assume that there is a definition of failure: the order success rate drops by 5% for 10 minutes.

2.2.1 One Minute Discovery

In the case of fault discovery, in the past, the fault notification was issued after waiting for a drop of 5% for 10 minutes. The SRE team started to step in to deal with the problem.

When the order success rate drops by 1%, 2%, or drops by 5% for 5 minutes, everyone lacks awareness. We want to advance the time to respond to failures. Start to intervene when the system indicators are abnormal in advance.

We added a risk early warning mechanism (shown by the solid line in Figure 1) after the abnormality of the system indicators was discovered and before the failure occurred. The risk warning is triggered by the GOC, after which the SRE team starts to intervene in the problem handling. In this way, it is very likely that the risk will be eliminated when the failure occurs.

Figure 1 Comparison of the risk early warning process and the original fault emergency process

2.2.2 Five-minute positioning

When the problem is discovered, we need to locate the cause as soon as possible. High availability faults are divided into two categories: change and operation.

To locate these two types of faults, you need to know several information:

  • What are the systems involved in the fault, and their dependencies?

  • Which system is causing the current error, which is the symptom and which is the root cause?

  • Is there a relevant change in the failure: time-related, scope-related?

Figure 2 Positioning products in five minutes

In the global business, for core applications, a unified log framework is adopted to collect application information. The collected information includes all RPC call links, core output parameters (error code, success), and middleware information. Through the log information collected by the log framework, we can restore the link topology information of all core applications, and real-time information such as the success rate, error code, and RT of the interface.

When a risk warning occurs, two core pieces of information can be known

  • Entry application

  • problem scenario

Through the entrance application and problem scenarios, and link topology information, we can locate the nodes that may have problems downstream of the link.

In addition to the collection of link information, we also collected change information, including release changes, configuration changes, and operational changes. By changing the correlation algorithm in the brain, the most relevant change to the current fault scenario is calculated.

Through the above method, for high-availability occurrences, a 2-minute positioning accuracy rate of more than 90% can be achieved. After positioning, the next step is to consider how to recover.

2.2.3 Ten-minute recovery

Of the 1-5-10 trio, the 10-minute recovery is often the hardest.

For change-type faults, the rollback speed of configuration changes is relatively fast, and the rollback can be completed within 10 minutes. For release-type changes, 10-minute recovery is difficult due to the application startup time involved. According to the situation in the release process, establish the capability of fine-grained flow cutoff, which can cut off the flow of problematic machines (released machines). But for the situation that has been released, it is still based on the complete rollback of the application.

2.3 Promotion of Stability Guarantee Construction

The big promotion is the project that invests the most manpower every year. Just to ensure the stability of the big promotion, it often requires the investment of thousands of people-days. These include various combing special projects and multiple full-link stress tests. In addition, the machine resources have also reached the peak value every year. In the case of ensuring the stability of the big promotion, how to reduce the cost of personnel and hardware has become the main proposition of the big promotion guarantee.

2.3.1 Guarantee the stability of the big promotion and create the certainty of the big promotion

During the preparation process, the most critical thing is stress testing. The pressure test is divided into several key steps: flow evaluation, pressure application, and pressure test review.

For traffic evaluation, we have uploaded all the process and data of traffic evaluation to the system. After confirming the key indicators in the big promotion, we will input the key indicators into the system, and the traffic value of each core link will be obtained. The follow-up guarantee will be based on this Flow values ​​are used to prepare hardware resources. The development students of each business domain only need to continuously maintain the calculation formulas of key indicators and system entrance traffic.

In the past pressure tests, there were often problems of low pressure and leaking pressure, resulting in abnormal flow at the peak of the big promotion. These abnormal flows have also caused the failure of P1. Therefore, during the pressure application process, we collected key system traffic and performance indicators, compared them with the above traffic evaluation results, and found that the traffic did not meet the expected links. Eliminate risk by increasing flow or boosting pressure. In addition, through the topology engine of the link, the comparison between daily traffic and pressure traffic is carried out to find links that may leak pressure.

Through this method, we discovered many cases of low pressure and leakage, which ensured the authenticity of the pressure test and ensured the stability of the big promotion.

2.3.2 Cost reduction and efficiency improvement

2.3.2.1 Hardware cost

Big promotions often have a feature, short cycle and large traffic. In the case of non-stress testing and non-big promotion peak, hardware resources do not need so much. Therefore, we consider that from pressure to link, we have the ability to automatically expand and contract capacity, thus greatly reducing the container resources we use.

Pressure test: Before the pressure test starts, expand the pressure machine and business container. After the pressure test is over, shrink the pressure machine and business container.

Big promotion day: The capacity will be expanded during the big promotion period, and the capacity will be reduced during the non-big promotion period. In this way, we can reduce container resources by 50%.

Figure 3 Automated stress testing

2.3.2.2 Manpower aging

The preparation for the global promotion involves multiple countries and multiple time zones. For each special item of preparation for the promotion, we have basically systematically reviewed the preparation content. Except for the review work, most of it can be done on the system. This greatly reduces the work of everyone's collaboration and data integration. The person in charge of the big promotion can also learn about the progress and risks of the current big promotion preparations from the system. Combined with automated tools such as automated stress testing and unattended values, the investment in personnel is greatly reduced.

In the past few big promotions, the number of people applying pressure has been reduced by 80%, and the input of the big promotion guarantee project team has been reduced by 30%.

Figure 4 Big Promotion Workbench

2.4 System High Availability Architecture

The above chapters describe the considerations of high availability for failures and preparations for big promotions. The next step is mainly the system high availability architecture. The high availability of each system is the basis for the high availability of the entire business. The SRE team cannot fully understand each system. How to ensure that all systems have the same and high standards is the direction we need to consider.

Here we introduce a highly available measurement system. Quantitative evaluation is carried out for the reliability, high throughput, supervision and fault tolerance of the system. For each score, we provide suggested optimization strategies and methods. Students of each business system can optimize the system in a targeted manner. And the SRE team also has an operational position. By continuously improving the measurement system and adding indicators, the person in charge of the system can be constantly driven to optimize governance.

Figure 5 High availability measurement system

The statistics of these indicators are often based on online data for post-event statistics. The quality of each change still needs to be guaranteed by the technical quality team. The safety production system is the secondary protection built on the technical quality system. The safety production is the right shift of the quality system, and the quality system is the base of the safety production. To ensure the stability of the system and business, we have also made the same efforts and explorations in the quality assurance system.

3. Quality Assurance System

The quality assurance of the global e-commerce system must first solve the quality problems of e-commerce business. It is necessary to build different quality systems on different e-commerce links to solve the most critical problems in various fields. At the same time, taking into account the business and technical differences between international e-commerce and domestic e-commerce, there will also be differences in quality systems or implementation methods. for example:

  • The number of site applications that appear due to multiple sites is relatively more inflated. The kernel code in charge of the business center needs to be run simultaneously in each site application of each business. How to efficiently test and cover, and how to accurately measure the coverage rate.

  • Because the online qps is relatively low, the ratio of test environment/official environment will be significantly higher, and the test cost will be greatly increased. At the same time, multiple businesses will have more parallel iterations due to parallel operations, which will significantly increase the number of test environments required, and exacerbate the problem of excessive consumption of test resources for the tested object. How to build a test environment system can greatly reduce the cost of the environment while supporting parallel business iterations more efficiently.

Through years of technical iterations and various large-scale technical reconstruction campaigns in the global e-commerce system, while supporting business and technology upgrades, it has solved many problems and continuously iterated the quality assurance system. We have built efficient automation and continuous integration capabilities, built a more efficient test environment technology and operating system, and transformed asset loss prevention and control from a single point to a long-term coverage system that can be shifted left and right. In practice Implemented a standard R&D process and supporting tools to measure and operate capabilities. These efforts are clearly reflected in the improvement of technology and efficiency. At the same time, the testing capabilities in various fields are commercialized through the quality platform, and friendly services are provided to other business teams.

3.1 Automation system

The challenge of building automation

Global Deployment & Legal Compliance

In the previous article, I introduced the challenges of internationalization infrastructure, and automated testing is also facing the same challenges. Due to the characteristics of the global deployment of the tested object, coupled with the restrictions of the laws and regulations of various countries, such as the prohibition of data export and entry, the display of test cases and test data, and the operation and maintenance of test cases are very basic demands. are facing great challenges.

open source architecture

The latest generation of international architecture realizes the closed loop and independent iteration of the business (bazaar layer) and middle platform (church layer), which means that it needs to be able to test independently. The traditional interface test only tests the complete application. The market Independent testing of layers and church layers requires finer granularity of automated testing, and also requires the focus of automated testing to shift from black-box automation to gray-box automation.

3.1.1 Automation Practice

We design the automation system according to the classic layered test theory, and apply the flow playback technology to multiple links in the layered test, including unit test generation, module test generation, interface test, link test cases, and provide test cases for them data support.

Figure 6 Hierarchical test pyramid

3.1.1.1 Unit testing

Unit testing is the preferred method for development engineers to ensure quality, but there are still several problems that need to be solved.

  • Existing old applications lack unit testing foundation and need to be supplemented quickly;

  • It is difficult to construct test data for complex systems, and supporting tools are required;

  • System refactoring can lead to mass failure of unit tests, requiring quick fixes at low cost.

For the stock code of the old system, a complete unit test case is built directly through the flow, and the unit test case of the stock code is quickly completed.

For new functions, we quickly generate the basic "skeleton" of the unit test case through static analysis, and the development engineer can quickly complete a unit test case after supplementing the test data.

For system reconstruction scenarios, unit test cases are regenerated in batches through static analysis and traffic re-acquisition, so that unit tests can become sustainable replacement assets and avoid becoming a burden on development engineers.

3.1.1.2 Interface Test

Internationalized business involves multi-tenant and multi-site features. The cost of use case maintenance and management will increase significantly with the horizontal expansion of business. Therefore, more efficient use case collection, use case maintenance and management are required.

In the collection of use cases, the strategy of automatic feature analysis and de-duplication is adopted. In the case of no one involved, the use case set can be automatically deposited, and then the expert experience can be manually input to improve the richness of features and the coverage of use cases.

In terms of use case management, the traditional relatively complex use case set management function is weakened, and the use case maintenance is carried out in a system hosting manner, and the use case is automatically scheduled according to multi-tenant, compliance and other conditions when running, ensuring accuracy and reducing manual intervention costs.

On the playback results, through the aggregate analysis of dislocations and assertions, the number of failure classifications is effectively reduced, the cost of manual troubleshooting is reduced, and code coverage and business coverage measurement results are provided after running.

Figure 8 Schematic diagram of interface test principle

3.1.1.3 Module Test

For development engineers, in addition to unit testing, there is a strong demand for small-scale integration testing of a single business scenario. Therefore, we propose a new testing method - module testing. The test granularity of module test is between unit test and interface test, and it mainly tests the modules with relatively cohesive logic.

Figure 9 Comparison of test scope between module test, unit test and interface test

The idea of ​​module testing is to run the test case directly in the application, and can access the resources and instances in the application during the test case running process, which is simpler and more real, but the test case is still a JUnit use case, which is consistent with the unit test.

The whole solution uses JavaAgent technology to realize the real-time communication between the test case and the application under test, which ensures that the modification of the test case takes effect immediately, and realizes the test immediately after modification.

Table 1 Comparison of the characteristics of module testing, unit testing, and interface testing

Module testing not only solves the problem of small-scale integration testing for development engineers, but also meets the requirements for independent testing of the market layer and church layer under the international open source architecture.

3.1.1.4 R&D assistance

In addition to layered testing, in order to improve R&D self-test efficiency and pre-empt quality assurance, we provide development engineers with self-test auxiliary tools, including local interface testing capabilities before testing and local independent joint debugging capabilities.

Interface self-test

The interface self-test is generally carried out after the unit test and module function test are completed. The strategy is to reuse the interface test capability, provide a complete set of self-test process and related functions for R&D in the IDE, and bind it to the current code branch (change) , so that development engineers can quickly complete the test in their coding environment.

Figure 10 Interface self-test process

local joint debugging

Joint debugging is the last link of R&D self-test, and it is also a link where upstream and downstream are highly coupled, interdependent and inefficient. In order to improve the efficiency of joint debugging, we have implemented a set of local self-test and joint debugging solutions. The idea is that the upstream and downstream participating in the joint debugging share the same link use case, and modify the corresponding data for the dependent party according to the interface agreement, so that all development engineers participating in the joint debugging can complete the joint debugging only relying on the link use case, realizing a thorough Decoupling.

Figure 11 Principle of local joint debugging

3.1.2 Measurement system

3.1.2.1 Code coverage

Code coverage is one of the effective means of test measurement, but the full code coverage often has a huge base, which is not conducive to fine management. In the daily iteration process, the incremental code coverage is more intuitive, and you can perform refined coverage management for each code change, and continuously improve the code coverage.

Figure 12 Principle of incremental code coverage

3.1.2.2 Analysis of influence surface

Measuring risk is also an important part of the measurement system, and impact surface analysis can help R&D engineers more accurately assess the overall impact of each function change. By modeling the function call relationship in the production environment collection traffic, a "graph" containing the relationship between all functions is formed. Through this "graph", the upstream caller of each function, that is, the affected function can be analyzed. .

Impact surface analysis can not only evaluate the impact of function changes on distributed systems, but also help church-level development engineers accurately assess the impact of church-level changes on the marketplace in an international open source architecture.

Figure 13 Schematic diagram of the principle of influence surface analysis

3.1.3 Continuous Integration for Automation Use Cases

3.1.3.1 What is continuous integration for automation use cases?

The difference between the traditional continuous integration is the concept of the object under test, and the continuous integration of automated use cases is an innovation to solve the problem of corruption of automated use cases. Continuous playback is performed with the main intervention deployment unit as the smallest playback unit, which reduces the running noise and fine-grained the use case running state, and provides more friendly quantification capabilities, including use case pass rate, interface coverage rate, code coverage rate, and business coverage rate etc., to help use cases continue to iterate.

3.1.3.2 Solution & Implementation

In order to improve the versatility and scalability, we use product functions to realize the production, consumption, and anti-corrosion of use cases, replacing manual maintenance. The continuous integration platform supports application access, use case collection management, and timing execution configuration.

Figure 14 Automated continuous integration

The continuous integration platform is mainly divided into 4 modules:

  • Use case management: support multiple automation use case platforms within the group

  • Accurate playback: define the source of use cases and improve the management accuracy of use cases

  • Anti-corrosion of use cases: Realize timely anti-corrosion of use cases through configurable rules, environmental control, and real-time alarms

  • Result measurement: provide multi-dimensional coverage report, more effective and convenient to help use case iteration

3.1.3.3 Effect

You can easily and clearly observe the use case running results and various coverage rates of the core measured objects in the last 20 days, and also give other optimization suggestions.

Figure 15 Continuous integration effect

3.2 Test environment

3.2.1 Test environment issues

  • Test efficiency issues: the stability of the test environment will seriously affect the efficiency of upstream and downstream testing;

  • Test environment resource issues: Concurrency during peak demand periods involves preemption of test environments; during low peak demand periods, redundant test environments are left idle;

  • Test environment isolation problem: the requirements in subdomains need to be isolated from each other; fast support for joint debugging between subdomains;

  • Test environment usage cost: test environment Q&A cost, and construction cost;

3.2.2 Difficulties in the test environment

  • Test environment isolation supports synchronous interface/HTTP service and asynchronous message isolation;

  • The test environment resources are flexibly occupied according to the demand, and at the same time, the problem of expansion of the pre-release environment should also be controlled;

  • Questions and answers about test environment: Business problems and system problems are often reported as environmental problems, which leads to time-consuming troubleshooting. How to quickly locate the problem?

3.2.3 Test environment scheme

3.2.3.1 Test environment control plan

Plan: trunk and project pre-release construction; environmental anti-corrosion and control; construction of environmental problem troubleshooting tools

  • Offline environment: manage and control changes related to DB, Tair and funds, and other needs are not within the scope of control;

  • Pre-release environment: The project pre-release environment is introduced to support daily demand changes. The project pre-release environment has no control and can be built immediately after use. Resources are released after changes are released or closed. The pre-release environment of the project is isolated by the environment label, and depends on other applications, and is not in the change list, and the routing master releases the environment.

  • Release phase: The release phase includes two types of situations, one is after the test is completed, production deployment, and after the deployment of the pre-release backbone environment, functional regression verification is performed. The second is business UAT acceptance. The main intervention development environment is deployed based on UAT changes; main intervention development: add function regression and test case access control (main intervention development refers to an environment that is consistent with the code version of the production environment, and is used to provide a stable upload. downstream dependent environment)

Figure 16 The reverse of a change in the test environment 

3.2.3.2 Project pre-release environment isolation plan

  • Flow forwarding process:

1) The user HTTP request enters the access layer, and the group network layer routes the traffic to the main intervention environment;

2) The main pre-launch environment will mark the traffic as the traffic routing identifier of the project pre-launch environment;

3) Return to the pre-launch environment of the target project through the traffic scheduling service, and forward it to the pre-launch environment of the target project by the main intervention launch environment.

  • Traffic isolation scheme:

1) Isolation of sub-domain projects: Isolation is carried out through the traffic environment standard, and the requirements do not affect each other, including messages and interface services;

2) Multi-domain joint debugging requirements: The project pre-release environment can be added to the joint debugging with one click, and the applications involved in the change will be assigned to a pre-release joint debugging environment. Applications with the same isolation ID call each other to support the joint debugging test.

Figure 17 Project pre-release environment isolation

3.2.3.3 Environmental anti-corrosion & troubleshooting

Purpose: Take measures such as monitoring the pre-launch, project pre-launch, and offline backbone environment to ensure the availability of the environment, discover and treat environmental corruption.

Strategy: Actively discover environmental anomalies through service availability and business correctness monitoring.

Figure 18 Environmental Stability

At present, a number of governance items such as "hsf service abnormality" and "business health not up to standard" have been accumulated to monitor and measure whether the environment is corrupt.

As mentioned above, the current business platform traffic isolation scheme adopts the group general traffic isolation scheme, which has the following advantages:

  • Guarantee that different requirements will be tested at the same time without interfering with each other;

  • Develop self-tests to fix bugs and isolate the test environment to ensure efficient development and testing.

However, the following problems may be encountered during use:

  • Isolate environment service exceptions, causing traffic to be routed to the main intervention;

  • There is no peer-to-peer deployment on and off the cloud, and user traffic is routed to the main intervention.

The harm caused by isolation failure includes missed tests caused by test branch errors. Therefore, it is necessary to confirm that the environment isolation is valid during the test process. For long-link applications, troubleshooting is more troublesome. Based on this, we also provide a link troubleshooting tool, which can query and analyze all the application environment and machine information that has passed through according to the request ID, helping business test students to quickly troubleshoot environmental problems, and the effect is very good in actual combat.

3.3 Quality tool platform

3.3.1 Problems

In order to reduce the repetitive workload and improve the transferability of test experts' experience, many test teams will build portals for performance tools, such as data construction tools and measurement tools. But the deeper problem of tool construction is how to continue to operate and how to continue to iterate correctly.

  • The huge business team has a long history, resulting in scattered tools, complex front-end and back-end links, and no unified brand mentality.

  • There is no overall planning between tools, there are missing and overlapping tool capabilities, and the repetition of wheels is rampant.

  • The development and maintenance of new test tools in various fields is costly and unfriendly to novices.

  • In the upgrade of devops R&D system, cloud native and other architectures, it is difficult to integrate and upgrade quality tools simultaneously.

  • There is no unified measurement index for the tool effect, which makes it difficult to operate and iterate.

3.3.2 Architecture and Design

Figure 19 Quality tool platform design

Figure 20 Quality tool platform scheme

3.3.3 Effect

  • Unified the product and technical architecture of multiple scattered field testing tools to improve sustainable operation capabilities;

  • Provide 10+ kinds of platform testing capabilities to the outside world, and have carried out 2.0 upgrades, including automation, environment, fund security, test accounts, open tools, etc.;

  • The platform serves more than 8 different business departments, with daily average UV>100 and daily average PV>1000.

3.3.4 A case of quality platform incubation: feature recognition & business measurement

The upgrade of the quality platform not only achieved the goal of improving the capabilities of existing tools, but also contributed to the emergence of many new tools by clarifying the relationship between various products and redefining new problems. Among them, the business measurement tool implemented with the help of the label system upgrade is a typical example. For experienced testers, incomplete test case coverage is an inevitable problem. Different from code coverage, our goal of polishing business measurement tools is to measure the effective coverage of force from a more objective and real perspective.

business pain point

Many use cases: The automation of various applications in international e-commerce is mainly based on traffic collection and precipitation use cases, and the use cases of a single application range from thousands to hundreds of thousands.

  • Many scenarios: The consumer link business scenarios are complex, and it is difficult to sort out the scenarios. The scenarios in different domains are sorted out in different ways.

  • Insufficient coverage of existing scenarios: Although there are many use cases, the coverage of scenarios is insufficient. Taking transactions as an example, only the backbone links can be covered, and some special scenarios are easy to miss.

  • Scenario coverage updates are not timely: New functions add new scenarios, and use cases are not updated in time, and the coverage of corresponding scenarios is often missed.

  • Impossible to measure: There is no coverage calculation caliber when the project is delivered, it is difficult to evaluate the coverage results, and the release standard can only rely on human experience

Design ideas

Figure 21 Test case feature identification and business measurement scheme

For more comprehensive collection of online traffic, through algorithm analysis and a small amount of manual participation, automatic acquisition of single-application business labels can be achieved, improving the accuracy of acquisition and reducing label maintenance costs. Through traffic coloring and other data analysis capabilities, the tag chain relationship is analyzed and managed online, so that full business scenarios can be analyzed. Of course, this process requires multiple iterations to reduce the noise of some special parameters, and the measured object needs to synchronize and maintain the full-scale business scenarios that have been produced while iterating continuously. At present, we have implemented some core applications, effectively supplementing a large number of uncovered test cases.

3.4 Asset Loss Prevention and Control

Fund security failure refers to the failure of online fund loss caused by technical reasons (including coding defects, change execution, system design loopholes, security loopholes, etc.). The asset loss scenarios are divided into the following two categories according to the object of the asset loss

  • More platforms: Buyers, sellers, platforms, suppliers, partners, etc. suffer direct financial losses, or do not receive corresponding preferential discounts (such as red envelopes, shopping allowances cannot be used, etc.), and ultimately need to be compensated by the platform or by the platform. Outflow/equity/money, etc.

  • Under-collection by the platform: under-collection of receivables such as platform commissions and freight charges or a decrease in income.

International e-commerce technology serves multiple businesses and national stations. The deployment structure and operation mode have increased the difficulty of asset loss prevention and control, such as the risk of data synchronization delay under the multi-unit deployment structure, and the complexity of multi-currency calculations in the cross-border mode. Exchange rate conversion, fiscal and tax compliance issues, multiple time zones, multiple languages, etc. We mainly ensure that the overall risk of asset loss is controllable through continuous investment in three aspects: asset loss scenario analysis, monitoring noise reduction, and intelligent monitoring rules.

3.4.1 Asset Loss Scenario Analysis

  • Business splitting: For business modules with capital loss risks, such as transaction orders, inventory deduction, marketing freezes, etc., according to the characteristics of the business modules, internal consistency sorting and reconciliation check supplements close to white box testing are carried out. Generally, it involves business sorting, technical risk sorting, strong and weak dependency sorting, etc.

  • Consistency with the upstream: Check the consistency of the requests transferred to this application, or the messages that this application needs to consume.

  • Consistency with downstream: Check the consistency of the application transferred by this application, or the consumer of the message sent by this application.

3.4.2 Asset Loss Monitoring and Noise Reduction

Monitoring itself is ineffective if it has a lot of noise. Especially in the field of asset loss monitoring, if there is no control, the alarm volume is often at the level of 10W/day. At present, we have unified the monitoring and verification rules of all sites such as e-commerce platforms, LAZADA, and AE into the fund security market, and have carried out continuous logic and script alarm anti-corrosion for a large number of existing reconciliation scripts.

Figure 22 Fund security monitoring market

3.4.3 Intelligent monitoring rules

At present, mainstream capital loss monitoring tools require manual participation in script writing and debugging. In international e-commerce where site and business expansion is normal, the marginal cost is high, so a low-cost and complete capital security check script generation is needed. way to reduce technology input costs. Based on the field level of the flow, we analyze the correlation degree algorithmically and automatically generate fund security rules.

Product Architecture Diagram:

Figure 23 Fund security monitoring rule generation

Among them, the algorithm module uses the data set generated by the traffic analysis module to analyze the algorithm, including data processing, field relationship analysis, conditional relationship analysis, and result processing . The overall flow chart is as follows:

Figure 24 Field correlation judgment

Through the above analysis and data aggregation, the output of the algorithm module is the relationship between fields + reliability , which is used by the monitoring rule generation module to automatically or manually generate monitoring items.

3.5 R&D process SOP

3.5.1 Existing Pain Points in the R&D Process

According to the survey, before there was no SOP, there were many pain points in the R&D process, among which the top pain points are as follows:

  • Requirements review/technical review/testing are all offline operations, and online data is missing;

  • The pre-release environment is unstable, and the R&D environment and test environment interfere with each other;

  • The timing of CR is too late, and the best effect of CR has not been exerted;

  • Insufficient coverage of unit tests and continuous integration tests did not play a full role;

  • The business coverage is difficult to measure, and there is no restriction on stuck points. 

3.5.2 Build R&D process SOP step by step

The derivation process of constructing the entire R&D SOP is as follows:

1) Observation: The current irregular R&D process affects R&D efficiency and R&D quality.

2) Research: Obtain appeals or pain points from different perspectives by investigating different stakeholders (PD, development, testing, etc.).

3) Screening: Screen out the demands or pain points that TOP-N urgently needs to solve.

4) Analysis: Analyze the status quo of these nodes and derive solutions.

5) Landing: Analyze which pain points can be solved by reusing or customizing existing tools, without reinventing the wheel, and which pain points need to be supported by developing new products.

6) Measurement: SOP measurement adopts a differentiated operation method. Basic measurement indicators can directly reuse existing R&D efficiency products to analyze and drill down data. New indicators need to maintain a separate set of visual reports to differentiate operations.

7) Growth: Through the analysis of data or feedback from stakeholders, new Top-N pain points will be generated, and then a continuously iterative and generalized R&D process SOP system will be formed.

Figure 25 R&D process SOP

The final results of the R&D process SOP system are as follows:

The SOP covers the entire R&D process, adopting the method of " screening key nodes + focusing on offline meetings + online process control " to standardize the R&D process, counting, analyzing and drilling down on key node data obtained in the process, and locating Bottleneck problems in the R&D process, thereby improving engineering quality and R&D efficiency.

Figure 26 R&D process SOP control

3.5.3 SOP digital operation

The goal of the R&D SOP is to digitalize the R&D process, improve the smoothness of the R&D delivery process, standardize the R&D process, and improve project quality.

The SOP implementation path mainly takes the two teams of merchants and transactions in the department as pilots, and adopts different strategies to implement according to local conditions according to the habits of different teams and the status quo of the R&D process:

1) The trading team adopts the method of binding technical architecture upgrade projects to promote the upgrade of research and development efficiency and obtain key data.

2) The business team adopts the binding of excellent engineering projects and OKR, and promotes the implementation of SOP and performance improvement by application, focusing on process incremental indicators.

After obtaining key indicator data, it can be used for in-depth operation of SOP. SOP measurement adopts a differentiated operation method, that is, the basic measurement indicators directly reuse the data measurable method of the existing group R&D efficiency products, and new indicators need to be re-maintained A set of customized measurement reports collects and analyzes key node information of the R&D process (requirements review, technical review, creation and change, deployment, test promotion, and release) to help R&D efficiency governance.

Figure 27 R&D process SOP operation objectives

Figure 28 SOP digital operation results in the R&D process

3.6 Globalization of China Taiwan Quality Assurance System

The quality construction in the above fields is integrated into the quality system, that is, the focus and key path of quality work. This is a system built on quality infrastructure, measurable, and bound to key business indicators. Through SOP construction and operation, it runs through and lands in the quality assurance process of the global e-commerce platform. We also continuously operate the quality data of various dimensions through the quality monthly report, and upgrade the system iteratively.

4. Future Outlook

Cross-border e-commerce, overseas e-commerce, and globalized e-commerce have become hot spots on the Internet in recent years. More and more companies have begun to build cross-border e-commerce technical capabilities or products in various dimensions. In the context of unpredictable international forms and policies, the opportunities and challenges we face are also very uncertain. But from a technical point of view, the measured object and the stability guarantee object must have certainty. We need to use experience and knowledge to explore and accumulate to support the evolution of global e-commerce technology. In the future, we will continue to explore aspects such as safe production, high-availability governance, measurable and trustworthy quality assurance, and automated testing that is closer to intelligence. We will support global technology and product rapid iteration more efficiently, clearly, and at low cost, and be flexible for more changes.

Guess you like

Origin blog.csdn.net/AlibabaTech1024/article/details/128922585