Test environment usage issues and optimization countermeasures practice | JD Logistics technical team

I. Introduction

We often hear developers and testers complain: "Why can't the test environment be used again!", "The test environment now deploys the master package!", "The test environment data has been changed again?", "What's wrong with the test environment? The deployment is so slow!", "I can only wait while you use the public services in the test environment?", "The test environment is down, and all my automated scripts failed!" The test environment is the basis for test implementation. The soundness and stability of the test environment directly affects the progress of the project. The indicators of the test environment include quality, efficiency, and cost. Quality is mainly the stability of the environment, efficiency is the environment deployment update and environment usage, and cost is mainly the cost of resource usage. and personnel maintenance costs. The quality, efficiency, and cost of the test environment will directly affect the delivery quality and efficiency of the project.





 

Testing environment issues are issues faced by every testing team. During the 2022 1024 Dark Horse Luncheon and exchanges with technical leaders, some students also raised the issue of environmental governance and improving the efficiency of environmental use. Cao Peng, Chairman of the JD Group Technical Committee Suggestions were also given





 



Solving test environment problems Ideally, there are multiple environments, but server resource costs and personnel maintenance costs do not allow it. Then we need to consider how to better pass specifications and processes with the existing tools and resources. Using our current tools to ensure the stability of our test environment and improve the efficiency of the test environment are issues we need to think about. This article mainly introduces some of the exploration and optimization we have done in the entire test environment management;



2. Problem research

In order to specifically solve everyone's pain points in using the test environment, we launched a survey on the test environment within the secondary department and selected 20 samples for different positions. The survey results are as follows:

1. Test environment user group

Result: Most of the people who use the test environment are R&D and testers





 

2. What is the general purpose of using the test environment?

Result: General functional test using test environment>R&D self-test>Performance test





 

3. Are you satisfied with the current compilation and deployment time of the environment?

Result: Not satisfied with the efficiency of current test environment deployment





 

4. Do you think the current test environment is stable and available at any time?

Result: Not satisfied with the stability of the current test environment





 

5. Not satisfied with the compilation and deployment time, mainly because of which applications are slower

application Feedback times
tms-bid-web-test 1
tms-tfc-web-test 1
tms-workbench-client-test 3
tms-workbench-web-test 1
tms-basic-web-test 1

The following is the sample deployment and compilation time of applications with slow compilation speeds based on feedback from our survey:

tms-bid-web-test:





 

tms-tfc-web-test:





 

tms-workbench-client-test:





 

tms-workbench-web-test:





 

tms-basic-web-test:





 

6. Everyone thinks that the main reason why the environment is unstable and unavailable is

Everyone shares a testing environment, joint R&D debugging, and regression testing, so problems can easily occur
Occasionally, the host address of the middleware test environment changes.
Currently there are two machines in the workbench test environment. Normally during deployment, it is to ensure that the service is available. At least one machine provides services when the deployment is restarted. However, the actual interval between the two deployments is a fixed 60s. The first machine is deployed from deployment to provisioning. The service lasts longer than 60 seconds, which results in both services being unavailable, and the workbench is developed by multiple teams, resulting in a lot of time waiting for the service to start, and the efficiency of R&D and testing becomes low.
Interface times out frequently
Downstream business testing relies on many applications, such as testing Jingyi app. For example, the downstream system: Jingyi app needs back-end tfc, rfq, basic, cmc, jdi and other environments to be continuously available. During troubleshooting, the service availability status is not continuously online.

7. What is most unsatisfactory about the current testing environment or affects work?

There is one set of test environments, no distinction is made, and it is unstable
The test environment is relatively stable and there are currently no such problems.
Configuration updates also require recompiling
There are many people using workbench, and people often deploy it, which delays the testing time.
Different groups of jmq are not isolated
High availability
There are too few environments. If there is an online regression, the test environment will be occupied.
Downstream orders require multiple inspections of the test environment, and the platform connection backend service is unavailable/no available backend service instance
frequent downtime
Frequent deployment, long compilation time, Jeff service is unavailable after deployment
The deployment is too slow. Some hosts have been stationary for a long time, and suddenly the service stopped, such as Tonglian. It is not clear whether there is a problem with the code.
When deploying in basic environment, other systems are basically unusable and need to wait.
The deployment time is long and there are many deployment personnel. The cargo airline system is quite satisfactory after being deployed for a long time. Occasionally, there will be problems with the interaction between systems.
Frequent deployment affects work efficiency

8. What needs do users expect the test environment to meet?

R&D joint debugging and integration testing are supported by a separate environment
There is a separate environment for testing regression environment, do not mix it with the functional testing environment
The stable version provides a separate environment corresponding to the online environment to provide a stable environment to the outside world.
stability
Is there any solution to improve single test coverage?
Develop an automated verification method;
High availability
Chat environment, test environment, independent
Stable environment



3. Solution

We have solved the problems reported in the survey and the currently known test environment problems. The main solutions are as follows:

1. Problem: The waiting time for system compilation and deployment is long

Solution: Seek help from Xingyun deployment and optimize compilation commands and configurations





 

Optimization results: The compilation speed is reduced from the original 5 to 10 minutes to less than 3 minutes;

Before tms-bid-web-test optimization:





 

After tms-bid-web-test optimization:





 



Before tms-tfc-web-test optimization:





 

After tms-tfc-web-test optimization:





 

2. Problem: The system stability is not good. Public applications such as basic and workbench are deployed frequently, which affects the use of others during deployment.

Solution: Use the existing orchestration and deployment method of Xingyun. The core is especially the application that provides services to other applications. Increase the number of instances in the backbone test group to 2, and adopt rolling deployment during deployment, that is, deploy on one machine and explore the active jsf After the interface is available, perform the deployment of the second machine. During the deployment process, one instance will be able to provide services, reducing the instability of the test system caused by frequent deployment.

Orchestration deployment configuration method:

(1) Expansion

Most test groups are currently deployed on one machine, so they need to be expanded to two machines before rolling deployment. Select the test group on the instance list page, select expansion/shrinking in the container status, and select the expansion operation.





 

The expansion quantity and expansion version default to the current content. After expansion, the container will be 2





 

After the confirmation is successful, the system will add a new machine and automatically deploy it. If the deployment is successful, subsequent operations can be performed.





 

(2) Orchestration and deployment (compilation + deployment)

On the deployment orchestration page, click [Add deployment orchestration]





 

Select [Compile + Deployment] in the public template





 

In the default template that is opened, the build task selects the branch that needs to be compiled. Usually we choose the test branch.





 

In the rolling deployment node, basic settings

Select the group: the group we want to deploy, usually the daily test group

Update method: According to the number of each group

Number of containers updated each time: 1

Each update interval: you can use the default 15s





 

In the task step, you need to add two nodes: jsf offline and jsf online. Just use the default configuration.





 

After the confirmation is successful, save and run. The compilation and build operation will be performed first (not detailed here). After the compilation and construction is completed, the rolling deployment operation will be performed, and the deployment will be carried out in 2 batches. The jsf will be offline first.





 

At this time, after jsf goes offline, one instance will still exist, and one instance jsf will go offline.





 

After the deployment is completed, the jsf survival verification will be performed. After the verification is successful, subsequent operations will be performed.





 

When the jsf detection is 100% alive, the sleep15s operation we set will be performed.





 

Deploy the second instance. After all deployments are completed, the jsf files of both instances will be available.





 







 

After all deployments are completed, a “dongdong” reminder will be displayed that the execution is complete.





 



Every time you update the version, go directly to the deployment orchestration page, go to the compile deployment task and click Run!





 

(3) Orchestration and deployment-optional

Application scenario: No compilation operation is required, only deployment operation is required

On the deployment orchestration page, click Scenario Deployment Orchestration





 

Select rolling deployment





 

Select the group to be deployed (select the daily test group for the trunk test environment), and the deployment configuration image will use the currently used image by default.





 

Perform the next step, update by quantity, set the number of containers for each update, the interval, select whether to go online and offline for jsf, load balancing to remove/mount, np download/online operation, it depends on whether load balancing and np download/online are checked. Whether to configure load balancing and whether to associate domain names in np





 

After saving and running, start the rolling deployment operation





 

(4) Restart operation-optional

Application scenario: only restart, no compilation, no deployment

On the deployment orchestration page, click Scenario Deployment Orchestration





 

Choose to specify an IP to restart





 

Manually enter the 2 instance IPs of the test environment and click Next





 

Batch setting 2, 50% for each batch, pause setting selects automatic transfer, interval time is customized, traffic type selects jsf traffic online/offline





 

After the settings are completed, just come over and click Run every time you restart.





 

(5) Deployment effect (detailed explanation of the process, no operation required)

Take deployment orchestration as an example (the deployment of compiling and building the page has the same effect)





 

First perform the jsf offline operation. At this time, only one instance of the application survives.





 

When you execute the jsf online operation again, it will always check whether the jsf interface is available. When the jsf interface is available, the jsf online verification will be successful.





 

At this time, both jsf interfaces are alive, one of them deploys a new package, and the other deploys an old package;





 

After the first batch of deployment is successful, it will enter the 60s waiting time we set.





 

After the waiting time is over, execute the jsf offline and online operations.





 





 



Finally, both servers are available and new packages are deployed.





 



(6) Log viewing

Because after two instances are deployed, the call traffic may randomly hit a certain machine. To view the logs, you can select all current machines in the instance list - Log Service, select the log path and search keywords to query.





 



Optimized results:

One-click deployment, one-click compilation + deployment operation, no need to wait for the compilation and deployment to be completed at any time, and a reminder will be given after completion
Ensure the stability of the environment. An application will be available at any time during the deployment process, and the environment will not be unavailable due to frequent deployment. Image test efficiency

3. Problem: There is no separate test regression environment and it is used together with the functional test environment.

Solution: Build a separate regression test environment for each test application and deploy the master version for regression testing and automated regression testing before going online.





 

Optimized results:

The master branch and the test branch are environmentally isolated and do not affect each other during regression testing and daily requirement testing.



 

4. Problem: Every time the configuration is updated, it needs to be recompiled.

Solution: There is an operation to restart and update the configuration, but it is not recommended. It is recommended to re-publish, which will generate another version number, which is convenient for rollback. If you use restart and update the configuration, although it takes effect quickly, the new configuration cannot be rolled back.





 

5. Problem: It is difficult to maintain the test environment. It is unclear whether the current test environment is available and which environments are currently abnormal.

Solution: Self-developed test environment monitoring dashboard tool, calling Xingyun interface to detect application survival, and view the current status of the test environment of different business lines in real time (automatic refresh in 5 minutes, manual refresh supported)





 

When the test environment is abnormally unavailable, a ping-dong message will be sent to the test environment operators and maintenance personnel simultaneously. After receiving the message, the relevant maintenance personnel will check the cause of the environment exception.





 

6. Problem: Abnormalities in the test environment cause daily automation execution to fail.

Solution: Provide encapsulated interfaces, and configure application test groups and machines for exploration every day. If the application fails multiple times, restart the application, and trigger the pipeline automation task after the exploration is successful.

interface:

/** * Get all machine IPs * @return */

public List<ConfMachineIp> getAllMachine(); /** * Get the machine status. * This method is used to obtain the current status of the machine. * @return the status of the machine */ public void machineStatus(); /** * Get the current status based on the application name and group name * @param appName application name * @param groupName group name * @return status value, true means the acquisition is successful, false indicates acquisition failure*/ Boolean getStatusByAppAndGroup(String appName,String groupName);

Identification of whether exploration work is required:





 

4. Summary

A sound, stable, and good-experience testing environment has always been a key link that affects product iteration efficiency and stability, and is also a necessary condition for automated testing. After targeted environment optimization, we have improved the testing environment for different goals. , regression environment, automation environment, performance test environment and other independent test environments, ensuring the stability of the test environment as much as possible, improving the use efficiency of the test environment while ensuring the resource budget, and targeting Comprehensive environmental monitoring and abnormal early warning enable environmental maintenance personnel to solve environmental problems in a targeted manner and reduce the personnel cost of environmental maintenance.

 

Author: JD Logistics Zhu Fei

Source: JD Cloud Developer Community Ziyuanqishuo Tech Please indicate the source when reprinting

MySQL 5.7, Moqu, Li Tiaotiao... Taking stock of the (open source) projects and websites that will be "suspended" in 2023. Kingsoft WPS crashed . Linux's Rust experiment was successful. Can Firefox seize the opportunity... 10 predictions about open source The middle school purchased an "intelligent interactive catharsis device" - which is actually a shell for the Nintendo Wii. "Ruiping", the father of Redis, LLM programming: omniscient and omnipotent&& Stupid The "post-open source" era has arrived: the license has expired and cannot serve the general public. Vim 9.1 is released , dedicated to Bram Moolenaar 2024 "New Year's Battle" in the front-end circle: React digs holes but does not fill them, must it rely on documentation to fill them? China Unicom Broadband suddenly limited the upload speed, and a large number of users complained. Niklaus Wirth, the father of Pascal, passed away.
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4090830/blog/10678298