How much do you know about the stability of the test?

1. Test stability issues

Ideally, we want every failed test case to be caused by a real defect. In actual situations, most of the reasons for the use case failure are some other reasons:

a. The version of a service is incorrectly deployed

b. The hard disk of the test executor is full because the log written during the last run was not cleared

c. There is dirty data in the database

d. There is a problem with writing test cases

e. Someone manually performed a timed task during the test run, and the running water was taken away

f. The message string

...

Every time we investigate, there are a lot of such problems. After a long time, the development and testing students will be tired. Some students took a quick glance at the failed use case and said that it was an "environmental problem" and no further investigation. In this way, many real flaws have been missed.

Two. Test the stability of three axes

How to deal with test stability problems? Many people will say: environment, process control, monitoring, tooling, adding machines, dedicated personnel, etc. These are all right. But these are all at the solution level, not at the methodology and theoretical system level.

At the level of methodology and theoretical system, we have three axes on safety production: grayscale, monitorable, and rollback. Similarly, for testing stability, I have three tricks:

a. High frequency (Frequency)

b. Isolation

c. Disposable when used up

One of the three axes: high frequency

1. The benefits of high-frequency running test are:

a. Shorten the verification delay

b. Change active verification to "passive waiting"

c. Identify intermittent issues

d. Exposure of unstable factors at all levels

e. Pushing the automation of human flesh links

g. Provide more data for analysis

...

2. High frequency is not only the only way to manage test stability, but also a game changer to manage other engineering problems:

a. Continuous packaging: In the past, packaging was only done before the deployment of the test environment. Often the deployment took a lot of time due to packaging problems, which also affected the subsequent test progress. In response to this problem, we have done continuous packaging. Every hour, we will package the HEAD of the master. Once we encounter problems (such as missing dependent mvn packages, missing configurations, etc.), we will fix them immediately.

b. Daily production: Now release the production environment once a week, every time it takes a lot of effort. I asked whether we can produce every day. The release is still based on the original rhythm, with new code posted once a week, and the rest of the week, even if there is no new code, it will be released again. Idling. For nothing else, it is to use high frequency to expose problems, to force the automation of human links, and to force the optimization of various links.

c. Branch merging is painful, so merge frequently, once a day, many times a day. Achieving the ultimate has become the backbone of development, has been rebase, has been submitted.

Ant's SRE team also uses high-frequency thinking. In order to strengthen the building of disaster tolerance and improve the success rate of disaster tolerance drills, one of the main ideas of the SRE team is to conduct high-frequency drills, and use high-frequency drills to fully expose problems and force capacity building.

High frequency is not so easy to do.

High frequency requires infrastructure protection. First of all, high frequency requires resources. High-frequency execution will also cause unprecedented pressure on all aspects of infrastructure. High frequency also requires the ability level to reach a certain benchmark. Take SRE's high-frequency drills for example. If there are still many problems in each exercise, it is impossible to engage in high frequency. The prerequisite for high-frequency exercises is that our isolation mechanism and recovery capabilities have reached a certain level. For the test operation, the high-frequency running test needs to be isolated and thrown away when it is used up.

For the high-frequency running test, a very common doubt is: I used to run only once a day, and I have no time to check the failed use cases one by one. Now that the high-frequency runs, don’t I have any more time? My answer is: Actually, this is not the case, because the problem will soon converge after the high frequency runs, so the total amount of investigation may be about the same or smaller.

Two of the Three Axes: Isolation

Compared with the other two of the three axes (high frequency, throw away when used up), the importance of isolation should be more widely accepted. The benefits of isolation include:

a. Avoid the mutual influence of test runs and reduce noise.

b. Improve efficiency and no longer need to coordinate with each other when performing certain destructive tests

Isolation is nothing more than two types: hard isolation and soft isolation. As for whether to follow the hard isolation route or the soft isolation route, it must be analyzed in detail according to the technology stack, architecture, and business form. But both roads lead to the end:

a. For hard isolation (full isolation environment, physical isolation) to become the final state, the key is cost. It is necessary to reduce the cost without increasing the quality blind zone. For example, if the entire payment system can be compressed into a single server, and all functions (including middleware level, such as timing tasks, message subscription, sub-database and sub-table rules, etc.) can be well covered, then It is an ideal endgame. Everyone can engage in several full-scale environments at any time, which is very cool. In addition, the decoupling of the architecture (for example, we do independent release by domain) is helpful to reduce the cost of hard isolation, and can greatly reduce the deployment scope of a set of tested systems.

b. The key to soft isolation (semi-shared environment, logical isolation, link-level isolation) is the effect of isolation. If the isolation is perfect, today's joint debugging environment can be deployed to the production environment to run. In this way, there is no problem of stable environment stability. In this way, real testing in production is achieved, which is also an ideal end-game state.

I have achieved both of these end states in my previous work. Can indeed work. These two kinds of isolation are both technical challenges to lead to the end. Cost reduction is a technical issue. It is also a technical issue to do thorough and reliable logical isolation.

For our payment or e-commerce systems today, will our future end result be hard or soft isolation? It's hard to say now. Judging from the technical feasibility, soft isolation is more likely to become our final game. After hard isolation is achieved in deep water, it will be difficult to do, because it will encounter the physical limits of the architecture. Breaking through the physical limits of the architecture may create new quality blind spots. But for a long time, hard isolation will continue to help us a lot. For example, when we do various unconventional tests, we need hard isolation. Soft isolation has to be able to support unconventional tests, and the technical complexity is very high. Since the last fiscal year, the reason why I have engaged in one-click pull-to-full test environment (hard isolation) in my team is that it is relatively easy to do one-click pull-to-full test environment, mainly for automation, and the routing-based soft isolation solution is all of a sudden. It is not very ready, and it is difficult to achieve the isolation level we need in the short term.

Hard isolation and soft isolation are not opposites, and can be used together. For example, when we pull up a routing-based isolation environment, we will pull a new database. It is a kind of hard isolation at the database level, which is a supplement to the lack of soft isolation capability at the database level.

In short, isolation is a must. What kind of isolation plan to adopt should be based on a comprehensive consideration of factors such as complexity, cost, and effectiveness.

Three of the Three Axes: Throw it when you use it up

My other favorite sentence is: Test environment is ephemeral. This sentence is my original. Ephemeral means short-living, short-lived, short-lived. I repeat this sentence to my QA team, hoping that students can always remember this principle in their daily work.

"Test environment is ephemeral" means:

a. Our test setup ability must be very strong. The one-click pull-up environment we are doing today is part of this ability. And after setup, you must be able to quickly verify.

b. Our test strategy, test plan, testability design and test automation must not rely on a long living test environment. Including: Can't rely on some old data in a long living test environment. For example, Test automation must be able to create its own data and create all the data it needs.

With these capabilities, a set of "out of the box" test environment can be built from scratch with zero labor cost, very fast and very repeatable, and all the data needed for testing can be created, and we can achieve the test environment Throw away when you run out: create an environment when you want to run the test, and destroy the environment after the test is finished. I will use it next time and build a new one. Moreover, not only the test environment, but also the test execution machine must be thrown away when it runs out.

For environments that need to be kept for a certain period of time when used up, a relatively short upper limit should also be set. For example, I have used this approach before:

a. The default life cycle of the joint debugging test environment is 7 days.

b. If you need to keep it at the time, you can extend the expiration date. Each extension can be extended up to 7 days (equivalent to newExpDate = now + 7, not newExpDate = currentExpDate + 7).

c. It can be extended up to 30 days (counting from createDate). If it takes more than 30 days, special approval is required (for example, the CTO of the business group).

d. The advantage of this is that it is forced. It must be forced across the board. It will be a bit painful at first, but soon everyone will get used to it, and automation will soon follow. If you don't force it so hard, many improvements won't happen.

The benefits of throwing away when used up are:

a. Solve the problem of environmental corruption and reduce dirty data

b. Improve repeatability to ensure that the environment in which each test is run is consistent

c. Force the construction of various optimization and automation capabilities (preparation of test environment, data creation, etc.)

d. Improve the liquidity of resource use. On the premise that the actual physical resources remain unchanged, increasing the liquidity can increase the actual capacity.

Throwing away the test environment when it is used up does introduce some new quality risks. If there is a long-term maintenance environment, the data in it is generated by the old version of the code before. After the new version of the code is deployed, these old data can help us find data compatibility problems in the new code. Now throw away when you use it up. If there is no old data, these data compatibility problems may not be discovered.

This risk does exist. The idea of ​​solving this wind direction is to look forward, not backward. We want to explore whether there are other solutions to the data compatibility problem. Are there any other testing or quality assurance methods? Even think about how to achieve "from test to unexpected" and eliminate the problem of data compatibility through architecture design, so that it does not become a problem.

Landing

The three-sided axe mentioned above, high frequency, isolation, and throw away when used up, is indeed a bit idealistic. Our infrastructure, architecture, and automation construction today are far from ideal.

But we just have to be a little bit idealistic. Doing these three tactics well, the technical challenge is very, very big, but we are optimistic and believe that we can achieve our goals. We have realism, we can break down the goal, combine the actual situation, and do it step by step.

Written at the end:

No winter will not pass, no spring will not come. The past 2020 was an extraordinary year for people all over the world. Everyone is fighting the epidemic bravely and bravely. Here we encourage ourselves together. Palm it, 2021 has come as promised, set a good goal and continue to grow up. 

Guess you like

Origin blog.csdn.net/feng8403000/article/details/114807230