Stress testing failure in high concurrency environment

Article directory

1. High concurrent stress test failure

Every time before a big sale, we need to conduct a stress test on the service. During the stress test, a red light alarm suddenly broke out, and the order volume was directly reduced by half. This was a very serious stress test failure. First, let’s take a look at how the stress test is performed:

① Normal online traffic distribution should be that computer rooms A and B each bear half of the online traffic.

Insert image description here

② Before the stress test, we need to route all traffic to one computer room, that is, let computer room B bear the entire online traffic.

Insert image description here

③ After the A computer room is vacated, we will flow the pressure test traffic into the A computer room, and perform the pressure test operations uniformly in the A computer room, so that the online traffic will not be affected.

Insert image description here

In fact, the node where this accident occurred was not during the stress test, but after the stress test was completed, when we redirected the traffic from the original A computer room to the A computer room, there was a warning that the order volume was declining month-on-month.
The first reaction at this time was to quickly restore to the previous state without problems. After switching to the previous group, I found that the alarm disappeared. There is a question here. Why was there no alarm during the stress test, but an alarm occurred when the traffic was switched back?
Feedback from the order side said that the traffic requests sent to computer room A all have a stress test mark, and orders with a stress test mark will be discarded. To distinguish between stress test traffic and online traffic, in fact, a ThreadLocal is used in the underlying code to define a stress test identifier. However, due to our mistake,During the stress test, only the stress test flag is set, but there is no remove, so the stress test flag will never disappear. Even if the stress test traffic is withdrawn, the thread pool is still in a state with the stress test flag, which causes Due to the pollution of the thread pool, when new traffic is sent in again, it will be considered as stress test traffic, causing all incoming order requests to be discarded.So after set, be sure to remove.

2. JVM tuning

I still want to add the tuning process when OOM occurs in heap memory and off-heap memory.

1. Heap memory tuning

① First, take out the problematic machine from the load balancing environment;
② Then use the jps command to find the GC process number;
③ Use the jmap -dump command to dump its heap snapshot;
④ Then use the TCP copy command to copy One copy comes out, one for the production environment, and one for the test environment. We analyze the parameters and modify the business code in the test environment.

2. Off-heap memory tuning

The off-heap memory here refers to the method area. We all know that class information is stored in the method area. The method area continues to increase, indicating that the code is constantly generating classes, resulting in frequent major GC.

① verbose: class can print out all class loading information;
② Find the frequently loaded class and conduct a global search in the idea;
③ We found that every request will compile and generate classes, causing the method area to expand continuously;
④ The reason is No caching was done. After caching modification, the problem was solved.