A story about database stress testing

Recently, I cooperated with a customer to conduct a stress test on the XX system. In fact, after communicating with the customer, it was learned that the customer's pressure was not great after the system went online, but because the early performance of the application party was not particularly satisfactory, it was not very satisfactory. Trust, so this stress test is required to focus on observation.

Participants
I, the customer, the application party (the customer and I are referred to as Party A, and the application party is referred to as Party B)

Environment configuration
database: RAC all-in-one cluster (for the convenience of statistics, the application links to one node)

Pressure measurement tool: jmeter

Stress test scenes
About 10 big scenes, each scene has 100, 200, 300 3 levels of concurrent small scenes, each small scene stress test for 10 minutes

Pressure test data volume The
pressure test data is fabricated by the application party, the database size is 2G, and the data volume of the key business tables involved is about 400,000, 100,000, and 30,000.

pressure test

I have done many stress tests before. For the database, it is mainly to collect the CPU and memory usage of the server at the time, and pay attention to whether the SQL execution part of the AWR report is abnormal, so as to facilitate the allocation of system resources after the official launch. In terms of data volume, 2G data can be said to be a very small amount of data. In addition, the maximum concurrency is 300, which is not too large for 2G data. I thought that the stress test could proceed smoothly, but it was just ideal and full.

Episode one

When testing one scenario A with 300 concurrent, the jmeter stress testing tool began to report an error (the specific error reported is not held for the time being), and the recovery given by Party B is that the amount of data is too large to reach 300. Continue to the next scenario B with 100 concurrent , After completing this 100 concurrent scene, there will be the following dialogue.

Party A: When the data in the xxx table is in the previous scene A 300 concurrent, it is still 100,000, and the scene B 100 concurrency becomes 30,000 after running.

Party B (stress tester): @经理I don't understand this very well, you can help me take a look.

Party B (manager): I asked someone to deal with this. The data volume of 100,000 pieces of data is relatively large, but it is actually not that big.

Party A: This is testing. Have your data cleaned up?

Party A: Today, write down the table of your test data and the corresponding data volume in the plan.

Party A: Do not delete data during the test.

Party A: Data cannot be deleted in order to meet the concurrency standard. If it is not reached, it will not be reached. It can be optimized later.

Party A: Make sure not to make small movements during the test.

Party B (stress tester): I don't know about deleting data. Generally, they will not be allowed to do anything during a stress test.

From the above conversation, I probably have an understanding of the situation. Party B may think that the amount of data is large, so Scenario A 300 reports an error concurrently. Without communicating with Party A, the data volume of the main business table was privately cleaned up. Party found that Party A was dissatisfied. In fact, the stress test is to confirm the operating pressure of the system. If the data is cleaned privately like Party B, the actual meaning of the stress test will be lost. Here, I would like to give you a suggestion to the DBAs and application personnel who are struggling to communicate in real time.

Episode Two

Due to the stress test, each big scene has 3 small scenes with different concurrency levels. However, when analyzing the AWR report, it is found that there is no significant change in the number of SQL executions. The number of concurrent SQL executions is 30,000, and the number of concurrent SQL executions is 200. There are more than 30,000. According to the previous stress test experience, this is definitely a problem. At the same time, it is proved by the use of the system CPU. There is no obvious difference in the use of two different levels of parallel CPU. Then Party A and Party B Start.

Insert picture description here
Insert picture description here

Party A: The number of SQL executed in the background of the database is not much different between 100 and 200

Party B (stress test personnel): 100 concurrency in 10 minutes, so many times; 200 concurrency in 10 minutes, shouldn't be doubled.

Party B (stress tester): Is this the total number of times?

Party A: Yes.

Party B (stress tester): Then I think this is all right?

Party B (stress test personnel): This is a temporary record of what you said, and they will look back.

Party B (stress tester): I have consulted for the situation you mentioned, and it may involve modifying the parameters in the corresponding service.

Party B (stress tester): So let's run away with 100 first.

Seeing this, I basically understand that the first few concurrent tests are equivalent to white tests. This also tells us that it is better to do things more carefully. At the same time, to persuade Party B, you must show evidence to prevent both parties from arguing. No wonder the customer said in advance. , This stress test should focus on. If it is just to cope with the work, simply collect some data, and then analyze it afterwards, then it is inevitable to avoid working hours.

Episode Three

The stress test finally reached the last 3 scenes. The performance of the first few CPU pressures was normal, at least it was stressful, but the last 3 scenes had almost no CPU pressure. Is the performance of the all-in-one machine too good? That shouldn’t be true. Besides, this scenario is about customer analysis and market analysis. From a literal point of view, you should visit a lot of data tables. This time, I will actually analyze each running SQL and the specific business involved. table.

Party A: What is the XXXX table in the customer analysis of the previous scenario?

Party B (stress test personnel): Let me ask.

Party A: The database server of the scene analyzed by the customer is almost under no pressure. The background shows that this table is more frequently accessed.

Party B (manager): The one just now is a selection of regions and provinces.

Party A: Oh, is there only one main table for the data source of the customer analysis background?

At this moment, the tester of Party B sent a crying emoticon, and I realized that a problem had occurred.

Party B (stress tester): When you ask, I took a look.

Party B (stress tester): The script analyzed by xx was partially banned during the previous adjustment.

Party B (stress tester): Let's run xx analysis again, I stopped.

Party A: . . . . . . . . . . . . . .

It seems that Party A’s initial distrust is based on it. Before this stress test, Party B had been preparing for about a week, but various situations still appeared.

to sum up

For this test, in addition to episode 1 that Party B did not perform authentically, the other two are all matters of Party B’s preliminary preparations. Here, we do not make excessive “positive” comments on Party B.

For me, I have the following insights:

1. Regardless of whether it is to yourself or your customers, you must act with the protagonist's mentality, and deal with things with the intention of harming others and yourself. For example, the XX party in the case

2. When communicating uncertain issues with people in other links, you need to produce conclusive evidence to prevent both parties from kicking the ball

3. Good communication is the first link of customer service. Maybe your ability is not enough for the time being, but you can't fool the customer. No one is a fool.

Guess you like

Origin blog.csdn.net/newdreamIT/article/details/101368792