Squeeze the server dry: an inhuman performance optimization

background

Students who have worked on 2B systems know that the most disgusting operation of the 2B system is that they like batching everything. No, I recently encountered a disgusting requirement-50 users import 10,000 documents at the same time. There are seventy or eighty fields in the document, please optimize it for me.

Excel import technology selection

Speaking of Excel import requirements, many students have done it and are familiar with it. The technology used here is the POI series.

However, the native POI is difficult to use, and you need to call the POI API to parse Excel. Every time you change a template, you have to write a bunch of repetitive and meaningless code.

Therefore, EasyPOI appeared later, which made a layer of encapsulation based on native POI, and used annotations to help you automatically parse Excel to your Java objects.

Although EasyPOI is easy to use, when the amount of data is very large, memory overflow will occur from time to time, which is very annoying.

So, I did some encapsulation somewhere later, and came up with an EasyExcel, which can be configured so that it will not overflow the memory, but the parsing speed will decrease.

If you want to deduct the technical details, it is the difference between DOM parsing and SAX parsing. DOM parsing is to load the entire Excel into memory and parse all the data at one time. For large Excel, it is OOM if the memory is not enough, while SAX parsing can support line-by-line parsing , so if the SAX parsing operation is done properly, there will be no memory overflow.

Therefore, after evaluation, the goal of our system is 5 million orders per day, and the import demand is very large. For the sake of stability, we finally choose to use EasyExcel as the technology selection for Excel import.

import design

We have also done some systems before, and they all couple the imported requirements with the normal business requirements, so there will be a very serious problem: everything will be lost. When the big import comes, the system is often special. Card.

The import request, like other requests, can only be processed on one machine. Which machine is unlucky for this import request to hit, and other requests that also hit this machine will be affected, because the import takes up a lot of A resource, whether it's CPU or memory, is usually memory.

There is also a very fucking problem. Once the business is affected, it can only be solved by adding memory. 4G can't be 8G, 8G can't be 16G, and all machines must simultaneously increase the memory, but in fact import requests It may only be a few requests, resulting in a waste of a lot of resources and a lot of machine costs.

In addition, each piece of data we import has seventy or eighty fields, and in the process of processing, we need to write database, write ES, write log and other operations, so the processing speed of each data is relatively slow, we press 50ms Calculated (actually longer than 50ms), it takes 10,000 * 50 / 1000 = 500 seconds to process the 10,000 pieces of data, which is close to 10 minutes. This speed is unacceptable anyway.

So, I've been thinking, is there any way to reduce costs, speed up the processing of import requests, and at the same time create a good user experience?

After thinking hard, I really came up with a plan: separate an import service and make it a general service.

The import service is only responsible for receiving the request. After receiving the request, it directly tells the front end that the request has been received, and the result will be notified later.

Then, parse Excel, throw it into Kafka without any other processing after parsing it, and consume it in the downstream service. After consumption, send another message to Kafka to tell the import service the processing result of this data. The import service detects that Feedback is received for all row counts, and the front end is notified that the import is complete. (front-end polling)

As shown in the figure above, we take importing XXX as an example to describe the whole process:

  1. The front end initiates a request to import XXX;
  2. The back-end import service returns immediately after receiving the request, telling the front-end that the request has been received;
  3. The import service writes a row of data to the database every time it parses a piece of data, and sends the data to the XXX_IMPORT partition of Kafka;
  4. Multiple instances of the processing service pull data from different partitions of XXX_IMPORT and process it. The processing here may involve data compliance checking, calling other services to complete data, writing database, writing ES, writing logs, etc.;
  5. After a piece of data is processed, send a message to Kafka's IMPORT_RESULT saying that the data has been processed, or succeeded or failed, and the failure must have a failure reason;
  6. Multiple instances of the import service pull data from IMPORT_RESULT and update the processing result of each data in the database;
  7. The front-end polling interface finds that the import has been completed during a certain request, and tells the user that the import was successful;
  8. Users can view and download the failed import records on the page;

This is the entire import process, and the journey of stepping on the pit begins. Are you ready?

Smart students will find that (pay attention to the official account Tongge, read the source code and learn together) In fact, mass import is somewhat similar to the spike in e-commerce. Therefore, Kafka is introduced in the whole process to reduce peaks and asynchrony.

preliminary test

After the above design, we tested and imported 10,000 pieces of data in just 20 seconds, which was more than a half-point faster than the previously estimated 10 minutes.

However, we found a very serious problem. When we imported data, the query interface was stuck and it took 10 seconds to refresh the query interface. From the appearance, the import affected the query.

initial suspicion

Because our query only goes to ES, the initial suspicion is that the resources of ES are not enough.

However, when we checked the monitoring of ES, we found that the CPU and memory of ES were still sufficient, and there was no problem.

Then, we double-checked the code and found no obvious problems, and the CPU, memory, and bandwidth of the service itself did not find any obvious problems.

It's really amazing, there's no way of thinking at all.

Moreover, our logs are also written to ES, and the amount of logs is larger than the amount of imports. When checking the logs, we did not find any stuck.

So, I thought, try to query the data directly through Kibana.

Just do what you say. While importing, query data on Kibana, and no card is found. The result shows that it only takes a few milliseconds to find out the data. More time is spent on network transmission, but the whole is only 1 second. The left and right data are brushed out.

Therefore, it can be ruled out that it is the problem of ES itself, and it must be our code problem.

At this point, I did a simple test. I separated the query and the imported processing service, and found that it was not stuck and returned in seconds.

The answer is about to surface. It must be that the ES connection pool resources are occupied during the import process, resulting in no connection when querying. Therefore, you need to wait.

By looking at the source code, I finally found that the number of ES connections is hard-coded in the RestClientBuilder class,
DEFAULT_MAX_CONN_PER_ROUTE=10, DEFAULT_MAX_CONN_TOTAL=30, each route has a maximum of 10, and the total number of connections is a maximum of 30. What's more, these two configurations are There is no parameter to configure if it is written in the code, it can only be achieved by modifying the code.

A simple estimate can also be made here. Our processing service deploys 4 machines. Each machine can establish a total of 30 connections. 4 machines are 120 connections. If 10,000 imported orders are evenly distributed, each connection needs to be processed. 10000 / 120 = 83 pieces of data, each piece of data is processed for 100ms (50ms used above, which are all estimates) is 8.3 seconds, so it is reasonable to wait for about 10 seconds when querying.

Directly increase these two parameters by 10 times to 100 and 300, (follow the official account Tongge, read the source code and learn together) and then deploy the service. The test found that the query was normal at the same time as the import.

Next, we tested 50 users to import 10,000 orders at the same time, that is, 500,000 orders were imported concurrently. Calculated by 10,000 orders and 20 seconds, the total time spent should be 50*20=1000 seconds/60=16 minutes. However, the test discovery takes more than 30 minutes. Where is the bottleneck this time?

doubt again

Our previous stress tests were based on 10,000 orders per user. The server configuration at that time was to import and serve 4 machines and process and serve 4 machines. According to our architecture diagram above, it stands to reason that the import service and the processing service are both It can be expanded infinitely, as long as the machine is added, the performance can go up.

So, first of all, we increased the number of machines for processing services to 25 (we are based on k8s, it is very convenient to expand the capacity, it is a matter of changing the number), run 500,000 orders, and found that there is no effect, it is still more than 30 minutes.

Then, we also added 25 machines into the service, and ran 500,000 orders. Similarly, we found that there was no effect. At this time, I was a little suspicious of life.

By looking at the monitoring of each component, it is found that the database imported into the service has an indicator called IOPS, which has reached 5,000, and continues to be around 5,000. What is IOPS?

It indicates how many times of reading and writing IO per second, which is similar to TPS/QPS, indicating that MySQL interacts with the disk per second. Generally speaking, 5000 is already very high.

At present, the bottleneck may be here. Check the configuration of this MySQL instance again and find that it uses ultra-high IO. In fact, it is still an ordinary hard disk. I wonder if it will be better if it is replaced by SSD.

Just go ahead and contact the operation and maintenance to purchase a new MySQL instance with an SSD disk.

Switched the configuration and ran 500,000 orders again. This time, the time really dropped. It only took 16 minutes, which was almost reduced by half.

Therefore, SSD is still much faster. Looking at the monitoring, when we import 500,000 orders, the IOPS of SSD's MySQL can reach about 12,000, which is more than double the speed.

Later, we also replaced the MySQL disk that handles the service with SSD, and the time dropped to about 8 minutes again.

Do you think it's over here?

think

As we said above, according to the previous architecture diagram, the import service and the processing service can be expanded infinitely, and we have added 25 machines respectively, but the performance has not reached the ideal situation, let us calculate.

Assuming that all the bottlenecks are in MySQL, for the import service, we need to interact with MySQL about 4 times for a piece of data. The entire Excel is divided into a header table and a row table. The first data is to insert the header table, and the following data are to update the header table and insert the row table. , and the header table and row table will be updated after processing, so if we calculate it by 12000 IOPS, MySQL will consume us 500000 * 4 / 12000 / 60 = 2.7 minutes. Similarly, the processing service is similar, and the processing service will go Write ES, but the processing service does not have a header table, so the time is also calculated as 2.7 minutes, but these two services are essentially parallel and have nothing to do with each other, so the total time should be controlled within 4 minutes, so we still have 4 minutes of optimization space.

re-optimization

After a series of investigations, we found that Kafka has a parameter called
kafka.listener.concurrency, the processing service is set to 20, and the partition of this topic is 50, which means that in fact our 25 machines only use 2.5 machines for processing Messages in Kafka (guessing).

After finding the problem, it is easy to handle. First adjust this parameter to 2, keep the number of partitions unchanged, and test again. Sure enough, the time has dropped to 5 minutes. After a series of adjustment tests, it is found that the number of partitions is 100, when the concurrency is 4, the efficiency is the highest, and the fastest can reach 4 and a half minutes.

At this point, the entire optimization process has come to an end.

Summarize

Now let's summarize what has been optimized:

  1. The imported Excel technology is selected as EasyExcel, which is really very good, and OOM has never occurred;
  2. The import architecture design is modified to asynchronous processing, refer to the seckill architecture;
  3. The number of Elasticsearch connections is adjusted to 100 per route, and the maximum number of connections is 300;
  4. Replace MySQL disk with SSD;
  5. Kafka optimizes the number of partitions and kafka.listener.concurrency parameters;

In addition, there are many other small problems, which cannot be mentioned one by one due to space and memory constraints.

Later planning

Through this optimization, we also found that when the amount of data is large enough, the bottleneck is still in the storage area. So, can the performance be further improved by optimizing the storage area?

The answer is yes, for example, there are some ideas as follows:

  1. Both the import service and the processing service are modified into sub-libraries and sub-tables, and different Excels fall into different libraries, reducing the pressure on a single library;
  2. Write MySQL to batch operations to reduce the number of IOs;
  3. The import service uses Redis for logging, not MySQL;

However, should we try all of them this time? In fact, there is no need. Through this stress test, we can at least know what we have in mind. It is not too late to optimize when the amount reaches that level.

Alright, that's it for today's article.

Original link:
https://www.cnblogs.com/tong-yuan/p/14523848.html?utm_source=tuicool&utm_medium=referral

If you think this article is helpful to you, you can retweet, follow and support

Guess you like

Origin blog.csdn.net/m0_67645544/article/details/124405582