Performance optimization: Excel imports 10w data

Statement of needs

Excel reports export 10w+ data.
In our import and export trading system, there may not be too much pursuit of efficiency because of the small amount of imported data. But in the secondary development version, I estimate that the number of Excel rows during import will be 10w+, and the amount of data inserted into the database is greater than 3n, that is to say, for 10w rows of Excel, at least 30w rows of data will be inserted into the database.

Some details

Data import: The template used for import is provided by the system, and the format is xlsx (supports 65535+ rows of data). The user writes the corresponding data in the corresponding column according to the header.

Data verification: There are two types of data verification:
2. Field length, field regular expression verification, etc. There is no external data interaction in the memory verification. Less impact on performance, data repeatability verification, such as whether the ticket number is the same as the existing ticket number in the system (you need to query the database, which greatly affects performance)

1. Data insertion: the test environment database uses MySQL 5.5, the database is not divided into tables, and the connection pool uses Druid

Iterative record

First edition: POI + line-by-line query and proofreading + line-by-line insertion

This version is the oldest version. It uses native POI to manually map rows in Excel into ArrayList objects, and then store them in List. The code execution steps are as follows:
1. Manually read Excel into List

2. Loop traversal, perform the following steps in the loop

.Check the field length

Some query database verification, such as verifying whether some current invoices for purchase and sales contracts exist in the system, you need to query the invoice table

Write current row data

3. Return the execution result, if there is an error/check failed. It is obvious that the prompt information is returned and the data is rolled back. This implementation must be rushed out. The subsequent use may be less and no performance problem is detected, but it is at most suitable for single-digit/tens-digit data. There are the following obvious problems:

The checksum of querying the database requires querying the database once for each row of data. The number of network I/Os for the application to access the database is magnified by n times, and the time is magnified by n times.

The written data is also written row by row, and the problem is the same as the above.
Data reading uses native POI, the code is very redundant, and maintainability is poor.

Second edition: EasyPOI + cache database query operation + batch insert

In view of the three problems analyzed in the first edition, the following three methods were used to optimize.
Cache data,Trade space for time.

The time cost of querying the database row by row is mainly in the back and forth network IO, and the optimization method is also very simple. Cache all the data participating in the verification in HashMap. Go directly to HashMap to hit.

For example: if the waybill in the check line exists, the contract number was originally used to query the waybill table to match the contract ID. If it is found, the verification will pass, the generated packing list ID, and if the verification fails, an error message will be returned to user.
The contract will not be renewed when it expires. Therefore, I use a piece of SQL, use all orders under the packing list as the key, and store the purchase and sale contract ID as the value in the HashMap. Subsequent verification only needs to hit the custom SessionMapper in the HashMap. Mybatis does not natively support the query to The result is directly written in a HashMap, and you need to customize the SessionMapper to specify the MapResultHandler to process the result set of the SQL query.

@Repository
public class SessionMapper extends SqlSessionDaoSupport {
    
        @Resource      				 public void setSqlSessionFactory(SqlSessionFactory sqlSessionFactory) {
    
            super.setSqlSessionFactory(sqlSessionFactory);    }
    // 区域楼宇单元房号 - 房屋ID    @SuppressWarnings("unchecked")    public Map<String, Long> getHouseMapByAreaId(Long areaId) {         MapResultHandler handler = new MapResultHandler();
 this.getSqlSession().select(BaseUnitMapper.class.getName()+".getHouseMapByAreaId", areaId, handler);        Map<String, Long> map = handler.getMappedResults();        return map;    }
MapResultHandler 处理程序,将结果集放入 HashMap
public class MapResultHandler implements ResultHandler {
    
        private final Map mappedResults = new HashMap();    @Override    public void handleResult(ResultContext context) {
    
            @SuppressWarnings("rawtypes")        Map map = (Map)context.getResultObject();        mappedResults.put(map.get("key"), map.get("value"));    }
    public Map getMappedResults() {
    
            return mappedResults;    }}

Use values ​​to batch insert
MySQL insert statement supports the use of values ​​(),(),() to insert multiple rows of data at a time. Batch insertion can be realized through mybatis foreach combined with java collection. The code is written as follows:

<insert id="insertList">    insert into table(colom1, colom2)    values    <foreach collection="list" item="item" index="index" separator=",">        ( #{item.colom1}, #{item.colom2})    </foreach></insert>

Use EasyPOI to read and write Excel
EasyPOI uses annotation-based import and export. You can modify Excel by modifying the annotations, which is very convenient and easy to maintain.

The third edition: EasyExcel + cache database query operation + batch insert

After adopting EasyPOI in the second version, it is easy to import thousands or tens of thousands of Excel data, but it takes a long time (5W data is written to the database in about 10 minutes), but since the subsequent import operations are basically developed on the side Importing while watching the log, there is no further optimization

Don't panic, go to GITHUB to find other open source projects. At this momentAli EasyExcelIt's in
sight : EasyExcel uses the same annotation method as EasyPOI to read and write Excel, so it is very convenient to switch from EasyPOI, and it will be done in minutes.

It is indeed as described by Ali Great God: 41w rows, 25 columns, 45.5m data reading takes an average of 50s, so it is recommended to use EasyExcel to read big Excel.

Fourth edition: Optimize data insertion speed

When inserting in the second edition, I used values ​​batch insert instead of inserting row by row. A long SQL is spliced ​​every 30000 rows and inserted sequentially. The entire import method is the most time-consuming and very stretchy. Later, I reduced the number of rows per splicing to 10000, 5000, 3000, 1000, 500 and found that the fastest execution is 1000.

Combined with some descriptions of innodb_buffer_pool_size on the Internet, I guess it is because of the excessively long SQL that exceeded the memory threshold during the write operation, and disk swap occurred. The speed is limited, and the database performance of the test server is not very good. He can't handle too much insertion. So in the end, 1000 inserts are used each time.

After each 1000 inserts, in order to drain the CPU of the database, the waiting time of the network IO needs to be utilized. This requires multiple threads to solve it.The simplest multi-threading can be achieved using parallel streams, And then I tested the code with parallel streams: 10w rows of excel, 42w backorders, 42w record details, 2w records, 16 threads inserted into the database in parallel, 1000 rows each time. Insertion time is 72s, total import time is 95s.

Other content that affects performance

Log, to avoid printing too many info logs in the for loop.

In the process of optimization, I also found a thing that particularly affects performance: the info log still uses 41w rows, 25 columns, 45.5m of data, and prints an info log every 1000 rows between the start and the data read, and caches Verification data-3+ info logs will be printed on each line after verification. The log framework uses Slf4j. Print and persist to disk. The following is the difference between printing log and not printing log efficiency.

to sum up

Ways to improve Excel import speed:

Use a faster Excel reading framework (Ali EasyExcel is recommended)

For verification that needs to interact with the database, use the cache appropriately according to the business logic. Use space for time
Use values(),(),() to join long SQL insert multiple rows of data at once

Use multiple threads to insert data and take advantage of the network IO waiting time (recommended to use parallel streams, easy to use)

Avoid printing useless logs in a loop

Guess you like

Origin blog.csdn.net/weixin_46011971/article/details/108784325