(Turn) Insert millions of data into mysql quickly and safely

Source: ksfzhaohui
juejin.im/post/5da5b568f265da5b6c4bc13d

Overview

There is a need to parse an order file, and the description file can reach ten million pieces of data, each piece of data is about 20 fields, each field is separated by commas, and it needs to be stored within half an hour.
Ideas

1. Estimate the file size

Because there are tens of millions of files, and each record is about 20 fields, so you can roughly estimate the size of the entire order file. The method is also very simple. Use FileWriter to insert 10 million pieces of data into the file and check the file size. After testing, it is about 1.5G;

2. How to batch insert

It can be seen from the above that the file is relatively large, and reading the memory at a time is definitely not enough. The method is to intercept a part of the data from the current order file each time, and then insert it in batches. How to insert in batches can use insert((...)values(...),( …), the efficiency of this method is quite high after testing;

3. Data integrity

When intercepting data, you need to pay attention to ensure the integrity of the data. Each record has a newline character at the end. It is necessary to ensure that each interception is the entire number according to this identification, and there is no half of the data.

4. Does the database support batch data?

Because of the need for batch data insertion, whether the database supports a large amount of data writing, such as using mysql, you can set max_allowed_packet to ensure the amount of data submitted in batches;

5. The case of mistakes in the middle

Because it is a large file parsing, if there is an error midway, for example, when the data is just inserted into the 900w, the database connection fails. In this case, it is impossible to re-insert it. All need to record the position of each data inserted, and it needs to be guaranteed and batched. The data is in the same transaction, so that you can continue to insert from the recorded position after recovery.

achieve

1. Prepare the data sheet

Two tables need to be prepared here: order status and location information table, order table;

CREATE TABLE `file_analysis` (
  `id` bigint(20) NOT NULL AUTO_INCREMENT,
  `file_type` varchar(255) NOT NULL COMMENT '文件类型 01:类型1,02:类型2',
  `file_name` varchar(255) NOT NULL COMMENT '文件名称',
  `file_path` varchar(255) NOT NULL COMMENT '文件路径',
  `status` varchar(255) NOT NULL COMMENT '文件状态  0初始化;1成功;2失败:3处理中',
  `position` bigint(20) NOT NULL COMMENT '上一次处理完成的位置',
  `crt_time` datetime NOT NULL COMMENT '创建时间',
  `upd_time` datetime NOT NULL COMMENT '更新时间',
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=2 DEFAULT CHARSET=utf8
CREATE TABLE `file_order` (
  `id` bigint(20) NOT NULL AUTO_INCREMENT,
  `file_id` bigint(20) DEFAULT NULL,
  `field1` varchar(255) DEFAULT NULL,
  `field2` varchar(255) DEFAULT NULL,
  `field3` varchar(255) DEFAULT NULL,
  `field4` varchar(255) DEFAULT NULL,
  `field5` varchar(255) DEFAULT NULL,
  `field6` varchar(255) DEFAULT NULL,
  `field7` varchar(255) DEFAULT NULL,
  `field8` varchar(255) DEFAULT NULL,
  `field9` varchar(255) DEFAULT NULL,
  `field10` varchar(255) DEFAULT NULL,
  `field11` varchar(255) DEFAULT NULL,
  `field12` varchar(255) DEFAULT NULL,
  `field13` varchar(255) DEFAULT NULL,
  `field14` varchar(255) DEFAULT NULL,
  `field15` varchar(255) DEFAULT NULL,
  `field16` varchar(255) DEFAULT NULL,
  `field17` varchar(255) DEFAULT NULL,
  `field18` varchar(255) DEFAULT NULL,
  `crt_time` datetime NOT NULL COMMENT '创建时间',
  `upd_time` datetime NOT NULL COMMENT '更新时间',
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=10000024 DEFAULT CHARSET=utf8

2. Configure the database package size

mysql> show VARIABLES like '%max_allowed_packet%';
+--------------------------+------------+
| Variable_name            | Value      |
+--------------------------+------------+
| max_allowed_packet       | 1048576    |
| slave_max_allowed_packet | 1073741824 |
+--------------------------+------------+
2 rows in set

mysql> set global max_allowed_packet = 1024*1024*10;
Query OK, 0 rows affected

By setting max_allowed_packet, ensure that the database can receive the packet size of the batch insertion; otherwise the following errors will occur

Caused by: com.mysql.jdbc.PacketTooBigException: Packet for query is too large (4980577 > 1048576). You can change this value on the server by setting the max_allowed_packet' variable.
    at com.mysql.jdbc.MysqlIO.send(MysqlIO.java:3915)
    at com.mysql.jdbc.MysqlIO.sendCommand(MysqlIO.java:2598)
    at com.mysql.jdbc.MysqlIO.sqlQueryDirect(MysqlIO.java:2778)
    at com.mysql.jdbc.ConnectionImpl.execSQL(ConnectionImpl.java:2834)

3. Prepare test data

public static void main(String[] args) throws IOException {
    FileWriter out = new FileWriter(new File("D://xxxxxxx//orders.txt"));
    for (int i = 0; i < 10000000; i++) {
        out.write(
                "vaule1,vaule2,vaule3,vaule4,vaule5,vaule6,vaule7,vaule8,vaule9,vaule10,vaule11,vaule12,vaule13,vaule14,vaule15,vaule16,vaule17,vaule18");
        out.write(System.getProperty("line.separator"));
    }
    out.close();
}

Use FileWriter to traverse and insert 1000w pieces of data into a file. This speed is still very fast. Don't forget to add a newline character (\n\r) after each piece of data;

4. Integrity of intercepted data

In addition to setting the size of the file to be read each time, you also need to set a parameter to get a small part of the data each time, and get the newline character (\n\r) from this small part of the data. So far, the size of this value is roughly the same as the size of each piece of data. Part of the implementation is as follows:

ByteBuffer byteBuffer = ByteBuffer.allocate(buffSize); // 申请一个缓存区
long endPosition = batchFileSize + startPosition - buffSize;// 子文件结束位置

long startTime, endTime;
for (int i = 0; i < count; i++) {
    startTime = System.currentTimeMillis();
    if (i + 1 != count) {
        int read = inputChannel.read(byteBuffer, endPosition);// 读取数据
        readW: while (read != -1) {
            byteBuffer.flip();// 切换读模式
            byte[] array = byteBuffer.array();
            for (int j = 0; j < array.length; j++) {
                byte b = array[j];
                if (b == 10 || b == 13) { // 判断\n\r
                    endPosition += j;
                    break readW;
                }
            }
            endPosition += buffSize;
            byteBuffer.clear(); // 重置缓存块指针
            read = inputChannel.read(byteBuffer, endPosition);
        }
    } else {
        endPosition = fileSize; // 最后一个文件直接指向文件末尾
    }
    ...省略,更多可以查看Github完整代码...
}

As shown in the above code, a buffer is opened up, which is about 200 bytes according to the size of each line of data, and then find the newline character (\n\r) by traversing, and add the current position to the previous end position after finding , To ensure the integrity of the data;

5. Batch insert data

Insert data in batches by insert(...)values(...),(...), part of the code is as follows:

// 保存订单和解析位置保证在一个事务中
SqlSession session = sqlSessionFactory.openSession();
try {
    long startTime = System.currentTimeMillis();
    FielAnalysisMapper fielAnalysisMapper = session.getMapper(FielAnalysisMapper.class);
    FileOrderMapper fileOrderMapper = session.getMapper(FileOrderMapper.class);
    fileOrderMapper.batchInsert(orderList);

    // 更新上次解析到的位置,同时指定更新时间
    fileAnalysis.setPosition(endPosition + 1);
    fileAnalysis.setStatus("3");
    fileAnalysis.setUpdTime(new Date());
    fielAnalysisMapper.updateFileAnalysis(fileAnalysis);
    session.commit();
    long endTime = System.currentTimeMillis();
    System.out.println("===插入数据花费:" + (endTime - startTime) + "ms===");
} catch (Exception e) {
    session.rollback();
} finally {
    session.close();
}
...省略,更多可以查看Github完整代码...

The above code saves batch order data and file parsing location information at the same time in one transaction. BatchInsert traverses the order list by using the ``tag of mybatis'' to generate values ​​data;

to sum up

The above shows part of the code. The complete code can be viewed in the batchInsert module in the Github address. The file size of each interception is set to 2M locally. After testing, 1000w pieces of data (about 1.5G in size) are inserted into the mysql database, and it takes about 20 minutes. About minutes, of course you can set the size of the intercepted file, and the time spent will change accordingly.

Complete code

https://github.com/ksfzhaohui/blog/tree/master/mybatis

Guess you like

Origin blog.csdn.net/weixin_46011971/article/details/108900788