Insert millions of data optimization

background

Finally, I came across a project with millions of data. It was a very simple single-table query. After adding the index, it still took more than 20 seconds. I have no choice but to study the optimization of millions of data queries on the weekends. The first step is to insert millions of data. The method I first used was stored procedures, thinking that stored procedures would be faster and more versatile. Unexpectedly, 5 million data has been inserted for 2 hours! ! !
So I continued on Baidu, reading articles, analyzing, summarizing, and thinking.
Finally, the implementation method was determined: multi-value table insertion (insert values ​​value set 1, value set 2), multiple SQLs submitted together.
1. One sql statement spells 10 values ​​(based on the longest sql length that mysql can support and the length of your own query field, comprehensive consideration. I may have spelled less here).
2. Submit 10 SQL statements each time (some articles say that this number is the most efficient, but I’m not sure. Anyway, the most important thing is that the efficiency meets the requirements).
3. This is 100, and it requires 50,000 cycles. This is obviously too slow. Multi-threads are needed to complete the above work. The number of threads is set to 16 (twice the number of cores of the machine, 8).
4. Each thread needs to process an average of 3125 tasks. So consider using queues to store tasks (the queue method is more efficient).
In general, it can be divided into the following two steps:
The first step: Java project multi-threading and queue assembly to implement SQL statements.
ps:
Step 2: Alibaba’s druid connection pool implements the execution of sql to the database.

start

Target 5 million data!

Get database connection

The first step is definitely to find a way to get the database connection and create a new DataSourceUtil tool class.

import com.alibaba.druid.pool.DruidDataSource;

import java.sql.Connection;

/**
 * @author: aliyu
 * @create: 2021-06-26 23:09
 * @description:
 */
public class DataSourceUtil {
    
    

    private static DruidDataSource druidDataSource;

    //初始化数据源
    static{
    
    
        druidDataSource = new DruidDataSource();
        //设置数据库连接参数
        druidDataSource.setUrl("jdbc:mysql://localhost:3306/db_myframe?serverTimezone=GMT");
        druidDataSource.setDriverClassName("com.mysql.cj.jdbc.Driver");
        druidDataSource.setUsername("root");
        druidDataSource.setPassword("XXXXX");
        //配置初始化大小、最小、最大
        druidDataSource.setInitialSize(16);
        druidDataSource.setMinIdle(1);
        druidDataSource.setMaxActive(16);
        //连接泄漏监测
        druidDataSource.setRemoveAbandoned(true);
        druidDataSource.setRemoveAbandonedTimeout(30);
        //配置获取连接等待超时的时间
        druidDataSource.setMaxWait(20000);
        //配置间隔多久才进行一次检测,检测需要关闭的空闲连接,单位是毫秒
        druidDataSource.setTimeBetweenEvictionRunsMillis(20000);
        //防止过期
        druidDataSource.setValidationQuery("SELECT 'x'");
        druidDataSource.setTestWhileIdle(true);
        druidDataSource.setTestOnBorrow(true);

    }
    /*一个方法从连接池中获取连接*/
    public static Connection getConnect() {
    
    
        Connection con = null;
        try {
    
    
            con = druidDataSource.getConnection();
        } catch (Exception e) {
    
    
            e.printStackTrace();
        }
        return con;
    }


}

ps: The main thing is the configuration of the data source, url, user name, password and so
on. I configured the number of connections to twice the number of cores of my computer - 16 (8*2=16), and the initialization is also 16 (full firepower). open).
Then getConnect method gets the connection.

Create thread class

According to the above analysis, each SQL group has 10 values, and one is submitted for every 10 SQLs. Then a thread can insert 100 items by executing one task at a time, which requires 50,000 tasks. These 50,000 tasks are completed by 16 threads.

public class InsertDataThread implements Runnable {
    
    
    /**
     * 初始化队列长度为50000,16个线程一起完成50000个任务
     */
    private static final BlockingQueue<Integer> queue = new LinkedBlockingQueue<Integer>(50000);
    /**
     * 模拟先插入前天的记录
     */
    private static String create_date = "DATE_SUB(current_time(3), INTERVAL 2 day)";
    private static String create_time = "DATE_SUB(curdate(), INTERVAL 2 day)";
    /**
     * 构造方法中为队列初始化赋值
     */
    public InsertDataThread() {
    
    
        for (int j = 0; j < 50000; j++) {
    
    
            queue.offer(j);
        }
    }

    /**
     * 线程任务执行(每次任务插入记录数为10个values乘以 10个sql一起提交 = 100条记录)
     */
    @Override
    public void run() {
    
    
        while (true) {
    
    
            try {
    
    
                synchronized (this){
    
        //queue作为线程共有的参数,肯定一个时间内只能有一个线程操作
                    queue.poll(); //取出队头(队列是阻塞的,要一个一个的执行)
                    if(queue.size() == 33200){
    
        //模拟昨天的数据
                        create_date = "DATE_SUB(current_time(3), INTERVAL 1 day)";
                        create_time = "DATE_SUB(curdate(), INTERVAL 1 day)";
                    }else if(queue.size() == 16600){
    
      //模拟今天的数据
                        create_date = "current_time(3)";
                        create_time = "curdate()";
                    }else if(queue.size() == 0){
    
    
                        break;
                    }
                    System.out.println("执行到了第"+ queue.size());
                }
                /*一个队列插入了100条*/
                Connection conn = DataSourceUtil.getConnect();
                conn.setAutoCommit(false); //需要关掉自动提交才可以多个sql一起提交
                for (int k = 0; k < 10; k++) {
    
    
                    /*一个sql 语句 10 条*/
                    StringBuffer sb = new StringBuffer();
                    sb.append("insert into db_myframe.t_authority (authority, authority_type, url, authority_parentid , remark, show_status, hide_children, authority_seq, create_time, create_date) values ");
                    for (int j = 0; j < 10; j++) {
    
    
                        sb.append("('学生管理', 10, '/mocknnd', 0, '学生管理', 1, 1, 1, "+create_time+", "+create_date+"), ");
                    }
                    String sql = sb.toString().substring(0, sb.lastIndexOf(","));
                    // 最后在所有命令执行完之后再提交事务conn.commit();
                    PreparedStatement pstmt = conn.prepareStatement(sql);
                    pstmt.execute();
                    pstmt.close();  //线程池管理的可能这个不需要自己写了
                }
                conn.commit(); // 每10个sql提交一次事务
                conn.close(); //线程池管理的可能这个不需要自己写了
            } catch (Exception e) {
    
    
                e.printStackTrace();
            }
        }
    }
}

Write a test class to perform insertion

public class InsertMillionDataTest {
    
    

    public static void main(String[] args) {
    
    
        InsertDataThread insertDataThread = new InsertDataThread();
        //核心数的两倍16个线程去执行任务
        for (int i = 0; i < 16; i++) {
    
    
            Thread th1 = new Thread(insertDataThread);
            th1.start();
        }
    }
}

other

It seems that if 100 submissions are made once, the execution time will be 2 minutes and 59 seconds.
Insert image description here
Main reference article address: https://blog.csdn.net/weixin_39561577/article/details/111257340

Guess you like

Origin blog.csdn.net/mofsfely2/article/details/118283519