How to efficiently generate millions of data

Article top image.png

Pailang.png

background

At work, most of our needs will encounter new scenarios for data, such as: batch generation of test data, simple data migration, statistical data reports, etc. If the amount of data added and modified at one time is particularly large , it will cause the execution time to be long and not meet our expectations, which is often head-scratching.

So how to quickly generate millions of data, usually we use multi-threading to speed up the execution of the code, but if it is another way, how to maximize the task by giving it a single thread? So that in the later multi-threading, the speed of each thread can be improved to the greatest extent. Here, I will use several different forms to test and compare for everyone to see which method is faster.

technology use

  • JAVA1.8
  • Mybatis
  • MySql5.7

Ready to work

First, we prepare a simple user information table for testing. The table structure is as follows.

create table zcy_user
(
    id          int auto_increment primary key,
    user_name   varchar(32)  null,
    sex         int          null,
    email       varchar(50)  null,
    address     varchar(100) null,
    create_time timestamp    null
);
复制代码

xml file to increase batch insertion.

insert into t_user(`user_name`,`sex`,`email`,`address`) VALUES
<foreach collection="list" item="emp" separator=",">
  (#{emp.userName}, #{emp.sex}, #{emp.email} ,#{emp.address})
</foreach>
复制代码

Test time

Here, in order to use the control variable method to test, we use a single thread to insert 1 million pieces of data each time to see how long it takes.

Insert 1 at a time

Since it takes a long time to insert 1 piece of data each time, here we estimate that 1 million pieces of data is approximately equal to 31165.70 seconds by taking 311.65 seconds for 10,000 pieces of data.

Insert 100 records each time

It takes 373.58 seconds to insert 100 pieces of data each time.

Insert 1000 records each time

It takes 62.08 seconds to insert 1000 pieces of data each time.

Insert 10,000 records each time

It takes 31.83 seconds to insert 10,000 pieces of data each time.

Insert 50,000 pieces each time

It takes 29.30 seconds to insert 50,000 pieces of data each time.

Insert 100,000 records each time

Each time 100,000 pieces are inserted, an error is reported at this time, and the reason for the error is that the upload size, quantity, and size of 5900064kb is greater than the maximum configuration limit of 4194304kb (4M).

Compared

number of inserts time/second RAM speed Speed ​​increase rate
1 31165 idle slow /
100 373 good slow 83 times
1000 62 good middle 502 times
10000 31 medium quick 1005 times
50000 29 larger quick 1074 times
100000 / very big / /

expand knowledge

Mysql memory structure

Mysql的官网中介绍了Innodb存储引擎结构包含了内存存储磁盘存储两部分,具体结构如下图所示。其中内存存储又包含了 Buffer Pool 和 Log Buffer。我们知道磁盘随机读写的效率很低,因此 Mysql 在设计的时候,读写的数据都优先读取到 Buffer Pool 中,之后再由 I/O 线程写入磁盘。

Buffer Pool

Mysql在设计的时候,为了避免每次访问都进行磁盘IO缓存表数据与索引数据,把磁盘上的一些经常被调用数据加载到缓冲池,起到加速访问的作用。那么是如何选择经常被调用的的数据呢,这里 Buffer Pool采用 LRU(Least recently used ,最近最少使用)算法来计算缓存中的存储数据。但是在普通的LRU算法中,如果有个SQL请求是查询某个表的全量数据,在进行深度分页的时候,其实每次都会获取前几页的数据,这样最后缓存中只会存下分页的前几也数据,然而这些数据并不是真正意义上的调用频率最高的数据,这样就导致缓存空间没有真实的利用起来。

根据上边的分析,Mysql 在设计的时候对 LRU 做了优化,在原来 LRU 拆分的基础上,又加了冷热数据的区分,把原来的空间的 3/8 存储冷数据,5/8 的空间存储热数据;那么具体是怎么判断出来冷热数据呢,根据 Mysql 的配置中可以看出,访问数据间隔1秒作为分界线,连续请求时间小于 1 秒的数据通过 LRU 算法将数据移动到冷数据的最前端;连续请求时间大于1秒的则将数据分配到热数据中去。通过这样的改造来保证缓存数据中的存储的是热数据。

#查看冷热数据区分时间 1000ms
mysql> show variables like 'innodb_old_blocks_time';
+------------------------+-------+
| Variable_name          | Value |
+------------------------+-------+
| innodb_old_blocks_time | 1000  |
+------------------------+-------+
1 row in set (0.00 sec)
复制代码

Change Buffer

这里我们拿更新为例,当需要更新的数据数据页在 Chang Buffer 中时,数据页在内存中就直接更新。如果数据页不在内存中。在不影响数据一致性的前提下,InooDB 会将这些更新操作缓存在 Change Buffer 中,这样就不需要从磁盘中读入这个数据页了。在下次查询需要访问这个数据页的时候,将数据页读入内存,然后执行 change buffer 中与这个页有关的操作。这样操作可以把多次操作通过 merge 形式合并后提交,提高了执行性能。如果执行失败,或者系统意外中断,也可以通过 Redo Log 来重心执行任务,保证数据正确。

性能提升原理分析

合并事务

上述测试的数据,insert values 其实就是多条 insert value 的事务合并到一起提交。对于 Mysql 而言,每次的事务就会生成一条对应的 Redo Log,当多个命令合并到一个事务中的时候,就可以减少磁盘的 IO 次数,以提升性能;但是日志文件的大小是有上限的,超过配置的最大值后,性能就不会巨大提升了。

#写入values保证一个事务
insert into t_user(`user_name`,`sex`,`email`,`address`) VALUES
(1,1,1,1),(2,2,2,2)…………(n,n,n,n);

#多个value写入一个事务
begin;
insert into t_user(`user_name`,`sex`,`email`,`address`) value(1,1,1,1);
insert into t_user(`user_name`,`sex`,`email`,`address`) value(2,2,2,2);
…………
insert into t_user(`user_name`,`sex`,`email`,`address`) value(n,n,n,n);
commit;
复制代码

保证数据顺序

我们知道数据在磁盘中是以文件形式存储,Innodb 引擎采用B+树索引存储,是有顺序的,如果插入索引是连续的,那么在次盘搜索时候会在同一片连续空间中进行顺序 IO 操作,反之索引是乱序的,会造成次盘扫描的时候进行随机 IO,这样的代价是什么高昂的。因此我们在生成数据的索引值尽量为 id 或者规律的数据,保证新增的时候能够在连续空间中存储。

使用场景

  • 数据报表导入
  • 批量初始化数据
  • 百万千万数据迁移
  • 日志表记录

总结

现在的很多工程项目中,数据量过百万已经是很常见的了,但是性能还需要得到保障,根据测试结果分析后,批量生成数据的时候,可以根据业务场景尽可能能把可以合并的请求合并到同一个事务中,这样可以提升不少的性能,给我们的数据库和服务减少压力,让程序更加稳定。

推荐阅读

政采云Flutter低成本屏幕适配方案探索

Redis系列之Bitmaps

MySQL 之 InnoDB 锁系统源码分析

招贤纳士

Zhengcaiyun technical team (Zero), a team full of passion, creativity and execution, Base is located in the picturesque Hangzhou. The team currently has more than 300 R&D partners, including "veteran" soldiers from Ali, Huawei, and NetEase, as well as newcomers from Zhejiang University, University of Science and Technology of China, Hangdian University and other schools. In addition to daily business development, the team also conducts technical exploration and practice in the fields of cloud native, blockchain, artificial intelligence, low-code platform, middleware, big data, material system, engineering platform, performance experience, visualization, etc. And landed a series of internal technology products, and continued to explore new boundaries of technology. In addition, the team has also devoted themselves to community building. Currently, they are contributors to many excellent open source communities such as google flutter, scikit-learn, Apache Dubbo, Apache Rocketmq, Apache Pulsar, CNCF Dapr, Apache DolphinScheduler, alibaba Seata, etc. If you want to change, you have been tossed with things, and you want to start tossing things; if you want to change, you have been told that you need more ideas, but you can't break the game; if you want to change, you have the ability to achieve that result, but you don't need you; if you If you want to change what you want to do, you need a team to support it, but there is no place for you to lead people; if you want to change, you have a good understanding, but there is always that layer of blurry paper... If you believe in the power of belief, I believe that ordinary people can achieve extraordinary things, and I believe that they can meet a better self. If you want to participate in the process of taking off as the business takes off, and personally promote the growth of a technical team with in-depth business understanding, a sound technical system, technology creating value, and spillover influence, I think we should talk. Anytime, waiting for you to write something, send it to [email protected]

WeChat public account

The article is released simultaneously, the public account of the technical team of Zhengcaiyun, welcome to pay attention

文章顶图.png

Guess you like

Origin juejin.im/post/7080308896786546695