MySQL Best Practices-Efficiently insert data

When you need to insert millions of data in a MySQL database in batches, you will realize that sending INSERT statements one by one is not a feasible method.

There are some INSERT optimization tips worth reading in the MySQL documentation.

In this article, I will outline two techniques for efficiently loading data into a MySQL database.

LOAD DATA INFILE

If you are looking for a solution to improve the original performance, this is undoubtedly your first choice. LOAD DATA INFILE is a highly optimized statement specifically for MySQL. It directly inserts data from the CSV / TSV file into the table.

There are two ways to use LOAD DATA INFILE. You can copy the data file to the server data directory (usually / var / lib / mysql-files /) and run:

LOAD DATA INFILE '/path/to/products.csv' INTO TABLE products;

This method is quite troublesome, because you need to access the file system of the server, set the appropriate permissions for the data files, etc.

The good news is that you can also store data files on the client and use the LOCAL keyword:

LOAD DATA LOCAL INFILE '/path/to/products.csv' INTO TABLE products;

In this case, read the file from the client file system, copy it transparently to the temporary directory on the server side, and then import it from the directory. All in all, this is almost as fast as loading files directly from the server file system, but you need to make sure that this option is enabled on the server.

LOAD DATA INFILE has many options, mainly related to the structure of the data file (field separators, attachments, etc.). Please browse the documentation to see the full content.

Although from a performance perspective, LOAD DATA INFILE is the best option, but this method requires you to first export the data to a text file in a comma-separated form. If you do not have such files, you will need to spend additional resources to create them, and may increase the complexity of the application to a certain extent. Fortunately, there is another option.

Extended inserts (Extended inserts)

A typical INSERT SQL statement looks like this:

INSERT INTO user (id, name) VALUES (1, 'Ben');

extended INSERT aggregates multiple inserted records into one query statement:

INSERT INTO user (id, name) VALUES (1, 'Ben'), (2, 'Bob');

The key is to find the optimal number of records to insert in each statement. There is no one-size-fits-all number, so you need to benchmark your data samples to find the maximum performance gain or find the best compromise between memory usage and performance.

In order to make full use of extended insert, we also recommend:

  • Use prepared statements
  • Run the statement in a transaction

Benchmarks

I want to insert 1.2 million records. Each record is composed of 6 mixed types of data, and each data is about 26 bytes in size. I used two common configurations for testing:

  • The client and server are on the same machine and communicate through UNIX sockets
  • The client and server are on different machines and communicate through a gigabit network with very low latency (less than 0.1 ms)

As a basis for comparison, I copied the table using INSERT ... SELECT, and the performance of this operation was inserting 313,000 data per second.

LOAD DATA INFILE

To my surprise, the test results prove that LOAD DATA INFILE is faster than copying tables:

  • LOAD DATA INFILE: 377,000 inserts per second
  • LOAD DATA LOCAL INFILE through the network: 322,000 inserts per second

The difference between these two numbers seems to be directly related to the time taken to transfer data from the client to the server: the size of the data file is 53 MB, the time difference between the two benchmarks is 543 ms, which means that the transmission speed is 780 mbps, Close to gigabit speed.

This means that, very likely, before the file is completely transferred, the MySQL server does not start processing the file: therefore, the speed of insertion is directly related to the bandwidth between the client and the server, if they are not on the same machine, consider This is very important.

Extended inserts

I use BulkInserter to test the speed of insertion. BulkInserter is part of the open source library PHP class I wrote, and each query can insert up to 10,000 records:

 

[Translation] MySQL Best Practices-Efficient Data Insertion

 

 

As we have seen, as the number of inserts per query grows, the insert speed will also increase rapidly. Compared with the one-by-one insertion speed, we have improved performance 6 times on the local host and 17 times on the network host:

  • Increased the number of inserts on the local host from 40,000 to 247,000 per second
  • The number of inserts per second on the web host increased from 1,2000 to 201,000

Both cases require approximately 1,000 inserts per query to achieve maximum throughput. But 40 inserts per query is enough to achieve 90% throughput on the local host, which may be a good compromise. It should also be noted that after the peak is reached, as the number of inserts per query increases, performance will actually decrease.

The advantage of extended insert is more obvious in the case of network connection, because the speed of continuous insertion depends on your network delay.

max sequential inserts per second ~= 1000 / ping in milliseconds

The higher the delay between the client and server, the more you will benefit from the extended insert.

in conclusion

As expected, LOAD DATA INFILE is the preferred solution for improving performance on a single connection. It requires you to prepare a file in the correct format. If you must first generate this file and / or transfer it to the database server, you must take the time consumption of this process into consideration when testing the insertion speed.

On the other hand, extended insert does not require a temporary text file, and can achieve a throughput equivalent to 65% of LOAD DATA INFILE, which is a very reasonable insertion speed. Interestingly, whether it is based on the network or the local host, aggregate multiple inserts into a single query can always get better performance.

If you decide to start using extended insert, you must first test your environment with data samples of the production environment and some different inserts to find the best value.

Be careful when increasing the number of inserts for a single query, so it may require:

  • Allocate more memory on the client
  • Increase the max_allowed_packet parameter configuration of the MySQL server.

Finally, it is worth mentioning that, according to Percona, you can use concurrent connections, partitions, and multiple buffer pools for better performance.

The benchmark runs on a bare server with Centos 7 and MySQL 5.7. Its main hardware configuration is Xeon E3 @ 3.8 GHz processor, 32 GB RAM and NVMe SSD. MySQL's benchmark table uses the InnoBD storage engine.

The source code of the benchmark is saved on gist, and the result graph is saved on plot.ly.

Guess you like

Origin www.cnblogs.com/CQqfjy/p/12724022.html