python(47): Multi-threaded multi-process application - summary of batch data creation

1. Create data requirements

1.1 million level quantity

1.2 Write multiple dbf files, each dbf file is associated with two sql files - the associated column is the cumulative column of 1.3

1.3 The two columns of the dbf file are globally unique and cumulative, and some columns are randomly generated.

1.4 The three columns of the sql file are globally unique and cumulative, and some columns are randomly generated.

Things to consider include: cpu, network, disk io read and write speed

2. Implementation plan

2.1 Serial execution

100,000 rows of data

Create a dbf and associated sql1, sql2 files, 3 minutes and 21 seconds  - cpu occupied 12% (12%-win, Linux environment is 100%), about (8-core system, single cpu is fully occupied)

2.2 Multi-thread parallelism, thread lock locks the global unique variable

100,000 rows of data

Creating 2 dbf files and their associated sql1 and sql2 files takes 7 minutes    --according to 2.1, the CPU usage is normal time

2.3 Multi-process parallelism, process lock locks the global unique variable

100,000 rows of data

Create a dbf file + associated sql1, sql2 files, time 7 minutes 05 ,? --Process lock switching takes time

Create 4 dbf files + associated sql1, sql2 files, time 12 minutes and 06 seconds,

Since each row of data requires a lock request, this situation is no longer suitable for using locks to allocate globally unique variables. To prevent waiting for locks from affecting the full utilization of CPU performance, the 2.4 solution is adopted.

2.4 Multi-process parallelism, no process locks are used, and the globally unique variable range used by each process is pre-allocated according to the amount of generated data.

100,000 rows of data

Create a dbf file + associated sql1, sql2 files, time 3 minutes and 39,

Create 4 dbf files + associated sql1, sql2 files, time 4 minutes and 02 seconds, -- 4 python processes, each occupying about 12% (8-core system)

1 million rows of data

Create 4 dbf files + associated sql1, sql2 files, time 33 minutes and 48 seconds - in line with expectations

3. Summary

3.1 Network optimization:

When obtaining environmental account information and other information in the tool, you need to request the relevant data in the server database through the network.

Optimization method: cache after the first acquisition, and search the cache first when creating data later

3.2CPU utilization optimization:

Multi-process utilization

3.3 Summary of the plan:

2.4 is the optimal solution. When multiple dbf files are created at the same time, each creation process starts a corresponding process, and each process occupies about 12% of the CPU during operation (the 8-core system is fully occupied); create dbf, sql1, The sql2 file is executed serially.

Since the above practices are all based on 10w, the time required to multiply tens of millions of data can be about 35-40 minutes if multiplied by the multiple.

Note: When the number of created files (number of processes) exceeds the number of system cores, the overall time will be multiple of the time to create a single file.

example:

8-core system-10w rows of data:

Single file - 3 minutes,

Create 1-8 files at the same time ~= 3 minutes,

Create 9-16 files simultaneously ~= 6 minutes  

The current data time of the tool is the above statistics, which is affected by the CPU performance and disk writing speed of the execution environment. Replacing a higher-performance machine will take less time.

Guess you like

Origin blog.csdn.net/qq_37674086/article/details/126282087