High-performance programming python (lineprofiler + multiprocessing)

Function optimization: first single-threaded optimization (with a line profiler), then the multi-process optimization

The use line_profiler

About installation error occurs, see this lineprofiler installation errors
line_profiler role is to get each line of the program execution time used.

from line_profiler import LineProfiler
lp = LineProfiler()# 把函数传递到性能分析器
lp_wrapper = lp(transferdata)
lp_wrapper('C:/Users/Administrator.SC-201610171623/Desktop/湖南采集数据/不接地-所有/不接地-A/D/1200')
lp.print_stats()

Here Insert Picture Description
At the same time display function to solve the problem of time in each row and each row by calling the function with time
it can be used as reference

multi-Progress

Issues to be analyzed consists of two parts:
(1) IO
(2) calculate
multiple processes suitable for compute-intensive,

Why not use multithreading

This concept has GIL (most python execution environment by default CPython) in the implementation of the Python parser (CPython), of course, there are JPython neither GIL.
To take advantage of multi-core, Python began to support multi-threading. And resolve data integrity between multiple threads and synchronization status of nature is the easiest way to lock. GIL is locked.
Python on multi-core multi-threaded CPU, only for IO-intensive computing have a positive effect; and when there is at least a CPU-intensive thread exists, the multi-threaded efficiency will decline sharply due to the GIL.
But not multi-process GIL impact
of this interpretation of the well
improved codes

def analyse(datname):#数据处理和数据写入都放在里面
    data=transferdata(datname)
    character=char(data)
    df=pd.DataFrame(np.array(character),columns=['均值','标准差','偏度','峰度',
                                  'd1/a6','d2/a6','d3/a6','d4/a6','d5/a6','d6/a6','能量比之和',
                                  'd1方差','d2方差','d3方差','d4方差','d5方差','d6方差',
                                  'd1最大','d2最大','d3最大','d4最大','d5最大','d6最大','a6最大',
                                  '第一周期均值','第一周期标准差','第一周期偏度','第一周期峰度',
                                  '第二周期均值','第二周期标准差','第二周期偏度','第二周期峰度'])
    df.to_csv('my_csv.csv', mode='a', header=True)
def generate(dat_txt,workers):
    with Pool(workers) as p:
        p.map(analyse,dat_txt)#analyse是最终的数据处理函数
if __name__ == '__main__':
    t = time.time()
    generate(dat_txt,workers=8)
    used = time.time()-t
    print(used)

Multiple processes simultaneously writing a file will be blocked
https://blog.csdn.net/Q_AN1314/article/details/51923022
Multiple processes simultaneously writing a file will be blocked, The solution:
1, the write operation is locked until the process is finished and put him to leave, but the implementation of such inefficient
2, more elegant approach: using multiprocessing callback function.
a, writes a single abstract operation function
B, the process contents to be written, returned as a return value
C, using the callback function returns the contents of the write process.

#数据分析函数
def analyse(datname):
    data=transferdata(datname)
    character=char(data)
    df=pd.DataFrame(np.array(character),columns=['均值','标准差','偏度','峰度',
                                  'd1/a6','d2/a6','d3/a6','d4/a6','d5/a6','d6/a6','能量比之和',
                                  'd1方差','d2方差','d3方差','d4方差','d5方差','d6方差',
                                  'd1最大','d2最大','d3最大','d4最大','d5最大','d6最大','a6最大',
                                  '第一周期均值','第一周期标准差','第一周期偏度','第一周期峰度',
                                  '第二周期均值','第二周期标准差','第二周期偏度','第二周期峰度'])
    return df#把进程需写入的内容作为返回值返回
#写文件的函数单独放在一个函数里
def write2csv(df):
    df.to_csv('my_csv.csv', mode='a', header=True)
if __name__ == '__main__':
    t = time.time()
    pool = Pool()
    for i in dat_txt:
        pool.apply_async(analyse,(i,),callback=write2csv)#analyse是进程运行的对象
    used = time.time()-t
    print(used)

apply_async for transmitting variable parameters, consistent with the apply function python. But he is non-blocking and support the results after the callback returns.
close () close the pool, it is no longer accepting new task join () blocked the main process waits for the child process exits, join method to use after the close.

In the windows environment, python multi-process multi-process did not like linux environment, like, multiprocessing library under linux environment is based on the fork function, the parent of a child after a fork process to make their resources, such as file handle is passed to child process. But no fork function in the windows environment, so if you open a file in the parent process, the child process writing, ValueError will appear: I O operation such errors / on closed file, but also the best in the windows environment Add IF name == ' main ' is determined such as to avoid possible RuntimeError or deadlock.

Use df.to_csv () each additional set of data will bring a key, behind the csv file processing is not convenient, improved here, wherein the calculated output directly to a csv file, a lot faster speed

def analyse(datname):
    data=transferdata(datname)
    character=char(data)
    return character

def write2csv(df):  
    writer = csv.writer(file_csv, dialect='excel')
    for data in df:
        writer.writerow(data)

def generate(dat_txt,workers):
    with Pool(workers) as p:
        p.map(analyse,dat_txt)

if __name__ == '__main__':
    __spec__ = "ModuleSpec(name='builtins', loader=<class '_frozen_importlib.BuiltinImporter'>)"#spyder的错误，加上这一句就不错了
    t = time.time()
    file_csv = open('try.csv','a', newline='')#加上newline=''保证行之间无空行
    pool = Pool()
    for i in dat_txt:
        pool.apply_async(analyse,(i,),callback=write2csv)
    pool.close()
    pool.join()
    file_csv.close()
    used = time.time()-t
    print(used)

Python multi-threaded and multi-process the join () method

Multi-thread deadlock

Deadlock and deadlock solutions --java

Deadlock / recursive lock / mutex / ...

double_Yang7

Published 26 original articles · won praise 12 · views 2936

Private letter concerns