Python 大文件多进程并行处理小例

这里的需求很简单，统计比较大的log文件的行数，最终版本请看最后一段代码。

环境

64G 32核心机械盘
python2.7.5

文件的信息

$ tail www.geniatech.net
14.182.200.249 - - [23/Aug/2018:00:11:06 HKT] "GET http://www.geniatech.net/down-eng/upgrade/stvm8_5.0_MyGica_Dolby//update.xml HTTP/1.1" 404 0 0 319 "-" "Apache-HttpClient/UNAVAILABLE (java 1.4)" "-" "-" "HIT" "-" 1

$ ls -lsh www.geniatech.net
2.5G -rw-r--r-- 1 liuzz liuzz 2.5G 8月  24 16:01 www.geniatech.net

$ time wc -l www.geniatech.net
12762416 www.geniatech.net

real    0m0.579s
user    0m0.184s
sys 0m0.395s

单进程

最原始的思路，利用生成器直接读取统计

# coding:utf-8
filename = "www.geniatech.net"
linenums = 0

with open(filename) as f:
    for line in f:
        linenums += 1

print linenums

看看统计耗时

$ time python batch.py
12762416

real    0m2.606s
user    0m2.022s
sys 0m0.584s

$ time python batch.py
12762416

real    0m2.588s
user    0m2.040s
sys 0m0.548s

多进程

下面是一个错误的多进程思路

# coding:utf-8
import multiprocessing as mp

filename = "www.geniatech.net"
cores = 20

pool = mp.Pool(cores)
jobs = []

def work(line):
    pass

with open(filename) as f:
    for line in f:
        jobs.append(pool.apply_sync(work,(line,)))

for job in jobs:
    job.get()

pool.close()

上面的做法有问题，内存不断上升,会把整个文件加载到内存中，然后cpu也超过100%，几十秒无法完成，然后强制关闭了。

下面是改进方案

# coding:utf-8
import os
import multiprocessing as mp

filename = "www.geniatech.net"
cores = 5


def process_wrapper(chunkStart, chunkSize):
    num = 0
    with open(filename) as f:
        f.seek(chunkStart)
        lines = f.read(chunkSize).splitlines()
        for line in lines:
            num +=1
    return num

def chunkify(fname,size=1024*1024):
    fileEnd = os.path.getsize(fname)
    with open(fname,'r') as f:
        chunkEnd = f.tell()
        while True:
            chunkStart = chunkEnd
            f.seek(size,1)
            f.readline()
            chunkEnd = f.tell()
            yield chunkStart, chunkEnd - chunkStart
            if chunkEnd > fileEnd:
                break

pool = mp.Pool(cores)
jobs = []

for chunkStart, chunkSize in chunkify(filename):
    jobs.append(pool.apply_async(process_wrapper, (chunkStart,chunkSize)))

res = []
for job in jobs:
    res.append(job.get())

pool.close()
print sum(res)

主要是通过生成器，每次取出文件的一个chunk（需要注意下文件的行对齐)，交给其他进程去处理，最后把结果合并统计总的行数。

# 5个进程
$ time python batch.py
12762416

real    0m1.367s
user    0m4.004s
sys 0m2.653s

#15个进程
time python batch.py
12762416

real    0m0.672s
user    0m4.780s
sys 0m3.755s

希望对大家有启发

Python 大文件多进程并行处理小例

单进程

多进程

猜你喜欢