参考文档:
https://www.ctolib.com/topics/85107.html
https://twiki.cern.ch/twiki/bin/view/Main/PythonLoggingThreadingMultiprocessingIntermixedStudy
https://github.com/google/python-atfork
服务采用多线程+ 多进程+ logging的模式,服务偶然会出现死锁的子进程。
如下:
通过分析子进程死锁的原因是在子进程复制(multiprocessing中调用os.fork)的时候,会把主进程的logging中的锁也复制了一份,如果这个时候锁是被占用的,子进程的锁就一直是占用的,即使父进程中的锁被释放,也不行。子进程会一直阻塞在获得锁的状态,就没办法记日志了。
测试环境上跑的时候并没有复现这个问题,且线上环境死锁的现象也是偶发的,现需要想办法复现这个问题。
测试:
import os
import sys
curdir = os.path.dirname(os.path.abspath(__file__))
sys.path.append(os.path.dirname(curdir))
from multiprocessing import Process
import threading
from common import log
import time
def worker(id):
# logger = logging.getLogger("worker '%s'" % id)
while True:
t = time.time()
msg = "worker '%s', time: %s" % (id, t)
# print msg
log.info(msg)
# logger.debug(msg)
def startProcessWorkers(numWorkers):
workers = []
for i in range(numWorkers):
id = "process %02i" % i
w = Process(target=worker, args=(id,))
w.start()
workers.append((id, w))
return workers
class Worker(threading.Thread):
def __init__(self, id, iterations=None):
threading.Thread.__init__(self)
self.id = id
self.iterations = iterations
self._stopFlag = False
def run(self):
# logger = logging.getLogger("worker '%s'" % id)
counter = self.iterations
while not self._stopFlag:
t = time.time()
msg = "worker '%s', time: %s" % (self.id, t)
# print msg
log.info(msg)
# logger.debug(msg)
if self.iterations:
counter -= 1
if not counter:
self._stopFlag = True
def terminate(self):
self._stopFlag = True
def startThreadWorkers(numWorkers, iterations=None):
workers = []
for i in range(numWorkers):
id = "thread %02i" % i
w = Worker(id, iterations=iterations)
w.start()
workers.append((id, w))
return workers
if __name__ == "__main__":
workers = []
workers.extend(startThreadWorkers(2, iterations=10000))
# iterations - make the threads stop after some time - processes
# in the case (2) will never resume anyway
workers.extend(startProcessWorkers(2)) # problem
# workers.extend(startThreadWorkers(2, iterations = 10000))
图中可看到启动了3个进程,但是日志文件中只记录了threading的日志,进程的日志并没有记录。
解决方案:https://github.com/google/python-atfork
原理是在python中无论任何时候调用fork(),都将进行一组回调。
_prepare_call_list
_parent_call_list
_child_call_list
分别对应三种触发时间:
1、prepare:在调用fork之前。
2、in the parent after fork-在fork后立即执行(不管成功还是失败),在父进程中。
3、in the child after fork-在fork后,子进程立即执行。
即:
方法:
import atfork
atfork.monkeypatch_os_fork_functions()
通过atfork.atfork注册三种回调:
atfork.atfork(prepare=my_lock.acquire,
parent=my_lock.release,
child=my_lock.release)
比方说logging模块中的锁:
from atfork import stdlib_fixer
stdlib_fixer.fix_logging_module()