MKL_THREADING_LAYER=INTEL is incompatible with libgomp.so.1 library

使用pytorch做分布式训练时,遇到错误:

Error: mkl-service + Intel(R) MKL: MKL_THREADING_LAYER=INTEL is incompatible with libgomp.so.1 library.
    Try to import numpy first or set the threading layer accordingly. Set MKL_SERVICE_FORCE_INTEL to force it.

解决方案1:在环境变量增加设置

export MKL_SERVICE_FORCE_INTEL=1

解决方案2:在环境变量增加设置

export MKL_THREADING_LAYER=GNU

问题分析:

Grepping conda manifests, libgomp is pulled in by libgcc-ng, which is in turn pulled in by, uh, pretty much everything. So the culprit is more likely to be whoever's setting MKL_THREADING_LAYER=INTEL. As far as that goes, well, it's weird.

import os

def print_layer(prefix):
    print(f'{prefix}: {os.environ.get("MKL_THREADING_LAYER")}')

if __name__ == '__main__':
    print_layer('Pre-import')
    import numpy as np
    from torch import multiprocessing as mp
    print_layer('Post-import')

    mp.set_start_method('spawn')
    p = mp.Process(target=print_layer, args=('Child',))
    p.start()
    p.join()

See, if torch is imported before numpy then the child process here gets a GNU threading layer (even though the parent doesn't have the variable defined).

Pre-import: None
Post-import: None
Child: GNU

But if the imports are swapped so numpy is imported before torch, the child process gets an INTEL threading layer

Pre-import: None
Post-import: None
Child: INTEL

So I suspect numpy - or ones of its imports - is messing with the env parameter of Popen, but half an hour's search and I can't figure out how.

Ref: https://github.com/pytorch/pytorch/issues/37377

猜你喜欢

转载自blog.csdn.net/dou3516/article/details/121396950