[Exception error] Unexpected option: --local_rank=0 (pycharm can run but not debug)

Today, after using run to run the cmd command converted from the shell file, run can run normally, but there is a problem with debug, the error message:

Usage:
	pydevd.py --port N [(--client hostname) | --server] --file executable [file_options]
Traceback (most recent call last):
  File "/home/mapengsen/.pycharm_helpers/pydev/pydevd.py", line 2016, in main
    setup = process_command_line(sys.argv)
  File "/home/mapengsen/.pycharm_helpers/pydev/_pydevd_bundle/pydevd_command_line_handling.py", line 146, in process_command_line
    raise ValueError("Unexpected option: " + argv[i])
ValueError: Unexpected option: --local_rank=0
[2023-07-08 10:08:11,202] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2934
[2023-07-08 10:08:11,202] [ERROR] [launch.py:321:sigkill_handler] ['/home/mapengsen/anaconda3/envs/38/bin/python', '-u', '/home/mapengsen/.pycharm_helpers/pydev/pydevd.py', '--local_rank=0', '--multiprocess', '--qt-support=auto', '--client', '127.0.0.1', '--port', '58899', '--file', '/mnt/d/Pycharm_workspace/DoubleTarget/RetMol/MolBART/train_retrieval.py', '--model-parallel-size', '1', '--pipe-parallel-size', '0', '--num-layers', '4', '--hidden-size', '256', '--num-attention-heads', '8', '--seq-length', '512', '--max-position-embeddings', '512', '--batch-size', '320', '--gas', '16', '--train-iters', '50000', '--lr-decay-iters', '320000', '--data-impl', 'mmap', '--distributed-backend', 'nccl', '--lr', '0.0001', '--lr-decay-style', 'cosine', '--min-lr', '1.0e-5', '--weight-decay', '0', '--clip-grad', '1.0', '--warmup', '0.01', '--checkpoint-activations', '--log-interval', '1', '--save-interval', '10000', '--eval-interval', '50000', '--eval-iters', '10', '--fp16', '--dataset_path', '../data/zinc.tab', '--deepspeed', '--deepspeed_config', 'megatron_molbart/ds_config.json', '--zero-stage', '1', '--zero-reduce-bucket-size', '50000000', '--zero-allgather-bucket-size', '5000000000', '--zero-reduce-scatter', '--checkpoint-activations', '--checkpoint-num-layers', '1', '--partition-activations', '--synchronize-each-layer', '--contigious-checkpointing', '--stage', '1', '--train_from', 'pretrain', '--model_ckpt_itr', '50000', '--attr', 'logp-sa', '--attr_offset', '0', '--data_source', 'jtnn', '--enumeration_input', 'false', '--retriever_rule', 'random', '--pred_target', 'reconstruction', '--n_retrievals', '10', '--n_neighbors', '100'] exits with return code = 1

Process finished with exit code 1

After I checked on the Internet, most of them said that it was because of the distribution. Maybe it was because I used deepspeed, which caused the problem of distribution.

At this time, refer to the article: pycharm terminated operation_How to debug torch.distributed under Pycharm_Qi Yuanyuan's Blog-CSDN Blog

 

The method is very simple to say, just need to make some settings in Pycharm's Configuration:

  • Open Run -> Edit Configurations...
  • Script path is no longer the path of your own code, but  launch.py the path where the file is saved. For example, mine is:
\\wsl$\Ubuntu-18.04\home\mapengsen\anaconda3\envs\38\lib\python3.8\site-packages\torch\distributed\launch.py
  • Set Parameters:
--nproc_per_node=1 main.py
  • Add in Environment variables  CUDA_VISIBLE_DEVICES=0,1 .
  • Delete the deepspeed part in the interpreter options (because distribution. is now used launch.py作为script path )
  • The remaining Python interpreter and Working directory can be set according to the usual situation.

After these steps, you can debug the distributed training code in Pycharm.

 

 

おすすめ

転載: blog.csdn.net/weixin_43135178/article/details/131608938