[Exception error] Unexpected option: --local_rank=0 (pycharm can run but not debug)

Today, after using run to run the cmd command converted from the shell file, run can run normally, but there is a problem with debug, the error message:

Usage:
	pydevd.py --port N [(--client hostname) | --server] --file executable [file_options]
Traceback (most recent call last):
  File "/home/mapengsen/.pycharm_helpers/pydev/pydevd.py", line 2016, in main
    setup = process_command_line(sys.argv)
  File "/home/mapengsen/.pycharm_helpers/pydev/_pydevd_bundle/pydevd_command_line_handling.py", line 146, in process_command_line
    raise ValueError("Unexpected option: " + argv[i])
ValueError: Unexpected option: --local_rank=0
[2023-07-08 10:08:11,202] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 2934
[2023-07-08 10:08:11,202] [ERROR] [launch.py:321:sigkill_handler] ['/home/mapengsen/anaconda3/envs/38/bin/python', '-u', '/home/mapengsen/.pycharm_helpers/pydev/pydevd.py', '--local_rank=0', '--multiprocess', '--qt-support=auto', '--client', '127.0.0.1', '--port', '58899', '--file', '/mnt/d/Pycharm_workspace/DoubleTarget/RetMol/MolBART/train_retrieval.py', '--model-parallel-size', '1', '--pipe-parallel-size', '0', '--num-layers', '4', '--hidden-size', '256', '--num-attention-heads', '8', '--seq-length', '512', '--max-position-embeddings', '512', '--batch-size', '320', '--gas', '16', '--train-iters', '50000', '--lr-decay-iters', '320000', '--data-impl', 'mmap', '--distributed-backend', 'nccl', '--lr', '0.0001', '--lr-decay-style', 'cosine', '--min-lr', '1.0e-5', '--weight-decay', '0', '--clip-grad', '1.0', '--warmup', '0.01', '--checkpoint-activations', '--log-interval', '1', '--save-interval', '10000', '--eval-interval', '50000', '--eval-iters', '10', '--fp16', '--dataset_path', '../data/zinc.tab', '--deepspeed', '--deepspeed_config', 'megatron_molbart/ds_config.json', '--zero-stage', '1', '--zero-reduce-bucket-size', '50000000', '--zero-allgather-bucket-size', '5000000000', '--zero-reduce-scatter', '--checkpoint-activations', '--checkpoint-num-layers', '1', '--partition-activations', '--synchronize-each-layer', '--contigious-checkpointing', '--stage', '1', '--train_from', 'pretrain', '--model_ckpt_itr', '50000', '--attr', 'logp-sa', '--attr_offset', '0', '--data_source', 'jtnn', '--enumeration_input', 'false', '--retriever_rule', 'random', '--pred_target', 'reconstruction', '--n_retrievals', '10', '--n_neighbors', '100'] exits with return code = 1

Process finished with exit code 1

After I checked on the Internet, most of them said that it was because of the distribution. Maybe it was because I used deepspeed, which caused the problem of distribution.

At this time, refer to the article: pycharm terminated operation_How to debug torch.distributed under Pycharm_Qi Yuanyuan's Blog-CSDN Blog

 

The method is very simple to say, just need to make some settings in Pycharm's Configuration:

  • Open Run -> Edit Configurations...
  • Script path is no longer the path of your own code, but  launch.py the path where the file is saved. For example, mine is:
\\wsl$\Ubuntu-18.04\home\mapengsen\anaconda3\envs\38\lib\python3.8\site-packages\torch\distributed\launch.py
  • Set Parameters:
--nproc_per_node=1 main.py
  • Add in Environment variables  CUDA_VISIBLE_DEVICES=0,1 .
  • Delete the deepspeed part in the interpreter options (because distribution. is now used launch.py作为script path )
  • The remaining Python interpreter and Working directory can be set according to the usual situation.

After these steps, you can debug the distributed training code in Pycharm.

 

 

Guess you like

Origin blog.csdn.net/weixin_43135178/article/details/131608938