Server Quantization Training Operation Instructions

The main steps of Freespace server pre-training:

  1. First log in to the bastion host, the command is as follows:

ssh [email protected]   (xxx is the prefix of personal email)

Password is personal email password

  1. Log in to the working machine, the command is as follows:

ssh [email protected]

The password is: l3

  1. Find the training source code and script of the freespace network on the working machine (this version is the verified version), the original path is /home/l3/chenghongkuan/freespace/perception-tnt8.2 , create a new directory under the root directory, such as , And copy the content under the original path to the personal directory.
  2. Cluster environment configuration
    1. Copy of the slurm client tool: the working machine slurm client tool has been installed, and the location is , you can directly copy this part of the content to your own directory:
    2. Token application: In your own client directory , execute the following command to complete the personal token application, you will receive an email, pay attention to check:
    3. Token configuration: Execute the following command to complete the configuration of personal token, where ak and sk can be found in the email received at the time of application, and the email will be received if the configuration is successful.

  1. Training task submission: This part is temporarily operated in the simplest way, the operation is as follows
    1. Under your own training path , find submit.sh, modify the address of HGclient in submit.sh, and change it to your own directory:

    1. Under your own training path, find train.sh, you can change the job_name to your own defined name;
    2. Under your own training path, find freespace.yaml, check whether DATASET:TRAINING:DATA_MODULE is apps.freespace.src.data.sfs_v3.SFSDataset, check whether MODEL:BACKBONE:CONV_BODY is SfsVps;
    3. Under your own training path, find train.sh, execute sh train.sh, and submit the training task to the remote cluster for training. If the training task is submitted successfully, the following print information will appear:

The Qianmo console will also check the current tasks, the interface is as follows:

  1. Training model acquisition: Find the task of this training in "My Job" on the Qianmo server page, and then click "View overview"

, enter the Log Agent interface, click "output", select the model, click "download" to download the model,

 

The main steps of Freespace quantization parameter extraction:

  1. Copy the pre-trained model such as model_0530000.pth to the directory perception-tnt/tnt/entries/output/.
  2. Modify freespace.yaml in the perception-tnt directory as follows:

Open the comment, and modify the WEIGHT file to the newly generated model file, such as /tnt/entries/output/model_0530000.pth;

  1. Modify the train_net.py in the perception-tnt/tnt/entries/ directory as follows:

Comment out this do_train code;

  1. Execute python train_net.py --config-file ../../freespace.yaml --compress-file ../../apps/freespace/config/compressor_config.yaml in the perception-tnt/tnt/entries/ directory --skip-test 0 , after execution, quantization_stats.yaml will be generated in the perception-tnt/tnt/entries/output/ directory.

The main steps of Freespace server retraining:

  1. Download the latest perception-tnt source code to the working machine, switch the branch to dev_fpga_v3
  2. Modify the content in freespace.yaml as follows:

Change the sfs_swapchannel in to sfs_v3 , and the SfsVps in to SfsVpsV3;

  1. Copy the model file and quantization parameter file quantization_stats.yaml to the /tnt/entries/output/ directory.
  2. Set BASE_LR to the last LR of the original trained network. If the previous records are lost, you can set the value smaller;
  3. Follow the pre-training steps 5 and 6 for training and model acquisition.

Relevant information and important information:

  1. For general reference of server training, you can see the following part in http://icode.baidu.com/repos/baidu/adu-lab/perception-tnt/blob/master:README.md :

  1. For the installation and configuration of the SLURM client, please refer to the wiki:

http://wiki.baidu.com/pages/viewpage.action?pageId=365141511

The current client tool has been installed on the working machine, and the remaining main work is the application and configuration of the token

  1. The GPU cluster required for training can be found at http://newqianmo.baidu.com/index.jsp#/user/gpu?status=RUNNING&_k=j0cckg , and relevant students should have relevant permissions, as shown in the figure below

  1. View the details of the submitted training task:

  1. Withdraw the submitted training task:    

Guess you like

Origin blog.csdn.net/weixin_45905610/article/details/131819830