stuck during mmclassification test

stuck during mmclassification test

Server: 6 1080ti
system: Ubuntu16.04
cuda: 10.2
pytorch: 1.8.2
mmcv-full: 1.8.0+cu102
mmclassification: 20220511

Description: When using mmclassification to test the accuracy of the model, a stuck situation occurs. From the progress bar, all forward inferences on the data are completed, but the final accuracy is never output;

Solution

By reading the source code and inserting the print statement, I found that distributed computing in mmclassification requires collecting all results through the collect_results_cpu function. Each gpu will write the results to the temporary folder .dist_test, and then reload when rank=0 into memory,
and in this way the scattered calculation results are unified.

Further debugging found that when gpu-id=4, the temporary results could not be saved normally, causing the program to freeze.

Use the CUDA_VISIBLE_DEVICES environment variable to block the gpu with id=4 and only use the other 5 blocks.

Summarize

It is estimated that other projects of mmlab have similar logic. If the server GPU quality is not good, such problems may occur.

Guess you like

Origin blog.csdn.net/bcfd_yundou/article/details/124716285