Deep learning stepping on the experience of stepping on the pit [continuous update]

background

In the process of deep learning alchemy, you will always encounter various strange problems. At this time, you will always find the answer on the csdn and Zhihu platforms. Then every time you encounter a problem, it is solved, but it is not recorded. It is really a pity, because Sometime in the future or someone will encounter similar problems, so this article is dedicated to sorting out pytorch, python, conda, pip and other issues, hoping to give you more help

pytorch

Question 1: cuda error: device-side assert triggered

If the error is reported in your own code, it is generally related to your own code logic error:
1. Check the code to see if the loss becomes nan during training, and you can change the splicing method of word vectors, etc.
2. If it is a classification task, it may be that the number of labels does not correspond.
Referring to an article, you can try to add in the code: torch.backends.cudnn.enable =True, torch.backends.cudnn.benchmark = True. (Not useful for my problem)
This kind of problem operates tensor exceptions, such as array out of bounds, precision mismatch, etc.

Problem 2: cuda out of memory

  • Change small batch_size
  • Add this sentence torch.cuda.empty_cache() where the error is reported
  • Add this statement with torch.no_grad():
  • Add such a sentence model.eval()
  • The loss and evaluation indicators are forcibly converted to float() type, or the loss is deleted at the end of each epoch.
  • Change "pin_memory": True to False, the specific reason is the original blog:
    pin_memory is the page-locked memory, when creating DataLoader, set pin_memory=True, it means that the generated Tensor data is originally a page-locked memory in the memory, so that It will be faster to convert the memory Tensor to the GPU memory.
    The memory in the host has two ways of existence, one is page-locked, the other is not page-locked, and the content stored in the page-locked memory will not be exchanged with the virtual memory of the host under any circumstances (Note: virtual memory is the hard disk) , the data will be stored in the virtual memory when the host memory is insufficient without page-locking memory. The video memory in the graphics card is all page-locked memory. When the computer has sufficient memory, you can set pin_memory=True. Set pin_memory=False when the system is stuck, or when the swap memory is used too much. Because pin_memory is related to computer hardware performance, pytorch developers cannot ensure that every alchemy player has high-end equipment, so pin_memory defaults to False.
    References: https://blog.csdn.net/xiyou__/article/details/118529350

Question 3: nan appears when nn.TransformerEncoder uses src_key_padding_mask

The reason for nan comes from src_key_padding_mask, src_key_padding_mask is a binary tensor, it should be True where it needs to be ignored, and it should be False when the original value needs to be retained. The inspection found that src_key_padding_mask is all True, which will cause all encoded results to be nan.
The solution is to update the mask or not use the mask.

Question 4: Tensor must be contiguous

Solution

batch_data[1].permute([1,0,2]).contiguous()

Reference:
https://www.jianshu.com/p/51678ea7a959

Question 5: After loading the model, an error is reported at optimizer.step(): RuntimeError: expected device cpu but got device cuda:0

Reason: When the optimizer loads parameters, the tensor is on the CPU by default, so all tensors need to be placed on the GPU, otherwise: an error is reported at optimizer.step(): RuntimeError: expected

device cpu but got device cuda:0。
optimizer = model.configure_optimizers(weight_decay, learning_rate, (beta1, beta2), device_type=cfg.device)
        optimizer.load_state_dict(checkpoint['optimizer'])
        for state in optimizer.state.values():
            for k, v in state.items():
                if torch.is_tensor(v):
                    state[k] = v.to(cfg.device)

Question 6: The process is stuck at runtime, and the video memory is allocated but does not start training (stuck in Using /home/faith/.cache/torch_extensions as PyTorch extensions root...)

Solution:
Generally, there will be an automatically generated .cache folder in the user's home directory (it may be hidden, you need to open the option to show hidden files), delete this folder, and regenerate the .cache file Clip, then there will be no conflicts
insert image description here

Question 7: ImportError: cannot import name 'UnencryptedCookieSessionFactoryConfig' appears when importing apex

Solution:
Download the apex installation file and install it manually! No more using pypi-managed apex!
1. git clone git://github.com/NVIDIA/apex
2. cd apex
3. pip install -v --no-cache-dir ./

Note: through pip install apex is not nvidia official document

问题8:accelerate库 RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:1 and cuda:0! (when checking argument for argument index in method wrapper_CUDA__index_select)

Note: The problem is still in the positioning analysis

问题9:RuntimeError: Library cudart is not initialized

Cause of the problem:
lack of cudatoolkit
Solution:
conda install cudatoolkit=11.7 -c nvidia

python

1)sqlalchemy

Question 1: sqlalchemy+pandas.read_sql) AttributeError: 'Engine' object has no attribute 'execution_options'Scenario
: Through database sql, use pandas.read_sql to read data,
analysis: SQLAlchemy v2.0.0 version and V1.XX version are quite different , the same code, version 2.0.0 and above cannot run
Solution:
add text import, transfer the execution sql to the text method

from sqlalchemy import create_engine, text
s_settings_df = pd.DataFrame(engine_cloud.connect().execute(text(query)))

conda

Question 1: An unexpected error has occurred. Conda has prepared the above report. If submitted, this report will be used by core maintainers to improve future releases of conda. Scenario: When using miniconda to install a new environment
in the
server on the host, enter
The installation failed when conda create -n chatgpt python==3.8.0, encountered An unexpected error has occurred. Conda has prepared the above report. problem. .

Analysis : I encountered this problem because I changed the path of the miniconda installation class library, because I have a server with insufficient space in the root directory

Solution: I checked a lot of information on the Internet, tried them all, and finally failed to solve the problem. Finally, after uninstalling minicoda, reinstall it, and specify a new directory

References:

pip

问题1:[ERROR: Could not install packages due to an OSError: [Errno 28] No space left on

Background : pip installs the torch library, the code is as follows

pip install -I https://download.pytorch.org/whl/cu118/torch-2.0.1%2Bcu118-cp38-cp38-linux_x86_64.whl

Reason : During the process of pip install, the /tmp directory is used to temporarily store the files required for build, and the space in the /tmp directory is insufficient, resulting in failure to install. Solution: Delete the
large files
in the /tmp directory, and
execute the command, which can help us find the /tmp directory large file

du -h --max-depth=1 | sort -hr

Question 2: pip install reports FileNotFoundError: [Errno 2] No usable temporary directory found in ['/tmp', '/var/tmp', '/usr/tmp']

Scenario : pip installs the torch library, the code is as follows

pip install -I https://download.pytorch.org/whl/cu118/torch-2.0.1%2Bcu118-cp38-cp38-linux_x86_64.whl

Reason : Deletion of pip-related temporary files due to problem 1
Solution :
Delete all files in ~/.cache/pip/

Here are several **.cache file cleaning methods** for reference:

  • Find a file with a date greater than 365 days, delete it directly, command: find ~/.cache/ -type f -atime +365 -delete
  • Find a file larger than 10M, command: find ~/.cache/ -size +10M, and then clean it up as appropriate.
  • To list the directories whose size is larger than 100M, command: du ~/.cache -t 100M, and then clean them up as appropriate.

References: https://blog.csdn.net/qq_36332660/article/details/129241167

image library

clip

Question 1: module 'clip' has no attribute 'load'

Use pip install clip directly, but this clip is not another clip, it is recommended to install the official website git code warehouse directly

Fast way of network speed: pip install git+https://github.com/openai/CLIP.git
Slow way of network speed:

  • First clone the official code to a directory, git clone https://github.com/openai/CLIP.git /tmp//CLIP
  • Then execute pip install ./ or python -m setup.py

问题2:model, preprocess = clip.load(‘ViT-L/14’) certificate verify failed: self signed certificate in certificate chain

Scenario : Simplify the core code, as follows:

import torch
import clip
from PIL import Image

device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

Reason : The network is not good, there is no ladder, but I am too lazy to find
the solution :
download it manually, analyze the source code of the clip, and there is a core file
clip.py
related to the download operation . The content is as follows:

_MODELS = {
    "RN50": "https://openaipublic.azureedge.net/clip/models/afeb0e10f9e5a86da6080e35cf09123aca3b358a0c3e3b6c78a7b63bc04b6762/RN50.pt",
    "RN101": "https://openaipublic.azureedge.net/clip/models/8fa8567bab74a42d41c5915025a8e4538c3bdbe8804a470a72f30b0d94fab599/RN101.pt",
    "RN50x4": "https://openaipublic.azureedge.net/clip/models/7e526bd135e493cef0776de27d5f42653e6b4c8bf9e0f653bb11773263205fdd/RN50x4.pt",
    "RN50x16": "https://openaipublic.azureedge.net/clip/models/52378b407f34354e150460fe41077663dd5b39c54cd0bfd2b27167a4a06ec9aa/RN50x16.pt",
    "RN50x64": "https://openaipublic.azureedge.net/clip/models/be1cfb55d75a9666199fb2206c106743da0f6468c9d327f3e0d0a543a9919d9c/RN50x64.pt",
    "ViT-B/32": "https://openaipublic.azureedge.net/clip/models/40d365715913c9da98579312b702a82c18be219cc2a73407c4526f58eba950af/ViT-B-32.pt",
    "ViT-B/16": "https://openaipublic.azureedge.net/clip/models/5806e77cd80f8b59890b7e101eabd078d9fb84e6937f9e85e4ecb61988df416f/ViT-B-16.pt",
    "ViT-L/14": "https://openaipublic.azureedge.net/clip/models/b8cca3fd41ae0c99ba7e8951adf17d267cdb84cd88be6f7c2e0eca1737a03836/ViT-L-14.pt",
    "ViT-L/14@336px": "https://openaipublic.azureedge.net/clip/models/3035c92b350959924f9f00213499208652fc7ea050643e8b385c2dac08641f02/ViT-L-14-336px.pt",
}

According to the usage requirements, download the corresponding model version file to ~/.cache/clip/, such as the ViT-B-32.pt model, and then execute the above code and it will be normal
insert image description here

Reference: https://zhuanlan.zhihu.com/p/613923088

Guess you like

Origin blog.csdn.net/stark_summer/article/details/130796106