Suggest collection - summary of problems encountered when using pytorch

  1. TypeError: unhashable type: 'numpy.ndarray'
    Reason for error : When converting pytorch's longTensor to numpy and using it for the key of dict, such an error will occur. In fact, the program output is already int, but it will still be considered as ndarray.
    Solution : Add .item() on the original basis

    classId = support_y[i].long().cpu().detach().numpy().item()
    
  2. TypeError: 'int' object is not callable
    The reason encountered during data loading : the data is not of Tensor type but np.array or other types.
    Solved :

    tensor = torch.LongTensor(data_x)
    data_x = autograd.Variable(tensor)
    tensor = torch.LongTensor(data_y)
    data_y = autograd.Variable(tensor)
    
  3. RuntimeError: DataLoader worker (pid(s) 18620, 45872) exited unexpectedly
    Encountered a solution when loading data : command in the loadernum_workers=0

  4. RuntimeError: input.size(-1) must be equal to input_size. Expected 10, got 2000
    Reason : The dimension is specified incorrectly when using view. After specifying batch_first=True in LSTM(input,(h0,c0)), the input is (batch_size,seq_len,input_size) otherwise it is input(seq_len, batch, input_size) to
    solve :

    lstm_out, self.hidden = self.lstm(
            embeds.view(self.batch_size, 200, EMBEDDING_DIM), self.hidden) 
    
  5. error AttributeError: module 'torch.utils.data' has no attribute 'random_split'.
    Reason : random_split of pytorch1.1.0 version is in torch.utils.data, and random_split of version 0.4.0 is in torch.utils.data.dataset.
    Solved : from torch.utils.data.dataset import random_split.

  6. ValueError: Sum of input lengths does not equal the length of the input dataset!
    Reason for error : Data set problem.

  7. Error TypeError: can't convert CUDA tensor to numpy. Use Tensor.cpu() to copy the tens or to host memory first.
    resolution : change the accuracy rate calculation to:

    acc1 = (pred_cls1 == val_y1).cpu().sum().numpy() / pred_cls1.size()[0]
    
  8. RuntimeError: Input and hidden tensors are not at the same device, found input t ensor at cuda:1 and hidden tensor at cuda:0
    Reason for error : because of the use of

    if torch.cuda.device_count() > 1:
    	print("Let's use", torch.cuda.device_count(), "GPUs!")
    	model = nn.DataParallel(model)
    model.to(device)
    

    The tensor does not specify the ID of the card.
    Solution : Two ways.
    1) First define one device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')(the device has been defined in "cuda:0" on card 0), and then model = torch.nn.DataParallel(model,devices_ids=[0, 1, 2])(assuming there are three cards). After that, the tensor needs to be migrated to the GPU as well. Note that all tensors must be on the same GPU, ie: tensor1 = tensor1.to(device), tensor2 = tensor2.to(device)etc. Note : It must not just tensor1.to(device)not assign a value, this will only create a copy.
    2) Directly use tensor.cuda() method. That is, first model = torch.nn.DataParallel(model, device_ids=[0, 1, 2])(assuming there are three cards, and the IDs of the cards are 0, 1, 2), then tensor1 = tensor1.cuda(0), tensor2=tensor2.cuda(0)and so on. (I put all the tensors in the card with ID 0, or you can put all the tensors in the card with ID 1)
    Reference URL: Pitfalls encountered in the process of learning Pytorch (continuously updated)

  9. error ‘DataParallel’ object has no attribute ‘init_hidden’.
    Reason : nn.DataParallel(m)What this sentence returns is not the original m, but a DataParallel, and the original m is stored in the module variable of DataParallel.
    Solution : If you need to modify the model after DataParallel and to(device), you need to

    if isinstance(model, nn.DataParallel):
        model = model.module
    
  10. Error Assertioncur_target >= 0 && cur_target < n_classes' failed.`.
    Reason : This problem is often encountered in classification training. Generally speaking, this error occurs when the number of types output in the network is different from the number of types set by the label. Pytorch has a requirement that the label must start with 0 when using the CrossEntropyLoss function for verification.
    Solved :

    tags_ids = range(len(tags_set)) # 从0开始
    tag2id = pd.Series(tags_ids, index=tags_set)

  11. RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location='cpu' to map your storages to the CPU.
    Reason for the error : The model originally trained by the GPU needs to be loaded on the CPU.
    Solution : model = torch.load(model_path, map_location='cpu')
    In the same way, if the original 4 GPU training is changed to one, then model = torch.load(model_path, map_location='cuda:0');
    if it is 4 to two: change the map_location to: map_location={'cuda:1': 'cuda:0'}.

  12. size mismatch for word_embeddings.weight: copying a param with shape torch.Size([3403, 128]) from checkpoint, the shape in current model is torch.Size([12386, 128]).
    Reason for error : The word_embedding layer parameters of the loaded model do not match the parameters input by the current model.
    Solution : The length of word2id and tag2id should be the same.

  13. error RuntimeError: Expected hidden[0] size (2, 359, 256), got (2, 512, 256).
    Reason : DataLoader is used to load data, the number of training instances in the data set cannot be divisible by batch_size, the size of the last batch is not equal to batch_size, and the hidden_layer is initialized with a fixed batch_size when initializing: Solution: If the model cannot handle online changes in autograd.Variable(torch.zeros(self.num_layers * 2, self.batch_size, self.hidden_dim // 2))
    batch size , you should set drop_last=True in torch.utils.data, so only the whole batch of data is processed during training. Right nowtestset_loader = DataLoader(test_db, batch_size=args.batch_size, shuffle=False, num_workers=1, pin_memory=True,drop_last=True)

  14. Loss=nan occurs during pytorch training.
    Reasons :
    1. The learning rate is too high. When the learning rate is relatively large, the parameters may be overshoot, and the result is that the minimum point cannot be found;
    reducing the learning rate can make the parameters move towards the extreme point
    . 2. There is a problem with the loss function.
    3. For the regression problem, there may be a calculation of division by 0, and adding a small remainder may be able to solve it.
    4. The data itself, whether there is Nan, you can use numpy.any(numpy.isnan(x)) to check the input and target.
    5. The target itself should be able to be calculated by the loss function. For example, the target of the sigmoid activation function should be greater than 0, and the data set needs to be checked in the same way.
    Solution : Reduce the learning rate or increase the batch_size.

  15. RuntimeError: Trying to backward through the graph second time, but the buffers have already been freed. Please specify retain_variables=True when calling backward for the first time.
    Reason for the error : There are multiple sub-networks in the network, and there are two or more losses that need to update the network parameters separately. For example, two need to execute loss1.backward() and loss2.backward() respectively. The two losses may have common parts, so after the execution of the first loss1.backward() is completed, Pytorch will automatically release the saved calculation graph, so the calculation will appear when the second loss2.backward() is executed The case where the graph is lost.
    Solution :
    1 Execute loss.backward(retain_graph=True)the reserved calculation graph, but this is likely to cause memory overflow (CUDA out of memory). Because Pytorch's mechanism is to free all buffers every time .backward() is called, it prompts to retain_graph. However, after retaining, the buffers will not be freed, so it will be OOM. Reference URL: https://blog.csdn.net/Mundane_World/article/details/81038274 2 When there is no need to calculate the gradient of the generator, it is used as input data
    when calculating the discriminator using the generated data , so that the current graph is .detach()Split, return a new Variable separated from the current graph, the returned Variable will never need gradient. Reference URL: https://blog.csdn.net/u011276025/article/details/76997425
    3 For my code, If retain_graph=True, the memory overflows, and I can't find the place where .detach() is needed. Finally, I found out that it is because my model does not reinitialize the hidden layer every time it is trained. need model.zero_grad()aftermodel.hidden = model.init_hidden()To clear the hidden state of the LSTM, separate it from the history of the previous instance, and avoid interference from previously running code. If it is not reinitialized, an error will be reported.

References: Click me to view the original text

Guess you like

Origin blog.csdn.net/weixin_41173374/article/details/104686129