TAGNN code interpretation

Hi, good guys , I read the paper in the previous blog post , and it’s nothing, I can’t write it myself, haha

For Video Recommendation in Deep learning QQ Group 277356808

Video recommendation deep learning plus this group

For Visual in deep learning QQ Group 629530787

Visual deep learning plus this, make no mistake

I'm here waiting for you

Don't add so much, it's not necessary, besides, don't accept private chat/private message on this webpage! ! !

I reported an error at the beginning, ah ah ah ah ah ah ah ah ah ah, life is still difficult, how to fix it. [I just said casually]

1- Use the data in GNN directly and find an error

epoch:  0
/home/xulm1/anaconda3/lib/python3.7/site-packages/torch/optim/lr_scheduler.py:122: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  "https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
start training:  2020-10-30 11:20:35.947314
[0/5839] Loss: 10.6708
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:106: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [8,0,0] Assertion `t >= 0 && t < n_classes` failed.
Traceback (most recent call last):
  File "main.py", line 69, in <module>
    main()
  File "main.py", line 48, in main
    hit, mrr = train_test(model, train_data, test_data)
  File "/data1/xulm1/TAGNN/model.py", line 139, in train_test
    loss.backward()
  File "/home/xulm1/anaconda3/lib/python3.7/site-packages/torch/tensor.py", line 195, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/xulm1/anaconda3/lib/python3.7/site-packages/torch/autograd/__init__.py", line 99, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: CUDA error: device-side assert triggered

The search found that the number of categories is too much or too little, so increasing or decreasing n_node by 1 is the same error. WOC, you can only take a closer look at the code and see if the data processing part is the same.

    train_data = pickle.load(open('./datasets/' + opt.dataset + '/train.txt', 'rb'))
    test_data = pickle.load(open('./datasets/' + opt.dataset + '/test.txt', 'rb'))

#对于sample
>>> len(train_data[0])
1205
>>> len(train_data[1])
1205

>>> len(test_data[0])
99
>>> len(test_data[1])
99

Basically, it is certain that the training set data is composed of the input sequence and the next click to form the "inputs-label", and for the diginetica data set I may have changed the previous data storage.

Consider the following example:

>>> train_data[0][-5:]
[[272, 287, 287, 287, 271], [272, 287, 287, 287], [272, 287, 287], [272, 287], [272]]
>>> train_data[1][-5:]
[287, 271, 287, 287, 287]

For the chronological click sequence [272, 287, 287, 287, 271], it can be split into 4 inputs-labels, as follows:

 [272, 287, 287, 287] ——271

[272, 287, 287]——287

[272, 287]——287

[272]——287

So I can reload the data to view, as follows, n_node=43137 [to increase by 1], so that I get the correct execution result

>>> np.min(train_data[1])
1
>>> np.max(train_data[1])
43136
start training:  2020-10-30 14:49:05.805033
[0/13] Loss: 5.7167
[3/13] Loss: 5.7231
[6/13] Loss: 5.7123
[9/13] Loss: 5.7171
[12/13] Loss: 5.7065
	Loss:	74.300
start predicting:  2020-10-30 14:49:06.557334
Best Result:
	Recall@20:	11.1111	MMR@20:	1.6607	Epoch:	0,	0

2- Small video data test

Small video data is relatively small, you can look at the running time. The user is about 400,000 and the item is about 80,000. However, in view of the relatively long sequence length (40~50), the starting point index of the random Poisson distribution of 2~8 is used to calculate.

One problem is that there is no uid recorded in the training data. I will go. . . . . . What's wrong, don't record it first, let's test the effect first.

  File "mymain.py", line 28, in main
    train_data,test_data =dat.load_data()
  File "/data1/xulm1/TAGNN/utils.py", line 163, in load_data
    return (train_seq,train_label) (test_seq,test_label)
TypeError: 'tuple' object is not callable

This error is that there is no comma between the two () in the code, forget it 

    input_in = torch.matmul(A[:, :, :A.shape[1]], self.linear_edge_in(hidden)) + self.b_iah
  File "/home/xulm1/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/xulm1/anaconda3/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 87, in forward
    return F.linear(input, self.weight, self.bias)
  File "/home/xulm1/anaconda3/lib/python3.7/site-packages/torch/nn/functional.py", line 1372, in linear
    output = input.matmul(weight.t())
RuntimeError: CUDA error: CUBLAS_STATUS_ALLOC_FAILED when calling `cublasCreate(handle)

The problem is that the number of categories is incorrect. Just modify it.

The memory burst directly after modification. . . . . . What a mess. Can't do it.

RuntimeError: CUDA out of memory. Tried to allocate 2.68 GiB (GPU 0; 10.73 GiB total capacity; 6.46 GiB already allocated; 763.56 MiB free; 9.14 GiB reserved in total by PyTorch)

Let's take a look at redis storage tomorrow, or should I try to reduce the number of users? Try with 100,000 users? I tried the following data, it's too slow.

(1140664, 2)
Train Lines: 850791
user numner 58098, test_seq length 58098, test_label length 58098
users , [    0     1     2 ... 58095 58096 58097]
The number of users: 58098
The number of items: 47681
The number of ratings: 1140664
Average actions of users: 19.63
Average actions of items: 23.92
The sparsity of the dataset: 99.958823%
-------------------------------------------------------
epoch:  0
/home/xulm1/anaconda3/lib/python3.7/site-packages/torch/optim/lr_scheduler.py:122: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  "https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
start training:  2020-10-30 20:42:16.552962
[0/8508] Loss: 10.7723

But what was waiting was not smooth execution, but another error. It seems that coding must start from 1.

RuntimeError: CUDA error: device-side assert triggered
/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:106: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [0,0,0] Assertion `t >= 0 && t < n_classes` failed.

The code has not been changed. The above is insufficient memory. I found that there are other timing tasks running, and I ran to find out.

/pytorch/aten/src/THCUNN/ClassNLLCriterion.cu:106: void cunn_ClassNLLCriterion_updateOutput_kernel(Dtype *, Dtype *, Dtype *, long *, Dtype *, int, int, int, int, long) [with Dtype = float, Acctype = float]: block: [0,0,0], thread: [2,0,0] Assertion `t >= 0 && t < n_classes` failed.
Traceback (most recent call last):
  File "mymain.py", line 65, in <module>
    main()
  File "mymain.py", line 44, in main
    hit, mrr = train_test(model, train_data, test_data)
  File "/data1/xulm1/TAGNN/model.py", line 139, in train_test
    loss.backward()
  File "/home/xulm1/anaconda3/lib/python3.7/site-packages/torch/tensor.py", line 195, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph)
  File "/home/xulm1/anaconda3/lib/python3.7/site-packages/torch/autograd/__init__.py", line 99, in backward
    allow_unreachable=True)  # allow_unreachable flag
RuntimeError: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`

>>> torch.cuda.is_available()
True

The encoding must be changed. Starting from 1, the total is increased by 1 to n_node. After the change, it can be executed normally, but the speed is too slow. However, there is still insufficient memory during prediction, because the training memory is not released.

The number of users: 58098
The number of items: 47681
The number of ratings: 1140664
Average actions of users: 19.63
Average actions of items: 23.92
The sparsity of the dataset: 99.958823%
-------------------------------------------------------
epoch:  0
/home/xulm1/anaconda3/lib/python3.7/site-packages/torch/optim/lr_scheduler.py:122: UserWarning: Detected call of `lr_scheduler.step()` before `optimizer.step()`. In PyTorch 1.1.0 and later, you should call them in the opposite order: `optimizer.step()` before `lr_scheduler.step()`.  Failure to do this will result in PyTorch skipping the first value of the learning rate schedule. See more details at https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate
  "https://pytorch.org/docs/stable/optim.html#how-to-adjust-learning-rate", UserWarning)
start training:  2020-11-01 15:15:19.663547
[0/8503] Loss: 10.7745
[1701/8503] Loss: 6.2509
[3402/8503] Loss: 5.0878
[5103/8503] Loss: 4.9218
[6804/8503] Loss: 4.8871
	Loss:	45659.489
start predicting:  2020-11-01 15:33:33.184119
Traceback (most recent call last):
  File "mymain.py", line 65, in <module>
    main()
  File "mymain.py", line 44, in main
    hit, mrr = train_test(model, train_data, test_data)
  File "/data1/xulm1/TAGNN/model.py", line 151, in train_test
    targets, scores = forward(model, i, test_data)
  File "/data1/xulm1/TAGNN/model.py", line 125, in forward
    return targets, model.compute_scores(seq_hidden, mask)
  File "/data1/xulm1/TAGNN/model.py", line 92, in compute_scores
    scores = torch.sum(a * b, -1)  # b,n
RuntimeError: CUDA out of memory. Tried to allocate 1.78 GiB (GPU 0; 10.73 GiB total capacity; 8.61 GiB already allocated; 751.56 MiB free; 9.15 GiB reserved in total by PyTorch)

Life is difficult, and it reminds me of the strange things about GPUs . How to deal with this kind of weird thing, one problem is that it takes up other GPU resources, this problem is very prominent when the GPU is seriously scarce.

Firstly? Or try another server? There are currently no redundant servers. . . . . . . . Tired.

Change it next time. Without a GPU, 1080 is fine.

 

I advise you to go back soon,

You say you don't want to go back, just tell me to hold you

The long sea breeze blows gently,

Cooled the wildfire,

I saw you sad,

You say how am I willing to go,

The bitterness is also beautiful,

How to stop crying, I have to kiss your hair softly

Let the wind continue to blow,

Can't bear to stay away.

 

 

Guess you like

Origin blog.csdn.net/SPESEG/article/details/109379980