Recommender System Notes (16): In-depth Understanding of Collaborative Filtering in Recommender System Graphs: GDE

background

        After learning the SimGCL algorithm, we found that too many graph enhancement operations will not improve much. This raises a question, why is graph convolution effective in the recommendation system model? The feature embedding of graph convolution often has many features. Does each feature effectively affect the model result?

        With these questions, I checked the relevant literature, and gave an answer to a paper on this algorithm in this year's SIGIR, namely: Less is More: Reweighting Important Spectral Graph Features for Recommendation.

        Paper link: https://arxiv.org/pdf/2204.11346.pdf

        The paper believes that:

  1. In the previous work, the research on domain aggregation in graph convolution is not thorough enough, so the author analyzes graph convolution in the frequency domain.

  2. Two conclusions are obtained from the analysis: a. Only a small part of neighbor smoothness or difference information can promote recommendation, and most of the graph information can be regarded as noise; b. Repeated graph convolution operations can only promote neighbor smoothness and cannot be effective Filters noise and is inefficient. And based on this, an efficient GCN (supergraph convolution) is further proposed as a bandpass filter. In addition, the gradient of the negative sample is dynamically adjusted to speed up the convergence.

Thought

        In graph-based collaborative filtering, only the following features play a role, and other features can be regarded as noise. The authors specified noise from smooth to rough. The rules are as follows: find the eigenvalues ​​and eigenvectors for the adjacency matrix, and the larger the variance var in each eigenvector, the rougher it will be, and vice versa.

        The GDE algorithm (algorithm proposed by the author) believes that only special smooth and rough features have an effect on the final model effect. Based on this, the author conducted experiments and proved this, and implemented high-pass and low-pass filters for effective feature extraction . .

principle

        The author first conducted an experiment based on the GCN and LightGCN recommendation algorithms, solving the eigenvalues ​​and eigenvectors of the zero-order matrix, and calculating the variance of each eigenvector. The smaller the scratch resistance, the smaller the difference between each node and its neighbor nodes; otherwise, the larger the difference between nodes.

        In the figure below, we can find that the value of NDCG is higher at rough or smooth, and the accuracy is also saturated at smooth, indicating that it is indeed the embedded features of these two parts that play a major role in the prediction of the model.

        Among them, the red dotted line is the accuracy of the recommendation result of randomly initializing the adjacency matrix. If the intermediate features are removed, the performance of the model will be improved, and more efficient results can be obtained. In the derivation of the LightGCN model paper, it can be found that as the number of layers increases, smooth will gradually have smoother and smoother features, that is, the entire model is always tending to smooth, and even suppresses the role of rough feature vectors .

        Based on this, the author implemented GCD to perform feature extraction, extract rough and smooth features, and filter out some features that can be regarded as noise, thereby improving the effect of collaborative filtering.

        So how is it achieved? The core is feature extraction, so the author uses hypergraph convolution to obtain stronger and more informative embeddings. The general form of hypergraph convolution is as follows:

        The hypergraph convolution mentioned in this article is actually to convert the hypergraph into a simple graph with weights, and then do GCN on the simple graph. The figure shows a single update operation of HyperGCN on a certain node vv.

        In this paper, item and user are respectively regarded as hyperedges, and the adjacency matrix of user and item can be obtained. Intuitively speaking, it is to first aggregate user or item to obtain the representation of hyperedge, and then aggregate from hyperedge to user or item to obtain the representation of user or item.

        In order to achieve effective feature extraction, the author designed to divide the graph G into three subgraphs Gs, Gr, and Gn, which respectively represent smooth subgraphs, rough subgraphs, and noise subgraphs, and designed filter functions that have different effects on different subgraphs. The convolution effect to achieve the purpose of feature extraction:

         Among them, γ(u/i,λt) can be understood as a frequency response function, can also be understood as a filter, and can also be understood as the importance evaluation function of the t-th node feature.

        After the required features are extracted, the features need to be aggregated. The author uses the pooling method to aggregate the features after the hypergraph convolution:

        After GDE feature extraction, the hypergraph representation of user and item is aggregated separately for the features of smmoth and rough, and finally the data features of user and item are aggregated to obtain the final feature representation:

         where P(r) and π(r) are the smallest (that is, the roughest) first m2 of the eigenvectors and eigenvalues ​​of the AU, respectively. Q(r) and σ(r) are the smallest (that is, the roughest) top n2 of AI's eigenvectors and eigenvalues, respectively. EU is the embedding of user, and EI is the embedding of item.

        The evaluation of importance will be related to the final feature extraction effect of the model. The author proposes two ways to calculate the importance of features. They are dynamic feature learning, that is, the attention mechanism:

        

        Another way is to design a static function with functions about eigenvalues:

 

        This is the author's formula for rewriting graph convolution based on the Taylor expansion of the function. For detailed derivation, please refer to the paper.

        After the author proposed the feature extraction of the model and the new hypergraph convolution calculation method, the loss function BPR Loss was also optimized. The previous loss function was not weighted for the samples, that is, the weights of the medicine books were the same. , which will lead to slower convergence speed and poor convergence effect, so the author proposes to use the negative sample dynamic weighting method:

       Among them, the parameter ξ=0.99, the experimental results also show that this negative sample weighted Loss can speed up the convergence, and the λ controls the degree of regularization. As shown in the picture:

         Visible: (a) On LightGCN, the gradient on negative samples disappears faster than on MF. (b) This problem can be alleviated by adaptively adjusting the gradient on negative samples.

Summarize

(1) The author studies that the main contribution part of the feature representation of GCN is the feature of the rough and smooth parts, so the hypergraph is used to convolve the user and item respectively for feature extraction. (Convolution is divided into only the part with the largest eigenvalue and the smallest eigenvalue (as two convolution kernels). The parameters of the convolution kernel can be dynamically learned or the eigenvalues ​​can be mapped with a function).

(2) The author's starting point in this article is GCN itself. Through frequency domain analysis, it is confirmed that GCN is actually convolved for the local area, and a multi-layer frequency response is designed. However, this article only uses one layer for convolution, but achieves convolution. The purpose of accumulating to a farther field of vision.

(3) Regarding why the high and low features of var can play a key role, the author of the principle did not give an explanation, and can only draw a conclusion from the results: the accuracy of the model is a small part of the height smoothness or difference (rough ) is determined by the characteristics, and the smooth signal is more effective than the rough signal, and further consideration is needed.

Reference link:

What is the difference between collaborative filtering and content-based recommendation? - Know almost

HyperGCN: A New Method of Training Graph Convolutional Networks on Hypergraphs - popozyl - 博客园

Guess you like

Origin blog.csdn.net/qq_46006468/article/details/126397359