MF vs MLP: Talk about the significance of scientific tuning in the recommendation model

MF vs MLP

Author: Nine Yu, Xiao Bian alchemy notes


The recommendation algorithm model based on Embedding has been a hot research topic in recent years, and the results of research and practice from the industry can be seen in major international conference journals. MF (Matrix Factorization), as a traditional method based on dot product and high-order combination Embedding, is widely used in recommendation systems. MF is mostly used to model the interaction between user and item, and inner product calculation is used for hidden features of user and item, which is a linear method.

The introduction of user and item bias to improve the MF effect also shows that the inner product is not sufficient to capture the complex structure information in the user interaction data. Therefore, in the NCF (Neural Collaborative Filtering) paper, the author introduces a deep learning method to describe the relationship between features nonlinearly is a way to solve this problem.

The main contents of this article are as follows:

1. Under the same experimental conditions, can matrix factorization (Matrix Factorization) have a greater improvement over MLP (Multi Layer Perceptron) after parameter tuning?

2. Although MLP can approach any function theoretically, this article compares and analyzes the approximation relationship between MLP and dot product function through experiments;

3. Finally, discuss the high cost of MLP providing Service in the actual online generation environment. Compared with "dot product", you can quickly find similar items through efficient search algorithms such as Faiss.

What are Dot Product and MLP?

Inner product and MLP

Dot Product

The dot product of the user vector UserEmbedding (p in the figure) and the item vector ItemEmbedding (q in the figure).
Insert picture description here

MLP (Multi Layer Perceptron)& NCF

you

MLP can fit any function theoretically. In the NCF paper, the author replaces the dot product with MLP, and joins the user vector UserEmbedding and the item vector ItemEmbedding as input.

youInsert picture description here

The NCF network can be decomposed into two sub-networks, one is called Generalized Matrix Factorization (GMF), and the other is Multi-Layer Perceptron (MLP).
you
Among them, GMF uses the user vector UserEmbedding and the item vector ItemEmbedding, and uses the Hadamard product to combine (element level-the corresponding position elements in two matrices of the same size are multiplied for example: (3x3)⊙(3x3) = 3x3), and then In the fully connected layer, a linear weighted combination is performed, that is, an h vector (weight vector) is trained.
you
In the MLP part, the user vector UserEmbedding and the item vector ItemEmbedding are spliced ​​together, and then input multiple FC layers. Since the splicing operation of <User, Item> passes through the FC layer, these features have undergone sufficient non-linear combination, and the final output uses the sigmoid function.

The effect of the model in the original paper is as follows:
Insert picture description here

Dot Product vs. MLP

The interesting part of this article is that the author raised a question, is the MLP model really better than the dot product?

Based on the above introduction, we will have a potential understanding that replacing the dot product with MLP can enhance the expression ability of the model. After all, MLP has the ability to fit arbitrary functions. In the "Neural Collaborative Filtering vs. Matrix Factorization Revisited" paper, the NCF experiment was reproduced, and on the same data set, the leave-one-out method was used to retain the last click of each user as verification. And through HR and NDCG to evaluate the effects of Dot Product and NCF as follows:

you

Through the effect in the picture, is it doubtful to the original perception? Of course, whether it is the comparison test in the original text, or what this article wants to express, it does not negate the positive role of the Deep Learning recommendation field. As a deep learning alchemist, thinking about some of the meaning behind the contrast is more interesting. The tuning part in the original text is more detailed and worth learning. The author introduced his alchemy process and how to search for the optimal parameters for the matrix factorization (MF) model.

Matrix factorization alchemy process

Original paper
From our past experience with matrix factorization models, if the other hyperparameters are chosen properly, then the larger the embedding dimension the better the quality – our experiments Figure 2 confirm this. For the other hyperparameters: learning rate and number of training epochs influence the convergence curves. Usually, the lower the learning rate, the better the quality but also the more epochs are needed. We set a computational budget of up to 256 epochs and search for the learning rate within this setting. In the first hyperparameter pass, we search a coarse grid of learning rates η ∈ {0.001, 0.003, 0.01} and number of negatives m = {4, 8, 16} while fixing the regularization to λ = 0. Then we did a search for regularization in {0.001, 0.003, 0.01} around the promising candidates. To speed up the search, these first coarse passes were done with 128 epochs and a fixed dimension of d = 64 (Movielens) and d = 128 (Pinterest). We did further refinements around the most promising values of learning rate, number of negatives and regularization using d = 128 and 256 epochs.

Throughout the experiments we initialize embeddings from a Gaussian distribution with standard deviation of 0.1; we tested some variation of the standard deviation but did not see much effect. The final hyperparameters for Movielens are: learning rate η = 0.002, number of negatives m = 8, regularization λ = 0.005, number of epochs 256. For Pinterest: learning rate η = 0.007, number of negative samples m = 10, regularization λ = 0.01, number of epochs 256.

Alchemy Notes

(1) The training set, validation set, and test set are divided. The user's last click is the test set, and the penultimate click is the positive sample of the verification set.

(2) Super parameter adjustment. List of adjustable parameters:

parameter meaning
epochs Number of training rounds
m Negative sampling rate
the SGD learning rate
d Embedding dimensions
std Standard deviation of the initial model coefficients (standard normal distribution)
λ Regularization coefficient

(3) Use Grid Search to adjust the learning rate η ∈ {0.001, 0.003, 0.01} and the negative sampling rate m={4,8,16} for the coarse-grained selection of the first-level results, and then select the better result pair λ = { 0.001, 0.003, 0.01} for the second level of results for fine-grained selection. At the same time, fix epochs, embedding dimension, and standard deviation.

(4) Adjust the number of training rounds, negative sampling rate, etc.;

references

1、《Neural Collaborative Filtering vs. Matrix Factorization Revisited》

https://arxiv.org/abs/2005.09683

Filtering vs. Matrix Factorization Revisited》

https://arxiv.org/abs/2005.09683

2. Reproduce the code https://github.com/hexiangnan/neural_collaborative_filtering

Guess you like

Origin blog.csdn.net/weixin_43901214/article/details/113115712