GRU4Rec study notes (SESSION-BASED RECOMMENDATIONS WITHRECURRENT NEURAL NETWORKS)

Table of contents

1. Abstract

2. Introduction

3. Recommended RNN/GRU

4. Customized GRU model

5. Highlights of this article

1、session-parallel mini-batch

2、negative side

3、ranking loss

6. Experiment

1. Data set

2. Evaluation indicators

3. Compare the baseline

4. Parameter and structure optimization

5. Results

6. Summary


1. Abstract

        It is mainly an improvement of RNN. When RNN uses the recommendation system, it will face the problem of short session time, not long -term sessions, resulting in inaccurate accuracy. In order to solve this problem, The author models the entire conversation.

2. Introduction

        The first paragraph: Mainly discusses the neglected issues in the field of machine learning and recommendation (session recommendation), and proposes the shortcomings of existing sessions such as cookies and browser fingerprints. These technologies are unreliable, and e-commerce deployments are mostly simple methods. , without using user data, through item-to-item similarity, co-occurrence relationships, or conversion probabilities, but They usually only consider the last click or selection and ignore information about past clicks .
        In the field of recommendation systems or information retrieval, " Co-occurrence relationship " refers to Two or more items (or items) in user behavior The frequency or probability of simultaneous occurrence . If the user selects item A in one behavior and selects item B in another behavior, then we say that item A and item B have a co-occurrence relationship in the user's behavior. This relationship can be used Infer similarities or relatedness between items.
        ​​​​In the field of recommendation systems or information retrieval, "Conversion probability" (the probability of switching to the next item) Usually refers to the user selecting an item at one point in time, and then switching to the next item. The probability of choosing another item at a point in time. This probability reflects the user’s The tendency or tendency to switch between different items .
        The second paragraph: mainly introduces two methods, Factor Model (Factor Models) and Neighborhood Methods (Neighborhood Methods), the former decomposes and reconstructs the sparse matrix. Due to the lack of personal information in session recommendations, It is more difficult to apply . The latter calculates the similarity between items or users. This kind of conversation recommendation is obtained Widely used .
        The third paragraph: Talking about the success of deep neural networks in image and speech recognition in recent years, various RNNs have become the first choice for sequential modeling. Sequence models include text translation, dialogue modeling, and image description. That is The success of RNN in other applications
        Paragraph 4: Discussing the issues and methods of applying RNN in recommendation systems, the author proposed session-based recommendation and explained The problem of dealing with sparse sequences , and Introduce the ranking loss function to adjust the RNN model. The initial input of the RNN is the first time the user clicks on the first item. Each consecutive click will depend on the previous one. All click outputs, additional click sequences The data is very large and is very important for training time and scalability.

3. Recommended RNN/GRU

RNN: ht is the hidden state at time step, g is the smooth bounded activation function (sigmoid), W is the weight matrix of the hidden layer, xt is the input of the unit at time t, and U is the previous time step to the current The weight of the time step.

update door

reset gate

candidate hidden layer

Hidden layer

output

4. Customized GRU model

        The input of the model is the status of the current session, and the output is the item of the next event in the session. The input vector is normalized for stability, and an embedding layer is added between the input and the first layer of GRU. A feedforward layer is added between the outputs (Feedforward layers are also called fully connected layers or dense layers). because Recommendation systems are not the main application field of recurrent neural networks, so the basic network is modified to better adapt to the task.
        ​ ​ 1-of-N coding is one-hot coding

5. Highlights of this article

1、session-parallel mini-batch

Sessions are parallelized in mini-batch, processing length Two strategies (1) Sort by length, so there will be relatively less padding (2) Truncate or clip the long ones, obviously Both strategies are not suitable , because in the recommendation system, long and short sequences are very different, so the author proposes the concept of session parallelism, as shown in the figure , divided into 3 parallels. When a sequence ends in parallel, a new sequence is added in without even padding.

2、negative side

        The author first talked about why negative sampling is necessary. First of all, because of the recommendation system The amount of data is too large . For a medium-sized online store, the number of items can be in the tens of thousands, while for larger ones, it can even be in the hundreds of thousands or even Millions, so the output is sampled and only a fraction is calculated.
        The author will Popular items , presumably users are aware of them. But the user did not choose , It means that the probability of users not liking it will be relatively high , As a negative sampling , it is also a fast negative sampling method. Pop-based sampling is more interpretable and the calculation will be faster.

3、ranking loss

The author briefly talks about three ways to rank loss: pointwise, pairwise and listwise, and analyzes the advantages and disadvantages:

        Pointwise is an independent estimate (intuitive), but it makes the ranking of related items low and unstable.

        pairwise is to compare the scores of positive and negative samples, requiring the positive ones to be ranked lower than the negative ones. It performed better in the experiment.

        Listwise is used for all items, but the calculation cost is too high, so it is not commonly used.

BPR loss

TOP1 loss

        Designed by the author, compared to bpr, it is behind loss Added a negative sample regularization term , hoping that the negative sample score will be as small as possible.

6. Experiment

1. Data set

        RecSys Challenge 2015 (RSC15) includes Clickstream data for e-commerce websites that end with a purchase event.
        VIDEO: OTT from YouTube-like Video service platform collects events of watching videos for a certain period of time.
        ​​ ​ OTT is the abbreviation of "Over-The-Top" and refers to services that transmit audio, video and other media content over the Internet without distribution through traditional cable or satellite TV operators.

2. Evaluation indicators

        Recall@20 and MRR@20
        Mean Reciprocal Rank: The value range of MRR is  Between 0 and 1, the closer to 1, the better the performance of the system. , MRR focuses on The position of the first correct result , regardless of subsequent ranking .
        Calculation formula:

3. Compare the baseline

POP:

The POP strategy does not consider the user's personalized interests, but focuses on the items based on the entire training Global popularity , always recommend items that are selected by the majority of users. To a certain extent, the problem of cold start for new users can be solved.
S-POP:
is a session-based POP that can Capture the popular items in the current session to better meet the user's short-term interests.
Item KNN
BPR-MF

4. Parameter and structure optimization

        For parameter adjustment, the optimizer author used two optimization algorithms, rmsprop and adagrad, and found that adagrad algorithm performs better, and also uses LSTM and RNN, which are not as good as GRU. It proves that the theoretical loss function TOP1 has the best effect, using one- Hot encoding is better, using a single-layer GRU unit has the best effect, and increasing the width of the GRU is beneficial.

5. Results

        The cross-entropy loss is unstable in VIDEO1000, so the author does not give the results. When adding units, you will find The loss of cross entropy is decreasing , however Pairs are improving .

6. Summary

        The author applied GRU to a new field: recommendation system, and made certain improvements to the basic GRU. Session-parallel mini-batch, mini-batch-based output sampling and ranking loss functions and outperform commonly used baselines.

Guess you like

Origin blog.csdn.net/zhu_xian_gang/article/details/134722767