The third paragraph: Talking about the success of deep neural networks in image and speech recognition in recent years, various RNNs have become the first choice for sequential modeling. Sequence models include text translation, dialogue modeling, and image description. That is
The success of RNN in other applications
Paragraph 4: Discussing the issues and methods of applying RNN in recommendation systems, the author proposed session-based recommendation and explained
The problem of dealing with sparse sequences
, and
Introduce the ranking loss function
to adjust the RNN model. The initial input of the RNN is the first time the user clicks on the first item. Each consecutive click will depend on the previous one. All click outputs, additional click sequences
The data is very large
and is very important for training time and scalability.
3. Recommended RNN/GRU
RNN: ht is the hidden state at time step, g is the smooth bounded activation function (sigmoid), W is the weight matrix of the hidden layer, xt is the input of the unit at time t, and U is the previous time step to the current The weight of the time step.
update door
reset gate
candidate hidden layer
Hidden layer
output
4. Customized GRU model
The input of the model is the status of the current session, and the output is the item of the next event in the session. The input vector is normalized for stability, and an embedding layer is added between the input and the first layer of GRU. A feedforward layer is added between the outputs (Feedforward layers are also called fully connected layers or dense layers). because
Recommendation systems are not the main application field of recurrent neural networks, so the basic network is modified to better adapt to the task.
1-of-N coding is one-hot coding
5. Highlights of this article
1、session-parallel mini-batch
Sessions are parallelized in mini-batch, processing length
Two strategies
(1) Sort by length, so there will be relatively less padding (2) Truncate or clip the long ones, obviously
Both strategies are not suitable
, because in the recommendation system, long and short sequences are very different, so the author proposes the concept of session parallelism, as shown in the figure , divided into 3 parallels. When a sequence ends in parallel, a new sequence is added in without even padding.
2、negative side
The author first talked about why negative sampling is necessary. First of all, because of the recommendation system
The amount of data is too large
. For a medium-sized online store, the number of items can be in the tens of thousands, while for larger ones, it can even be in the hundreds of thousands or even Millions, so the output is sampled and only a fraction is calculated.
The author will
Popular items
, presumably users are aware of them.
But the user did not choose
,
It means that the probability of users not liking it will be relatively high
,
As a negative sampling
, it is also a fast negative sampling method. Pop-based sampling is more interpretable and the calculation will be faster.
3、ranking loss
The author briefly talks about three ways to rank loss: pointwise, pairwise and listwise, and analyzes the advantages and disadvantages:
Pointwise is an independent estimate (intuitive), but it makes the ranking of related items low and unstable.
pairwise is to compare the scores of positive and negative samples, requiring the positive ones to be ranked lower than the negative ones. It performed better in the experiment.
Listwise is used for all items, but the calculation cost is too high, so it is not commonly used.
BPR loss
TOP1 loss
Designed by the author, compared to bpr, it is behind loss
Added a negative sample regularization term
, hoping that the negative sample score will be as small as possible.
6. Experiment
1. Data set
RecSys Challenge 2015 (RSC15) includes
Clickstream data for e-commerce websites
that end with a purchase event.
VIDEO: OTT from YouTube-like
Video service platform
collects events of watching videos for a certain period of time.
OTT is the abbreviation of "Over-The-Top" and refers to services that transmit audio, video and other media content over the Internet without distribution through traditional cable or satellite TV operators.
2. Evaluation indicators
Recall@20 and MRR@20
Mean Reciprocal Rank: The value range of MRR is
Between 0 and 1, the closer to 1, the better the performance of the system.
, MRR focuses on
The position of the first correct result
,
regardless of subsequent ranking
.
Calculation formula:
3. Compare the baseline
POP:
The POP strategy does not consider the user's personalized interests, but focuses on the items based on the entire training
Global popularity
, always recommend items that are selected by the majority of users. To a certain extent, the problem of cold start for new users can be solved.
S-POP:
is a session-based POP that can
Capture the popular items in the current session
to better meet the user's short-term interests.
Item KNN
BPR-MF
4. Parameter and structure optimization
For parameter adjustment, the optimizer author used two optimization algorithms, rmsprop and adagrad, and found that
adagrad algorithm
performs better, and also uses LSTM and RNN, which are not as good as GRU. It proves that the theoretical loss function TOP1 has the best effect, using one- Hot encoding is better, using a single-layer GRU unit has the best effect, and increasing the width of the GRU is beneficial.
5. Results
The cross-entropy loss is unstable in VIDEO1000, so the author does not give the results. When adding units, you will find
The loss of cross entropy is decreasing
, however
Pairs are improving
.
6. Summary
The author applied GRU to a new field: recommendation system, and made certain improvements to the basic GRU.
Session-parallel mini-batch, mini-batch-based output sampling and ranking loss functions
and outperform commonly used baselines.