在这里插入图片描述

1 introduction

文中，作者研究了如何有效地处理神经推荐系统中的上下文数据。首先对传统的将上下文作为特征的方法进行了分析，并证明这种方法在捕获特征交叉时效率低下。然后据此来设计RNN推荐系统。 We first describe our RNN-based recommender system in use at YouTube. Next, we offer “Latent Cross,” an easy-to-use technique to incorporate contextual data in the RNN by embedding the context feature first and then performing an element-wise product of the context embedding with model’s hidden states.

学习的目的 to best learn from users actions, e.g., clicks, purchases, watches, and ratings…

一些重要的contextual data：request and watch time, the type of device, and the page on the website or mobile app

2 describe

Netflix Prize setting $\equiv(i, j, R)$ , user $i$ gave movie $j$ a rating of $R$ . $\equiv(i, j, t, d)$ , user $i$ watched video $j$ at time $t$ on device type $d$ .

recommender systems as trying to predict one value of the event given the others: for a tuple $e = (i, j, R)$ , use $(i, j)$ predict $R$ .

$\begin{array}{lll}{\text { Symbol }} & {\text { Description }} \\ \hline e & {\text { Tuple of } k \text { values describing an observed event }} \\ {e_{\ell}} & {\text { Element } \ell \text { in the tuple }} \\ {\mathcal{E}} & {\text { Set of all observed events }} \\ {u_{i}, v_{j}} & {\text { Trainable embeddings of user } i \text { and item } j} \\ {X_{i}} & {\text { All events for user } i} \\ {X_{i, t}} & {\text { All events for user i befor time t}} \\ {e^{(\tau)}} & {\text { Event at step } \tau \text { in a particular sequence }} \\ {<·>} & {\text { k way inner product}} \\ {*} & {\text { Element-wise product }} \\ {f(\cdot)} & {\text { An arbitrary neural network }}\end{array}$

$\begin{array}{l}{\text { machine learning perspective, we can split our tuple } e \text { into features }} \ {x \text { and label } y \text { such that } x=(i, j) \text { and label } y=R \text { . }}\end{array}$

矩阵分解： $u_{i} \cdot v_{j}$
张量分解： $\sum_{r} u_{i, r} v_{j, r} w_{t, r}$
表示称内积： $\left\langle u_{i}, v_{j}, w_{t}\right\rangle=\sum_{r} u_{i, r} v_{j, r} w_{t, r}$

3 MoDELING PRELIMINARIES

3.1 First Order DNN的局限

模型最后的输出： $h_{\tau}=g\left(W_{\tau} h_{\tau-1}+b_{\tau}\right)$ ，这个公式可看做 $h_{\tau - 1}$ 的一阶转换，原因就是只涉及了 $h_{\tau - 1}$ 中元素的加法，并没有涉及到元素间的乘法。
矩阵分解可以捕捉到不同类型输入（user，item，time等）之间的低秩关系

3.2 Modeling Low-Rank Relations

通过生成一些low-rank数据来验证first order DNN是否可以很好的建模low-rank之间的关系。
生成长度为 $r$ 的随机向量 $u_i$ : $u_{i} \sim \mathcal{N}\left(0, \frac{1}{r^{1 / 2 m}} \mathbf{I}\right)$ ，其中 $r$ 为data的秩，m个特征。
当 $m = 3$ 时，每个样本可以表示成 $\left(i, j, t,\left\langle u_{i}, u_{j}, u_{t}\right\rangle\right)$ ，将三部分合并起来作为输入，然后经过RELU激活函数输入到最终的线性层，损失函数采用MSE（ mean squared error loss ），采用Adagrade进行优化，最终以Pearson correlation进行评价。
在这里插入图片描述
1、随着隐藏层大小增加，模型拟合训练数据的能力更好。
2、rank从1变成2时，隐藏层nodes需要翻倍才能达到相同的准确率。
3、Considering collaborative filtering models will often discover rank 200 relations , this intuitively suggests that real world models would require very wide layers for a single two-way relation to be learned.
结果：ReLU layers越多，拟合的越好，但效率低。因此开始考虑RNN模型。

4 YOUTUBE’S RECURRENT RECOMMENDER

RNNs are notable as a baseline model because they are already second-order neural networks, significantly more complex than the first-order models explored above, and are at the cutting edge of dynamic recommender systems

4.1 Formal Description

the input to the model is the set of events for user: $X_{i}=\left\{e=(i, j, \psi(j), t) \in \mathcal{E} | e_{0}=i\right\}$ , use $X_{i,t}$ to denote all watches before $t$ for user $X_i$ : $X_{i, t}=\left\{e=(i, j, t) \in \mathcal{E} | e_{0}=i \wedge e_{3}<t\right\} \subset X_{I}$ , $\operatorname{Pr}\left(j | i, t, X_{i, t}\right)$ 表示the video $j$ that user $i$ will watch at a given time $t$ based on all watches before $t$ .

user $i$ 在time $t$ 观看 $\psi(j)$ 上传的具有 $w_t$ feature的video $j$ 。模型以user $i$ 在time $t$ 之前的浏览记录 $X_{i,t}$ 作为输入。使用 $e^{(\tau)}$ 表示序列中的第 $\tau$ 次事件， $x^{(\tau)}$ 表示事件 $e^{(\tau)}$ 转换后的输入(就是user $i$ 对应的一些embedding)，而 $y^{(\tau)}$ 表示预测的标签。当前时刻 $e^{(\tau)}=(i, j, \psi(j), t)$ ，下一时刻 $e^{(\tau+1)}=\left(i, j^{\prime}, \psi\left(j^{\prime}\right), t^{\prime}\right)$ ，则输入 $x^{(\tau)}=\left[v_{j} ; u_{\psi}(j) ; w_{t}\right]$ 来预测标签 $y^{(\tau+1)}=j^{'}$ ，其中 $v_{j}$ 是video的embedding， $u_{\psi}(j)$ 是上传者(uploader)的embedding， $w_{t}$ 是情景的embedding。在预测 $y^{(\tau+1)}$ 时，不能使用 $e^{(\tau)}$ 的标签作为输入，但是可以使用 $w_{t}$ 的情景特征，记为 $c^{(\tau)}=\left[w_{t}\right]$ 。

4.2 Structure of the Baseline RNN Model

RNN模型对一系列的actions进行建模:
1、对于每个event $e^{(\tau)}$ ， $e^{(\tau)}$ 对应为 $x^{(\tau)}$ ，先输入到一层NN中得到 $h_{0}^{(\tau)}=f_{i}\left(x^{(\tau)}\right)$
2、将其输入到RNN（LSTM、GRU）模型，得到 $h_{1}^{(\tau)}, z^{(\tau)}=f_{r}\left(h_{0}^{(\tau)}, z^{(\tau-1)}\right)$
3、使用 $f_{o}\left(h_{1}^{(\tau-1)}, c^{(\tau)}\right)$ 来预测 $y^{(\tau)}$
在这里插入图片描述

4.3 Context Features

1、TimeDelta
$\Delta t^{(\tau)}=\log \left(t^{(\tau+1)}-t^{(\tau)}\right)$
2、Software Client
video的长短会影响user观看使用的device
3、Page
从网站home_page开始浏览的话可能对new content更有兴趣，从一个具体的video page跳转可能表示user对某个特定的topic更感兴趣。
4、Pre- and Post-Fusion.
前面将情景特征标记为 $c^{(\tau)}$ ,pre-fusion表示情景特征从NN底部作为input，post-fusion表示和RNN的输出合并起来。把 $c^{(\tau)-1}$ 作为pre-fusion特征来影响RNN的state，而把 $c^{(\tau)}$ 作为post-fusion特征来直接用于预测 $y^{(\tau)}$ 。

5 CONTEXT MODELING WITH THE LATENT CROSS

前面介绍，直接将content feature concat 是低效的，因此下面展开研究。

5.1 Single Feature

以time为例，perform an element-wise product in the middle of the network $h_{0}^{(\tau)}=\left(1+w_{t}\right) * h_{0}^{(\tau)}$ , 通过 0-mean Gaussian来初始化 $w$ ，有两点好处：
1、This can be interpreted as the context providing a mask or attention mechanism over the hidden state. (相当于在隐状态上加了mask和attention)
2、enables low-rank relations between the input previous watch and the time.（捕捉上次记录与time的low-rank关系）
对于 $h_{1}^{(\tau)}$ , $h_{1}^{(\tau)}=\left(1+w_{t}\right) * h_{1}^{(\tau)}$ .

5.2 Using Multiple Features

通常，会有很多 contextual feature，以device和time为例： $h^{(\tau)}=\left(1+w_{t}+w_{d}\right) * h^{(\tau)}$
1、相当于在隐状态上加了mask和attention
2、捕捉2-way relation
3、加法运算容易训练, 而 $w_{t} * w_{d} * h^{(\tau)}$ 以及 $f\left(\left[w_{t} ; w_{d}\right]\right)$ 难训练\。

推荐算法 | 《Latent Cross: Making Use of Context in Recurrent Recommender Systems》