wide & deep evolutionary model

Recommended system model evolution

LR-->GBDT+LR

FM-->FFM-->GBDT+FM|FFM

FTRL-->GBDT+FTRL

Wide&DeepModel (Deep learning era)

From the model analyzes the following four aspects:

1.why (principle behind the design of the model)

2.how (specifically how to design, how to apply)

3.discussion (model discussion)

Wide&Deep

  • why

Memorization 和 Generalization

If you designed a takeaway recommendation system gugu, users need to go to bed to wake up a take-away point, the recommendation system recommended to the user of a barbecue meal, if the user purchased is marked as 1, otherwise 0 (description is not a good recommendation). Estimated CTR is a measure of the next line recommendation system indicators.

wide(memorization)

Then how do the right product for the user, we need to remember user preferences. So, you design a few relevant features with a simple linear model of re-learning the right combination of these features, click on the model predicts the probability of a particular product line on gugu2.0. Over time, users tired of eating, we need to change the taste, but the model only remember a specific pattern. Some combination of features not present in the training set, because the model is not seen, remembered no information about this feature, leading to a single model, users will be satisfied.

deep(generalization)

In order to recommend some new food and food-related points before the user, but to taste is not the same. Models need to be able to capture all the intrinsic link between food, ordinary discrete features can not meet this requirement, embedding the introduction of low-dimensional vector representation dense discrete features of the method, similar to the food in some dimension of embedding may be the same. Such as saliva chicken and chicken pepper, after embedding represented by four dimensions [chicken, spicy, hemp, sweet]

[0.52,0.23,0.312,0.002] [0.52,0.23,0.45,0.002]

After using the dense vector embedding, can fully exploit the similarity of different food, can make new reasonable recommendation, feed-forward neural network learning before use for features not seen since the depth of learning generalization ability of the model also you can make predictions good. But you will find that the model over-generalization, when the user's behavior is sparse, gugu will recommend some relevant foods less.

wide+deep

Why not at the same time memorization and generalization of it? Wide & Deep combined linear and depth model, using the advantages of the two models joint training.

  • how

input

wide: it includes feature and other types of sparse cross type. Manual input feature and the original cross feature

deep: dense features, including features and real value of the type wherein after embedding

training

wide: \(y=wx+b\)

deep: \(a^{(l+1)}=f(w^{(l)}a^{(l)}+b^{(l)})\)

joint: \(P(Y=1 | \mathbf{x})=\sigma\left(\mathbf{w}_{w i d e}^{T}[\mathbf{x}, \phi(\mathbf{x})]+\mathbf{w}_{d e e p}^{T} a_{f}^{\left(l_{f}\right)}+b\right)\)

Wide portion trained with FTRL + L1; Deep portion with AdaGrad trained. The BP algorithm training mode of joint train.

  • discussion

1. The use of a combination of wide and deep, wide cross feature manually, for deep embedding discrete

2. improved wide portion, characterized in automated cross, DeepFM, DCN

3.embedding get online training, the possibility of off-line pre-training

4.deep part improvements, AFM

DCN

  • why

FM automatic combination of features, but it is limited to the second order cross product. Farewell artificial combination of features, and automatic learning higher-order combinations of features it

\(x1x2x3\)

  • how

拟合残差
\[ \mathbf{x}_{l+1}=\mathbf{x}_{0} \mathbf{x}_{l}^{T} \mathbf{w}_{l}+\mathbf{b}_{l}+\mathbf{x}_{l}=f\left(\mathbf{x}_{l}, \mathbf{w}_{l}, \mathbf{b}_{l}\right)+\mathbf{x}_{l} \]

  • discussion
  1. 显示的高阶特征组合,特征组合阶数随着网络深度增加而增加

  2. 复杂度线性增长,相比DNN更快

  3. 利用最后的高阶组合特征,实际高层特征组合已经包含了低层的组合,考虑单层的组合引入最后的计算

  4. 特征交互还是bit-wise,对模型记忆能力提升是否有帮助

  5. 是否真的学到了高阶特征交互?输出是输入的标量乘积

xDeepFm

  • why

传统特征工程缺点:

1.好的特征需要专家知识

2.大数据量下无法无法手动交叉特征

3.手动交叉特征的无法泛化

FM对所有特征组合,引入噪声;FNN、PNN聚焦于高阶特征,忽略了低阶特征;

DNN学习高阶特征交互,但是学习到特征交互是隐含的,bit-wise级的,那么DNN是否真的有效在高阶特征处理上?CIN被设计在vector-wise级进行学习高阶特征

embedding: 不同样本的长度不同,但embedding维度是一样的

隐式高阶特征:bit-wise

显示高阶特征交互: DCN,输出受限于和x0的交互、bit-wise

CIN(Compressed Interaction Network(CIN))

CIN没有有效的学习到高阶特征交互,输出是x0的标量乘积
\[ \begin{aligned} \mathrm{x}_{i+1} &=\mathrm{x}_{0} \mathrm{x}_{i}^{T} \mathrm{w}_{i+1}+\mathrm{x}_{i} \\ &=\mathrm{x}_{0}\left(\left(\alpha^{i} \mathrm{x}_{0}\right)^{T} \mathrm{w}_{i+1}\right)+\alpha^{i} \mathrm{x}_{0} \\ &=\alpha^{i+1} \mathrm{x}_{0} \end{aligned} \]
但是标量并不意味着线性!!!

  • how

bit-wise到vector-wise

显示交互

复杂度非指数级增长
\[ \mathrm{X}_{h, *}^{k}=\sum_{i=1}^{H_{k-1}} \sum_{j=1}^{m} \mathrm{W}_{i j}^{k, h}\left(\mathrm{X}_{i, *}^{k-1} \circ \mathrm{X}_{j, *}^{0}\right) \]

前一层[公式] 中的 [公式] 个vector,与输入层 [公式] 中的 [公式] 个vector,进行两两Hadamard乘积运算,得到 [公式] 个 vector,然后加权求和

[公式] 层的不同vector区别在于,对这[公式] 个 vector 求和的权重矩阵不同。 [公式] 即对应有多少个不同的权重矩阵 [公式]

1.为什么做Hadamard积

保持维度不变

2.vector-wise交互

网络的每一层计算是以embedding向量的方式进行哈达玛积,保持embedding的结构

3.每一层的输出由当前输入和隐状态共同决定,类RNN

4.类CNN(装饰)

sum pooling 有效性:\(p_{i}^{k}=\sum_{j=1}^{D} \mathrm{X}_{i, j}^{k}\)

当只有一层,sum pooling就是两两向量的内积之和,降为FM

组合

\[ \hat{y}=\sigma\left(\mathbf{w}_{\text {linear}}^{T} \mathbf{a}+\mathbf{w}_{d n n}^{T} \mathbf{x}_{d n n}^{k}+\mathbf{w}_{\operatorname{cin}}^{T} \mathbf{p}^{+}+b\right) \]
线性单元、DNN、CIN;记忆、泛化、记忆+泛化

1.CIN如何显示的执行特征交互

2.必须组合显示和隐式表达吗

3.xDeepFm参数设置影响

  • discussion

1.特征交叉利用稠密向量进行,是否存在一个网络进行离散高阶向量级特征交互

2.交互深度改进,残差,全局信息观

3.identify激活函数,线性?

bit-wise VS vector-wise

假设隐向量的维度为3维,如果两个特征(对应的向量分别为(a1,b1,c1)和(a2,b2,c2)的话)在进行交互时,交互的形式类似于f(w1 * a1 * a2,w2 * b1 * b2 ,w3 * c1 * c2)的话,此时我们认为特征交互是发生在元素级(bit-wise)上。如果特征交互形式类似于 f(w * (a1 * a2 ,b1 * b2,c1 * c2))的话,那么我们认为特征交互是发生在特征向量级(vector-wise)。

explicitly VS implicitly

显式的特征交互和隐式的特征交互。以两个特征为例xi和xj,在经过一系列变换后,我们可以表示成 wij * (xi * xj)的形式,就可以认为是显式特征交互,否则的话,是隐式的特征交互。

Guess you like

Origin www.cnblogs.com/gongyanzh/p/12098348.html