Deep Crossing: Web-Scale Modeling without Manually Crafted Combinatorial Features【论文记录】

交叉特征有很好的效果,但人工组合发现有意义的特征很难
深度学习可以不用人工挖掘特征,还可以挖掘到专家都没找到的高阶特征
特色在于残差单元的使用,特征的表示

1 摘要

  • automatically combines features to produce superior models
    自动组合特征以产生出色的模型

  • achieve superior results with only a sub-set of the features used in the production models.
    仅使用生产模型中使用的特征的子集即可获得出色的结果。

2 Sponsored Search

  • Sponsored search is responsible for showing ads alongside organic search results
    Sponsored Search 负责与自然搜索结果一起展示广告
概念 含义
Query 用户在搜索框中输入的文本字符串
Keyword 广告商指定的与产品相关的文本字符串,以匹配用户查询
Title 广告客户指定的赞助广告标题,以吸引用户的注意
Landing page(登录页面) 当用户点击相应的广告时,用户访问的产品网站
Match type 提供给广告客户的选项,可以让用户查询关键字与关键字的匹配程度如何,通常为以下四种之一:精确,词组,广泛和上下文
Campaign 一组具有相同设置(如预算和位置定位)的广告,通常用于将产品分类
Impression(展品) 向用户显示的广告实例。通常会在运行时记录展品以及其他可用信息
Click 用户是否点击了展品的指标。 通常会在运行时记录一次单击以及其他可用信息
Click through rate 总点击次数超过总展示次数
Click Prediction 平台的关键模型,可预测用户针对给定查询点击给定广告的可能性

3 特征表示

  • Simply converting campaign ids into a onehot vector would significantly increase the size of the model.
    只将广告系列 ID 转化为 onehot 向量,就会大大增加模型的大小

    • One solution is to use a pair of companion features as exemplified in the table, where CampaignID is a one-hot representation consisting only of the top 10, 000 campaigns with the highest number of clicks.
      一种解决方案是使用表中示例的一对广告特征,CampaignID 是只包含点击次数最高的前 10,000 个广告的 onehot 表示
      individual features

    • Other campaigns are covered by CampaignIDCount, which is a numerical feature that stores per campaign statistics such as click through rate. Such features will be referred as a counting feature in the following discussions
      其他广告由 CampaignIDCount 包含,CampaignIDCount 是一个数字特征,可存储每个广告的统计信息,例如点击率。 在以下讨论中,此类功能将被称为计数特征。

  • Deep Crossing avoids using combinatorial features. It works with both sparse and dense individual features
    Deep Crossing 不使用特征组合。 它可以同时处理稀疏和密集的个体特征

4 模型结构

Deep Crossing模型结构

  • The objective function is log loss but can be easily customized to soft-max or other functions
    目标函数是 log 损失函数,但也能定义为 softmax 或其他函数
     logloss  = − 1 N ∑ i = 1 N ( y i log ⁡ ( p i ) + ( 1 − y i ) log ⁡ ( 1 − p i ) ) (1) \text { logloss }=-\frac{1}{N} \sum_{i=1}^{N}\left(y_{i} \log \left(p_{i}\right)+\left(1-y_{i}\right) \log \left(1-p_{i}\right)\right) \tag{1}  logloss =N1i=1N(yilog(pi)+(1yi)log(1pi))(1) p i p_i pi 是 Scoring 层一个节点的输出

4.1 Embedding and Stacking Layers

  • The embedding layer consists of a single layer of a neural network, with the general form
    Embedding 由神经网络的单层组成,一般形式为
    X j O = max ⁡ ( 0 , W j X j I + b j ) (2) X_{j}^{O}=\max \left(\mathbf{0}, \mathbf{W}_{j} X_{j}^{I}+\mathbf{b}_{j}\right) \tag{2} XjO=max(0,WjXjI+bj)(2) 其中,
    X j I X^I_j XjI n j n_j nj 维的输入特征,
    W j W_j Wj m j × n j m_j \times n_j mj×nj 矩阵
    b b b n j n_j nj 维的
    m j < n j m_j \lt n_j mj<nj,embedding 就可以减小输入特征的维度
    这个运算参考于 ReLU

  • Note that both { W j W_j Wj} and { b j b_j bj} are the parameters of the network, and will be optimized together with the other parameters in the network.
    W j W_j Wj b j b_j bj 会和网络中的其他参数一起进行优化,这与 word2vec 不同

4.2 Residual Layers

源于 Residual Net 的 Residual Unit,进行了修改

  • The unique property of Residual Unit is to add back the original input feature after passing it through two layers of ReLU transformations
    残差单元的独特属性是在经过两层 ReLU 转换后,将原始输入特征添加回去
    X O = F ( X I , { W 0 , W 1 } , { b 0 , b 1 } ) + X I (3) X^{O}=\mathcal{F}\left(X^{I},\left\{\mathbf{W}_{0}, \mathbf{W}_{1}\right\},\left\{\mathbf{b}_{0}, \mathbf{b}_{1}\right\}\right)+X^{I} \tag{3} XO=F(XI,{ W0,W1},{ b0,b1})+XI(3) F ( ⋅ ) \mathcal{F}(\cdot) F() 表示拟合 X O − X I X^O - X^I XOXI 的残差

  • the authors believed that fitting residuals has a numerical advantage. While the actual reason why Residual Net could go as deep as 152 layers with high performance is subject to more investigations, Deep Crossing did exhibit a few properties that might benefit from the Residual Units.
    在这篇论文中1作者认为拟合残差具有数值优势。 尽管“Residual Net”可以深入到 152 层还能有很高性能的实际原因尚待进一步研究,但“Deep Crossing”确实显示出一些可能会 受益于“残差单元” 的属性。
    残差单元

  • Deep Crossing was applied to a wide variety of tasks. It was also applied to training data with large differences in sample sizes. It’s likely that the Residual Units are implicitly performing some kind of regularization that leads to such stability.
    Deep Crossing 被应用于各种各样的任务。它也适用于样本数量差异较大的训练数据。残差单元可能会隐式执行某种正则化操作,从而导致这种稳定性。

5 相关工作

  • Fig. 3 is the architecture of a modified DSSM using log loss as the objective function. The modified DSSM is more closely related to the applications of click prediction. It keeps the basic structure of DSSM on the left side of the green dashed line, but uses log loss to compare the predictions with real-world labels.
    图 3 是使用对数丢失作为目标函数的改进 DSSM 的体系结构。 修改后的 DSSM 与点击预测的应用更紧密相关。它将 DSSM 的基本结构保留在绿色虚线的左侧,但使用对数损失将预测与实际标签进行比较。
    DSSM with Log Loss

6 总结

  • Deep Crossing demonstrated that with the recent advance in deep learning algorithms, modeling language, and GPU-based infrastructure, a nearly dummy solution exists for complex modeling tasks at large scale
    Deep Crossing 证明了随着深度学习算法,建模语言和基于GPU的基础架构的最新发展,针对大型复杂建模任务存在着几乎是虚拟的解决方案

  1. K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385,2015. ↩︎

猜你喜欢

转载自blog.csdn.net/qq_40860934/article/details/110451599