Interpretation of 2020 of the industrial-grade recommendation system: Deeply optimize the user experience and empower the business

In the era of mobile Internet, data is extremely rich, but at the same time it also leads to a reduction in the efficiency of people's access to effective information, that is, information overload. The recommendation system can actively and personalized push the information that users are interested in to greatly alleviate the problem of information overload. Therefore, it has become one of the fastest growing and most widely used technologies in the mobile Internet era.

Compared with search engines, users actively express less information in recommendation scenarios and have unclear intentions. How to combine users' multi-dimensional information to effectively model user points of interest is a very challenging and valuable issue. In the ultra-large-scale data scenario, the industrial recommendation system mainly includes two core phases, the recall phase (Match) and the ranking phase (Rank). The recall phase quickly selects a set of candidate items that may be of interest to users from the ultra-large-scale candidate library, requiring the rapid selection of thousands of candidate sets of interest from hundreds of millions of candidate items within tens of milliseconds. The sorting stage accurately scores the recall candidate set, accurately estimates the user's click-through rate on the target item, and then presents the content that users are most interested in.

The following is an introduction to the latest developments in the core links of the recommendation system's recall and ranking.

Recall phase

The recall phase is the core part of the recommendation system. Its purpose is to reduce the scale of recommended items from hundreds of millions to a small range (thousands) in a very short period of time, while ensuring the effectiveness of the recommendation system as much as possible. Therefore, the quality of the recalled items directly determines the upper limit of the effect of the entire recommendation system, which is a very important part of the industrial recommendation. In recent years, related research has also flourished.

Mobius [1] explored how to optimize the recall model from the sample. The starting point of the paper is that if you can find samples that are "good" in themselves but have not been recalled by the system, then these samples can be effectively used for model training and correction. However, it is not a simple problem how to determine which samples are so-called "good" samples from the samples that have not been recalled. For this reason, the paper proposes an "approximate evaluation" scheme: taking the score of the subsequent ranking model as a reference, That is, the samples with high CTR model scores but not recalled are regarded as difficult samples. In the offline stage, samples with low correlation but high estimated CTR are selected based on the original recall method and marked as difficult as data enhancement, and the original two-class model is also upgraded to a three-class problem (positive sample/negative sample/difficult sample), which is effective The problems of sparse click exposure data and insufficient long-tail training are alleviated. When this method is directly applied to online systems, there will be a problem of high computational pressure. For this reason, in the online phase, the paper introduces ANN approximate nearest neighbor retrieval algorithm for acceleration, and uses vector compression to save online memory requirements and effectively alleviate online service performance problems. . Continuing the understanding of the training data, the EBR [2] model also indicates that the negative sample screening in the recall phase is the most important thing. The sorting mode can not be copied and only use the exposed and unclicked data as negative samples, but should be in the unexposed and unexposed data. Random sampling is performed on the clicked data to generate a negative sample, and the clicked data is used as a positive sample. At the same time, with the help of the difficult sample mining method in the cv field, the negative samples are divided into simple/difficult and treated separately, which is similar to Mobius's thinking. EBR is shown in the figure below:


As we mentioned earlier, the recall system has extremely demanding requirements for computing time, which has also led to the introduction of some excellent models such as depth models into the recall system. The reason is that the complex depth model calculations are too large. , The direct application to all candidate sets will bring about an explosion of calculations. For this reason, the recall technology TDM based on the depth matching of the tree model is proposed, and the full database retrieval of complex models can be realized based on the tree organization structure. In recent years, continuous progress has been made around this framework, from TDM, JTM [3] to BSAT [4], a series of explorations from multiple dimensions of model structure, tree construction, and retrieval process, as shown in the following figure:


Image source: https://flashgene.com/archives/145299.html

The core goal of TDM work is to be able to break through the limitations of the vector retrieval mode, so that complex deep learning models can achieve approximate optimal retrieval of the entire database within limited resources and time frames. TDM is based on the assumption of the maximum interest heap to build a tree structure index for the entire library, and uses a complex deep neural network model to model the degree of interest of users and nodes in the tree. In the process of online services, beam search is used to achieve rapid retrieval.

In TDM work, the optimization goals of the optimization of the model and the learning of the tree structure index are not exactly the same, which may lead to the mutual restraint of the optimization of the two and lead to the suboptimal overall effect. The quality of the tree structure index has a crucial influence on the overall recall effect. Therefore, how to obtain a high-quality tree index structure through learning is the main problem that JTM wants to solve.

JTM follows the system framework of TDM tree structure index + arbitrarily complex model, and solves the shortcomings of TDM through joint optimization and hierarchical feature modeling. JTM proposes a joint training framework for alternate optimization under a common loss function to learn the tree index structure to avoid the situation where the two optimization goals are inconsistent and lead to sub-optimal solutions. At the same time, the concept of hierarchical user interest expression is proposed. With the help of the hierarchical structure of the tree index, the user behavior characteristics are modeled at different levels of accuracy, which better characterizes the expression of user interest from coarse to fine in the retrieval process. Based on JTM, BSAT conducts joint learning on the retrieval process (ie beamsearch) to alleviate the problem of node distribution mismatch between the offline training phase and the online service phase: 1) During training, it is generated through positive sample uptracking + random negative sampling at the same layer. Nodes on the tree, and the nodes on the spanning tree are retrieved through beamsearch during online services; 2) The label of the node on the tree is determined by the label of its child nodes, and it is unavoidable that there is no error in the beam search process.

BSAT first gave the theoretical definition of the optimal tree model for Beam Search and proved its existence theoretically. In order to train the optimal tree model, the article gives the definition of the loss function and the training algorithm. The core differences in the training algorithm are: 1) the nodes on the tree used for training are generated by beamsearch; 2) the label of the node on the tree is equal to the label of the item with the highest edge probability in its subtree, not only by its child nodes Label decision. The above improvements can effectively solve the problem of training-online service mismatch. The experimental results show that BSAT has achieved better results than JTM in terms of recall indicators. BSAT is shown in the figure below:


Although the tree structure-based recall has achieved good results, there are still two shortcomings. 1) The tree structure itself is difficult to learn, and the sparse leaf node data makes it difficult to learn a good tree structure at a finer level, which restricts the retrieval effect; 2) Each candidate item can only be assigned to one leaf node, which limits the model to describe the expression of the candidate set items from multiple angles. DeepRetrieval [5] works by changing the index structure, using the D x K matrix as the item index, and weaving K^D paths through the D step prediction of K choices at each step. The path code is calculated by the EM algorithm and the model together to obtain the path and the item The many-to-many relationship of, and then learn the relationship between users and items from a more complex perspective, and experiments show that DeepRetrieval can achieve an effect equivalent to the brute force full library matching algorithm at a nearly linear computational complexity. DeepRetrieval is shown in the figure below:


Sort phase

排序阶段的核心问题是点击率 (CTR) 预估,也就是去衡量一个物品被特定用户点击的概率,最终推荐点击率靠前的结果。早期排序算法是以逻辑回归模型 (LR)+ 人工特征工程为主,模型简单易解释,但人工特征工程的成本比较高,不同任务间算法结论也难以复用,更重要的是对于稀疏数据很难抽取有效的人工特征。为了解决上面的问题,因子分解机 (FM) 提出了特征嵌入和二阶特征交叉来缓解数据稀疏和人工特征工程的问题。近些年,随着深度学习的快速发展,排序模型也不断演进,从经典 DNN 模型到结合浅层的 Wide&deep 模型再到结合二阶特征交叉的 DeepFM 模型,深度学习越来越广泛地应用在 CTR 预估问题上。其中 PNN、NFM、DCN、xdeepFM 等工作又进一步丰富了在深度 CTR 模型中进行有效高阶特征交叉的方式。进入 2020 年,特征交互依然是基于深度学习的 CTR 模型的研究热点之一,此外用户行为序列建模等也得到了极大的关注。

自从 Transformer 结构提出以来,推荐模型的特征交互开始进入 attention 时代。其中 AutoInt [6] 工作提出使用多头自注意力 (Multi-headSelf-attention) 机制将模型交互由低阶引向高阶,同时展示出 attention 机制在可解释性上的天然优势。而 FiBiNet [7] 工作借鉴 Squeeze-ExcitationNetwork(SENet) 中通道 attention 结构来学习动态特征的重要性并利用双线性函数来更好的建模交叉特征。

2020 年关于注意力机制在 CTR 预估中应用的研究依然没有停止。例如,自适应因子网络(AdaptiveFactorization Networks [8] )认为虽然 attention 可以建模高阶交互特征,但是同一种高阶交互并不一定适合所有的原始特征,为此论文提出了一种可以自适应调整交互阶数的网络来提升模型的性能。而 InterHAt [9] 则侧重于探究通过引入分层策略来改善 self-attention 的特征建模效果。除了特征本身之间的直接交互外,DRM [10] 则从特征映射空间中基的语义相关性来建模特征交互,也就是特征维度关系建模,论文认为特征空间的维度本身包含一些“隐”的语义,并且证明通过 attention 机制建模这种语义之间的相关性对模型 CTR 的预估有明显的正面作用,特别是当特征维度比较的大的时候,这种增益效果就越明显。AFN 如下图所示:


另一方面,用户行为序列建模也是 CTR 预估中一个热门话题,早期的 YouTube 工作直接把用户观看过的视频序列做平均池化 (mean pooling) 作为用户历史兴趣的表达。后来的 DIN 将 attention 的思想引入到行为序列建模中,使用目标物品对用户行为序列中的物品做 attention,得到 attention score, 然后基于这些 score 对用户行为特征加权求和来表征用户的兴趣。这种方式有效提升用户行为序列建模的效果,但是无法有效区分用户行为中兴趣的起始和终止,这个问题在之后 DIEN 中被深入讨论并处理。

DIEN 基于 RNN 去建模用户行为中兴趣的演化过程,但是 RNN 方式对用户行为序列进行串行计算,耗时相对还是比较高,线上使用压力较大,为此 BST[11] 提出可以使用 transformer 来建模用户的行为序列来缓解这个问题。另一方面,DSIN [12] 发现虽然用户在每个会话中的行为是相近的,但在不同会话中差别会比较大,因此论文提出要基于每个会话进行用户行为序列建模,也就是 session 建模。

对于 2020 年,序列建模的相关工作进展主要还是体现在如何建模更长的用户序列以及如何建模用户序列演进关系两部分。

建模更长的用户序列也就意味着对用户有着更全面的了解,但同时面对的噪声也就更多。对于这个问题,19 年 MIMN [13] 模型提出线上使用 UIC 模块(用户兴趣中心)专门用于更新用户最新兴趣表示,将耗时重头从实时计算部分拆分出来做异步更新,多通道用户兴趣记忆网络 MIMN 主要包括基于正则化修正的 NTU 模块和用来加强用户兴趣提取的能力的 MIU 单元,通过 UIC 的更新机制和 MIMN 的序列建模,首次将超长用户行为序列建模扩展到千级别长度。

虽然 MIMN 模型将用户序列长度处理到千,但由于借用了经典 CTR 的 Embedding+MLP 模式,导致特征规模的扩增显著,同时将所有用户历史行为编码到一个固定大小的内存矩阵上会给内存单元带来很多噪音。延续这一方向,20 年的 SIM [14] 模型通过两阶段的方式来对用户终身行为序列进行建模,一阶段采用通用搜索单元(GSU)以次线性时间复杂度从原始长期行为序列中寻找 top-K 相关的子行为序列以减少噪声,包括 hard-search 同类目搜索和 soft-search 使用内积的近似最近邻两种实现方式;二阶段使用过滤后的较短行为子序列,用精准搜索单元(ESU)捕捉更精准的用户兴趣,包括 DIN/DIEN 等复杂模型;实现请求级别建模用户超长行为序列。SIM 如下图所示:


图片

在用户序列演进关系建模方面,DTS[15] 通过在一个常微分方程中引入时间信息,使用神经网络根据用户历史行为连续建模用户的兴趣演变。但在电商场景中,临时促销活动中的爆品热销物品会成为用户的短期新兴趣,仅使用用户行为序列通常无法预测用户产生的新兴趣,而预测用户新的兴趣又严重依赖于物品的演化过程,因此基于时间感知的深度物品演化网络(DTAN [16] )提出基于时间 attention 的物品演化建模网络来解决这个问题。

上面的工作都是假设用户的每一个行为都是处于其真实意图的,但是实际电商场景中,有些用户行为可能是随机的,与真实用户意图无关的,如果基于这个无关的行为进行推荐可能会产生错误的推荐,同时反映用户真实意图的行为也可能有缺失。基于此,论文 [17] 提出基于卡尔曼滤波的注意力机制来克服这种观测上的误差和缺失。他们使用 transformer 来捕捉长期行为序列的关系和动态特征,但是在 Attention 里将用户历史行为作为用户隐藏兴趣的间接测量值,为用户的隐藏兴趣给出解析解。最终模型效果提升相对显著。

未来展望

随着互联网技术的发展,云技术、实时计算服务推动信息流推荐朝着实时化边缘化的方向发展,更好的满足用户多元化需求。这就需要对用户兴趣的实时追踪建模,对已有的深度学习推荐模型进一步升级优化,及时捕获用户兴趣就可以更迅速精准的为用户推荐感兴趣物品,提升用户体验。

随着智能产品的增多以及 5G 及物联网的发展,未来新场景下的交互方式的变化,能够让我们获得更多维度的用户信息,提升推荐模型的效果。信息流场景的爆发式增长,推荐系统的商业价值越来越重要,个性化推荐成为了标配,如何满足用户的使用体验以及如何提高商业价值,是工业、学术界专家学者在推荐系统上的深入探索的重要方向。


Guess you like

Origin blog.51cto.com/15060462/2674677