1.9 trillion parameter volume, the industry's first trillion parameter recommended fine-line model

2021-02-03 15:24:02

Heart of the Machine released

Heart of the Machine Editorial Department

Come, approach the internal structure of the fast-hand industry's first trillion-parameter recommended fine-line model.

The personalized recommendation system aims to provide a "customized" product experience based on user behavior data. The precise recommendation system model is also the core competitiveness of many Internet products. As a national-level short video App, Kuaishou recommends tens of billions of videos to hundreds of millions of users every day. This involves a challenge: How does the recommendation system model accurately describe and capture user interests?

The solutions used in the industry today usually combine a large number of data sets and fitting parameters to train a deep learning model, so that the model is closer to the reality. Google recently released the first trillion-level model Switch Transformer with a parameter volume of 1.6 trillion, which is 4 times faster than the largest language model (T5-XXL) previously developed by Google.

In fact, the total parameter volume of the fast-handed trillion-parameter fine-arrangement model exceeds 1.9 trillion, which is larger in scale and has been put into practice. This article will formally introduce the development history of the Kuaishou Fine Pacing model.

1.9 trillion parameter volume, the industry's first trillion parameter recommended fine-line model

 

First look at a comparison chart, from left to right:

  • Google BERT-large NLP pre-training model: 340 million parameters
  • Google Meena open domain chatbot: 2.6 billion parameters
  • Google T5 pre-training model: 11 billion parameters
  • OpenAI GPT3 language model: 175 billion parameters
  • Google Switch Transformer language model: 1.600 billion parameters
  • Kuaishou refined sorting model: 1.9 trillion parameters

Parametric personalized CTR model-PPNet

Before 2019, the Kuaishou App was mainly based on the two-column waterfall gameplay. The user's interaction and click with the video were distinguished by the two-stage viewing. In this form, the CTR estimation model becomes particularly critical, because it will directly determine whether users are willing to click on the video shown to them. At that time, the mainstream recommendation model in the industry was still based on simple fully connected deep learning models such as DNN and DeepFM. However, considering that the co-construction semantic model of a certain user and the video will have a personalized deviation based on the co-construction semantic model of the global user, how to learn a unique personalized deviation for different users on the DNN network parameters has become a fast hand Recommend the direction of team optimization.

In the field of speech recognition, the core idea of ​​the LHUC algorithm (learning hidden unit contributions) proposed in 2014 and 2016 is speaker adaptation. One of the key breakthroughs is to learn for each speaker in the DNN network. A specific hidden unit contribution (hidden unit contributions) to improve the speech recognition effect of different speakers. Drawing lessons from the ideas of LHUC, the Kuaishou recommendation team started an experiment on the fine-ranking model. After many iterations of optimization, the recommendation team designed a gating mechanism that can increase the personalization of DNN network parameters and allow the model to converge quickly. Kuaishou calls this model  PPNet (Parameter Personalized Net) . According to Kuaishou, after PPNet went live in 2019, it has significantly improved the model's CTR target estimation ability.

1.9 trillion parameter volume, the industry's first trillion parameter recommended fine-line model

 

PPNet structure diagram

As shown in the figure above, the left side of PPNet is the current common DNN network structure, which is composed of sparse features, embedding layers, and neural layers. On the right are the unique modules of PPNet, including Gate NN and id features that only give Gate NN as input. Among them, uid, pid, and aid represent user id, photo id, and author id respectively. The embedding of all the features on the left will be spliced ​​together with the embedding of these 3 id features as the input of all Gate NNs. It should be noted that the embedding of all features on the left does not accept the back propagation gradient of Gate NN. The purpose of this operation is to reduce the influence of Gate NN on the convergence of existing feature embedding. The number of Gate NN is the same as the number of layers of the neural network on the left, and its output is the same as the input of each layer of neural network as an element-wise product to make the user's personalized bias. Gate NN is a 2-layer neural network, in which the activation function of the second-layer network is 2 * sigmoid, the purpose is to constrain each item of its output in the range of [0, 2], and the default value is 1. When the Gate NN output is the default value, PPNet is equivalent to the left part of the network. After experimental comparison, adding personalized bias items to the neural network layer input through Gate NN can significantly improve the model's target estimation ability. PPNet uses Gate NN to support the personalized ability of DNN network parameters to improve the estimation ability of the target. In theory, all estimation scenarios based on the DNN model can be used, such as personalized recommendation, advertising, and DNN-based reinforcement learning. Scenes and so on.

Multi-objective prediction optimization-MMoE-based multi-task learning framework

As the needs of short video users continue to escalate, Kuaishou released version 8.0 in September 2020. In this version, the bottom navigation bar is added, and a "featured" tab is added on this basis, which supports the form of single row up and down. This compatible double-column click and single-column up and down version is designed to provide users with a better consumption experience and add more diversified consumption methods. Under the new interface, a considerable number of users will use both double-column and single-column. The consumption patterns and interaction forms of users on these two pages are very different, so the distributions expressed at the data level are also very different. How to use the two parts of data in model modeling and make good use of it has become an urgent problem for the Kuaishou recommendation team.

The Kuaishou team found that multi-task learning becomes more important as the number of single-line business scenarios increases. Because in the single-column scenario, the user's interaction behavior is based on the video shown to the user by the show, and there is no very important click behavior like the two-column interaction. These interactive behaviors are relatively equal, and the number of these behaviors is as many as dozens (time-related estimated goals, likes, following, reposts, etc.).

1.9 trillion parameter volume, the industry's first trillion parameter recommended fine-line model

 

1.9 trillion parameter volume, the industry's first trillion parameter recommended fine-line model

 

Estimation target of refined model (partial)

With the increasing amount of single-column business data, from the model level, it is recommended that the team try to separate out a separately optimized model for single-column businesses. Specifically, at the feature level, the dual-column model features can be fully reused, but for the single-column goal, additional personalized bias features and some statistical value features are added. At the Embedding level, because the amount of single-column data in the early stage is small, embedding convergence cannot be guaranteed. Initially, the double-column data click behavior is used to lead the training, and then the single and double-column user video viewing behavior (effective playback, long playback, short playback) is used to lead the training of embedding. At the network structure level, it is mainly based on the shared-bottom network structure training. Irrelevant targets occupies a tower, and related targets share the top-level output of the same tower, which can improve the target estimation effect to a certain extent. After the model went online, it had a certain effect at first, but some problems were quickly exposed. First of all, it does not take into account the difference in embedding distribution in single and double column business, which results in insufficient embedding learning. Secondly, at the level of multi-task learning, in single-column scenarios, user interactions are all based on the single-stage behavior of the current video show, and various goals influence each other. The improvement of a single goal of the model may not necessarily bring overall online benefits.

Therefore, how to design a good multi-task learning algorithm framework so that all estimated goals can be improved is very critical. This algorithm framework must consider data, features, embedding, network structure, and individual user interaction characteristics. After thorough investigation and practice, the recommendation team decided to adopt the MMoE model (Multi-gate Mixture-of-Experts) to improve the current model.

MMoE is a classic multi-task learning algorithm proposed by Google. Its core idea is to replace the shared-bottom network with the Expert layer, and learn different expert network weights for each target on multiple expert networks through multiple gated networks. Perform fusion characterization, and learn each task through the task network on the basis of this fusion characterization.

By referring to the MMoE algorithm and the difficulties of the fast-hand recommendation scenarios mentioned above, the recommendation team modified the MMoE algorithm and designed a new multi-task learning algorithm framework. Specifically, at the feature level, semantic unification was carried out, the semantic inconsistency features in single and double column services were corrected, and the relevant features for users in single column were added. At the Embedding level, space remapping is carried out, and the embedding transform layer is designed to directly learn the single and double column embedding mapping relationship to help the single and double column embedding map to a unified spatial distribution. At the feature importance level, a slot-gating layer is designed to select feature importance for different businesses.

Through the above three changes, the model will represent the embedding of the input layer from feature semantics. Embedding is distributed in different services, and features are normalized and regularized at three levels of different business importance, and remapped to a unified feature representation space , Making the MMoE network better capture the posterior probability distribution relationship between multiple tasks in this space. Through this improvement to MMoE, all the goals of the model have been improved significantly.

Short-term behavioral sequence modeling-Transformer model

In Kuaishou's fine-ranking model, the user's historical behavior characteristics are very important, and have a good representation of the dynamic changes in user interests. In the fast-handed recommendation scenario, the user's behavior features are very rich and changeable, and its complexity far exceeds the video features or context features. Therefore, it is necessary to design an algorithm that can effectively model user behavior sequences.

At present, the modeling of user behavior sequences in the industry is mainly divided into two modes, one is to perform weighted sum on the historical behavior of users, and the other is to perform time series modeling through models such as RNN. In the two-column fine-ranking model in the early stage of Kuaishou, the user behavior sequence is simply sum pooling as the model input. In the single-column scenario, the user passively receives the Kuaishou recommended video, and after losing the cover information, the user needs to watch the video for a period of time before giving feedback. Therefore, the active video selection right is reduced, which is more suitable for the recommendation system to do E&E (Exploit & E&E) on the user’s interest. Explore).

Kuaishou's sequence modeling is inspired by the Transformer model. The Transformer model is a classic neural network translation model proposed by Google in 2017. Later, the popular BERT and GPT-3 are also based on part of the structure of this model. Transformer mainly includes two parts: Encoder and Decoder. The Encoder part models the input language sequence. This part is very similar to the user behavior sequence modeling goal. Therefore, Kuaishou learns from the algorithm structure and optimizes the amount of calculation.

1.9 trillion parameter volume, the industry's first trillion parameter recommended fine-line model

 

MMoE combined with Transformer to model user interest sequence

First, Kuaishou recommends that the team use the user's video playback history as a sequence of actions. Candidate sequences include user long-play history sequence, short-play history sequence, user click history sequence, etc. This type of list exhaustively records the user’s watch video id, author id, video duration, video tag, video watch duration, video watch time, etc. Content, a complete description of the user's viewing history. Secondly, do log transformation to the video viewing time to replace position embedding. In the Kuaishou recommendation scenario, the short-term viewing behavior of the user is more related to the current estimation, and the long-term viewing behavior more reflects the user's multi-interest distribution, and the use of log transformation can better reflect this correlation. Finally, replace multi-head self-attention with multi-head target attention, and use the input of the current embedding layer as the query. The purpose of this design has two points. First, current user characteristics, estimated video characteristics and context characteristics provide more information than a single user behavior sequence. Secondly, the amount of calculation can be simplified, from O(d*n*n*h) to O(d*n*h + e*d), where d is the dimension of attention, n is the length of the list, and h is the number of heads. e*d characterizes the complexity required to transform the embedding layer dimension into the attention dimension.

The transformed Transformer network can significantly improve the model’s estimation capabilities. In offline evaluation, the user’s watch time estimate has improved significantly, and the online user’s watch time has also been significantly improved.

Long-term interest modeling

For a long time, Kuaishou's fine-ranking model has been more inclined to use the user's recent behavior. As mentioned above, through the use of transformer and MMoE models, the fast-handed fine-ranking model accurately models the short-term interests of users and has achieved very large benefits. In the previous model, dozens of recent historical behaviors of users were used for modeling. Due to the characteristics of the short video industry, dozens of recent historical behaviors can usually only express the interest of users in a short period of time. This causes the model to rely too much on the user's short-term behavior, which leads to the lack of modeling of the user's long-term interest.

In view of the business characteristics of Kuaishou, the Kuaishou recommendation team has also modeled the long-term interests of users, so that the model can be aware of the long-term history of users. The Kuaishou recommendation team found that after extending the user's interaction history sequence (play, like, follow, repost, etc.), the model can better capture some potential user interests, even if such behaviors are relatively sparse. In response to this feature, the recommendation team designed and improved the user's ultra-long-term interest modeling module based on the previous model, which can comprehensively model the user's behavior from several months to a year, and the length of the user's behavior sequence can reach ten thousand. This model has been promoted in all businesses and has achieved huge online benefits.

1.9 trillion parameter volume, the industry's first trillion parameter recommended fine-line model

 

Schematic diagram of the structure of the fast-handed user's long-term interest fine ranking model

Trillions of features, trillions of parameters

With the iteration of the model, the complexity of the deep learning network is getting higher and higher, the number of features added to the model is also increasing, and the size of the model's feature scale also restricts the iteration of the refined model. This will not only limit the scale of model features, cause some features to be expelled, and bring instability of model convergence, but also make it easier for the model to expel low-frequency features, resulting in poorer online cold start effects (new video, New users), not friendly enough for long-tail videos or new users.

In order to solve this problem, the students of Kuaishou recommendation and architecture have improved the training engine and online serving, so that offline training and online serving services can be flexibly expanded according to the configured feature quantity, which can support the fine-ranking model offline and online. Billion features, the scale of trillions of parameters. In particular, the new model is more friendly to the traffic distribution of new videos and new users. It has a significant improvement in the indicators of new users and new videos. It implements the concept of Kuaishou recommendation "Inclusiveness". The current Kuaishou fine-ranking model, The total feature quantity exceeds 100 billion, and the total parameter quantity of the model exceeds 1.9 trillion.

Online training and estimation services

In order to support online training and real-time estimation of hundreds of billions of feature models in recommended scenarios, the recommendation team modified the training framework and the parameter server of the online estimation service. In the online learning of recommendation models, the parameter server storing Embedding needs to be able to accurately control the use of memory to improve the efficiency of training and estimation. In order to solve this problem, the recommendation team proposed a conflict-free and memory-efficient global shared embedding table (Global Shared Embedding Table, GSET) parameter server design.

1.9 trillion parameter volume, the industry's first trillion parameter recommended fine-line model

 

Each ID mapped to an Embedding vector will soon fill up the memory resources of the machine. In order to ensure that the system can be executed for a long time, GSET uses a customized feature score elimination strategy to control the memory footprint to always be lower than the preset threshold. Traditional cache elimination strategies such as LFU and LRU only consider the frequency information of the entity's appearance, and are mainly used to maximize the hit rate of the cache. The feature score strategy considers additional information in the machine learning scenario to assist in feature elimination.

In the online learning process of the recommendation system, a large number of low-frequency IDs will enter the system. These low-frequency IDs will usually not appear in future estimates. The system may quickly eliminate them again after receiving these features. In order to prevent the meaningless entry and eviction of low-frequency IDs from affecting system performance, GSET supports some feature admission strategies to filter low-frequency features. At the same time, in order to improve the efficiency of GSET and reduce costs, Kuaishou also adopted a new storage device-non-volatile memory (Intel AEP). Non-volatile memory can provide a single machine with an approximate memory-level access speed of several terabytes. In order to adapt to this kind of hardware, the recommended team implemented the underlying KV engine NVMKV to support GSET, thus ensuring the online stability of the trillion-parameter model.

Look to the future

According to Dr. Song Yang, the person in charge of the Kuaishou recommendation algorithm and former Staff Research Manager at Google Research, the short video industry has its own unique challenges, which are manifested in the large number of users, the large amount of video uploads, the short life cycle of works, and the rapid changes in user interests. And many other aspects. Therefore, it is difficult for short video recommendation to imitate the refined operation of the traditional video industry, and it is necessary to rely on recommendation algorithms to distribute videos in a timely and accurate manner. The Kuaishou recommendation algorithm team has been deeply customizing and actively innovating short video services. It has put forward many industry-first recommendation models and ideas. At the same time, it has also brought many system and hardware challenges to the recommendation engineering architecture team.

Dr. Song Yang believes that the fast-hand and fine-ranking trillion-parameter model is a milestone breakthrough in the recommendation system. It combines the advantages of sequence models, long-term and short-term interest models, gated models, and expert models. It is by far the industry’s most advanced model. One of the most comprehensive and effective recommendation models. The model has been fully launched in Kuaishou's main business to serve users. In the future, there may be more challenges and opportunities for the "algorithm-system-hardware" trinity. It is hoped that this will further promote the technological innovation and breakthrough of the Kuaishou recommendation system, enhance user experience and create value for users.

Guess you like

Origin blog.csdn.net/weixin_42137700/article/details/113818220