YouTube deep learning video recommendation system

1. Application scenarios of the recommendation system

As the world's largest video sharing website, almost all videos on the YouTube platform come from UGC (User Generated Content). This content generation model has two characteristics:

  • One is that its business model is different from Netflix, as well as domestic streaming media such as Tencent Video and iQiyi. Most of the content of these streaming media is purchased or self-made movies, drama series and other head content, while YouTube content is all It is a self-made video uploaded by users, with a wide variety of styles, and the head effect is not so obvious;
  • Second, due to the huge video base of YouTube, it is difficult for users to find the content they like.

2. YouTube recommendation system architecture

In order to sort the massive videos quickly and accurately, YouTube also adopts the classic recall layer + sort layer recommendation system architecture.

Insert picture description here

The recommendation process can be divided into two levels. The first level is to use the Candidate Generation Model to complete the rapid screening of candidate videos. In this step, the candidate video set is reduced from one million to several hundred, which is equivalent to the recall layer in the classic recommendation system architecture. . The second level is to use the Ranking Model to complete the fine ranking of hundreds of candidate videos, which is equivalent to the ranking layer in the classic recommendation system architecture.

Whether it is a candidate set generation model or a ranking model, YouTube uses deep learning solutions.

3. Candidate set generation model

The candidate set generation model for video recall, the architecture is shown in the figure below.

Insert picture description here
The bottom layer is its input layer. The input features include the Embedding vector of the user's historical viewing video and the Embedding vector of the search term. For these Embedding features, YouTube uses the user's viewing sequence and search sequence to generate a pre-training method similar to Item2vec.

In addition to the video and search term Embedding vector, the feature vector also includes the user's geographic location Embedding, age, gender and other features. What we need to pay attention to here is that for the feature of sample age, YouTube not only uses the original feature value, but also uses the squared feature value as a new feature input model.
This operation is actually to explore the non-linear characteristics of the feature.

Once the features are determined, these features will be connected in the concat layer and input to the upper ReLU neural network for training.

After the three-layer ReLU neural network, YouTube used the softmax function as the output layer. It is worth mentioning that the output layer here is not to predict whether the user will click this video, but to predict which video the user will click, which is different from the general deep recommendation model.

In general, the candidate set generation model of the YouTube recommendation system is a standard deep recommendation model that uses Embedding pre-training features. It follows the architecture of the Embedding MLP model, but differs in the final output layer.

4. Unique online service method of candidate set generation model

5. Ranking Model

The architecture of YouTube's deep learning ranking model
In the input layer, compared to the candidate set generation model that needs to coarsely screen several million candidate sets, the sorting model only needs to sort hundreds of candidate videos, so more features can be introduced for fine sorting. Specifically, the features introduced from left to right in YouTube's input layer are:

  • impression video ID embedding: Embedding of the current candidate video;
  • watched video IDs average embedding: the average value of the last N video Embedding watched by the user;
  • language embedding: Embedding of user language and Embedding of current candidate video language;
  • time since last watch: indicates the time since the user last watched the same channel video;
  • #previous impressions: The number of times this video has been exposed to this user;

After these five types of features are connected, they need to go through a three-layer ReLU network to perform sufficient feature intersection, and then arrive at the output layer. It is important to note here that the output layer of the ranking model is different from the candidate set generation model. There are two main differences:One is that the candidate set generation model selects softmax as its output layer, and the ranking model selects weighted logistic regression (weighted logistic regression) as the model output layer; the other is that the candidate set generation model predicts that the user will click "which video" and sort The model predicts whether the user will click on the current video.

In fact, the fundamental reason why the ranking model uses different output layers is that YouTube wants to predict more accurately User’s watch time, because watch time is YouTube’s most favored business metric, And using Weighted LR as the output layer can achieve this goal.

In the training of Weighted LR, we need to set a weight for each sample. The size of the weight represents the importance of this sample. In order to be able to estimate the viewing time, YouTube sets the weight of the positive sample to the length of time the user watches the video, and then uses Weighted LR for training, so that the model can learn information about the user's viewing time.

For the ranking model, a model service platform such as TensorFlow Serving must be used for online inference of the model.

6. Processing of training and test samples

In order to improve the training efficiency and prediction accuracy of the model, Youtube has adopted many engineering measures for processing training samples, mainly including three points:

  • The candidate set generation model converts the recommendation model into Multi-classification problemIn predicting the scene to be watched next time, each candidate video will be a classification, and if softmax is used for its training, it is very inefficient.
    • Youtube uses the commonly used word2vec The negative sampling training method reduces the number of classifications for each prediction, thereby accelerating the convergence speed of the entire model
  • In the preprocessing of the training set, Youtube did not use the original user logs, but Extract an equal number of training samples for each user.
    • YouTube's purpose in doing so is to reduce the excessive impact of highly active users on the loss of the model, make the model too biased towards the behavior pattern of active users, and ignore the larger number of long-tail user experiences.
  • When dealing with the test set, Youtube does not use the classic random leave-one-out method, but must use the user's most recent behavior as the test set.
    • The main purpose of leaving only the last viewing behavior for the test set is to avoid the introduction of future information, which results from the data traversal problem that does not match the facts.

7. Deal with the user's interest in the new video

8. Summary

The architecture of the YouTube recommendation system is a typical recall layer plus a sorting layer. The candidate set generation model is responsible for recalling hundreds of candidate videos from the millions of candidate sets, and the ranking model is responsible for the fine ranking of hundreds of candidate videos, and finally selected Dozens of recommendations to users.

The candidate set generation model is a typical Embedding MLP architecture. It should be noted that its output layer is a multi-class output layer, which predicts which video the user clicked. In the serving process of the candidate set generation model, the video Embedding needs to be extracted from the output layer, and the user Embedding is obtained from the last ReLU layer, and then usedFast nearest neighbor search Get the candidate set.

The sorting model is also an Embedding MLP architecture. The difference is that its input layer contains more user and video features. The output layer uses Weighted LR as the output layer, and the viewing time is used as the weight of the positive sample. The ability to predict the viewing time is closer to the commercial goal YouTube wants to achieve.

reference:

  • Deep learning recommendation system
  • Deep Neural Networks for YouTube Recommendations

Guess you like

Origin blog.csdn.net/weixin_44127327/article/details/112991157