Exploration and practice of QQ music recommendation recall algorithm

1. Business introduction

1. Business introduction

There are a lot of recommended products on the homepage of QQ Music, such as: personalized radio station, 30 songs per day, single song recommendation, UGC playlist recommendation and AI playlist, etc.

picture

In the picture above, you can see that each product has different characteristics and shapes. For example: personalized radio stations provide an immersive listening experience; AI algorithm playlists update 30 songs every day. These various product forms pose many challenges to recommendation algorithms and architectures. The optimization objectives and sample structures of different morphological portals are different.

2. QQ music recommendation scene features

Next, we introduce some features of QQ music recommendation scenarios.

First of all, at the user level, the platform covers a very wide range of consumers, both young and old.

insert image description here

Secondly, the inherent attributes of the target group are relatively scarce. Except for the portrait of the music itself, other attributes only have a small amount of demographic information filled in by users. At the behavioral level, that is, the interaction level of users, finishing and cutting songs are the main operational behaviors, as well as other operations such as favorites, blocking, following, and adding self-created playlists.

Finally, unlike e-commerce and video streaming scenarios, repeated consumption of music is a major feature of music recommendation scenarios. In addition, the products recommended by music are diverse, and the characteristics of different forms are very distinct; for example, the audio, lyrics, singer, etc. of the song, the title and picture of the UGC playlist, and so on.

The above recommended scenarios bring the following challenges to the recall algorithm:

When the user listens to music, the noise is relatively loud. If the samples are not carefully processed and screened, the recall accuracy is not good enough.

The popularity of the head is very serious. Relatively speaking, if no specific intervention is made, the recommendation results will lack a sense of surprise.

User attributes are scarce, and cold start is relatively difficult.

3. QQ music recommendation solution

Based on the above three problems, we propose the following solutions:

Recall using the fusion music knowledge graph;

Introduce sequence and multi-interest recall;

Mining the way of audio recall to recall songs with "similar sound" for users;

Explore the method of federated learning to solve the difficult problem of scarcity of user attributes.

These schemes are described in detail below.

2. Recall of Fusion Music Knowledge Graph

First, the recall of the fusion knowledge graph is introduced. This part is mainly to improve the accuracy of recall.
insert image description here

Music itself contains a lot of basic attributes, for example, almost every song has album, singer, genre and language. In order to improve the accuracy of recall, many recall models will incorporate these attributes into the model as the Side-Info of the song for learning, and models such as EGES / GraphSage are also used in the recall of QQ Music. However, these two types of models also have shortcomings. For example, the EGES model can integrate Meta information, such as the language, album, etc. mentioned above. The increase of this feature will make the generalization of recall insufficient; the ecology of Douyin will also It penetrates the Meta association logic of many songs. In addition, the music library of QQ Music is very large and rich. The training period of using some complex graph models is relatively long, and the efficiency is also strongly dependent on engineering capabilities. Therefore, the recall of knowledge graphs is combined next, and a compromise is made between these two aspects. There is a good effect.

insert image description here

Music has a rich knowledge graph, usually triples. For example, Jay Chou sang Dongfengpo, a Chinese-style song. Compared with the characteristics of the song, the map contains richer information and relationships, and the relationship can be transmitted. Taking the self-built playlist as a training sample as an example, that is, the introduction of the map in the right picture is equivalent to a vertical series of songs co-occurring in different playlists.

insert image description here

Modeled using the Song2vec method, the objective function is shown in the figure above. It can be seen that on the basis of Song2vec, relationship learning is added, where the gamma factor indicates the degree to which the current relationship can be integrated into the model.

There are many ways to construct triples. Using the genre map as an example, there are two construction methods (songid1, genre(genre), songid2) and (songid1, relation, genre(genre)). The former is a common construction method in NLP, but in the music scene, this relationship is mutual, and 2 N (N-1) pairs of relationships are constructed by Cartesian product; while the latter relationship is more direct, The number of relations drops directly to the level of N. The fusion knowledge graph has a great improvement in the accuracy of recall, and the improvement of the BadCase rate is also very significant. Taking Quan Zhilong's "Today" as an example, the left side is only associated with Song2vec, which will be strongly bound to the popular songs of Douyin; while the fusion of Song2vec and TransE on the right side can ensure the association of songs Accuracy and some generalization.

3. Sequence and multi-interest recall

Sequence and multi-interest recall, mainly to mine the temporal and spatial characteristics of the sequence, as well as the multi-interest representation of users.

After making some improvements to the sample features and model structure, the YouTube model has a very good recommendation effect on recall. The song completion rate of this recall channel is very high, but there are also many problems. E.g:

  • Question 1: There is a sequence relationship between the user's listening behavior, especially in the recommended scene, in addition to the location information, it also includes the temporal influence of the behavior, that is, there is a time and space relationship at the same time;

  • Problem 2: The method of avg/sum pooling on the sequence is too rough, especially in the case of many users' interests, which will cause the user's interests to be neutralized or even wiped out.

insert image description here

Next, the improvement and practice of the above problems will be introduced separately from sequence modeling and multi-interest modeling.

3.1 Spatial and temporal modeling schemes

QQ Music adopts SASRec sequence modeling to model the user's historical broadcast behavior, extract more valuable information, and superimpose multiple self-help mechanisms to learn more complex feature transformations. The main idea is to use the user's sequence L to predict its target Target P. In the self-attention layer, V calculates the Attention weight based on QK and then inputs it to the subsequent network, and finally uses sampled_softmax_loss to do multi-classification for prediction. In addition to the fusion of absolute position and relative time, Item Input and Output sharing Embedding, compared with the Youtube model, the HR@100 indicator has been greatly improved. Based on SASRec + Share Embedding, which integrates both time and location modeling, the result can achieve an accuracy of 23.72%, while the original Youtube is 21.25%, and the accuracy is increased by 2.5%.

insert image description here

3.2 Multi-interest extraction scheme

In the QQ music scene, more than 80% of users listen to more than two genres, and more than 47% of users listen to songs in more than two languages. It is very important how to more accurately dig out the multiple interests of users in listening to songs, and even the interest in listening to songs of small groups.

Taking the MIND model as an example, the multi-interest model has several very important modules, such as:

The first part is that Context/demographic is a module that integrates contextual information and statistical information such as age, gender and city;

The second part is the multi-interest extractor, a multi-interest extraction module based on user sequences, which is also the core of the model. MIND uses the capsule network for multi-interest extraction. Unlike ordinary neurons, the capsule neuron’s Both input and output are vectors rather than scalars;

The last part is the Online Serving module, which is divided into multiple interest vectors for neighbor retrieval online. Each index set is a cluster of a user’s interest, that is, different User Embeddings are used to index users’ different interest clusters online. kind.
insert image description here

When I first tried the model, I encountered some problems, such as:

The clustering effect of song Embedding is not very good;

The discriminative degree of user's interest vector clustering is not enough.

We also made some optimizations for these two problems:

  • Optimization 1: For problem 1, on the basis of Songid, add the language, genre and other data of the finished songs for splicing, minimize the cost of model learning, and explicitly tell the model that the clusters of some songs are similar.

  • Optimization 2: For problem 2, in the second layer, that is, the parameters of the dynamic routing layer, Routing logits is updated by re-initializing each new sample. In this way, the clustering of song Embedding has been significantly improved. , and MIND combined with sideinfo and Modified DR routing methods can achieve a result of 25.2% in the Hitrate@200 indicator, which is a very obvious improvement compared to the first two multi-interest baselines.

insert image description here

3.3 Multi-interest representation method based on Self-Attention

In addition to MIND's use of capsule network for multi-interest extraction, there are currently multi-interest representation methods based on Self-Attention in the industry. The difference mainly lies in the type of neurons, how the weights are assigned, and how the weights are updated. As can be seen in the figure below, the distribution probability of the weight of the capsule network in the left figure is normalized among all the capsules in the previous layer; in the right figure, each attention head independently processes its input.

insert image description here

We have also made a lot of attempts to extract multi-interest based on Self-Attention. The experiment found that the multi-interest model based on Self-Attention can well describe users' preferences in different genres and languages, and the average recommended popularity is also relative to Youtube recall has eased. The picture on the left is a screenshot of a user's 30 songs per day. Based on multiple interests, three interests of the user were mined: popular in Mandarin, popular in English, and popular in Japanese. In the AB experiment, the completion of broadcast and the improvement of collection are more obvious. Taking 30 songs per day as an example, the DAU increased by 2%; the total broadcast and collection penetration rates will increase by more than 2 points; the language and genre diversity also increased by 3 points.

picture

The introduction of this method has also solved some problems in the top share of songs. About 2% of the popularity has dropped, and the popular recommendation problem has also improved.

4. Audio recall

Audio recall is a more characteristic recall method for music scenes, which will be explained in two parts.

4.1 Audio Feature Mining Method

For the songs in the music library, based on four categories of attribute detection, such as pure vocals, pure instrumental music, vocals with accompaniment and others, and ten genre detections, such as rock, folk, country, etc., to characterize the version of a song And genre, that is, version and genre. Specifically, taking 3 seconds as a paragraph, for each eigenvalue of 14 categories, take T scores along the time axis, and calculate the statistical values, including maximum, minimum, mean, variance, kurtosis and skewness. Based on these 14 categories, the audio features on the right are extracted, and the audio features are the corresponding audio representations (audio vectors).

picture

insert image description here

We did some experimental analysis and got some conclusions: the upper left picture is the distribution of cold-start new songs recommended to users, and the upper right picture is the distribution of favorite songs corresponding to the user population, calculate the completion rate of cold-start new songs and the audio of the user's favorite songs The Pearson correlation coefficient between the similarities (the specific calculation method is listed below), we can see that the lower left figure is in line with the normal distribution, we found that the audio embedding weighted similarity between songs and user assets is related to the user’s completion rate of listening to songs. The correlation coefficient conforms to a normal distribution, which to some extent indicates that some users’ listening behaviors are sensitive to audio (r_value > 0).

Based on the above analysis conclusion, audio embedding is also used in multiple scene recall of QQ music single recommendation. For example, the use of audio similarity for single-point recall improves the user's sense of surprise, and the user's collection behavior has increased significantly. Some time ago, the popular Martian Brother's Leave The Door Open recalled songs like Peaches or Walk on Water through audio similarity. Mining audio representations of songs also aids in cold-start distribution in the absence of other synergistic information.

picture

In the new song cold start and song release recall, QQ Music uses the audio vector to process the user's audio preference and song audio representation, uses the song's audio representation to recall candidate songs, and then uses the user's audio preference as a feature for sorting, and also Got very good results.

4.2 Multimodal Audio Recall Method

The methods described above are all based on pure audio representation. Is it possible to combine user behavior for metric learning? Through practice, we propose the User-Audio Embedding modeling method. The user part is the 40-dimensional user embedding calculated using the deep model. The model in the audio part uses a song that the user likes and n songs that the user does not like, and performs metric learning with the 40-dimensional user embedding. The trained audio part model can get 40-dimensional embedding for any audio input. Compared with the simple audio embedding mentioned above, the user audio embedding that integrates user information has further improved the recall accuracy of audio, which is also in the country, rap/hip-hop/K-pop in the MIREX Awards. The accuracy of the three genre classifications has reached the best results in history. The User-Audio Embedding model also won the MIREX award, and the paper was published on ICASSP. Interested students can search for this article.

picture

5. Federated Learning Recall

5.1 Federated Learning Recall Method

Federated learning is a machine learning technique that trains algorithms among multiple distributed edge devices or servers with local data samples without exchanging data samples, protecting data privacy. In recent years, with the rise of federated learning, there have been many successful cases of joint modeling in the fields of finance and other fields. We have also begun to seek to introduce vertical federated learning in the Tencent ecosystem to improve the accuracy of recall.

There are three categories of federated learning:

  • Horizontal federated learning is mainly about the similarity or the same business. It is characterized by overlapping features, and the main thing to do is the combination of samples;
  • Vertical federated learning is mainly to reach the similarity of users, which is characterized by a lot of overlapping users;
  • Federated transfer learning is mainly to combine features, users and services are not similar, and the overlap between features and users is relatively small.

picture

In the QQ music scene, we seek vertical federated learning to further characterize user characteristics. QQ Music combined the system data of other business scenarios to jointly train the dual-tower DSSM model; the QQ Music Tower contains song-related attributes, including language, singer, version, etc.; while other business system towers mainly include user attributes, user interest preferences, interest tags, etc.

In the online service, the Qyin Tower produces Item Embedding, and other business system towers produce User Embedding; Item Embedding is used to build an index, and User Embedding is predicted through online real-time Serving to do a neighbor query.

picture

5.2 Federated Learning Upgrade Scheme

The figure below is a multi-objective model with two towers. QQ Music has simply upgraded the model based on the two-tower recall model, which can be combined with multi-business scenario modeling. The MMoE model is used to learn multi-objectives. The left side is the user side, and different Experts are introduced for learning; the right side is the business data of different business scenarios, including the Item side of QQ Music and the Item side of business X. This joint learning can integrate the attributes and characteristics of different domains into the model, and then learn user representations more accurately. The introduction of federated learning has greatly improved the user's cold-start data, such as: personalized radio, 30 songs per day and single modules, etc. The cold-start data of these entrances have been significantly improved, and the average length of cold-start is about 10%. significant improvement. It should be emphasized here: Federated learning completely protects user privacy. TME strictly follows relevant laws and regulations, follows the principle of privacy protection, and provides users with more secure and reliable services.

picture

6. Q&A session

Q1: How is the recall sample of music realized? What are the differences between sample selection and sequencing assays, and why?

A: In the music scene, there are a lot of entrances. The sample distribution of each entrance is very different, or the feature distribution is different. For example, the user distribution and characteristics of the 30 songs per day are very different from those of the radio station. This problem was also mentioned at the beginning, so for sorting, the samples on the sorting side are optimized for each individual point. So the samples here are the samples of the point itself. The recall is a recall model shared by all the entrances, so for the recall model, the data of the large market is used, that is, the overall data of QQ Music is used for unified training.

The advantage of this is that the data will be relatively rich, and information from different circles can be learned. For the deep recall samples, the finished sequence samples are more used, and also include demographic characteristics, as well as some collection information. This is not the case for samples on the sorted side.

What I just talked about is the selection of recall samples for deep models. For ordinary single-point recall, this part is mainly about how to build a graph model. At present, the establishment of the graph model mainly uses the user's self-built playlist. This part of the data may contain billions of data. Based on the co-occurrence of songs in the playlist and the interaction between songs and users, a very large graph model can be constructed. . After the graph model is constructed based on the above method, various graph models can be used to characterize the nodes.

Q2: How to balance the length and interests of a user in the music scene?

A: First of all, the input of the deep recall model itself is a relatively long-term sequence. This part of the interest sequence is a description of the user's overall listening behavior for a relatively long period of time. This part of the characterization is relatively long-term; single-point recall is the recall of I to I, which is related to the user's recent playback behavior, which may be a short-term related behavior. For example, a user's recent favorite singers in this day or two days will be considered as the user's recent strong short-term interest, and will use this interest as the next to send more songs with similar audio that may be liked, or Say synergy similar songs etc.

Therefore, if long-term and short-term interests are used as the dimension, one approach is to use a deep sequence model, which is more focused on long-term interests, and the single-point recall model will be relatively short-term. In addition, we will also build long-term and short-term portraits of users. Based on the long-term and short-term portraits, some corresponding recall paths will be given to satisfy users’ long-term and short-term interest exploration. Of course, this is not only done in recall, but also long-term and short-term features of users are added to the ranking model to capture the interests of users. This part needs to be fused while recalling, and finally achieve the best results.

Q3: For the recall of multiple interests, how to choose the number of recalls for each interest number?

A: We did an online experiment, first of all, how to choose the number of each interest. In this part of the offline experiment, we compared different hyperparameters to determine the impact of different settings on Hitrate. Generally speaking, the more K is selected, the better the diversity will be; if the K value is too large, the accuracy will decrease. There are various options when online, for example: there are now three clusters, each cluster recalls 50 songs, that is, 150 songs, and the Quota allocation to each cluster is fair; another approach is that each The cluster recalls some more songs, and then sorts it, truncating 150 songs. Here, the more powerful clusters will be exposed, and the weaker clusters will be exposed less.

This part of the online experiment has also been done, and the results of the three clusters are combined for sorting, instead of assigning 50 to each Quota, the data results will be relatively better. However, the share of popular distribution will be more, and the utilization rate of content is actually not that high. So the current method is: each cluster is given a certain quota, so that different interests, even if the weight is small, have the opportunity to enter the ranking level for fair competition.

The results shown in the multi-interest recall section are also based on this method; the overall effect is also very good, and multiple indicators have achieved a win-win situation.

Q4: Content related to audio features

A: Audio features are added to the ranking model, and audio features are widely used in the QQ Music ranking model. As mentioned earlier, in the music scene, audio is a key feature, which can show the user's interest to a certain extent.

Combined with the analysis in the PPT, it can be seen that some users or most users, their listening behaviors are still related to audio information. Our recent attempts on audio, whether it is based on the audio similarity recall of singers or the audio similarity recall of songs, have a very good performance in data, which is reflected in the increase in the collection rate. Because users are not very familiar with these songs, or it has penetrated some of the current collaborative logic, it will bring more surprises to users.

Q5: What is the QQ Music technology stack like?

A: First of all, QQ Music data is based on ClickHouse + Superset's OLAP analysis and calculation visualization platform architecture, and then combined with some big Tencent components, QQ Music has also made some open source components. Later, I will introduce my own machine learning platform. At the model training level, TensorFlow is the main development direction. In terms of data processing, it is mainly big data processing languages ​​and components such as Hive. Skills such as C++ and Go are required at the overall service level or Serving level. This is also the direction of most of Tencent's business.

Guess you like

Origin blog.csdn.net/qq_35812205/article/details/123971401
Recommended