Recommendation system (10) User behavior sequence modeling-Pooling route

For recommendation systems, accurately capturing user interests is a core proposition they face. Whether it is the optimization of samples, features or model structure, what is essentially done is to improve the recommendation system's ability to capture user interests. Therefore, how to improve this ability plays an important role in improving the recommendation effect. He is also an algorithm engineer. The core starting point for daily work.
User historical behavior is very important information. Based on the rich behavioral data that differs greatly among different users and changes over time, how to effectively use this information to discover the real interests of users hidden behind their behaviors and accurately express them can not only reflect the differences of different users, It can also capture the changes in user interests over time, which is very critical to the recommendation effect.

1. Commonly used features

The essence of recommendation is ranking, and ranking is the art of features . Although feature engineering does not seem to be as "high-end" as deep models, in actual business, optimization based on feature engineering is more stable and reliable than model-based, and the effect is often not inferior to that of the optimized model. Feature engineering must be combined with business understanding. In specific business scenarios, imagine that you are an actual user, and what features will have a greater impact on whether you click or convert. Generally speaking, the following characteristics can be enumerated:

1.1 Context features

Such as day of the week, time, network type, operating system, client version, etc.

1.2 User characteristics

That is to say, user portraits can share the characteristics of users in various dimensions in other APPs or in different scenarios of the same APP (for example, some large Internet companies usually involve multiple e-commerce, short videos, travel, payment and other fields, so It is easy to obtain the characteristics of users in various dimensions and build accurate user portraits), which mainly includes three parts:

  • Static features : User ID, gender, age, city, occupation, income level, whether you are a college student, whether you are married, whether you have children, registration time, whether you are a VIP, whether you are a new user, etc. Static features generally have a high degree of distinction. For example, people of different genders and ages have very different interests. Another example is whether you have children, which will directly determine whether you are interested in maternal and infant products.
  • Statistical characteristics : For example, the User’s PV (Page View), VV (Video View), CTR (Click-Through-Rate), completion rate, single VV duration, etc. in the past 30 days, 14 days, and 7 days. It is best to include absolute statistics at the same time. value and relative value. After all, 2 exposures and 1 click have the same CTR as 200 exposures and 100 clicks, but their confidence levels are very different. Most of the statistical features are posterior features, which are very helpful for model prediction. When constructing features, you must pay attention to the **[data crossing]** problem. When constructing features, do not include the statistical data of the day.
  • Behavior sequence characteristics : This is a very hot research direction at present, and it is also the key to the optimization of fine ranking models. You can construct a short-term click sequence and a long-term purchase sequence for users, as well as a click-to-purchase sequence with positive feedback and a non-click sequence with negative feedback exposure. Sequence length is currently a pain point. When the sequence is too long, models such as Transformer may have a large amount of calculations, causing indicators such as model RT and P99 to be unable to handle and a large number of timeouts occur.

Data travel : using future data for training. For example: the statistical value of the sample at the current time can only be the data before the current time, and the data statistics after the current time cannot be used. Assuming that the CTR is updated hourly, the CTR value of each sample is calculated using the total number of samples in the current hour. The CTR of the sample at 21:01 cannot actually be calculated using the overall CTR at 21:00. It should be calculated using the data before 21:01. Otherwise, there will be a data travel problem, and the sample at 21:01 actually uses future click data.

1.3 Item characteristics

Unlike User features, Item features usually cannot be shared with other APPs. Important features such as Item IDs of different APPs cannot be aligned, making domain migration impossible. The main features are as follows:

  • Static features : such as Item ID, author ID, category ID, shelf time, clarity, physical duration, Item Tag, etc. These features are generally generated by machine recognition, manual marking, user filling in and operation review, etc., and are very important.
  • Statistical characteristics : such as the Item's PV, VV (Video View), CTR, completion rate, single VV duration, etc. in the past 14 days, 7 days, and 3 days. It is best to include both absolute and relative values. Like the User-side statistical features, attention should be paid to data crossing issues.

1.4 Crossover features

Item and User cross characteristics, such as the statistical characteristics of Item on users of different genders and ages. Although the model can achieve automatic feature crossover, whether the crossover is good is another matter. It still makes sense to manually construct key intersection features.

2. How to deal with features

Feature processing mainly includes the following situations:

2.1 Discrete values

For direct embedding, pay attention to the convergence issues of high-dimensional sparse ID features, such as Item ID and User ID.

2.2 Continuous values

There are two main ways: first, directly concat with other embeddings: simple operation, but poor generalization ability; second, positive samples are divided into buckets with equal frequency and then discretized: strong generalization ability, which is a common solution at present. .

2.3 Multi-valued features

The most typical one is the user behavior sequence . The main methods are:

  • Mean-pooling, sum-pooling : Mean-pooling or sum-pooling the Item features in the behavior sequence one by one.
  • att-pooling : Calculate the attention of each Item in the behavior sequence and the target Item to be scored, and then average it, which is a weighted average, such as DIN. This method considers the importance of the Item and also supports the introduction of the important side information of the Item. By introducing the item index, it can actually also contain certain timing information and can be used as a baseline for sequence modeling.
  • Sequence modeling : Model each item in the behavior sequence through an RNN model such as GRU, and then take out the output of the last position, such as DIEN. This method takes into account the temporal relationship of user behavior and interest migration. Currently, Transformer is basically used for temporal modeling, which can alleviate problems such as backpropagation gradient dispersion, poor long sequence modeling capabilities, and high serial time consumption.

3. Why is user behavior sequence modeling needed?

In recommendation scenarios (such as product recommendation, video recommendation, music recommendation, etc.), user behavior data is usually very rich. When an item is exposed to the user, the user may perform a variety of behaviors based on the item. Typical cases are as follows:

  • E-commerce platform : click, browse, add purchases, place orders, exit, etc.;
  • Video platform : click, play, like, comment, reward, repeat play, etc.;

The above behaviors imply the user's diverse interests and directly express the user's preference for the item. When the user repeatedly plays (or repeatedly purchases or clicks on) an item, it indicates that the user is most likely to be interested in the current item. When the user If you directly swipe over an item, there is a high probability that the user is not interested in the current item. In daily life, many users may not be able to clearly express their thoughts and interests, but through the user's behavior, interests that the user himself is not aware of can be captured - as the saying goes: Although the mouth is tough, the body is honest . A person's behavior is the best criterion for testing his or her ideas.

In addition to being rich, user historical behavior data also has the characteristics of large differences and rapid changes:

  1. Large differences : There are huge differences in the behavioral data of different users. For example, a user who is interested in technology usually has a large number of technology-related items in his historical behavior data; a user who is interested in music has related music items in his history. Will appear frequently in behavior;
  2. Rapid changes : User behavior data changes rapidly, and the result is that the behavior distribution changes over time. For example, in an e-commerce scenario, user behavior may change according to consumption needs. When users need to buy a refrigerator, they usually shop around, so users will browse a large number of refrigerator-related items. However, once the user completes the purchase of the refrigerator, there will no longer be this demand in the short term, so the user's interest in refrigerator-related items will decrease sharply.

User historical behavior is very important information. Based on the rich behavioral data that differs greatly among different users and changes over time, how to effectively use this information to discover the real interests of users hidden behind their behaviors and accurately express them can not only reflect the differences of different users, It can also capture the changes in user interests over time, which is very critical to the recommendation effect.

4. User behavior sequence modeling method

In the era of deep learning, various representations are embedding. When building features, user behavior data is represented in the form of behavior sequences. With the increasing application of deep learning in the recommendation field, user behavior sequence characteristics have received increasing attention. In recommendation scenarios, the key point of user interest modeling is how to use the user's behavior sequence characteristics to obtain effective embedding to represent user interests. In response to this, many methods have been proposed.

Early uses of behavioral sequences were of limited length, often on the order of ten or hundreds. In the past two years, as the model has become increasingly difficult to achieve results, practitioners have begun to improve it from the perspective of data and features. At the same time, in conjunction with engineering transformation and performance improvement, the length of the behavioral sequences used has gradually lengthened, from the order of ten or hundreds. It has increased to the order of thousands or even tens of thousands, and user interest representation has gradually developed from short sequences to long sequences. As the sequence length increases, the user's interests also develop from a single representation to multiple interests.

The two methods of short sequence and long sequence have different modeling ideas. The difference in starting point of method design mainly comes from the performance pressure brought by model complexity. When processing short sequences, the main methods used in the industry are:

  1. Based on the idea of ​​pooling , this idea is simple and directly uses pooling methods such as sum and mean;
  2. Based on the RNN serialization modeling idea , this idea is generally implemented using RNN[1], LSTM[2], GRU[3] and other related recurrent neural networks;
  3. Based on the attention idea , this idea is divided into self attention and target attention. The typical method of self attention is transformer[4], and target attention includes din[5], dien[6], dsin[7], etc.

The core of long sequence methods is to solve the computing performance problem faced by sequence length, including MIMN [8], SIM [9], etc.

Users often show diverse interests in actual scenarios, so the industry has also proposed some modeling methods for users with multiple interests, such as MIND [10], DMIN [11], etc.

This article first shares the modeling method based on pooling. In subsequent articles, we will continue to introduce methods such as attention, long sequences, and multiple interests.

4.1 Ideas based on pooling

When early deep models first started to be applied in the recommendation field, the processing method of sequence features was simple and direct, using the idea of ​​pooling. The paper "Deep Neural Networks for YouTube Recommendations" [12] published by Google's YouTube team at the 2016 RecSys conference used mean pooling to process sequence features. After that, this idea was widely adopted by the industry, including sum pooling, mean pooling, max Pooling is a commonly used method in this idea.

As shown in Figure 1, this is the DNN method used in this paper. In terms of sequence feature processing, the embedding vectors of the user's search history and viewing history are weighted and averaged respectively to obtain the user's overall search and viewing history status.
Alt
We first give the formal definition of the behavior sequence, U = { u 1 , u 2 , u 3 , . . . , un } U=\left \{ u_{1},u_{2},u_{3}, ...,u_{n} \right \}U={ u1,u2,u3,...,un} Representative set,I = { i 1 , i 2 , i 3 , . . . , in } I=\left \{ i_{1},i_{2},i_{3},...,i_{ n} \right \}I={ i1,i2,i3,...,in} represents a collection of materials (which can be short videos, products, music, etc.). A series of user behaviors of users can be expressed asB u = { b 1 u , b 2 u , b 3 u , . . . , b ∣ B u ∣ u } B_{u}=\left \{ b_{1}^{u},b_{2}^{u},b_{3}^{u},...,b_{\left | B_{ u} \right |}^{u} \right \}Bu={ b1u,b2u,b3u,...,bBuu} ∣ B u ∣ \left | B_{u} \right | Bu represents the length of user behavior sequence;biu b_{i}^{u}biuRepresentative use of u's first i's history, can include more types of side info information, biu = ( si , 1 u , si , 2 u , . . . , si , ku ) b_{i}^{u}=\left ( s_{i,1}^{u} ,s_{i,2}^{u},...,s_{i,k}^{u}\right )biu=(si,1u,si,2u,...,si,ku) s i , k u s_{i,k}^{u} si,kuThe k-th side info information represents the i-th behavior of user u, usually the material ID, category, time when the behavior occurred, etc. In practice, each user behavior can be converted into a dense vector: eiu = concat ( E mbedding ( biu ) ) e_{i}^{u}=concat(Embedding(b_{i}^{u}))eiu=concat(Embedding(biu)) , it can be displayed as follows:E u = { e 1 u , e 2 u , e 3 u , . . . , e ∣ B u ∣ u } E_{u}=\left \{ e_{1 }^{u},e_{2}^{u},e_{3}^{u},...,e_{\left | B_{u} \right |}^{u} \right \}Eu={ e1u,e2u,e3u,...,eBuu}

After the mean-pooling, the result is:
A u = f ( E u ) = 1 ∣ B u ∣ ∑ i = 1 ∣ B u ∣ eiu A_{u}=f(E_{u})=\frac{1 }{\left | B_{u} \right |} \sum_{i=1}^{\left | B_{u} \right |}e_{i}^{u}Au=f(Eu)=Bu1i=1Bueiu

Based on the idea of ​​pooling, the user behavior sequence characteristics are processed. The operation is simple and direct. It can be implemented by using the functions tf.reduce_sum, tf.reduce_mean, and tf.reduce_max that come with tensorflow.

The shortcomings of the above-mentioned ideas are also obvious. The sequence is treated as an unordered set, each Item is treated equally, and the importance of different Items cannot be distinguished, thus weakening the representation effect of user interest. In actual scenarios, different Items in the user's historical behavior have different representations of the user's current interests. For example, compared to the Items the user browsed a month ago, it is more appropriate to use the Items browsed one day ago to represent the user's current interests.

4.2 Extension: Application of pooling technology in the field of image processing

In the field of image processing, the pooling layer has an obvious role: reducing the size of the feature map, which means it can reduce the amount of calculation and the required video memory.
Mean-pooling: that is, only average the feature points in the neighborhood. Advantages
and Disadvantages: It can preserve the background well, but it easily makes the picture blurred.
Forward propagation: average within the neighborhood.

Alt

max-pooling (maximum pooling): that is, taking the maximum feature point in the neighborhood

Advantages and Disadvantages: It can preserve texture features very well. Max-pooling is generally used now, but mean-pooling is rarely used.
Forward propagation: take the largest value in the neighborhood and remember the index position of the maximum value to facilitate backpropagation.
Alt

Stochastic-pooling: Just randomly select elements in the feature map according to their probability values, that is, elements with larger values ​​have a greater probability of being selected . Unlike max-pooling, only the maximum element is always taken. Within the area, normalize the values ​​in the left picture, that is, 1/(1+2+3+4)=0.1; 2/10=0.2; 3/10=0.3; 4/10=0.4
Alt

Then randomly select according to the probability value. Generally speaking, the probability is high and it is easy to be selected. For example, when the probability value is 0.3, then the value of (1, 2, 3, 4) after pooling is 3. When using stochastic pooling, the reasoning process is also very simple. Just find the weighted average of the matrix area. For example, in the figure above, the pooling output value is: 1 0.1+2 0.2 + 3 0.3+4 0.4=3.

5. References

https://weibo.com/ttarticle/p/show?id=2309634696248908382628

  • 1-RNN(ICLR2015), RECURRENT NEURAL NETWORK REGULARIZATION. https://arxiv.org/pdf/1409.2329.pdf
  • 2-LSTM, Convolutional LSTM Network: A Machine Learning Approach for Precipitation Nowcasting. https://arxiv.org/pdf/1506.04214.pdf
  • 3-GRU. Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation. https://arxiv.org/pdf/1406.1078.pdf.
  • 4-Attention is All you Need. https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
  • 5-Deep Interest Network for Click-Through Rate Prediction. https://dl.acm.org/doi/pdf/10.1145/3219819.3219823
  • 6-Deep Interest Evolution Network for Click-Through Rate Prediction. https://arxiv.org/pdf/1809.03672.pdf
  • 7-Deep Session Interest Network for Click-Through Rate Prediction. https://arxiv.org/pdf/1905.06482.pdf
  • 8-Practice on Long Sequential User Behavior Modeling for Click-Through Rate Prediction. https://arxiv.org/pdf/1905.09248.pdf
  • 9-Search-based User Interest Modeling with Lifelong Sequential Behavior Data for Click-Through Rate Prediction. https://arxiv.org/pdf/2006.05639.pdf
  • 10-Multi-Interest Network with Dynamic Routing for Recommendation at Tmall. https://arxiv.org/pdf/1904.08030.pdf.
  • 11-Deep Multi-Interest Network for Click-through Rate Prediction. https://dl.acm.org/doi/pdf/10.1145/3340531.3412092.
  • 12-Deep Neural Networks for YouTube Recommendations. https://static.googleusercontent.com/media/research.google.com/zh-CN//pubs/archive/45530.pdf.
  • 13-https://blog.csdn.net/m0_59023219/article/details/130883277

Guess you like

Origin blog.csdn.net/Jin_Kwok/article/details/131914478