[Recommendation] Tuning of the sorting model

Sorting Model Selection

  LR, GBDT, LR+GBDT, FM/FFM, deep model (wide & deep, DeepFM, DCN, etc.)

Common training methods for ranking models

method
Method 1: T+1 training (full amount of data within a fixed time sliding window for training)
Method 2: Daily Incremental Training
Method 3: Combine method 1 and method 2, that is, do incremental training within the day (such as hour-level incremental training), and do T+1 full-volume training at the end of the day.

Sample Class Imbalance Handling Attempts

  For the ranking task data set in the field of search and promotion, the positive and negative samples are seriously unbalanced, that is to say, the number of negative samples is often hundreds or thousands of times the number of positive samples. For such a serious sample category imbalance, if no solution is taken to alleviate this problem, it will be very difficult for the model to learn knowledge. For sorting tasks, common methods to alleviate category imbalance are as follows:

  Downsampling negative samples:The purpose is to sample some negative samples from a large number of negative samples in a certain way, so as to reduce the proportion of negative samples. After sampling the negative samples, in the ranking task in the calculation of the advertising field, calibration needs to be performed when the ranking formula is finally calculated (calibration occurs during online inference, and calibration is not considered during offline training). For personalized search and recommendation systems, they only focus on the relative order of scoring probabilities, and the relative order of scoring probabilities before and after sampling will not change, so no calibration is required.

  Sampling negative samples will waste a lot of negative samples.

  Oversampling positive samples:The purpose is to increase the number of positive samples by some method. If the positive sample is oversampled to the same magnitude as the negative sample, the overall training sample size will be too large, and the training time will be much longer than the time before the positive sample is oversampled. If the positive samples are sampled, the scoring probability also needs to be calibrated in the ranking task of calculating advertisements, while the ranking tasks in personalized search and recommendation systems do not need to calibrate the scoring probability.

  By simply copying positive samples and changing the number of positive samples to the same sample size as negative samples, the offline effect of the model is not bad, but the training time is relatively long. For example, it takes 4 to 5 hours to run a single epoch on a single machine. For this reason, we ended up not using this solution.

  · Each epoch samples the number of negative samples and positive samples 1:1:In this way, negative samples are not wasted, and the number of epochs can be weighed according to the number of negative samples and training time.

  The method is like this, use tf.keras + tf.data.experimental.sample_from_datasets API to sample from a large number of negative samples, and splicing them with positive samples to form the final training set. It turns out that the AUC of the verification set after each epoch does not change. If the model has fully converged, that is, the loss is basically unchanged and the model parameters are basically unchanged. At this time, it is possible that the AUC of the verification set will not change, but the model should be able to see different negative samples in the first epochs. , should continue to learn without converging, so the AUC of the first few epoch validation sets should change. The inference may be the pit of Tensorflow/TF. It is possible that each epoch fits the same negative sample and the model basically converges after the first epoch.

  Use class weight or sample weight:The purpose is to make the model pay more attention to positive samples by setting weights. This method is used in many projects, and the actual effect is also very good.

  · Use Focal loss:Focal loss can do difficult sample learning and alleviate the imbalance of category samples. Due to time constraints, this project did not try this solution.

underfitting

  For underfitting, simply speaking, the model is not learning enough. Common processing methods are as follows:

  With more and better features:

  Feature scaling processing of continuous features (the depth model is very sensitive to the amplitude change of continuous features, so if the depth model is used for modeling, the feature scaling process must be performed on continuous features). There are many methods of feature scaling, such as Z-score normalization, MinMax normalization, log, smoothing methods (such as Bayesian smoothing) and so on. For ratio features such as historical ctr, smoothing of this feature is often considered in the sorting task. The purpose is to make the values ​​obtained after smoothing those items with the same historical ctr but more exposures and more clicks than those with long tails. (for example, an item that is clicked once within 7 days and exposed 2 times and an item that is clicked 50 times within 7 days and exposed 100 times, although their click-through rates are the same, their popularity is very different, so this At this time, it is not recognizable for them to directly use ctr within 7 days as a feature, and it is very meaningful to do ctr smoothing at this time).

  It is also possible to gradually add more cross-features on the device side and the advertising side according to the business semantics. We also tried this method in the project, and the underfitting was further alleviated.

  Increase model complexity/capacity:

  In the case of the currently used wide & deep sorting model, there are two ways to increase the model capacity. Method 1 is to increase the number of fully connected layers or increase the number of neurons in each layer. Method 2 is to increase the embedding vector length becomes larger. Generally, the fully connected layer used in the ranking model in the industry is usually 3 layers. This project also uses 3 layers. We tried to increase the number of neurons in each layer. In addition, I have seen many articles that generally set the dimension of the embedding vector of the itemid/usrid embedding table used in the sorting model to 8 or 10 (which is considered an empirical value).

  In addition to the methods mentioned above to alleviate underfitting, adjusting the learning rate and batch size, as well as the processing method of sample category imbalance may alleviate underfitting. Therefore, it can be seen that there are many combination factors to alleviate underfitting. The recommended way is to change only one factor each time to compare the effects after training. Feature scaling must be done first. In addition to the three factors of learning rate, batch size and sample category imbalance, we can try the following methods step by step to alleviate underfitting: Mining some good features (excluding crossover features, such as some historical statistical features), then increase the complexity of the model (mainly increase the number of layers or the number of neurons), and finally gradually increase the meaningful cross features.

overfitting

  Overfitting refers to the effect/performance of the model on the training set, but the performance on the validation set is far from the training set. In actual production projects, we care more about whether the model performs well on the training set and the verification set. In this case, we don’t care even if it is overfitting. For example, the AUC on the training set is 0.95, and the verification set The AUC on the set is 0.8, so this situation is overfitting, but the AUC on the validation set is high enough, so we can accept this situation; if the AUC on the training set is 0.95, the AUC on the validation set It is 0.6, this kind of overfitting is what we need to care about. From underfitting to overfitting, sometimes it is an instant thing (for example, if more features are added at once, it is easy to change from underfitting to overfitting). For scenarios where deep models are used for sorting tasks, the common processing methods for overfitting are as follows:

  Collect more data: 目的是让模型能更多的见到不同的数据分布,从而学习到不同的知识。比如T+1训练中的T常见的是7天的数据作为训练集(当然这个T取多少和训练集中的正样本量有多少有很大关系),在我们这个项目中,T取的是30天的数据,因此能获得更多的样本尤其是正样本。

  减少模型复杂度/容量: 也就是使用小一点的神经网络,包括小一点的embedding table,目的是让神经网络和embedding table的容量变小。在实际的项目中,见到过把itemid/userid embedding table的embedding向量的长度设置为几百几千的,不建议这样,太容易过拟合了,就像前面提到的,设置为8左右就是一个不错的起点。注意这里的embedding指的是input embedding,而关于output embedding以及文本embedding向量长度的选择可以参考我的github中的文章推荐系统概览。

  使用BatchNormalization (简称BN,本质是对神经元的激活值进行整形,它在Deep Learning中非常有用,建议尽量用):使用BN的话,batch size不能太小,而batch size的调整一般伴随着同方向的learning rate的调整(也就是把batch size调大的话,learning rate可以适当调大一点点)。虽然BN主要是在CNN卷积层用的比较多,但是MLP层也可以用,RNN的话要用LayerNormalization(简称LN)。在当前项目中,使用BN后的离线效果提升很明显。

  使用Early stopping早停: 监控模型在验证集的metric,并early stopping早停。Early stopping并不是必须的,如果设定模型固定跑的epoch数量,之后选择一个表现最好的epoch的checkpoint也是可以的,这个情况下就不需要early stopping。

  正则化方法: 在深度学习中,常用的正则化方法是Dropout,L1/L2正则,Label标签平滑等。当前项目使用了dropout和L1/L2正则。Dropout的比率以及L1/L2正则的超参数在调试的时候,都要小步调整,大幅调整很容易一下子就从过拟合到了欠拟合了。

  使用更少的特征: 在这个项目中,一下子增加了几种交叉特征后,模型从欠拟合到了过拟合。然后在去掉了几个交叉特征之后,过拟合得到缓解。因此加入新的特征要一点点加,小步走。

  在使用深度模型发生过拟合的时候,首先要检查验证集的数据分布(比如每个连续特征的统计分布,每个离散特征的覆盖度,和训练集中的数据分布做一下对比)。如果训练集和验证集的数据分布相差很多,考虑如何重新构造训练集和验证集;否则,建议尝试按照如下的顺序来缓解(每做完一步就训练看效果,如果验证集的效果能接受了,就先打住;否则继续下一步):使用BN(基本上是标配)——使用更少的特征(如果特征本身就不多,可以跳过;主要关注交叉特征是否很多)——收集更多的数据(如果正样本量已经足够多,可以跳过这步)——使用正则化方法——减少模型复杂度/容量(尤其要注意embedding table中embedding向量的长度)

其他问题

  数据集变了,模型的离线评估AUC变化很大:

  数据集变大可能会导致容量小的模型效果变差,发生欠拟合。对于CTR/CVR任务,训练流程跑通以后,用固定滑窗的数据集来训练调试模型;而一般固定滑窗内的数据集的量级差不多。数据集的清洗和预处理每天都要保证一致性的行为,否则出问题调试很花时间。

  要尽量保证特征的线上线下一致性。

  同样的数据集和同样的模型,两个实验对比,发现对验证集的评估指标AUC有差别:

  ML带入的随机性很多,所以最好在上下文尽量一致的情况下对比,包括超参数的设置,训练任务的相关参数和随机种子fix(这个非常重要,包括python random seed和tensorflow.random rseed都需要fix)。

  经常发现在分布式训练中模型的评估指标比单机训练的评估指标要差:

  这个是很常见的。使用分布式训练甚至只是单机多卡的时候,学习率可能不适合还用单机单卡训练的学习率,适当需要调整。对于horovod分布式训练方式,一般来说,把学习率变大一点就好,不能完全按照horovod官网建议的那样即用worker数量乘以之前单机单卡的学习率作为调整后的学习率(这个可能会得到很大的学习率,从而导致模型学习效果不好)。对于parameter server分布式训练的异步梯度更新方式,可能需要把学习率调小,为了让最慢那个stale model replica的更新对整体的影响小一些。

  特征的覆盖度问题:

  如果某些离散特征的特征值的样本出现频率很低比如少于10次,那么可以考虑特征向上合并或者把那些小类别统一归并为”Other”。

Guess you like

Origin blog.csdn.net/qq_43592352/article/details/131937450