#Thesis title: [Exposure Bias] UKD: Debiasing Conversion Rate Estimation via Uncertainty-regularized Knowledge Distillation (UKD: Debiasing Conversion Rate (CVR) Estimation via Uncertainty Regularized Knowledge Distillation) #Thesis address: https:
// arxiv.org/pdf/2201.08024.pdf
#Paper source code open source address: no yet
#Paper affiliation conference: WWW 2022
#Paper affiliation unit: Ali
1. Introduction
This article is a related improvement method for the sample selection bias problem in advertising. The same common recommendation system also has SSB problems, so it has reference significance.
Traditional post-click conversion rate (CVR) estimation models are trained using click samples. However, after the model is live, it needs to be estimated for all display ads, which leads to sample selection bias (SSB) problem. Therefore, reliable supervision signals for unclicked ads are needed to alleviate the SSB problem.
This paper proposes an Uncertainty Regularized Knowledge Distillation (UKD) framework:
- Debiases CVR estimates by extracting knowledge from unclicked ads. The teacher model learns click-adaptive representations and generates pseudo-conversion labels on unclicked ads as supervision signals.
- The student model is then trained on clicked and unclicked ads via knowledge distillation , with uncertainty modeling to mitigate inherent noise in the pseudo-labels .
2. Method
As shown in the figure, the overall process of the model is shown. Click on the adaptive teacher model to generate a student model regularized for labels and uncertainties to distill knowledge.
2.1 Click Adaptive Teacher Model
The goal of the teacher model is to generate pseudo-labels for unclicked data D unclick given only access to conversion labels for clicked data D click . There is a difference in the feature distribution of clicked samples and non-clicked samples. In order to have accurate inference capabilities for unclicked ads, the teacher model needs to learn click-adaptive representations. Pseudo-transition label generation is approached from the perspective of unsupervised domain adaptation, where the source/target domains are hit/unclick spaces. In this way, the problem is formulated as generating reliable pseudo-conversion labels for untagged unclicked ads (D unclick as target domain), given tagged clicked ads (D click as source domain).
2.1.1 Click Adaptive Representation Learning
2.1.1.1 Model structure
The teacher model is the left part of Figure 2, which mainly includes the feature representation learner TTT f( ), CVR predictorTTT p( ), hit the discriminatorTTT d( ). Feature Representation LearnerTTT f( ) takes sample features as input to learn their dense representationhhh(T), T T T f( ) contains embedding layer and multi-layer dense layer. The CVR prediction period is used to predict the CVR score
, mainly including the dense layer and softmax function.
In order to make the feature representation hhh (T)Better click adaptation to facilitate the generation of pseudo-conversion labels on unclicked ads, the teacher model introduces a click discriminatorTTT d( ) classifies the domain (i.e. clicked or not clicked) of each sample. If a strong click discriminator cannot correctly predict the domain label of an example, its representationhhh (T)is click-adaptive. The formula is expressed as follows, where pconvis the predicted cvr distribution and pdis the predicted domain distribution.
2.1.1.2 Adversarial Learning
In order to learn click-adaptive representation, given an ad, by TTT f( ) is represented hoping to confuse the click discriminator and maximize the domain classification loss, while the click discriminator ( ) itself aims to minimize the domain classification loss to be a strong classifier. The teacher model is optimized via adversarial learning:
the first minimizes the loss estimated by CVR to optimize ( ) and ( ). The second means that the learner ( ) makes the representations of clicked and unclicked ads indistinguishable, while the click discriminator ( ) is optimized to better distinguish between clicked and unclicked ads.
2.1.2 Generating pseudo-labels for unclicked ads
Feed unclicked data into the teacher model and predict its CVR distribution as pseudo-labels. Will be used as unclicked samples for subsequent models.
2.2 Uncertainty regularized student model
Based on the pseudo-conversion labels of unclicked ads learned by the click-adaptive teacher model, the UKD framework further builds a knowledge distillation-based student model that learns from clicked ads (with real labels) and unclicked ads (using pseudo labels) for the whole space CVR estimation. This model alleviates the SSB problem by explicitly considering unclicked samples during training compared to models trained using only clicked samples.
The distillation strategy can guide the student model to mine the valuable knowledge learned by the teacher model. Due to the inherent noise in the teacher's predictions, pseudo-labels for unclicked ads are less confident than true conversion labels for clicked ads. To address this issue, an uncertainty-regularized student model is proposed to reduce the negative impact of noise by simulating the uncertainty of pseudo-labels during distillation.
2.2.1 Base Student Model: Label Distillation
2.2.1.1 Model structure
The base student model is built on a multi-task model and consists of two feature representation learners (i.e., SSS vf( ) for the CVR task,SSS cf( ) for the CTR task). The two representation learners share a feature embedding layer, and each learner has several dense layers to learn its respective representation h. And each of the two predictors contains a dense layer of softmax function. The forward process of the base student model is:
2.2.1.2 Distilling knowledge from unclicked ads
With pseudo-conversion labels of unclicked ads learned by the teacher, the student model is optimized across the space to alleviate the SSB problem. The goal of the CVR estimation task is:
the overall loss function is expressed as
2.2.2 Uncertainty regularization
The confidence of the pseudo-conversion labels of unclicked ads is expected to be lower than that of real conversion labels of clicked ads, since the latter is obtained from user feedback logs while the former is generated by the teacher model. Due to the inherent noise in teacher predictions, unclicked samples with noisy pseudo-labels can mislead the training process of the student model. For effective knowledge distillation of unclicked ads, there are two key aspects:
- (i) identify noisy and unreliable unclicked samples,
- (ii) Reduce negative effects during the distillation process.
This paper reflects the noise by estimating the uncertainty of pseudo-labels of unclicked samples , where higher uncertainty values indicate worse reliability. By using high uncertainty as a measure of noisy unclicked samples, the negative impact of such samples can be reduced by simply assigning low weights to their CVR loss, thus avoiding misleading the distillation process of the student model. Therefore, a student model with uncertain regularization is proposed. It estimates the uncertainty of pseudo-labels for each unclicked ad, and dynamically adjusts the weight of the CVR loss for unclicked ads according to the uncertainty level, reducing the negative impact of noise .
2.2.2.1 Uncertainty modeling
The uncertainty-regularized student model contains two CVR predictors SSS vp( ) andSSS vp'( ) to simultaneously estimate the CVR scores (shown on the right side of Figure 2), and then model the uncertainty as their inconsistency. letpppconv andp'p'p' convrepresents the distribution of predictions from two CVR predictors. The uncertainty is expressed as the KL divergence of two predictions, where the purpose of dropout is to enhance the difference between the two predictions.
2.2.2.2 Uncertainty regularization
Based on the estimated uncertainty of each unclicked sample, the negative impact of noisy unclicked samples during distillation is reduced by dynamically adjusting the uncertainty-based weights and CVR loss. Compared to the base student model, the distillation process of unclicked ads is now regularized by pseudo-label uncertainty, mitigating the inherent noise present in the teacher's predictions.
For each unclicked sample, use as uncertainty regularization to weight its original CVR loss. This factor is inversely proportional to uncertainty ( is a hyperparameter). Therefore, the loss LLL CVR_unclickcan be defined as the following formula,
if a sample has high uncertainty, the factor returns a smaller value to reduce the weight of its CVR loss. If the uncertainty approaches 0, the factor tends to 1, and such a student model transfers to the base student model.
3. Results
Reference: https://zhuanlan.zhihu.com/p/471138795