#Reading Paper# 【曝光偏差】WWW 2022 UKD: Debiasing Conversion Rate Estimation via Uncertainty-regularized

#Thesis title: [Exposure Bias] UKD: Debiasing Conversion Rate Estimation via Uncertainty-regularized Knowledge Distillation (UKD: Debiasing Conversion Rate (CVR) Estimation via Uncertainty Regularized Knowledge Distillation) #Thesis address: https:
// arxiv.org/pdf/2201.08024.pdf
#Paper source code open source address: no yet
#Paper affiliation conference: WWW 2022
#Paper affiliation unit: Ali
insert image description here

1. Introduction

This article is a related improvement method for the sample selection bias problem in advertising. The same common recommendation system also has SSB problems, so it has reference significance.
Traditional post-click conversion rate (CVR) estimation models are trained using click samples. However, after the model is live, it needs to be estimated for all display ads, which leads to sample selection bias (SSB) problem. Therefore, reliable supervision signals for unclicked ads are needed to alleviate the SSB problem.
This paper proposes an Uncertainty Regularized Knowledge Distillation (UKD) framework:

  • Debiases CVR estimates by extracting knowledge from unclicked ads. The teacher model learns click-adaptive representations and generates pseudo-conversion labels on unclicked ads as supervision signals.
  • The student model is then trained on clicked and unclicked ads via knowledge distillation , with uncertainty modeling to mitigate inherent noise in the pseudo-labels .

2. Method

insert image description here
As shown in the figure, the overall process of the model is shown. Click on the adaptive teacher model to generate a student model regularized for labels and uncertainties to distill knowledge.

2.1 Click Adaptive Teacher Model

The goal of the teacher model is to generate pseudo-labels for unclicked data D unclick given only access to conversion labels for clicked data D click . There is a difference in the feature distribution of clicked samples and non-clicked samples. In order to have accurate inference capabilities for unclicked ads, the teacher model needs to learn click-adaptive representations. Pseudo-transition label generation is approached from the perspective of unsupervised domain adaptation, where the source/target domains are hit/unclick spaces. In this way, the problem is formulated as generating reliable pseudo-conversion labels for untagged unclicked ads (D unclick as target domain), given tagged clicked ads (D click as source domain).

2.1.1 Click Adaptive Representation Learning

2.1.1.1 Model structure

The teacher model is the left part of Figure 2, which mainly includes the feature representation learner TTT f( ), CVR predictorTTT p( ), hit the discriminatorTTT d( ). Feature Representation LearnerTTT f( ) takes sample features as input to learn their dense representationhhh(T) T T T f( ) contains embedding layer and multi-layer dense layer. The CVR prediction period is used to predict the CVR scoreinsert image description here
, mainly including the dense layer and softmax function.

In order to make the feature representation hhh (T)Better click adaptation to facilitate the generation of pseudo-conversion labels on unclicked ads, the teacher model introduces a click discriminatorTTT d( ) classifies the domain (i.e. clicked or not clicked) of each sample. If a strong click discriminator cannot correctly predict the domain label of an example, its representationhhh (T)is click-adaptive. The formula is expressed as follows, where pconvis the predicted cvr distribution and pdis the predicted domain distribution.
insert image description here

2.1.1.2 Adversarial Learning

In order to learn click-adaptive representation, given an ad, by TTT f( ) is represented hoping to confuse the click discriminator and maximize the domain classification loss, while the click discriminator ( ) itself aims to minimize the domain classification loss to be a strong classifier. The teacher model is optimized via adversarial learning:
insert image description here
the first minimizes the loss estimated by CVR to optimize ( ) and ( ). The second means that the learner ( ) makes the representations of clicked and unclicked ads indistinguishable, while the click discriminator ( ) is optimized to better distinguish between clicked and unclicked ads.

2.1.2 Generating pseudo-labels for unclicked ads

Feed unclicked data into the teacher model and predict its CVR distribution insert image description hereas pseudo-labels. Will be insert image description hereused as unclicked samples for subsequent models.

2.2 Uncertainty regularized student model

Based on the pseudo-conversion labels of unclicked ads learned by the click-adaptive teacher model, the UKD framework further builds a knowledge distillation-based student model that learns from clicked ads (with real labels) and unclicked ads (using pseudo labels) for the whole space CVR estimation. This model alleviates the SSB problem by explicitly considering unclicked samples during training compared to models trained using only clicked samples.

The distillation strategy can guide the student model to mine the valuable knowledge learned by the teacher model. Due to the inherent noise in the teacher's predictions, pseudo-labels for unclicked ads are less confident than true conversion labels for clicked ads. To address this issue, an uncertainty-regularized student model is proposed to reduce the negative impact of noise by simulating the uncertainty of pseudo-labels during distillation.

2.2.1 Base Student Model: Label Distillation

2.2.1.1 Model structure

The base student model is built on a multi-task model and consists of two feature representation learners (i.e., SSS vf( ) for the CVR task,SSS cf( ) for the CTR task). The two representation learners share a feature embedding layer, and each learner has several dense layers to learn its respective representation h. And each of the two predictors contains a dense layer of softmax function. The forward process of the base student model is:
insert image description here

2.2.1.2 Distilling knowledge from unclicked ads

With pseudo-conversion labels of unclicked ads learned by the teacher, the student model is optimized across the space to alleviate the SSB problem. The goal of the CVR estimation task is:
insert image description here
the overall loss function is expressed as
insert image description here

2.2.2 Uncertainty regularization

The confidence of the pseudo-conversion labels of unclicked ads is expected to be lower than that of real conversion labels of clicked ads, since the latter is obtained from user feedback logs while the former is generated by the teacher model. Due to the inherent noise in teacher predictions, unclicked samples with noisy pseudo-labels can mislead the training process of the student model. For effective knowledge distillation of unclicked ads, there are two key aspects:

  • (i) identify noisy and unreliable unclicked samples,
  • (ii) Reduce negative effects during the distillation process.

This paper reflects the noise by estimating the uncertainty of pseudo-labels of unclicked samples , where higher uncertainty values ​​indicate worse reliability. By using high uncertainty as a measure of noisy unclicked samples, the negative impact of such samples can be reduced by simply assigning low weights to their CVR loss, thus avoiding misleading the distillation process of the student model. Therefore, a student model with uncertain regularization is proposed. It estimates the uncertainty of pseudo-labels for each unclicked ad, and dynamically adjusts the weight of the CVR loss for unclicked ads according to the uncertainty level, reducing the negative impact of noise .

2.2.2.1 Uncertainty modeling

The uncertainty-regularized student model contains two CVR predictors SSS vp( ) andSSS vp'( ) to simultaneously estimate the CVR scores (shown on the right side of Figure 2), and then model the uncertainty as their inconsistency. letpppconv andp'p'p' convrepresents the distribution of predictions from two CVR predictors. The uncertainty is expressed as the KL divergence of two predictions, where the purpose of dropout is to enhance the difference between the two predictions.
insert image description here

2.2.2.2 Uncertainty regularization

Based on the estimated uncertainty of each unclicked sample, the negative impact of noisy unclicked samples during distillation is reduced by dynamically adjusting the uncertainty-based weights and CVR loss. Compared to the base student model, the distillation process of unclicked ads is now regularized by pseudo-label uncertainty, mitigating the inherent noise present in the teacher's predictions.
For each unclicked sample, use insert image description hereas uncertainty regularization to weight its original CVR loss. This factor is inversely proportional to uncertainty ( is a hyperparameter). Therefore, the loss LLL CVR_unclickcan be defined as the following formula,
insert image description here
if a sample has high uncertainty, the factor returns a smaller value to reduce the weight of its CVR loss. If the uncertainty approaches 0, the factor tends to 1, and such a student model transfers to the base student model.

3. Results

insert image description here
Reference: https://zhuanlan.zhihu.com/p/471138795

Guess you like

Origin blog.csdn.net/CRW__DREAM/article/details/127669706