Contextual Bandit approach for personalized news article recommendations

Contextual Bandit approach for personalized news article recommendations

Summary

Personalized web services strive to adapt their services (advertising, news articles, etc.) to individual users by using both content and user information. There are two challenges in solving this problem:

First, web services are characterized by dynamic changes in content pools, making traditional collaborative filtering methods inapplicable. Second, the scale of most web services of practical interest requires solutions that are both fast to learn and fast to compute.

Therefore, we model personalized recommendation of news articles as a contextual bandit problem, in which the learning algorithm sequentially selects articles to serve users based on the contextual information of users and articles, while adjusting the article selection strategy based on user click feedback to maximize The total number of user clicks.

Insert image description here

introduce

This article discusses the challenge of determining the most appropriate web-based content for an individual user at the optimal time. News filters must promptly recognize the popularity of breaking news while also adapting to the declining value of existing, aging news stories. It is often difficult to model popularity and temporal changes based on content information alone. In practice, we often explore the unknown by collecting consumer feedback in real time to assess the popularity of new content while monitoring changes in its value. For example, you can specify a small amount of traffic for this type of exploration. Based on user responses (such as clicks) to randomly selected content in this small portion of traffic, the most popular content can be identified and leveraged in the remaining traffic. This strategy, which randomly explores a part of the traffic and greedily exploits the rest, is called a greedy strategy. We need to allocate more traffic to new content to learn its value faster and have fewer users tracking time changes to existing content.

Both users and content are represented by a set of features, user characteristics may include historical activity at an aggregate level as well as stated demographic information, and content features may include descriptive information and categories. In this case, exploration and exploitation must be deployed at an individual level since different users may view the same content very differently. Because there may be a large number of choices or actions available, it becomes critical to identify commonalities between content items and transfer this knowledge across the content pool.

In many web-based scenarios, the content world undergoes frequent changes, and content popularity changes over time. In addition, a large number of visitors may be brand new and have no historical consumption records; this is called a cold start situation. Therefore, when one or both of user interests and content are new, understanding the match between them becomes essential. However, obtaining such information can be costly and may reduce user satisfaction in the short term, which raises the question of optimally balancing two competing goals: maximizing user satisfaction in the long term, and to collect information about the match between user interests and content. The above problem is actually called a feature-based exploration/exploitation problem. In this paper, we formulate it as the contextual bandit problem, which is a principled approach in which the learning algorithm sequentially selects articles to serve users based on the context information of users and articles, while adjusting the article selection strategy based on user click feedback to maximize Maximize the total number of long-term user clicks. We define a bandit problem and then review some existing methods in Section 2. We then propose a new algorithm, LinUCB, in Section 3, which has similar regret analysis to the best-known algorithms and is used to compete with the best linear predictors with lower computational overhead. We also discuss the issue of offline evaluation in Section 4, showing that this may be a reasonable assumption for different users when interactions are independent and distributed identically (i.e., i.e., ). We then test our new algorithm and several existing algorithms using this offline evaluation strategy in Section 5.

Contextual bandit problem modeling

The algorithm observes users: ut u_tut​,臂: a ∈ A t a \in \mathcal{A}_t aAt​, and the eigenvectors of arms and actions: X t , a X_{t,a}Xt,a(Aggregating information about users and arms is called context)

A chooses an arm at at ∈ A t a_t \in \mathcal{A}_t based on the rewards observed in previous trialsatAt, and receive the return rt, at r_{t,a_t}rt,at, whose expected value depends on ut u_tutat a_tat

According to the new observation (X t , a , at , rt , at ) (X_{t,a},a_t,r_{t,a_t})(Xt,a,at,rt,at) Improved arm selection strategy. For the unclosed arma ≠ ata \ne a_ta=atNo feedback rt , at r_{t,a_t}rt,at

In the above process, A's T-trial total return is defined as: ∑ t = 1 T rt , at \sum_{t=1}^T r_{t,a_t}t=1Trt,at, the optimal expected T-trial return is defined as: E [ ∑ t = 1 T rt , at ∗ ] E[\sum_{t=1}^T r_{t,a^*_t}]E[t=1Trt,at] a t ∗ a^*_t atThis is the arm with the largest expected gain at trial t.

Our goal is to design a model that maximizes the expected total benefits above. Equivalently, we can find an algorithm that minimizes the regret of the optimal arm selection strategy. Here, the T-trial regret of algorithm A RA ( T ) R_A(T)RA(T)​​正式定义为:
R A ( T ) = d e f E [ ∑ t = 1 T r t , a t ∗ ] − E [ ∑ t = 1 T r t , a t ] ( 1 ) R_A(T) \xlongequal{def} E[\sum_{t=1}^T r_{t,a^*_t}] - E[\sum_{t=1}^T r_{t,a_t}] \quad\quad\quad\quad (1) RA(T)def E[t=1Trt,at]E[t=1Trt,at]( 1 )
According to the definition of return, the expected return of an article is its click-through rate (CTR). Choosing an article with the largest CTR is equivalent to maximizing the user's expected number of clicks, which in turn is consistent with the maximizing bandit formula The total expected return in is the same.

Due to imprecise knowledge of A, this seemingly optimal arm may actually be suboptimal. To avoid this undesirable situation, A must explore by actually selecting arms that appear to be suboptimal in order to gather more information about them. Exploration may increase short-term regret because some suboptimal weapons may be selected. Obtaining information about the average return of a weapon (can improve A's estimate of the return of a weapon, thereby reducing long-term regret.

LinUCB algorithm

Assume that the expected return of arm a is in a vector with unknown coefficients θ a ∗ \theta^*_aiad-dimensional features X t , a X_{t,a}Xt,ais linear. For all t,
E [ rt , a ∣ X t , a ] = X t , a T θ a ∗ ( 2 ) E[r_{t,a}| }^T \theta^*_a \quad\quad\quad\quad\quad\quad\quad\quad(2)E [ rt,aXt,a]=Xt,aTia( 2 )
This kind of model is called a disjoint model, and the parameters are not shared by different arms. D a D_aDa:a m × dm\times d of experiment tm×Design matrix of d (rows corresponding to m contexts of previously observed article a). ba ∈ R m b_a\in \mathbb{R}^mbaRm : Feedback of whether the user clicked or not. Training data(D a , ca ) (D_a,c_a)(Da,ca)得到系数:
θ ^ a = ( D a TD a + I d ) − 1 D a T ca \hat{\theta}_a = (D^T_a D_a + I_d)^{-1} D^T_a c_ai^a=(DaTDa+Id)1 DaTca
I d I_d Idis d × dd\times dd×The identity matrix of d . ca c_acaThe components of are D a D_aDaIndependent of the corresponding row in , the probability is at least 1 − δ ( δ > 0 ) 1-\delta(\delta>0)1d ( d>0)

Insert image description here

The above inequality gives a reasonable UCB of the expected return of group a, from which the UCB-type arm selection strategy can be derived: in each trial t, choose:

Insert image description here

A a = d e f D a T D a + I d A_a \xlongequal {def} D^T_a D_a + I_d Aadef DaTDa+Id

The arm selection criterion in equation (5) can also be viewed as an additive trade-off between revenue estimates and model uncertainty reduction

Mixed model modeling

The combination of shared and non-shared components can be more helpful to the function of the arm. Add another linear term to the right side of equation (2), using the following mixed model:
E [rt, a ∣ X t, a ] = Z t , a T β ∗ + X t , a T θ a ∗ ( 6 ) E[r_{t,a}| ^T \theta^*_a \quad\quad\quad\quad\quad\quad\quad\quad(6)E [ rt,aXt,a]=Zt,aTb+Xt,aTia(6)
其中 Z t , a ∈ R k Z_{t,a}\in\mathbb{R}^k Zt,aRk is the characteristic of the current user/article combination,β ∗ β^∗b is a vector of unknown coefficients common to all arms. All arms shareβ ∗ β^∗b , non-cooperationθ a ∗ \theta^*_aia。​

Due to shared characteristics, the confidence intervals for each arm are not independent. Fortunately, there is an efficient way to calculate UCB along the same line of reasoning in the previous section. The derivation process relies heavily on block matrix inversion techniques. We propose a new hybrid model algorithm LinUCB.

Insert image description here

We propose an approach that is easy to implement, based on log data, and unbiased.

According to the current historical record ht − 1 h_{t-1}ht1, the policy π selects the same arm a as the log policy selects, then the event is added to the history, and the total return is updated. Otherwise, if policy π selects a different arm than that used by the logging policy, then the event will be completely ignored and the algorithm will continue processing the next event without making any other changes to its state. Note that since the log policy selects each arm randomly and uniformly, this algorithm retains each event with probability exactly 1/K, independent of all other events. This means that the retained events have the same distribution as the events selected by D. Therefore, we can show that two procedures are equivalent: the first evaluates the policy against T real events from D, and the second evaluates the policy against a stream of recorded events using a policy evaluator.

Theorem 1:

For all distributions D of the context, all policies π, all T and all event sequences h T h_ThT:
Insert image description here

where S is the i.i.d iid from the uniform random log policy and DEvent flow drawn by i.i.d. _ The expected number of events obtained from the stream for collecting a history of length T isKT KTKT

This theorem says that in the real world, every history has the same probability as the policy evaluator. Therefore, many of these historical statistics, such as the average return RT/T RT/T returned by Algorithm 3R T / T , are all unbiased estimates of the algorithm's π value. Furthermore, the theorem states that it is expected thatKT KTK T recorded events to keep sizeTTsample of T.

Any randomization in the policy is independent of randomization in the world, we only need to show that it is conditional on history −1 that the distribution of each process in the tth event is the same. In other words, we must show:

Insert image description here

Since arma arm \quad aarma is chosen uniformly at random in the log policy, so for any policy, any history, any function and any arm, the probability of the policy calculator exiting the inner loop is the same, meaning that this happens in the last event, last The probability of an event isP r D ( })P r D ( Xt,1,,Xt,K,rt,a)

Since every event in the stream is retained with exact 1/K probability, the expected number of events required to redefine it is exactly KT.

Guess you like

Origin blog.csdn.net/perfectzxiny/article/details/119672275