Sorting loss function List-wise loss (Series 3)

Sorting series:

The earliest article about list-wise was published in Learning to Rank: From Pairwise Approach to Listwise Approach. There have been various variations since then, but they remain the same. This article focuses on the principles.

Paper link listNet , reference implementation code: implementation code

1. Why List-wise loss?

pairwise advantages and disadvantages
advantages:

  • Some well-proven classification models can be used directly.
  • In some specific scenarios, its pairwise features are easy to obtain.

shortcoming:

  • The learning goal is to minimize the classification error of document pairs, rather than minimizing the document ranking error. The learning objectives are inconsistent with the actual objectives (MAE, NDCG).
  • The training process can be extremely time-consuming because the number of generated document pairs can be very large.

So how does this paper solve these problems?
In pointwise, we use each <query, document> as a training sample to train a classification model. This method does not consider the sequential relationship between documents; in the pariwise method, the correlation of any two documents under the same query is considered, but it also has the shortcomings mentioned above; in listwise, we convert a < query,documents> is trained as a sample, where documents is a list of files related to this query .
The paper also proposes a probability distribution method to calculate the listwise loss function. And proposed two methods: permutation probability and top one probability. Both methods are detailed below.

2. Method introduction

2.1. loss input format

Suppose we have m queries:
Q = ( q ( 1 ) , q ( 2 ) , q ( 3 ) , . . . , q ( m ) ) Q=(q^{(1)}, q^{(2 )}, q^{(3)},...,q^{(m)})Q=(q(1),q(2),q(3),...,q( m ) )
There are n documents that may be related to each query (n may be different for different queries)
d (i) = (d 1 (i), d 2 (i), . . ., dn ( i ) ) d^{(i)} = (d^{(i)}_1, d^{(i)}_2, ..., d^{(i)}_n)d(i)=(d1(i),d2(i),...,dn(i))
For all documents under each query, we can get the true relevance score of each document and query according to the specific application scenario.
y ( i ) = ( y 1 ( i ) , y 2 ( i ) , . . . . , yn ( i ) ) y^{(i)} = (y^{(i)}_1, y^{( i)}_2, ...., y^{(i)}_n)y(i)=(y1(i),y2(i),....,yn(i))
We can get from each document pair( q ( i ) , dj ( i ) ) (q^{(i)}, d^{(i)}_{j})(q(i),dj(i)) gets the score of the document,q ( i ) q^{(i)}q( i ) and document setd (i) d^{(i)}dAfter scoring each document in ( i )
, the feature vectors of all documents under the query can be obtained : x (i) = (x 1 (i), x 2 (i), . . ., xn (i)) x ^{(i)} = (x^{(i)}_1, x^{(i)}_2, ..., x^{(i)}_n)x(i)=(x1(i),x2(i),...,xn(i))
and under the condition that the true relevance score of each document is known:
y ( i ) = ( y 1 ( i ) , y 2 ( i ) , . . . , yn ( i ) ) y^{(i)} = (y^{(i)}_1, y^{(i)}_2, ..., y^{(i)}_n)y(i)=(y1(i),y2(i),...,yn(i))
We can construct training samples:
T = { ( x ( i ) , y ( i ) ) } T=\begin{Bmatrix} (x^{(i)}, y^{(i)}) \end{Bmatrix }T={ (x(i),y(i))}

What needs special attention is that one of the training samples here is (x (i), y (i)) (x^{(i)}, y^{(i)})(x(i),y( i ) ), and herex ( i ) x^{(i)}x( i ) is a document list related to query, which is also an important feature that distinguishes it from pointwise and pairwise.

关于y ( i ) y^{(i)}y( i ) Relevant descriptions in the paper:
Insert image description here

2.2. Loss calculation

So now that we have training samples, how to calculate the loss?
Assuming we already have the ranking function f ff, we can calculate the feature vector x (i) x^{(i)}x(i)的得分情况:
z ( i ) = ( f ( x 1 ( i ) ) , f ( x 2 ( i ) ) , . . . , f ( x n ( i ) ) ) z^{(i)} = (f(x_1^{(i)}), f(x_2^{(i)}), ..., f(x_n^{(i)})) z(i)=(f(x1(i)),f(x2(i)),...,f(xn(i)))
Obviously our learning goal is to minimize the error between the true score and the predicted score:
∑ i = 1 m L ( y ( i ) , z ( i ) ) \sum_{i=1}^{m} L(y ^{(i)}, z^{(i)})i=1mL ( y(i),z( i ) )
L is the loss function of listwise.

2.2.1. Probabilistic model

Suppose that for a certain query, the documents that may be related to it are { 1 , 2 , 3 , . . . , n } \{1, 2, 3, ..., n\}{ 1,2,3,...,n } , assuming that the result of a certain sort isπ \piπ
π = < π ( 1 ) , π ( 2 ) , . . , π ( n ) > \pi=<\pi(1), \pi(2), .., \pi(n)>Pi=<p ( 1 ) ,p ( 2 ) ,..,p ( n )>

For n documents, there are n! arrangements. All these sorting situations are recorded as Ω n \Omega_nOhn. Assuming that there is a ranking function, then for each document, we can calculate the relevance score s = ( s 1 , s 2 , . . . , sn ) s = (s_1, s_2, ..., s_n)s=(s1,s2,...,sn) . Obviously for every sorting situation, it is possible to happen, but every arrangement has its maximum likelihood value.

We can define a certain arrangement π \pi like thisProbability of π (maximum likelihood value):
P s ( π ) = ∏ j = 1 n ϕ ( s π ( j ) ) ∑ k = jn ϕ ( s π ( k ) ) P_s(\pi) = \prod_{ j=1}^{n} \frac{\phi (s_{\pi(j)})}{\sum_{k=j}^{n}\phi(s_{\pi(k)})}Ps( p )=j=1nk=jnϕ ( sπ ( k ))ϕ ( sp ( j ))
where ϕ \phiϕ represents the normalization of the score.

For example, there are three documents π = < 1, 2, 3 > \pi = <1,2,3>Pi=<1,2,3> , its ranking function calculates the score of each document ass = ( s 1 , s 2 , s 3 ) s=(s_1, s_2, s_3)s=(s1,s2,s3) .
_ ( s 3 ) ⋅ ϕ ( s 3 ) ϕ ( s 3 ) P_s(\pi) =\frac{\phi(s_1)}{\phi(s_1)+\phi(s_2)+\phi(s_3)} \ cdot \frac{\phi(s_2)}{\phi(s_2)+\phi(s_3)} \cdot \frac{\phi(s_3)}{\phi(s_3)}Ps( p )=ϕ ( s1)+ϕ ( s2)+ϕ ( s3)ϕ ( s1)ϕ ( s2)+ϕ ( s3)ϕ ( s2)ϕ ( s3)ϕ ( s3)
For another sorting, for example π ′ = < 3 , 2 , 1 > {\pi}' = <3,2,1>Pi=<3,2,1> , then the probability of this arrangement is:

P s ( π ) = ϕ ( s 3 ) ϕ ( s 3 ) + ϕ ( s 2 ) + ϕ ( s 1 ) ⋅ ϕ ( s 2 ) ϕ ( s 2 ) + ϕ ( s 3 ) ⋅ ϕ ( s 1 ) ϕ ( s 1 ) P_s(\pi) =\frac{\phi(s_3)}{\phi(s_3)+\phi(s_2)+\phi(s_1)} \cdot \frac{\phi(s_2) }{\phi(s_2)+\phi(s_3)} \cdot \frac{\phi(s_1)}{\phi(s_1)}Ps( p )=ϕ ( s3)+ϕ ( s2)+ϕ ( s1)ϕ ( s3)ϕ ( s2)+ϕ ( s3)ϕ ( s2)ϕ ( s1)ϕ ( s1)

Obviously, the order <3,2,1> has the lowest score, and the order <1,2,3> has the highest score.

2.2.2. Top K Probability

The computational complexity of the above method of calculating permutation probability reaches n! n!n !, which is too time-consuming, so a more efficient method top one is proposed in the paper. We extend it to top k here to analyze and summarize.

The above calculates the probability of a certain sorting method:
P s ( π ) = ∏ j = 1 n ϕ ( s π ( j ) ) ∑ k = jn ϕ ( s π ( k ) ) P_s(\pi) = \prod_{j =1}^{n} \frac{\phi (s_{\pi(j)})}{\sum_{k=j}^{n}\phi(s_{\pi(k)})}Ps( p )=j=1nk=jnϕ ( sπ ( k ))ϕ ( sp ( j ))
Ranked first is nnIn n cases, the second one isn − 1 n−1n1 situation, and so on. Equivalent to using topnnn to calculate.
Then topK ( K < n ) K(K<n)K(K<n ) Definition:
P s ( π ) = ∏ j = 1 K ϕ ( s π ( j ) ) ∑ k = jn ϕ ( s π ( k ) ) P_s(\pi) = \prod_{j=1}^{ K} \frac{\phi(s_{\pi(j)})}{\sum_{k=j}^{n}\phi(s_{\pi(k)})}Ps( p )=j=1Kk=jnϕ ( sπ ( k ))ϕ ( sp ( j ))
In the same way, the computational complexity here is n ∗ ( n − 1 ) ∗ ( n − 2 ) ∗ . . . ∗ ( n − k + 1 ) n∗(n−1)∗(n−2)∗.. .*(n−k+1)n(n1)(n2)...(nk+1 ) , which isN ! / ( N − k ) ! N!/(Nk)!N!/(Nk )! different arrangements, greatly reducing the computational complexity.
IfK = 1 K=1K=1 , it becomes the top 1 situation in the paper. At this time, there are n different arrangements:
P s ( π ) = ϕ ( s π ( j ) ) ∑ k = jn ϕ ( s π ( k ) ) P_s(\ pi) = \frac{\phi (s_{\pi(j)})}{\sum_{k=j}^{n}\phi(s_{\pi(k)})}Ps( p )=k=jnϕ ( sπ ( k ))ϕ ( sp ( j ))

For N ! / ( N − k ) ! N!/(Nk)!N!/(Nk )! different permutations, there are N!/(N−k)! permutation prediction probabilities, forming a probability distribution. The corresponding permutation probabilities are then calculated based on the real correlation scores to obtain the real permutation. Probability distributions. From this, cross-entropy can be used to calculate the distance between the two distributions as the loss function:
L ( y ( i ) , z ( i ) ) = − ∑ j = 1 n { P y ( i ) ( j ) ∗ log ( P z ( i ) ( j ) ) } L(y^{(i)}, z^{(i)}) = - \sum_{j=1}^{n} \{P_{y^{(i) }}(j) *log(P_{z^{(i)}}(j))\}L ( y(i),z(i))=j=1n{ Py(i)(j)log(Pz(i)(j))}

For example, there are three documents under a query <A, B, C> <A,B,C><A,B,C>

In the figure above, g is the probability distribution of various permutations calculated with real scores, and f and h are the other two permutation probability distributions. We just need to compare which permutation probability distribution is closer to the real permutation probability distribution, so use The predicted relevance score of this distribution is used as the final score.

2.2.3. ListNet

Here is the final form of ListNet.
In the paper, Listnet just converts the above top K topKt o p ϕ \phiin Kϕ function becomes exp function:
P s ( π ) = exp ⁡ ( s π ( j ) ) ∑ k = jn exp ⁡ ( s π ( k ) ) P_s(\pi) = \frac{\exp (s_{\ pi(j)})}{\sum_{k=j}^{n}\exp(s_{\pi(k)})}Ps( p )=k=jnexp(sπ ( k ))exp(sp ( j ))

Isn't this just calculating the softmax of the predicted score? In fact, this is indeed the case. This is how it is done in the implementation code. At that time, I was confused when I looked at the code directly. Isn't this just a softmax operation on the score predicted by the document? What does it have to do with top-one? Only after reading the paper carefully will you know what is going on.

When top-1, there are only n permutations, which greatly reduces the amount of calculation. If top K ( K > 1 ) K (K>1)K(K>1 ) , the number of permutations that need to be calculated will increase.

Assuming that the parameter of the sorting function f is w, the ranking probability distribution of top-one is:
P z ( i ) ( fw ) ( xj ( i ) ) = exp ⁡ ( fw ( xj ( i ) ) ) ∑ k = jn exp ⁡ ( fw ( xk ( i ) ) ) P_{z^{(i)}(f_w)}(x_j^{(i)}) = \frac{\exp (f_w(x_j^{(i)})) }{\sum_{k=j}^{n}\exp(f_w(x_k^{(i)}))}Pz(i)(fw)(xj(i))=k=jnexp(fw(xk(i)))exp(fw(xj(i)))

It still needs to be noted here: the list of all documents that may be related to a certain query is used as a sample for training.

Final disfunction:
L ( y ( i ) , z ( i ) ( fw ) ) = − ∑ j = 1 n { P y ( i ) ( xj ( i ) ) ∗ log ( P z ( i ) ( fw ) ( xj ( i ) ) ) } L(y^{(i)}, z^{(i)}(f_w)) = - \sum_{j=1}^{n} \{P_{y^{( i)}}(x_j^{(i)}) *log(P_{z^{(i)}(f_w)}(x_j^{(i)}))\}L ( y(i),z(i)(fw))=j=1n{ Py(i)(xj(i))log(Pz(i)(fw)(xj(i)))}

Editor's summary:
Simply put, Listwise's loss can be essentially summarized as follows:
step1: softmax on all sets containing positive and negative samples; step2: use cross entropy to sum all samples to calculate the loss,
but from the principle In fact, this is just a compromise for calculation speed. In addition, the groundtruth in cross entropy is the above yj (i) y^{(i)}_jyj(i)Score, so that the higher the groundtruth score, the greater the loss caused by a large prediction error, thus paying more attention to the actual high score but the predicted low score, thus improving the top score.

Implementation details

I discovered a detail when looking at the source code. The paper talks about cross entropy, but in the code implementation I found that cross entropy is used, that is, the KL divergence is in red, and the cross entropy is in green.
Insert image description here

Summarize

In pairwise, only the relative order of the two document pairs is considered, but the position of the document in the search list is not considered. The document ranked at the top of the search site results is more important. If there is a judgment error in the top document, the cost will be obviously high. for the documents listed next. The improvement idea for this problem is to introduce cost-sensitive factors, that is, each document pair has a different weight according to its order in the list. The higher the weight, the greater the weight, that is, the higher the search list, if the order is wrong. The price paid is higher (evaluation index NDCG); while listwise treats all documents under a query as a sample, because it needs to combine different arrangements and obtain its arrangement probability distribution to minimize the error with the true probability distribution. This Various sequential relationships between documents are considered. This situation is well avoided.
Define the loss function from the perspective of a probabilistic model.
In practice, all n docs under a query that may be related to it are actually used as a training sample (at this time, batch_size=n can be understood). It must be noted that when calculating top_1 probability, it is within a query Do softmax on all documents, rather than on all samples currently being trained. This is an important difference between pointwise and pairwise.

Reference link:
Paper sharing— >Learning to Rank: From Pairwise Approach to Listwise Approach

Guess you like

Origin blog.csdn.net/u014665013/article/details/129283415