Search ranking algorithm

Sort model LTR (L2R, learning to rank)

  • Pointwise: for each item ordered list, a direct learning value, for example, can be estimated click-through rate (Predict CTR, pCTR), then follow the estimates can be sorted in descending order. Common model has LR, FFM, GBDT, XGBoost. LTR is GBDT application of more nonlinear model. Additive Groves (abbreviated AG) model is constructed on the basis of a random forest, added Bagging algorithm, such generalization ability stronger. Grove AG by a number of compositions (Bagging) from each Grove composed of multiple trees, each tree at the time of fitting the training target value is true with other tree Results and prediction residuals. Upon reaching a specified number of trees in the process of training and re-training of the tree will replace the previous fall tree. Google FTRL method proposed linear model can be updated online.
  • Pairwise: twenty-two learning relationship has two entries. Common model has GBRank, RankNet, LambdaMart, RankSVM. LambdaMart is Lambda and MART (Multiple Additive Regression Tree, GBDT alias) combining a GBDT of improvements to scheduling problems. When calculating the gradient LambdaMart recalculated Lambda, re-ordered physical meaning given gradient, which is calculated using the sigmoid probability of each ordered pair using cross entropy loss as a function to determine the fit, and the off-line sorting indicator (e.g., MAP , NDCG) to take into account the gradient.
  • Listwise: a list of the best sort of as the ultimate optimization goal to optimize the model, the typical model as ListNet by predicting the gap between the real distribution and sorting distribution. The introduction of standardization with discounted cumulative gain (Normalized Discounted Cumulative Gain, NDCG) as a measure of the quality of the sorted list to sort effect to ensure the optimal level list.

Pairwise model refers to all documents two each on a pair, such as (X1, X2), if the score is greater than X1 X2 then the pair as a positive example +1, otherwise it is negative Example-1. Pairwise effect is usually better than Pointwise (academia is the case, the industry is also increasingly using the Pairwise).

Here are a few classic models Sort:

RankSVM

RankSVM paired Sorting learning (Pairwise) classic model, scoring and ranking of the query-doc pair by SVM. Wherein the difference between the input feature RankSVM dimension query-doc pair, i.e. \ (X_2-X_1 \) , the relative order of the label (+1, -1).

GBRank

GBRank proposed by Yahoo employees. Learning is a basic gradient riser (Gradient Boosting Machine, GBM). For all x> N training samples of a partial order of y, the loss function defined as:
\ [L_1 = \ sum_. 1} = {I ^ N \ max (0, H (y_i) -H (x_i) ) \\ || \\ L_2 = {1
\ over 2} \ sum_ {i = 1} ^ N (\ max (0, h (y_i) -h (x_i))) ^ 2 \] decrease in solved using a gradient , the m-th iteration updated H: \ (H (y_i ^ m) = H_ {m-. 1} (y_i) - \ rho_m \ times {\ partial L_2 are ^ {m-. 1} \ over \ partial y_i} \) . partial sequence only when the prediction error is updated, at this time there are \ (h (y_i ^ m) = h (x_i ^ {m-1}); \ h (x_i ^ m) = h (y_i ^ {m- 1}) \) . in order to learn the correct partial learning also needed to ensure more robust, and therefore need to be large enough difference, loss function modified as follows:
\ [L_3 = {1 \ over 2} \ sum_ I = {1} ^ N (\ max (0,
\ tau + h (y_i) -h (x_i))) ^ 2 \] such partial order after learning error, update: \ (H (y_i ^ m) = H (x_i ^ { }. 1-m) - \ of tau; \ H (x_i ^ m) = H (y_i. 1-m ^ {}) + \ of tau \) .

GBRank using all GBM the normalization: \ (h_m (X) = {mh_ {m-. 1} (X) + \ ETA g_m (X) \ over m +. 1} \) , where \ (\ ETA \ ) for the shrinkage.

RankNet

RankNet proposed by Microsoft Boggs (ICML2005), than GBRank proposed earlier point: in the partial order to avoid the learning process of modeling each sample mapped to the exact value of the sort, but just sort of learning has a partial order It can be. RankNet to learn probability has partial order by the neural network. Target probability set learning is \ (\ bar of P_ {XY} \) , when x sorting (x> y) when the prior Y, \ (\ bar of P_ {XY} =. 1 \) , 0 if not determine partial timing relationship is 0.5. reset input x, y are output values corresponding to \ (F (X), F (Y), O_ {XY} = F (X) -f (Y) \) , partial order prediction value \ (\ text {sigmoid} ( O_ {xy}) = {1 \ over 1 + e ^ {- O_ {xy}}} \) loss function is defined as cross-entropy loss (loss probability logarithmic):
\ [\ begin {align} L_ { xy} & = - \ bar P_ {xy} \ log P_ {xy} - (1- \ bar P_ {xy}) \ log (1-P_ {xy}) \\ & = - \ bar P_ {xy} [ O_ {xy} - \ log (1 + e ^ {O_ {xy}})] + (1- \ bar P_ {xy}) \ log (1 + O_ {xy}) \ \ & = - \ bar P_ {
xy} O_ {xy} + \ log (1 + e ^ {O_ {xy}}) \ end {align} \] Since the partial order probability pair having transfer resistance, thus RankNet may be O (n) complexity. Since RankNet to reduce errors pair optimization goal, therefore NDCG (position of interest related documents are located) and other indicators to measure the effect is not very good (the same RankSVM, GBDT and other ranking model is also true). Subsequent improved model comprises LambdaRank, LambdaMart.

开源代码实现: Ranklib contains a variety of algorithms.

LambdaMART

This is Microsoft's use of a model for a long time in Bing, also L2R reputation in this area. After Boggs team realized RankNet not index optimization search (NDCG or MAP, etc.) directly addressed LambdaRank, use NDCG amount of change in the model parameters are optimized. Then the
combination of LambdaRank and GBDT up the idea, put forward a stronger LambdaMART.
Lambda is defined as the amount of change in the two documents NDCG (in fact, with this amount of change in the number multiplied by the gradient of the probability of loss). Lambda gradient more attention to enhance the position of the front ranking position of high-quality documents, effectively avoid the occurrence of position down the front position in this case of high-quality documents.

Learning to rank evaluation

Precision and Recall(P-R)

PR has two distinct disadvantages:

  • All articles are only relevant and irrelevant divided into two tranches, the classification is obviously too rough.
  • It does not consider the location factor.

Discounted Cumulative Gain(DCG)

DCG addresses both issues of PR. For a keyword, all documents can be divided into multiple levels of relevance, here at \ (rel_1, rel_2, ... \ ) to represent. Articles correlation contribution to the overall evaluation of the list with an increase of position and logarithmic decrement, after more by the location, the more severe attenuation.

NDCG assume this indicator is that in a sorting result, the relevant information is not ranked higher than the relevant information, and the most relevant information needed to be placed at the top, the most irrelevant information rows at the bottom. Any sort results once deviated from the assumption would be "penalized" or "punishment." NDCG is a sort evaluation for the test set.

NDCG considering the sort of score, when it comes to NDCG need from CG to begin with. CG (cumulative gain, the cumulative gain) can be used to evaluate based on scoring / rating personalized recommendation system. Suppose we recommend the k items, this CGk recommendation list is calculated as follows:

\[ CG_k=\sum_{i=1}^k \text{rel}_i \]

\ (\ text {rel} _i \) represents the k-th item of correlation or rating. Suppose we recommended a total of k film, \ (\ text {rel} _i \) may be a user score of the i-movie.

Such as watercress to the user recommended five movies: M1, M2, M3, M4, M5

The users of these five movie ratings are: 5, 3, 2, 1, 2

Then the CG is equal to the recommended list
CG5 = 5 + 3 + 2 + 1 + 2 = 13.
CG does not consider the recommended order, on this basis, after consideration of the item we introduce the order, there is a DCG (discounted CG), discount cumulative gain. Formula is as follows:

\[ DCG_k=\sum_{i=1}^k \frac{2^{\text{rel}_i}-1}{\log_2(i+1)} \]

Then the recommendation list DCG equal
\ [DCG_5 = \ frac {2 ^ 5-1} {\ log_2 2} + \ frac {2 ^ 3-1} {\ log_2 3} + \ frac {2 ^ 2-1} {\ log_2 4} + \ frac {2 ^ 1-1} {\ log_2 5} + \ frac {2 ^ 2-1} {\ log_2 6} = 31 + 4.4 + 1.5 + 0.4 + 1.2 = 38.5 \]

For sorting engine, the length of the list of results is often not the same as the different requests. When the integration sort performance comparison of different sort of engine, comparability between different indicators DCG length of the request is not high. DCG not take into account the number of valid recommendation list and each search result in real, commonly used in the industry currently is Normalized DCG (NDCG), it assumes perfect sort the list can obtain a request before the p positions, this perfect score list is referred Ideal DCG (IDCG), NDCG IDCG ratio is equal to the DCG. So NDCG is a value between 0 and 1.

DCG under perfect result, that IDCG defined as follows:

\[ \text{IDCG}_p=\sum_{i=1}^{|REL|}{2^{rel_i}-1 \over \log_2(i+1)} \]

| REL | on behalf good location to the list of results up to p according to relevance ranking.

\[ NDCG_k=\frac{DCG_k}{IDCG_k} \]

Following the example above, if a total of 7 Movie: M1, M2, M3, M4 , M5, M6, M7
user these seven movie ratings are: 5, 3, 2, 1, 2, 4, 0

7 this sort film Rating: 5, 4, 3, 2, 2, 1, 0

DCG perfect in this case
\ [IDCG_5 = \ frac {2 ^ 5-1} {\ log_2 2} + \ frac {2 ^ 4-1} {\ log_2 3} + \ frac {2 ^ 3-1} {\ log_2 4} + \ frac
{2 ^ 2-1} {\ log_2 5} + \ frac {2 ^ 2-1} {\ log_2 6} = 31 + 9.5 + 3.5 + 1.3 + 1.2 = 46.5 \] so , \ (NDCG_5 = \ FRAC DCG_5} {} = {IDCG_5 \ FRAC {38.5} {46.5} = 0.827 \)
NDCG is a number from 0 to 1, the closer to 1 the more accurate the recommendation described.

Reference: http://sofasofa.io/forum_main_post.php?postid=1002561

Expected Reciprocal Rank(ERR)

Compared with the DCG, in addition to considering the position and allow a variety of related attenuation level (in R1, R2, R3 ... represented) than, ERR goes a step further, also consider all documents in a row before the document relevance. Take, for example, A very relevant document, ranked No. 5. If four of the top surface of the relevant documents are not high, the document A contribution to the list on a lot. Conversely, if the previous four documents related to a large degree, has been completely solve the user's search needs, users simply do not click the document first five positions, the document A contribution to the list would be small.
ERR is defined as follows:

\[ ERR=\sum_{r=1}^n{1\over r}\prod_{i=1}^{r-1}(1-R_i)R_r \]

Guess you like

Origin www.cnblogs.com/makefile/p/l2r.html