Ranking algorithm (three) - sort of learning

Transfer: https://blog.csdn.net/anshuai_aw1/article/details/86018105

Learning to rank (abbreviated LTR, L2R), also known as sort of learning, machine learning refers to any technique for sorting.

table of Contents

A, LTR Introduction

Appear background of 1.1 LTR

1.2 LTR basic framework

Second, the training data acquisition

2.1 manual annotation

2.2 Search Logs

2.3 common data set

Third, feature extraction

Fourth, the model training

4.1 single document method (PointWise Approach)

4.2 Documentation method (PairWise Approach)

4.3 document list method (ListWise Approach)

4.4 Comparison of Methods

4.5 Xgboost中的Learning to rank

4.5.1 Data form

4.5.2 Xgboost realization of pairwiseRank

Fifth, the evaluation criteria

references

 


 

A, LTR Introduction

Appear background of 1.1 LTR

Use machine learning techniques to sort the search results, which is very popular in recent years of research. Information retrieval has been developed for decades, why the machine learning and information retrieval technology combined with each other appeared later? There are two reasons.

 One reason is that: elements of traditional information retrieval model for ranking the relevance of the query and the document under consideration is not much, mainly using term frequency, inverse document frequency and length of the document to fit these sort of artificial factor formula . Because not many factors to consider, a fitting formula is feasible by hand, machine learning at this time does not come in great handy, because the machine learning is more suitable for the use of a lot of features to fit the formula, at this time if manually count on a few ten kinds of considerations fitted Sort formula is not realistic, and machine learning to do this type of work is very appropriate. With the development of search engines, conduct a web page for ordering the factors to be considered more and more, such as pageRank value pages, queries and documents that match the word number, Web address URL link length and so have an impact on page rank, Google page Rank formula takes into account the current 200 kinds of factors, and the machine learning role to play, which is one of the reasons.

Another reason is: For supervised machine learning, first of all requires a lot of training data, on this basis, it is possible to automatically sort learning model, rely on manual annotation large amount of training data is not realistic. For search engines, although not by artificial to mark a lot of training data, but the user clicks on record as machine learning methods can be an alternative to the training data, such as user issues a query, the search engine returns search results, the user can click on a these pages can be assumed that the user clicks on the website and the user query is more relevant pages. Although this assumption does not hold a lot of time, but experience has shown that the use of this click data to train a machine learning system is indeed feasible.

Briefly, sorted by relevance generally in the field of Information Retrieval. More typical is a query in a search engine query, it will return documents related to a document, and then sorted according to the degree of correlation between (query, document), and then returned to the user. With the factors affecting the degree of correlation increases, it becomes difficult to use conventional sequencing methods, people expect to solve this problem through machine learning, which led to the birth of LRT .

1.2 LTR basic framework

LTR core or machine learning, but the goal is not just simple classification or regression, and the most important is the result of the sort of output document, it is usually the framework are as follows:

The steps described as: training data acquisition -> Feature Extraction -> Model Training -> Test Data Prediction -> effect evaluation.

Next, we followed the steps described above.

Second, the training data acquisition

2.1 manual annotation

Manually labeled data are the following major types:

  • Single-point mark

    • For each query document marked an absolute tag
    • Two yuan mark: related vs unrelated
    • Five mark: Perfect (Perfect), excellent (Excellent), good (Good), general (Fair), poor (Bad), generally belong to two steps behind not relevant

      Benefits: less marked O (n)
      disadvantages: difficult standard. . . Not unified

  • Twenty-two mark

    • For a query Query, whether you want to label the document more relevant than the document d1 d2 (q, d1) ≻ ( q, d2)?

      Benefits: tagging more convenient
      disadvantages: there are estimated to be greater than marked O (n ^ 2)

  • List mark

    • For a query Query, the artificial ideal of a good standard sort the whole child

      Benefits: with respect to the above two, the subject looks great
      harm: marked a great amount.

2.2 Search Logs

When a search engine user clicks to build up the data becomes very so.

For example, the results of which are located 123 ABC, A is lower than the position B, but get more hits, then the correlation may be better than B A.

Click on implicit data reflects the relative quality of the correlation between search results under the same Query. In the search results, the result of a high probability of being clicked position will be greater than the result of a low position, which is called "click Prejudice" (Click Bias). But it took more than a way to circumvent this problem. Because we only record the occurrence of a "click upside down," the high and low results, the use of such "preferences" as training data.

In practice, in addition to click data, also tend to use more data. For example, by session logging, mining, such as time on page and other dimensions, that is considered a long residence time playing the greater score. In the actual scene, search logs often contain a lot of noise. And only Top Query (the more the number of searches Query) to produce a sufficient number of telling search logs.

2.3 common data set

A number of existing public data sets can be used

  1. LETOR, http://research.microsoft.com/en-us/um/beijing/projects/letor/
  2. Microsoft Learning to Rank Dataset, http://research.microsoft.com/en-us/projects/mslr/
  3. Yahoo Learning to Rank Challenge, http://webscope.sandbox.yahoo.com/

Third, feature extraction

Search engines use a series of features to determine the result of the sort. A feature called a "feature". According to my understanding,

feature can be divided into three categories:

  1. Doc features itself: Pagerank, content richness, whether it is spam, quality value, CTR, etc.
  2. Characteristics of Query-Doc: Query-Doc correlation between the number of Query appear in the document, the query word Proximity value (that is, all query words how much can occur within a window in the document) and so on. Of course, some features of Query-Doc is not explicit, but there Semantic, that although Query does not appear in the document, but semantically related.
  3. Characteristics Query: The number of occurrences in the Query Query all, ratios, etc.

This stage is to extract all of the features, training for subsequent use.

Fourth, the model training

L2R algorithm mainly includes three categories: single document method (PointWise Approach), documentation of methods (PairWise Approach) method and list of documents (ListWise Approach).

We in turn introduce three types of algorithms, and finally tell us how Xgboost is sort of learning.

4.1 single document method (PointWise Approach)

Single document processing target method is a single document, the document is converted to feature vectors, the machine learning system to learn from the training data according to a document classification or regression function scoring, scoring result is the search result. Here we use a simple example of this approach. 

The figure is the training set of manual annotation, in this example, we used three characteristics for each document: Cosme query document similarity score, PageRank value and the value of the query terms Proximity page, and the related judgment is binary, i.e., either related or not related, of course, where the correlation judgment can be extended to diverse relevance according to the present embodiment for convenience of explanation made simplified.

5 provides examples of training examples, each training instance are marked out the corresponding inquiry, three feature scores and their correlation is determined. For a machine learning system, according to the training data, a need for a linear scoring function:

Score(Q, D)=a *CS+b*PM+c*PR+d 

In this formula, cs Representative Cosine similarity variables, PM the representative value of the variable Proximity, PR representatives pageRank, and a, b, c, d is a parameter corresponding to the variable.

If the score is greater than the set threshold, it may be considered relevant, if less than the set threshold value may be deemed irrelevant. By training examples, you can obtain an optimal a, b, c, d parameter combination, when determining the parameters, even if the learning machine learning system is completed, can then use this scoring function is determined correlation. For a new query Q and a document D, the system first obtains the document D corresponding to its three characteristic values, both calculated using the learning score after the combination of parameters when the score is greater than the set threshold value, the document is determined to be relevant document, or judged irrelevant documents.

Microsoft given data link is given below in Section 2.3.

pointwise form

  • Input: feature vector of doc
  • Output: relevance scores of each doc
  • Loss function: return loss, classified loss, orderly classification loss

advantage:

  • Fast, low complexity.

Disadvantages:

  • General effect
  • Without taking into account the relative relationship between documents
  • In the ordering, ranked affect several documents before ordering most of the effect is very important, Pointwise does not consider the impact of this

pointwise commonly used algorithms

  • Classification
    • Discriminative model for IR (SIGIR 2004)
    • McRank (NIPS 2007)
  • Regression
    • Least Square Retrieval Function (TOIS 1989)
    • Regression Tree for Ordinal Class Prediction (Fundamenta Informaticae, 2000)
    • Subset Ranking using Regression (COLT 2006)
  • Ordinal Classification
    • Pranking (NIPS 2002)
    • OAP-BPM (EMCL 2003)
    • Ranking with Large Margin Principles (NIPS 2002)
    • Constraint Ordinal Regression (ICML 2005)

4.2 Documentation method (PairWise Approach)

For the search task, the system receives a user query that returns a list of related documents, so the key question is has the relative order to determine the relationship between the document , and the document Pairwise will shift the focus of the relationship is reasonable judgments.

Pairwise main problem is sorted into a document to determine the order.

For queries Q1after manually tagging, Doc2=5the highest score, followed by Doc34 points, the worst is Doc13 points, after this there into the relative relationship: Doc2>Doc1、Doc2>Doc3、Doc3>Doc1and, according to the reverse order of the relationship can be obtained relevance sort order, so problems can be sorted into two arbitrary nature of the relationship of judgment documents, and determine the order of any two documents called a very familiar classification.

 

Pairwise form

Input:

  • The same query a documentx_i,x_j,sign(y_i-y_j)
  • Two documents marked relative relationship, if the document x_ithan the x_jmore relevant, thesign (y_i-y_j) = 1
  • Relationship between documents are retained under the same query

Output:

  • Sort function gives the document to calculate score

Pairwise commonly used algorithms:

  • Learning to Retrieve Information (SCC 1995)
  • Learning to Order Things (NIPS 1998)
  • Ranking SVM (ICANN 1999)
  • Rank Boost (JMLR 2003)
  • LDM 2005 (COW)
  • RankNet (ICML 2005)
  • Frank cows (2007)
  • ISGSY cows (2007)
  • GBRank cows (2007)
  • QBRank (NIPS 2007)
  • MPRank (ICML 2007)
  • IRSVM 2006 (COW)
  • LambdaRank (NIPS 2006)
  • LambdaMART (inf.retr 2010)​​​​​​​

While Pairwise method for Pointwise method has been improved, but obviously there are two problems:

  1. 只考虑了两个文档的先后顺序,没有考虑文档出现在搜索列表中的位置
  2. 不同的查询,其相关文档数量差异很大,转换为文档对之后,有的查询可能有几百对文档,有的可能只有几十个,最终对机器学习的效果评价造成困难。

4.3 文档列表方法(ListWise Approach)

与Pointwise和Pairwise不同,Listwise是将一个查询对应的所有搜索结果列表作为一个训练实例,因此也称为文档列方法。

文档列方法根据K个训练实例训练得到最优的评分函数F,对于一个新的查询,函数F对每一个文档进行打分,之后按照得分顺序高低排序,就是对应的搜索结果。所以关键问题是:拿到训练数据,如何才能训练得到最优的打分函数?

Listwise主要有两类:

  • Measure specific: 损失函数与评估指标相关,比如:L(F(x),y)=exp(-NDCG)
  • Non-measure specific:损失函数与评估指标不是显示相关,考虑了信息检索中的一些独特性

这里介绍一种训练方法,它是基于搜索结果排列组合的概率分布情况来训练的,下图是这种方式训练过程的图解示意。

首先解释下什么是搜索结果排列组合的概率分布,我们知道,对于搜索引擎来说,用户输入査询Q, 搜索引擎返回搜索结果,我们假设搜索结果集合包含A. B 和C 3个文档,搜索引擎要对搜索结果排序,而这3个文档的顺序共有6种排列组合方式:

ABC, ACB, BAG, BCA, CAB和CBA,

而每种排列组合都是一种可能的搜索结果排序方法。

对于某个评分函数F来说,对3个搜索结果文档的相关性打分,得到3个不同的相关度得分F(A)、 F(B)和F(C), 根据这3个得分就可以计算6种排列组合情况各自的概率值。 不同的评分函数,其6种搜索结果排列组合的概率分布是不一样的

了解了什么是搜索结果排列组合的概率分布,我们介绍如何根据训练实例找到最优的评分函数。上图展示了一个具体的训练实例,即査询Q1及其对应的3个文档的得分情况,这个得分是由人工打上去的,所以可以看做是标准答案。可以设想存在一个最优的评分函数g,对查询Q1来说,其打分结果是:A文档得6分,B文档得4分,C文档得3分, 因为得分是人工打的,所以具体这个函数g是怎样的我们不清楚,我们的任务就是找到一 个函数,使得函数对Q1的搜索结果打分顺序和人工打分顺序尽可能相同。既然人工打分 (虚拟的函数g) 已知,那么我们可以计算函数g对应的搜索结果排列组合概率分布,其具体分布情况如上图中间的概率分布所示。假设存在两个其他函数hf,它们的计算方法已知,对应的对3个搜索结果的打分在图上可以看到,由打分结果也可以推出每个函数对应的搜索结果排列组合概率分布,那么hf哪个与虚拟的最优评分函数g更接近呢?一般可以用两个分布概率之间的距离远近来度量相似性,KL距离就是一种衡量概率分布差异大小的计算工具,通过分别计算hg的差异大小及fg的差异大小,可以看出fh更接近的最优函数g,那么在这个函数中,我们应该优先选f作为将来搜索可用的评分函数,训练过程就是在可能的函数中寻找最接近虚拟最优函数g的那个函数作为训练结果,将来作为在搜索时的评分函数。

上述例子只是描述了对于单个训练实例如何通过训练找到最优函数,事实上我们有K个训练实例,虽然如此,其训练过程与上述说明是类似的,可以认为存在一个虚拟的最优评分函数g (实际上是人工打分),训练过程就是在所有训练实例基础上,探寻所有可能的候选函数,从中选择那个KL距离最接近于函数g的,以此作为实际使用的评分函数。 经验结果表明,基于文档列表方法的机器学习排序效果要好于前述两种方法。

Listwise常用算法:

  • Measure-specific
    • AdaRank (SIGIR 2007)
    • SVM-MAP (SIGIR 2007)
    • SoftRank (LR4IR 2007)
    • RankGP (LR4IR 2007)
    • LambdaMART (inf.retr 2010)(也可以做Listwise)
  • Non-measure specific
    • ListNet (ICML 2007)
    • ListMLE (ICML 2008)
    • BoltzRank (ICML 2009)

4.4 方法的对比

在工业界用的比较多的应该还是Pairwise,因为构建训练样本相对方便,并且复杂度也还可以。

4.5 Xgboost中的Learning to rank

Xgboost提供了排序学习的接口。我在这里,简单介绍下。当然,LGBM也有排序学习的接口,与Xgboost类似。

以Xgboost排序学习的官方文档xgboost的learning to rank文档为标准。

在ranking的情况下,数据集一般都需要被格式化为group input。例如,在学习web pages的rank场景下,rank page数据是根据不同的queries分到各groups的。

4.5.1 数据形式

①基本数据形式train.txt
Xgboost接受像libSVM格式数据,例如


  
  
  1. 1 101 :1.2 102 :0.03
  2. 0 1 :2.1 10001 :300 10002 :400
  3. 0 0 :1.3 1 :0.3
  4. 1 0 :0.01 1 :0.3
  5. 0 0 :0.2 1 :0.3

每行表示:

label   特征1索引:值 特征2索引:值

  
  

②groups索引文件train.txt.group
除了group input format,XGboost需要一个索引group信息的文件,索引文件train.txt.group格式如下:


  
  
  1. 2
  2. 3

这意味着,数据集包含5个实例,前两个是一个group,后三个是一个group。

③实例权重文件train.txt.weight

XGboost还支持每个实例的权重调整,数据格式如下:


  
  
  1. 1
  2. 0 .5
  3. 0 .5
  4. 1
  5. 0 .5

4.5.2 Xgboost的pairwiseRank实现

Xgboost使用的是LambdaRank算法。原理可以参考文献【2】。

如何构造pair对?
xgboost/src/objective/rank_obj.cc,75行开始构造pair对。如上理论所说,每条文档移动的方向和趋势取决于其他所有与之 label 不同的文档。因此我们只需要构造不同label的“正向文档对”。其方法主要为:遍历所有的样本,从与本样本label不同的其他label桶中,任意取一个样本,构造成正样本;

如何定义梯度?
xgboost/src/objective/rank_obj.cc中,写到了它是使用lambdaWeight.
然后将梯度和文档对输入GBDT训练即可。

输出是什么?
根据LambdaMart原理,输出是模型对每个文档的打分?有点疑问,到底是labdaMart还是LambdaRank?需要抽空去看看这2个算法。

五、评估标准

参看我之前的文章《IR的评价指标-MAP,MRR和NDCG的形象理解

参考文献

【1】机器学习排序之Learning to Rank简单介绍 综合介绍LTR

【2】机器学习算法-初识Learning to Rank 综合介绍LTR

【3】机器学习算法-L2R进一步了解 介绍了xgb中排序使用

[4] XGBoost for Ranking using the method  described xgb Sorting use

[5] Learning to rank learning foundation 

[6] Learning to rank summary of the basic algorithm  know almost an article, a brief introduction LambdaMART, RankNet and LambdaRank

[7] Learning To Rank study and application of  knowledge is almost an article about some of the ideas Foreword

[8] application LTR (Learning To Rank) Personalized electricity supplier in the field of search 

 

Learning to rank (abbreviated LTR, L2R), also known as sort of learning, machine learning refers to any technique for sorting.

table of Contents

Guess you like

Origin blog.csdn.net/App_12062011/article/details/90342245