Summary of Retrieval Model and Evaluation Metrics

Classic retrieval model

 

The information retrieval model has gone through several different stages from its birth to the present, which are based on set theory, based on linear algebra, based on statistics and probability. Although expert retrieval is different from traditional information retrieval, the two are still closely related, and this paper will also use the retrieval based on expert description documents as the Baseline as the basis for subsequent optimization. Therefore, it is necessary to understand the traditional retrieval model. This article will briefly introduce the classic models at different stages.

2.1.1.1 Boolean Model

The Boolean model is a simple but elegant model based on set theory and Boolean algebra theory. It has attracted much attention in the past and has been widely used in many early commercial search engines. In recent years, it has been gradually replaced by vector space models and probability models, but it still has a place in the field of information retrieval and is used as a good Baseline.

The Boolean model is based on the following assumptions: 1. A document can be represented by a collection of words. 2. A query can be expressed as a Boolean expression with keywords connected by AND, OR, NOT logical operators. 3. A document is relevant if the indexed terms in the document satisfy the query's Boolean expression.

For example, the user query is "Apple AND (Jobs OR ipad4)", if an article contains "Apple" and one of "Jobs" or "ipad4" at the same time, then this article is relevant and meets the user's needs.

Boolean models are easy to understand, easy to implement, and theoretically supported by Boolean algebra.

But the Boolean model has a fatal deficiency, it is strictly binary correlation (0 or 1), that is, correlation or no correlation at all, cannot reflect the different degrees of correlation, and the results it returns are disordered and too rough. E.g

Q=World^Cup^South^Africa^2010,Index(d)={World,Cup,South,Africa},

d will be considered irrelevant. Again,

Q=World||Cup||South||Africa||2010,Index(d1)={World,Cup,South,Africa,2010};

Index(d2)={2010}, at which point d1 and d2 will be considered equally correlated, although d1 is significantly more correlated. Furthermore, it may not be practical to require ordinary users to be able to construct proper Boolean expression queries themselves.

2.1.1.2 Vector Space Model (VSM)

The founder of the information retrieval field, Salton, proposed the VSM model in the 1970s. Compared with the strict binary correlation of the Boolean model, it proposed a partial matching retrieval strategy. As a document representation and similarity calculation model, the VSM model is widely used not only in the field of retrieval, but also in text mining, natural language processing and other fields.

VSM represents both the query and the document as a collection of words, and maps them to a high-dimensional space, where each dimension represents a word in the document collection, by calculating the space vector represented by the query word set and the space represented by the document word set The similarity of the vector serves as the relevance of the document to the query.

VSM converts words into weights when they are mapped to high-dimensional space. The more commonly used mapping function is TF*IDF, which also considers the occurrence of words in the document and document collection. A basic TF*IDF formula is as follows:

ω=tfi(d)*log(N/dfi)                   (2-1)

 

 

(2-2)

where N is the number of documents in the document set, tf i (d) is called the word frequency, is the number of times word i appears in document d, and df i document frequency is the number of documents containing word i in the document set. According to the TF*IDF formula, the higher the frequency of a word in the document, the greater the weight, indicating that the word represents the stronger the attribute of the document; but when the document collection contains more documents of the word, the weight is smaller. , indicating that the word is less discriminative, indicating that the attribute of the document is weaker.

In the VSM model, the calculation framework of the feature weight is TF*IDF, but the specific TF and IDF calculation formulas have various deformations:

A TF variant formula is: Wtf = 1+ log(TF), which is used to suppress the side effects caused by excessive word frequency. That is, a word appears 10 times in one document, and 1 word appears in another document. According to formula (2-1), the TF will be different by 10 times, but in fact we do not need such a big difference, choose log Buffering, where 1 in the formula is for smoothing when the word frequency is 1.

Another TF variant formula is: , which is for suppression of long documents. TF represents the actual word frequency of the word in the document, and Max(TF) represents the word frequency of the most frequent word in the document. α is a regulating factor, and new research shows that the effect of α is 0.4 is better.

A variant formula of IDF is: , the meaning of each item is the same as formula (2-1), where 1 is to smooth the situation when df is equal to N.

According to the TF*IDF calculation framework, if the TF and IDF of a word can be divided into four quadrants according to the level, the comprehensive weight of the word is shown in Table 2.1.

 

TF high

TF low

IDF high

high

generally

IDF low

generally

Low

 

VSM assumes that the more similar documents are in a high-dimensional space, the more relevant the documents are in terms of content. The similarity between a query and a document can be expressed as the cosine of the angle between a vector that maps to a high-dimensional space:

 

 

(2-2)

The more similar D and Q are, the smaller the vector angle between them, the larger the cos(D, Q), and the larger the vector angle between them, indicating that the two vectors are more different, and the similarity between D and Q is lower. However, formula (2-2) has an obvious defect, that is, it over-suppresses long documents: assuming that the weights of words related to the retrieval topic are similar in the long and short documents, but the long document also discusses other topics, then the molecular part of the cosin formula is basically the same, However, the length of the denominator part of the long document vector is greater than the length of the short document, which will cause cosin (D long, Q) to be smaller than cosin (D short, Q).

In addition to calculating the similarity by cosine, the following common similarity calculation methods can be used, as shown in Table 2.2.

Table 2.2 Commonly used similarity calculation methods

method name

Calculation formula

Simple matching

 

Dice’s Coefficient

 

Jaccard’s Coefficient

 

Overlap Coefficient

 

 

By calculating the similarity between the query and each document, the documents can be sorted according to the similarity and returned as a result. VSM is also easy to implement, and the related documents it returns can be sorted, but when the vector dimension is too high, the calculation will be very time-consuming, and the system resource utilization will be abnormally large; VSM has a basic assumption, the keyword Term vector Term is considered to be unrelated to each other, but in fact most of Term are dependent, for example, there is a close relationship between words in natural language, such as co-occurrence, the researchers later proposed an N-gram language model to improve it; In addition, VSM is an empirical model, which is constantly explored and perfected by experience and intuition, and the theoretical support is not very sufficient.

 

2.1.1.3 Probabilistic Model

Probabilistic models were proposed by Robertson and Sparck Jones in 1976, and they used the relevant feedback information to gradually refine in order to obtain the desired query results. The basic idea of ​​the probability model is: according to the query Q, the documents in the document set are divided into two categories, the set R related to Q, and the set R' not related to Q. For document sets of the same class, the distribution of each index item is the same or similar; for document sets of different classes, the distribution of each index item is different. It can be seen that the distribution of each index item in the document is calculated, and according to the calculated distribution, we can determine the relevance between the document and the query, that is:

 

 

(2-3)

 

Among them, P(R=1) and P(R=0) are related to the specific Q, that is, P(R=1)/P(R=0) is fixed. P(d|R=1) is the probability of occurrence of document d in the set of documents related to Q, and P(d|R=0) is the probability of occurrence of d in the set of documents not related to Q.

The most commonly used probability model formula so far is the BM25 formula proposed by Robertson,

 

 

(2-4)

 

 

(2-5)

where qtf is the word frequency of word t in the query, tf is the word frequency of word t in document d, is the normalization of document length, and the last term is the smoothed log( ), which is the inverse document frequency of word t. k 1 , b, k 3 are empirical parameters.

The probability model has a strict mathematical theoretical foundation and is highly theoretical. It is one of the best models at present. This has been confirmed in evaluation projects such as TREC, but the model is too dependent on the text set, and the formula requires parameters estimates, and rely on the assumption of binary independence.

2.1.1.4 Language Model

Before being applied to information retrieval (Information Retreval), language models have been successfully applied in speech recognition, machine translation and Chinese word segmentation, and have the advantages of high accuracy, easy training, and easy maintenance.

Language model modeling methods are roughly divided into two categories: one is to rely entirely on large-scale text data for statistical modeling; the other is a deterministic language model based on Chomsky's formal language. Pay more attention to grammatical analysis.

From a basic idea, other retrieval models consider from query to document, that is, how to find relevant documents given a user query. However, the language model is on the contrary, it is a reverse way of thinking, that is to consider from the document to the query, build a different language model for each document, determine the probability of generating a query from the document, and sort according to this probability as the final search results.

When applied to IR, the language model and documents are closely related. When a query q is input, the documents are sorted according to the query likelihood probability or the probability that the document can generate the query under the language model. If it is assumed that the terms are independent of each other, the unigram language model is used to have:

 

 

(2-6)

Wherein the query q includes the words t 1 , . . . , t M .

However, the language model faces the problem of data sparseness, that is, the query word does not appear in the document, and the entire generation probability will be 0. Therefore, the language model introduces data smoothing, that is, the word distribution probability of "robbing the rich and helping the poor", so that all words have non-zero values. probability value. Language retrieval models often introduce a background probability of the document set as data smoothing. The background probability is an overall language model of the document set. Due to the relatively large scale of the document set, most query words appear, avoiding zero probability. If the document set also uses a unigram language model, the background probability of a word is the number of occurrences of the word divided by the sum of the occurrences of all words in the document set. The probability formula for generating query words by adding data smoothing is:

 

 

(2-7)

where P(t i |C) is the background language model of t i , TF/LEN is the document language model of ti, the whole probability is composed of two parts of linear interpolation, and λ belongs to [0,1] is an adjustment factor.

Later, many variants of language models appeared in IR, some of which have gone beyond the scope of query likelihood probability models, such as KL distance-based models. At present, among various retrieval evaluation projects, the language model retrieval method is slightly better than the vector space model after parameter optimization, and the effect is on par with the BM25 equal probability model.

 

The DFR (Divergence From Randomness Models) model was proposed by Gianni Amati and Keith van Rijsbergen in 2002. It is based on statistics and information theory. It is one of the more effective retrieval models today, and it is a parameter-free statistical retrieval model.

Inf(tf|d) represents the amount of information contained in each word in document d, , . Sorting documents means arranging documents with more information to the front.

 

where TF is the word frequency in the document set, l(d) is the length of document d, tf is the word frequency in the document of length l(d), and the total number of words in the TFC document set.

DFR is divided into two types: the model for each document (Type I) and the model for the entire document set (Type II).

Type I DLH uses entropy as the word weight for word t:

Type II DPH uses another way as word weight for t:

 

 

 

Retrieval Model Evaluation Metrics

 

The two most commonly used basic indicators in information retrieval are precision and recall, as shown in Figure 2.3:

 

Figure 2.3 Schematic diagram of precision rate and recall rate

The definitions of precision and recall are as follows:

Accuracy = the number of related documents in the returned results / the number of returned results (2-8)

Recall = the number of related documents in the returned result / the number of all related documents (2-9)

A metric that combines accuracy and recall is the F-value:

 

 

(2-10)

The commonly used evaluation indicators in TREC are:

 

<1 means emphasizing precision, >1 means emphasizing recall. Usually F takes 1, that is, the precision and recall rate are treated equally. At this time:

 

 

(2-11)

The commonly used evaluation indicators in TREC are:

1. Average accuracy AP

 

 

(2-12)

AP represents the average value of the accuracy rate corresponding to each related document, where Rq represents the number of documents related to the query q, and DocQ(i) represents the total number of documents that have been retrieved when the i-th related document is retrieved, which can also be understood as The location of the i-th related documentation.

2.MAP(Mean Average Precision)

MAP represents the average accuracy of the query set based on AP:

 

 

(2-13)

3.P@n

Represents the proportion of related documents in the top n documents.

4.NDCG(Normalized Discounted cumulativegain)

       Each document is not only related and irrelevant, but can be divided into different levels of relevance, such as the five-level relevance mentioned earlier, 0, 1, 2, 3, and 4. The calculation of NDCG is more complicated and is divided into the following steps, namely CG (Calculated Gain), DCG (Discounted Calculated Gain), NDCG.

DCG is an evaluation metric for multivalued correlation and includes a location discount factor.

 

 

(2-14)

The sorted list of query q is, represents the document at position r, is the income of the document, generally set = 2 l(r) -1, where l(r) is the relevance level of document r, is the position discount factor Generally set for.

NDCG is normalized to DCG:

 

 

(2-15)

where Z k is the DCG value of the ideal descending order of the documents. Here is a list, as shown in Table 2.2 below.

Table 2.2 Precision and Recall

Doc Id

Relevance level

Gain

CG

DCG

Max DCG

NDCG

1

5

31

31

31=31*1

31

1=31/31

2

2

3

34

32.9=31+3*0.63

40.5

0.81

3

4

15

49

40.4=32.9+15*0.5

48.0

0.84

4

4

15

64

46.9

54.5

0.86

5

4

15

79

52.7

60.4

0.87

6

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=326437213&siteId=291194637