Network Information Retrieval (a) retrieval model: boolean, vector, probabilistic retrieval

First, the basic concept

1. Why modeling?

 model is an abstraction of a process or an object for study attributes concluded that to make predictions. Conclusion The quality depends on how closely the model represents reality

Quality  conclusion depends on the model represents reality closeness, such as robotics

2. What is the retrieval model

IR's core issue: to predict which documents are relevant and which documents are irrelevant. The main work of this sort is that the core of the problem, how to calculate the relevance of this sort in order to process the document.
Retrieval model describes the following details
 document representation (Document representation) - to the library documents indicate
 inquiries (Query representation) - query entered by the user representation
 comparison method (Comparison function) - How to resolve to sort the documents
of different reduction model uses the concept of correlation is not the same.

3. Feature form retrieval model

Here Insert Picture Description

Two, General method- shared bag of words

 retrieval process is very difficult to determine the relevant information, user demand is very vague, including information about the context, requirements, etc., must be as simple as possible.
(1) bag of words shared assumptions - the most basic assumptions - to determine the relevance of the word
all words (document, query) come from the same dictionary, the same as when certain words between the two, is considered relevant.
(2) bag of words method - with the word-based (like grammar structure)
to each document as a bag of words filled with words, information retrieval is a document matching words in the query words.

1. Index terms

Defined index words: key word (word) extracted from the list file or dictionary or word (phrase), can reflect the description on the level of the material content of primary or secondary.

 general terms in the document, but identifying nouns has always been a particularly difficult task, it is generally only remove stop words, the ground was like.

 Now suppose that in addition to the stop words are indexed word (text representation), general data storage no problem.

2. Right word weight

Not all the words to express the theme of the document, in order to measure the weight of a word on the importance of this document. Note that the weight of a word is meaningful only in the corresponding document. The weight of the right word for the ability to index the contents of the document.
Here Insert Picture Description

3. The classic model retrieval

Here Insert Picture Description
Two how they represent the core of a comparison process?

Third, the Boolean model

 simple search model set theory.
 index term binarization process, the right word appears weight 1, the weight is not present 0.
 query is a Boolean expression. (And or not), for example, I checked the word including information and learning, information or learning.
And or not the computer for a bit difficult to understand, we generally expressed a general inquiry into several disjunctive conjunctions, so that the computer understand.

1.Case

Here Insert Picture Description

2. Similarity measure

Here Insert Picture Description
 documents meet Boolean queries, then the similarity is 1, otherwise the similarity is 0, there is no third possibility.

3. Retrieval step

 First step: each document documentation set is expressed as an index word vector Boolean - Indicates thesaurus
 Step two: the query is expressed as DNF - represents a query
 The third step: the similarity is calculated according to calculating the similarity of each document to the query (0 OR 1); - comparing
 fourth step: if the similarity is 1, indicates a match, the document can be output as a result; If the similarity is 0, indicating a mismatch, that the the document does not meet the needs of users. - The output document
his work can be done two represent a comparison, but can not complete the order, only the binary output of this shortcoming.

4. Example Searches

Here Insert Picture Description

5. Boolean model discussion:

(1) Advantages

AND indicate the relationship between the concepts.
OR can represent an alternative vocabulary. (Synonym)
NOT can be expressed antonyms.
 accurate and efficient (0-1 indicates, very efficient disjunctive normal form), simple and elegant.
(2) disadvantages

 natural language is complex. AND dig some relationship does not exist, in different sentences, paragraphs, words are bound together by force. OR use words is very difficult to guess, but also the need to manually select the synonyms training. To exclude guess words more difficult.
 All are exact matches: based on binary decision, there is no partial match, but IR is actually a fuzzy query. Often too few or too many retrieved documents.
 not result Sort: No sorting mechanism, only the relevant or irrelevant, no level changes.
 Boolean query represented: allows users to use or not and represent the query. User burden is too large, so the query the user structure is usually too simple.

Fourth, the vector space model - the most mainstream

 also called vector model. The main idea: in high-dimensional space, each thing (document, query, words) can be represented as a vector, such as:
Here Insert Picture Description
 Each index has a word of non-binary weights, these weights can be used to calculate the query and the degree of similarity between documents. Sort the results to provide a better match or partial match.

1. Similarity measure

Here Insert Picture Description

2. Issues to be considered vector space model

 How to choose a basic vector (dimension)?
 • How to convert vector objects? Words (Terms) documents (Documents) Query (Queries)
 • how the amplitude along the dimension selected?
 • How object comparison vector space?

3.  vector space basis vectors and

Formally, a vector space basis vectors defined by a set of linearly independent.

And the corresponding dimensions and directions  vector space model, we want the vector space is relatively static, the number of basis vectors can not change frequently, the dimension can not be too high.
 must be orthogonal and linearly unrelated.

(1) selecting basis vectors

Here Insert Picture Description
The so-called orthogonal, is the base and the base completely unrelated, but in terms of the words often have relevance, Sun Yang - swimming, Yao Ming - basketball. However, eventually the use of words as the basis vectors, after all, a word related to a small number of words, for most words or not relevant.

(2) coefficient vector

 vector dimension along each dimension (ie weights for each word)
 appear on behalf of words, the degree of importance or relevance,
how to set the words weight? Boolean representation?

4. TF-IDF weighting method

Weight = word frequency?
Here Insert Picture Description
It is enough to consider only the word frequency? The more the more meaningful thing? But the, an English and other words in the document there is not a lot of times, did not make much sense. Therefore, the document can be rare word (I appeared in large numbers here, rarely seen in other places) a good description,

High word frequency is a good evidence of low frequency document to make it more valuable. (This article appeared more times, the less appear in other articles, the more important this article. For A and B, A and C retrieve two, then B is especially important for first appears, the greater significance, because A there is both a.)

(1) Tf factors: word frequency: the frequency of occurrence document d k

Here Insert Picture Description

(2) Idf factor: inverse document frequency: countdown document collection word k frequency of occurrence

Here Insert Picture Description

(3) Weight Method

Here Insert Picture Description

(4) Meaning

Here Insert Picture Description

(5) query vector

Here Insert Picture Description
Plus 0.5 document library because early rarely, freq argument is likely to zero, some words may make little or no there have been also able to search out relevant documents after adding 0.5, now document library is very large, generally do not need plus 0.5.

5. Similarity calculation

Here Insert Picture Description

(1) the product of cosine similarity OR

Inner product and length of the document related to the length of use after the inner product standardization, it becomes a cosine similarity.
Here Insert Picture Description

6, vector model using summary

Here Insert Picture Description
(1) Advantages

 simple, math-based approach to everything vectorization, use the angle between the vectors were compared.
 able to consider term frequency weights, weights of any kind can be incorporated, taking into account both the local word frequency also consider the global word frequency.
 provide some sort and output matching mechanism.
(2) disadvantages

 lost the semantic information, contextual information (no way)
 assume between words are independent, basis vectors that's it.
 lack of control Boolean model: If the "AB" is a two-word queries, the first document A frequent but does not contain B, the second document contains A also contains B, but a small number of occurrences, then, may be the first opportunity document is selected larger than the second

Fifth, the probability model - the most likely direction of development

Probabilistic retrieval model is one of the best results of the current model of information retrieval, which is based on an analysis of the existing feedback results for the current query sorted according to Bayes' theorem.
A good explanation

Published 186 original articles · won praise 13 · views 9280

Guess you like

Origin blog.csdn.net/csyifanZhang/article/details/104656408