Vector space model (Vector Space Model)

Sort is the core component of the search engine, to a large extent determine the quality is good or bad search engine. Although search engines consider hundreds of relevant factors in the actual result ranking, but the most important factor is the user of relevant content on the page . (Ps: Baidu towards the most notorious of "PPC" strategy, that is, in the search result ranking, the advertiser money up to the top of the column, rather than considering the quality of the content, thus seriously affecting the user experience). The real message here is: given user search terms, how sorting the web pages in terms of relevant content . To determine whether the page content relevant to the user, depending on the model retrieval search engine used, a common retrieval model are: Boolean model, vector space model, probabilistic model and machine learning sorting algorithms. In my project, using the vector space model (Vector Space Model, VSM), and therefore This article summarize the contents of the vector space model dependent.

Vector space model is a representation of the document and the similarity calculation tool , not only in the search field, natural language processing, text mining tools and other fields is widely used.

1. The document representation

Expressed as a tool for documentation, vector space model to each document is seen as a vector consisting of a t-dimensional features , defined features can be taken in different ways, the most common is the word as the feature is extracted from a document t keywords, each of which will feature an algorithm calculated according to its weight, with this t-dimensional feature vectors with weights on a document to indicate this.

FIG. 4 shows the documents shows how the three-dimensional vector space, for example 2, which consists of three bands with weights of feature document composition {w21, w22, w23}. In practice, the dimensions are usually very high, reaching tens of thousands of peacekeeping, here only to simplify the explanation. User queries can also be viewed as a special document, will be converted into the t dimension feature vector, which was converted to the reason why also a t-dimensional vector, in order to calculate document similarity, he said back will.

 

 The following is an example of a representation of the document, to the document D4, D5 and user query, keywords are extracted by converting the feature can be expressed as follows.

 

 

2. The similarity calculation

 

3. The weight calculation feature

 

Guess you like

Origin www.cnblogs.com/kkbill/p/11517121.html
Recommended