The Beauty of Mathematics - Chapter 11 Personal Notes

Chapter 11 How to Determine the Relevance of Web Pages and Query

There are four main categories that affect search engine quality (besides click data) today:

1. Complete index

2. Measurement of web page quality

3. User Preferences

4. Methods to determine the relevance of a web page to a query

 

1 TF-IDF, a scientific measure of search keyword weight

There is a simple way to measure the relevance of web pages and queries, which is to directly use the total word frequency of each key appearing in the web page.

Immediately TF1 + TF2 + ... + TFN

Of course, to remove the "stop word" (stop word). words like "of"

The weight of words and words are also different. The setting of this weight must meet two conditions:

① The stronger the ability of a word to predict the topic, the greater the weight, and vice versa, the smaller.

②The weight of stop words is zero.

If a word only appears in very few web pages, it is easy to lock the search target through it, and its weight is large. On the contrary, it is small.

In information retrieval, the most used weight is "Inverse Document Frequency" (IDE), the formula is log(D/Dw), where D is the number of all web pages.

Using IDE, the formula for correlation calculation is changed from a simple summation of word frequencies to a weighted summation, that is,

TF1 * IDE1 + TF2 * IDE2 + ... + TFN * IDEN

The concept of IDE is the cross entropy (Kullback-Leibler Divergence) of the probability distribution of keywords under a specific condition

 

2 Further reading: Information Theory Basis of TF-IDF

The weight of each keyword w in a query should reflect how much information the word provides to the query.

The simple way is to use the information amount of each word as its weight, namely:

where N is the size of the entire corpus and is a constant that can be omitted. The above formula can be simplified to:

A drawback of the above formula is that it cannot reflect the resolution of keywords (the TF of two keywords is the same)

Make some ideal assumptions:

①The size of each document is basically the same, and they are all M words, that is,

②Once a keyword appears in a document, no matter how many times, the contribution is equal. Such a word either appears c(w) = TF(w)/D(w) times in a document, or it is zero. Note that c(w) < M. So:

roll out:

It can be concluded that the more information I(w) of a word, the greater the TF-IDF value; at the same time, the more times w appears on average in the literature hit by w, the smaller the second term, and the greater the TF-IDF. .

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324884717&siteId=291194637