Weight Metrics in Search: TF-IDF and BM25

When we search for things on the Internet, search engines always display highly relevant content in the front, and low-relevance content at the back. So, how do search engines calculate the relevance of keywords and content? Two important weight measurement methods are introduced here: TF-IDF and BM25.

    Before entering the theoretical discussion, let us give an example. Suppose, we want to find articles related to "Lucence". Think about it, those articles that only appeared "Lucence" once in the content may be talking about some kind of technology and mentioning the tool Lucence by the way. And those articles that appear "Lucence" two or three times are probably devoted to Lucence. Through intuition, we can draw the judgment: the more times the keyword appears, the higher the matching degree between the document and the keyword .

 

Definition of TF

    There is a special term to represent the number of times a keyword appears, called "term frequency" (Term Frequency), abbreviated as TF. The larger the TF, the higher the correlation in general.

    However, you may find a problem. If "Lucence" appears once in a short essay, and "Lucence" is mentioned twice in a book of several hundred pages, we wouldn't consider that book more relevant to Lucence. In order to eliminate the influence of the size of the document itself, the length of the text is generally considered when using TF:

TF Score = the number of times a word appears in the document / the length of the document

For example: a document D with a length of 200, in which "Lucence" appears 2 times, "de" appears 20 times, and "principle" appears 3 times, then:

TF(Lucence|D) = 2/200 = 0.01
TF(的|D) = 20/200 = 0.1
TF(原理|D) = 3/200 = 0.015

 The relevance of the phrase "Principle of Lucence" to document D is the sum of the relevance of the three words.

TF(Lucence的原理|D) = 0.01 + 0.1 + 0.015 = 0.125

    One problem we found was that the word "of" had a lot of weight, while it contributed very little to the topic of the document. Such words are called stop words, and their word frequency is not considered when measuring relevance. After removing the word, the correlation above becomes 0.025. Among them, "Lucence" contributed 0.01, and "Principle" contributed 0.015.

    Careful people will also find that "principle" is a very general word, while "Lucence" is a professional word. Intuition tells us that the word "Lucence" is more important to our search than "principle". To abstract it, it can be understood that  the stronger the ability of a word to predict a topic, the more important it is, and the greater the weight should be. Conversely, the smaller the weight.

    Suppose we consider the sum of all documents in the world as a document repository. If a word rarely appears in the document library, it is easy to find the target through it, and its weight should be large. Conversely, if a word appears a lot in the document base, seeing that it is still unclear what it is talking about, its weight should be small. Function words such as "的, 地, and 得" appear so frequently that setting the weight to zero does not affect search results, which is one of the reasons why they are stop words.

Definition of IDF

    Assuming that the keyword w has appeared in n documents, the larger n is, the smaller the weight of w is. The commonly used method is called "Inverse Dcument Frequency" (Inverse Dcument Frequency, abbreviated as IDF). General:

IDF = log(N/n)

 

Note: The log here refers to the base 2 logarithm, not the base 10 logarithm.

    N represents the total number of documents. If the total number of documents in the world is 10 billion, "Lucence" has appeared in 10,000 documents, and "principle" has appeared in 200 million documents, then their IDF values ​​are:

IDF(Lucence) = log(100亿/1万) = 19.93
IDF(原理) = log(100亿/2亿) = 5.64

    "Lucence" is 3.5 times as important as "Principle". The stop word "的" appears in all documents, and its IDF=log(1)=0. The final relevance of the phrase to the document is the weighted sum of TF and IDF:

simlarity = TF1*IDF1 + TF2*IDF2 + ... + TFn*IDFn

    The correlation of the "principle of Lucence" mentioned above with document D can now be calculated:

simlarity(Lucence的原理|D) = 0.01*19.93 + 0 + 5.64*0.015 = 0.2839

 

    Among them, "Lucence" accounts for 70% of the weight, and "principle" only accounts for 30% of the weight.

TF-IDF in Lucence

    The early Lucence used TF-IDF directly as the default similarity, but made appropriate adjustments. Its similarity formula is:

simlarity = log(numDocs / (docFreq + 1)) * sqrt(tf) * (1/sqrt(length))

    numDocs: The number of documents in the index, corresponding to N in the previous section. Lucence doesn't (and can't) use the entire Internet's documents as the base, but the total number of documents in the index as the base.

  •     docFreq: The number of documents that contain keywords, corresponding to n in the previous section.

  •     tf: The number of times the keyword appears in the document.

  •     length: The length of the document.

    The above formula will be split into three parts when calculating in the Lucence system:

IDF Score = log(numDocs / (docFreq + 1))
TF Score = sqrt(tf)
fieldNorms = 1/sqrt(length)

   fieldNorms is the normalization of the text length. Therefore, the above formula can also be expressed as:

simlarity = IDF score * TF score * fieldNorms

 

BM25, the next generation of TF-IDF

    The new version of lucence no longer uses TF-IDF as the default correlation algorithm, but uses BM25 (BM means Best Matching). BM25 is based on TF-IDF and an improved algorithm.

TF in BM25

    The traditional TF value can theoretically be infinite. Unlike BM25, it adds a constant k to the TF calculation method to limit the growth limit of the TF value. Here are the formulas for both:

传统 TF Score = sqrt(tf)
BM25的 TF Score = ((k + 1) * tf) / (k + tf)

    The following is the trend chart of the impact of word frequency on TF Score in the two calculation methods. As can be seen from the figure, when tf increases, the TF Score increases, but the TF Score of BM25 will be limited between 0~k+1. It can approach k+1 infinitely, but it can never be reached. This can be understood in business as the influence strength of a certain factor cannot be infinite, but has a maximum value, which is also in line with our understanding of the logic of text relevance. In the default setting of Lucence, k=1.2, the user can modify it.

How BM25 treats document length

    BM25 also introduces the concept of average document length, and the influence of a single document length on relevance is related to its ratio to the average length. In the TF formula of BM25, in addition to k, two other parameters are introduced: L and b. L is the ratio of document length to average length. If the document length is twice the average length, then L=2. b is a constant that specifies how much L affects the score. The formula with L and b added becomes:

TF Score = ((k + 1) * tf) / (k * (1.0 - b + b * L) + tf)

    The following is the trend chart of the impact of word frequency on TFScore under different L conditions:

    As can be seen from the figure, the shorter the document, the faster it approaches the upper limit, and vice versa. It is understandable that for content with only a few words, such as the article "title", only a few words need to be matched to determine relevance. And for large-scale content, such as the content of a book, it needs to match many words to know what its main point is.

    As mentioned above, the role of parameter b is to set how much L affects the score. If b is set to 0, L completely loses its influence on the score. The larger the value of b, the greater the influence of L on the total score. At this point, the final complete formula of similarity is:

simlarity = IDF * ((k + 1) * tf) / (k * (1.0 - b + b * (|d|/avgDl)) + tf)

 

Traditional TF-IDF vs. BM25

    The traditional TF-IDF is a basic theory of natural language search. It conforms to the calculation principle of entropy in information theory. Although the author did not know what it had to do with information entropy when he first proposed it, you will find that it is related to the IDF formula. The formula for entropy is similar. In fact, IDF is the cross entropy of the probability distribution of keywords under a specific condition.

    BM25 adds several adjustable parameters on the basis of traditional TF-IDF, which makes it more flexible and powerful in application, and has high practicability.

readers think

    Why does the TF Score calculation of BM25 use d/avgDl instead of square root, log or other calculation methods? Is there any theoretical support behind it?

 

related articles

Elasticsearch full text retrieval and cosine similarity

Recommendation Engine Algorithms - Guess What You Like

Use logistic regression to classify users (theory + actual combat)

 

 

IT course shop, find good courses: https://www.itkedian.com  A site dedicated to quickly discovering new technology courses such as big data, artificial intelligence, and blockchain.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324388374&siteId=291194637