Cont. TF-IDF (BigData & Data Mining)

Cont.

Insert image description here

Example

Example 1

Term frequency (TF) is the number of times a word occurs divided by the total number of words in the document.
If the total number of words in a document is 100, and the word "cow" appears 3 times, then the word frequency of the word "cow" in the document is 3/100=0.03.
One way to calculate document frequency (IDF) is to divide the total number of documents contained in the document set by a determination of how many documents the word "cow" appears in.
Therefore, if the word "cow" appears in 1,000 documents and the total number of documents is 10,000,000, the reverse document frequency is lg(10,000,000 / 1,000)=4.
The final TF-IDF score is 0.03 * 4 = 0.12.

Example 2

In a web page with a total of 1,000 words, "atomic energy", "of" and "application" appear 2 times, 35 times and 5 times respectively, then their word frequencies are 0.002, 0.035 and 0.005 respectively.
We add these three numbers together and the sum 0.042 is a simple measure of the relevance of the page to the query "applications of atomic energy".
In summary, if a query contains keywords w1, w2,…,wN, their word frequencies in a specific web page are: TF1, TF2,…, TFN. (TF: term frequency).
Then, the correlation between this query and the web page is: TF1 + TF2 + ... + TFN.

loopholes

In the above example, the word "的" accounts for more than 80% of the total word frequency, and it is almost useless for determining the theme of the web page. We call such words "Stopwords", which means that their frequency should not be considered when measuring relevance. In Chinese, there are dozens of words that should be deleted, including "shi", "he", "中", "地", "德", etc. After ignoring these words that should be deleted, the similarity of the above webpage becomes 0.007, of which "atomic energy" contributes 0.002 and "application" contributes 0.005. Careful readers may also spot another small flaw. In Chinese, "application" is a very general word, while "atomic energy" is a very specialized word, and the latter is more important than the former in the relevance ranking. Therefore, we need to give a weight to each word in Chinese. The setting of this weight must meet the following two conditions:

  1. The stronger a word's ability to predict a topic, the greater its weight, and vice versa. When we see the word "atomic energy" on a web page, we can more or less understand the theme of the web page. We saw "Application" once and still knew basically nothing about the subject matter. Therefore, the weight of "atomic energy" should be greater than that of applications.
  2. The weight of words that should be deleted should be zero.
    We can easily find that if a keyword only appears in a few web pages, it is easy for us to lock the search target through it, and its weight should be large. On the other hand, if a word appears in a large number of web pages, we see that it is still not very clear what content it is looking for, so it should be small. In summary, assuming that a keyword w appears in Dw web pages, the larger Dw is, the smaller the weight of w is, and vice versa. In information retrieval, the most commonly used weight is "Inverse document frequency" (Inverse document frequency, abbreviated as IDF), and its formula is log (D/Dw) where D is the number of all web pages. For example, we assume that the number of Chinese web pages is D=1 billion, and the word "的" that should be deleted appears in all web pages, that is, Dw=1 billion, then its IDF=log(1 billion/1 billion)=log (1 ) = 0. If the special word "atomic energy" appears in two million web pages, that is, Dw = 2 million, then its weight IDF = log (500) = 2.7. Assume that the general word "application" appears in 500 million web pages, and its weight IDF = log(2) is only 0.3. In other words, finding one match for "atomic energy" in a web page is equivalent to finding nine matches for "applications." Using IDF, the above correlation calculation formula changes from a simple sum of word frequencies to a weighted sum, that is, TF1 IDF1 + TF2 IDF2 +… + TFN*IDFN. In the above example, the correlation between this web page and "Applications of Atomic Energy" is 0.0069, of which "Atomic Energy" contributes 0.0054, while "Applications" only contributes 0.0015. This ratio is more consistent with our intuition.

application

The weight calculation method is often used in the vector space model together with cosine similarity to determine the similarity between two documents.

Theoretical assumptions

The TFIDF algorithm is based on the assumption that the most meaningful words for distinguishing documents should be those words that appear frequently in documents and less frequently in other documents in the entire document collection, so if the feature space coordinates Taking TF word frequency as a measure can reflect the characteristics of similar texts. In addition, considering the ability of words to distinguish different categories, the TFIDF method believes that the smaller the text frequency of a word, the greater its ability to distinguish different categories of text. Therefore, the concept of inverse text frequency IDF is introduced, using the product of TF and IDF as the value measure of the feature space coordinate system, and using it to complete the adjustment of the weight TF. The purpose of adjusting the weight is to highlight important words and suppress minor ones. word. But in essence, IDF is a weighting that tries to suppress noise, and it is simply believed that words with small text frequency are more important, and words with high text frequency are less useful. Obviously, this is not completely correct. The simple structure of IDF cannot effectively reflect the importance of words and the distribution of feature words, making it unable to complete the function of weight adjustment well, so the accuracy of the TFIDF method is not very high.
In addition, the position information of words is not reflected in the TFIDF algorithm. For Web documents, the weight calculation method should reflect the structural characteristics of HTML. Feature words reflect the article content to different degrees in different markers, and their weight calculation methods should also be different. Therefore, different coefficients should be assigned to the feature words located at different positions on the web page, and then multiplied by the word frequency of the feature words to improve the effect of text representation.

Model probability

Overview of Information Retrieval
Information retrieval is a technology that is currently widely used. Paper retrieval and search engines all belong to the category of information retrieval. Usually, people abstract the information retrieval problem as: on the document set D, for the query string q composed of keywords w[1] ... w[k], return a matching degree of query q and document d according to the relevance (q, d ) sorted list of related documents D'. For this problem, various classic information retrieval models
have appeared successively , and they have proposed their own set of solutions from different perspectives. 布尔模型、向量模型The Boolean model is based on the Boolean operation of sets and has high query efficiency. However, the model is too simple and cannot effectively sort different documents, resulting in poor query results. The vector model treats both the document and the query string as multi-dimensional vectors composed of words, and the correlation between the document and the query corresponds to the angle between the vectors. However, due to the huge number of words, the vector dimensions are very high, and a large number of dimensions are 0, so the effect of calculating the vector angle is not good. In addition, the huge amount of calculation also makes the vector model almost impossible to implement on massive data sets such as Internet search engines.
tf-idf model
Currently, the tf-idf model is widely used in practical applications such as search engines. The main idea of ​​the tf-idf model is: if the word w appears frequently in a document d and rarely appears in other documents, it is considered that the word w has good distinguishing ability and is suitable for combining article d and article d. distinguished from other articles.
Probabilistic Perspective of Information Retrieval
Intuitively, tf describes the frequency of word occurrence in documents; and idf is the weight related to the number of documents in which a word appears. It is relatively easy for us to understand the basic idea of ​​tf-idf qualitatively, but it is not so easy to explain why specific details of tf-idf.
Summarize
The TF-IDF model is an information retrieval model widely used in practical applications such as search engines, but there have always been various questions about the TF-IDF model. This article is a box-ball model based on conditional probability for information retrieval problems. Its core idea is to transform "the matching degree problem between query string q and document d" into "the conditional probability problem that query string q comes from document d." It defines a clearer goal for information retrieval problems from a probabilistic perspective than the matching degree expressed by the TF-IDF model. This model can incorporate the TF-IDF model into it, explaining its rationality on the one hand, and discovering its imperfections on the other. In addition, this model can also explain the meaning of PageRank and why there is a product relationship between PageRank weight and TF-IDF weight.

Guess you like

Origin blog.csdn.net/weixin_45646640/article/details/130451093