TFIDF word segmentation filter, extract keywords

Task 1: Now there is a long article "China's Bee Farming". Use a computer to extract its keywords.

1. Word frequency: If a word is important, it should appear multiple times in this article. We conduct "term frequency" (Term Frequency, abbreviated as TF) statistics.

2. Stop words: As a result, you must have guessed it. The most frequently used words are ----"的", "是", "在" ---- the most commonly used words in this category. They are called "stop words" , meaning words that are not helpful in finding the results and must be filtered out.

 

Rule 1: If a word is relatively rare, but it appears multiple times in this article, then it is likely to reflect the characteristics of this article, which is exactly the keyword we need.

  Suppose we filter them out and only consider the remaining meaningful words

  Found that the three words "China", "bee", and "farming" appear as many times

  Because "China" is a very common word, relatively speaking, "bee" and "breeding" are not so common, and "bee" and "breeding" are more important than "China"

 

3. IDF: The most common words ("的", "是", "在") give the least weight,

    The more common words ("China") give less weight,

    The less common words ("bee", "farming") give greater weight.

    This weight is called "Inverse Document Frequency" (Inverse Document Frequency, abbreviated as IDF),

    Its size is inversely proportional to how common a word is.

 

4. TF-IDF: After "term frequency" (TF) and "inverse document frequency" (IDF), the two values ​​are multiplied to get the TF-IDF value of a word.

    The higher the importance of a word to the article, the greater its TF-IDF value.

    Therefore, the first few words are the key words of this article.

 

Implementation:

1. Calculate word frequency

  Term Frequency (TF) = the number of times a word appears in the article

Articles are divided into lengths and shorts. In order to facilitate the comparison of different articles, "word frequency" is standardized.

  Word frequency (TF) = number of occurrences of a word in the article / total number of words in the article

Or word frequency (TF) = the number of times a word appears in the article / the number of words with the highest word frequency

 

2. The number of occurrences of a word in the article

At this time, a corpus is needed to simulate the language environment.

Inverse document frequency (IDF) = log (total number of documents in the corpus/total number of documents containing the word + 1)

 

3. Calculate TF-IDF

  TF-IDF = term frequency (TF) * inverse document frequency (IDF)

  It can be seen that TF-IDF is directly proportional to the number of occurrences of a word in the document, and inversely proportional to the number of occurrences of the word in the entire language.

  Therefore, the algorithm for automatically extracting keywords is to calculate the TF-IDF value of each word in the document.

  Then sort them in descending order, taking the top few words.

 

It can be seen from the above table that "bee" has the highest TF-IDF value, "farming" is second, and "China" is the lowest. (If the TF-IDF of the "的" word is also calculated, it will be a value extremely close to 0.)

So, if you choose only one word, "bee" is the key word of this article.

 

to sum up:

The advantage of the TF-IDF algorithm is that it is simple and fast, and the result is more in line with the actual situation.

The disadvantage is that simply measuring the importance of a word by "word frequency" is not comprehensive enough, and sometimes important words may not appear many times.

Moreover, this algorithm cannot reflect the position information of the words. The words appearing in the front position and the words appearing behind the position are both regarded as having the same importance, which is incorrect.

(One solution is to give more weight to the first paragraph of the full text and the first sentence of each paragraph.)

 

Task 2: TF-IDF and similar applications with cosine: find similar articles

In addition to finding keywords, I also hope to find other articles similar to the original article

Need to use cosine similarity:

Sentence A: I like watching TV, not movies

Sentence B: I don’t like watching TV or movies

The basic idea is: if the words used in these two sentences are more similar, their content should be more similar.

      Therefore, we can start from word frequency and calculate their similarity.

 

1. Word segmentation

  Sentence A: I/like/watch/TV, don't/like/watch/movies.

  Sentence B: I / don't / like / watch / TV, nor / don't / like / watch / movies.

2. List all values

  I like, watch, TV, movies, no, also.

3. Calculate word frequency

  Sentence A: I 1, like 2, watch 2, TV 1, movie 1, not 1, also 0.

  Sentence B: I 1, like 2, watch 2, TV 1, movie 1, no 2, also 1

4. Write the word frequency vector.

  Sentence A: [1, 2, 2, 1, 1, 1, 0]

  Sentence B: [1, 2, 2, 1, 1, 2, 1]

We can judge the similarity of vectors by the size of the included angle. The smaller the angle, the more similar.

Assuming that the a vector is [x1, y1] and the b vector is [x2, y2], then the law of cosines can be rewritten into the following form

 in conclusion:

  We get an algorithm for "finding similar articles":

  • Use TF-IDF algorithm to find the keywords of two articles
  • Take out several keywords for each article (such as 20), merge them into a set, calculate the word frequency of each article for the words in this set (in order to avoid differences in article length, you can use relative word frequency);
  • Generate respective word frequency vectors for two articles
  • Calculate the cosine similarity of two vectors, the larger the value, the more similar

Calculate the cosine similarity of two vectors, the larger the value, the more similar

 

Task 3: How to use word frequency to automatically summarize articles

Information is contained in sentences, some sentences contain more information, and some sentences contain less information.

"Automatic summarization" is to find the sentences that contain the most information.

The amount of information in a sentence is measured by "keywords". If you include more keywords, it means that the sentence is more important.

Luhn proposed to use "cluster" to represent the aggregation of keywords. The so-called "cluster" is a sentence fragment containing multiple keywords.

 

As long as the distance between keywords is less than the "threshold", they are considered to be in the same cluster. The threshold suggested by Luhn is 4 or 5.

In other words, if there are more than 5 other words between two keywords, these two keywords can be divided into two clusters.

Importance of cluster = (number of keywords included)^2 / length of cluster

There are 7 words in the cluster, 4 of which are keywords. Therefore, its importance score is equal to (4 x 4) / 7 = 2.3.

 

Then, find the sentences containing the cluster with the highest score (for example, 5 sentences) and put them together to form the automatic summary of this article

Copy code

1 Summarizer(originalText, maxSummarySize): 
 2 // Calculate the word frequency of the original text and generate an array, such as [(10,'the'), (3,'language'), (8,'code')...] 
 3 wordFrequences = getWordCounts(originalText) 
 4 // Filter out stop words, the array becomes [(3,'language'), (8,'code')...] 
 5 contentWordFrequences = filtStopWords(wordFrequences) 
 6 // Follow Sort by word frequency, the array becomes ['code','language'...] 
 7 contentWordsSortbyFreq = sortByFreqThenDropFreq(contentWordFrequences) 
 8 // Divide the article into sentences 
 9 sentences = getSentences(originalText) 
10 // Select the sentence where the keyword appears first 
11 setSummarySentences = {} 
12 foreach word in contentWordsSortbyFreq: 
13 firstMatchingSentence = search(sentences, word)
14 setSummarySentences.add(firstMatchingSentence)
15 if setSummarySentences.size() = maxSummarySize: 
16 break 
17 // Combine the selected sentences in the order of appearance to form a summary 
18 summary = "" 
19 foreach sentence in sentences: 
20 if sentence in setSummarySentences: 
21 summary = summary + "" + sentence 
22 return summary

Copy code

Guess you like

Origin blog.csdn.net/qq_41587243/article/details/87799925