Task 1: Now there is a long article "China's Bee Farming". Use a computer to extract its keywords.
1. Word frequency: If a word is important, it should appear multiple times in this article. We conduct "term frequency" (Term Frequency, abbreviated as TF) statistics.
2. Stop words: As a result, you must have guessed it. The most frequently used words are ----"的", "是", "在" ---- the most commonly used words in this category. They are called "stop words" , meaning words that are not helpful in finding the results and must be filtered out.
Rule 1: If a word is relatively rare, but it appears multiple times in this article, then it is likely to reflect the characteristics of this article, which is exactly the keyword we need.
Suppose we filter them out and only consider the remaining meaningful words
Found that the three words "China", "bee", and "farming" appear as many times
Because "China" is a very common word, relatively speaking, "bee" and "breeding" are not so common, and "bee" and "breeding" are more important than "China"
3. IDF: The most common words ("的", "是", "在") give the least weight,
The more common words ("China") give less weight,
The less common words ("bee", "farming") give greater weight.
This weight is called "Inverse Document Frequency" (Inverse Document Frequency, abbreviated as IDF),
Its size is inversely proportional to how common a word is.
4. TF-IDF: After "term frequency" (TF) and "inverse document frequency" (IDF), the two values are multiplied to get the TF-IDF value of a word.
The higher the importance of a word to the article, the greater its TF-IDF value.
Therefore, the first few words are the key words of this article.
Implementation:
1. Calculate word frequency
Term Frequency (TF) = the number of times a word appears in the article
Articles are divided into lengths and shorts. In order to facilitate the comparison of different articles, "word frequency" is standardized.
Word frequency (TF) = number of occurrences of a word in the article / total number of words in the article
Or word frequency (TF) = the number of times a word appears in the article / the number of words with the highest word frequency
2. The number of occurrences of a word in the article
At this time, a corpus is needed to simulate the language environment.
Inverse document frequency (IDF) = log (total number of documents in the corpus/total number of documents containing the word + 1)
3. Calculate TF-IDF
TF-IDF = term frequency (TF) * inverse document frequency (IDF)
It can be seen that TF-IDF is directly proportional to the number of occurrences of a word in the document, and inversely proportional to the number of occurrences of the word in the entire language.
Therefore, the algorithm for automatically extracting keywords is to calculate the TF-IDF value of each word in the document.
Then sort them in descending order, taking the top few words.
It can be seen from the above table that "bee" has the highest TF-IDF value, "farming" is second, and "China" is the lowest. (If the TF-IDF of the "的" word is also calculated, it will be a value extremely close to 0.)
So, if you choose only one word, "bee" is the key word of this article.
to sum up:
The advantage of the TF-IDF algorithm is that it is simple and fast, and the result is more in line with the actual situation.
The disadvantage is that simply measuring the importance of a word by "word frequency" is not comprehensive enough, and sometimes important words may not appear many times.
Moreover, this algorithm cannot reflect the position information of the words. The words appearing in the front position and the words appearing behind the position are both regarded as having the same importance, which is incorrect.
(One solution is to give more weight to the first paragraph of the full text and the first sentence of each paragraph.)
Task 2: TF-IDF and similar applications with cosine: find similar articles
In addition to finding keywords, I also hope to find other articles similar to the original article
Need to use cosine similarity:
Sentence A: I like watching TV, not movies
Sentence B: I don’t like watching TV or movies
The basic idea is: if the words used in these two sentences are more similar, their content should be more similar.
Therefore, we can start from word frequency and calculate their similarity.
1. Word segmentation
Sentence A: I/like/watch/TV, don't/like/watch/movies.
Sentence B: I / don't / like / watch / TV, nor / don't / like / watch / movies.
2. List all values
I like, watch, TV, movies, no, also.
3. Calculate word frequency
Sentence A: I 1, like 2, watch 2, TV 1, movie 1, not 1, also 0.
Sentence B: I 1, like 2, watch 2, TV 1, movie 1, no 2, also 1
4. Write the word frequency vector.
Sentence A: [1, 2, 2, 1, 1, 1, 0]
Sentence B: [1, 2, 2, 1, 1, 2, 1]
We can judge the similarity of vectors by the size of the included angle. The smaller the angle, the more similar.
Assuming that the a vector is [x1, y1] and the b vector is [x2, y2], then the law of cosines can be rewritten into the following form
in conclusion:
We get an algorithm for "finding similar articles":
- Use TF-IDF algorithm to find the keywords of two articles
- Take out several keywords for each article (such as 20), merge them into a set, calculate the word frequency of each article for the words in this set (in order to avoid differences in article length, you can use relative word frequency);
- Generate respective word frequency vectors for two articles
- Calculate the cosine similarity of two vectors, the larger the value, the more similar
Calculate the cosine similarity of two vectors, the larger the value, the more similar
Task 3: How to use word frequency to automatically summarize articles
Information is contained in sentences, some sentences contain more information, and some sentences contain less information.
"Automatic summarization" is to find the sentences that contain the most information.
The amount of information in a sentence is measured by "keywords". If you include more keywords, it means that the sentence is more important.
Luhn proposed to use "cluster" to represent the aggregation of keywords. The so-called "cluster" is a sentence fragment containing multiple keywords.
As long as the distance between keywords is less than the "threshold", they are considered to be in the same cluster. The threshold suggested by Luhn is 4 or 5.
In other words, if there are more than 5 other words between two keywords, these two keywords can be divided into two clusters.
Importance of cluster = (number of keywords included)^2 / length of cluster
There are 7 words in the cluster, 4 of which are keywords. Therefore, its importance score is equal to (4 x 4) / 7 = 2.3.
Then, find the sentences containing the cluster with the highest score (for example, 5 sentences) and put them together to form the automatic summary of this article
1 Summarizer(originalText, maxSummarySize): 2 // Calculate the word frequency of the original text and generate an array, such as [(10,'the'), (3,'language'), (8,'code')...] 3 wordFrequences = getWordCounts(originalText) 4 // Filter out stop words, the array becomes [(3,'language'), (8,'code')...] 5 contentWordFrequences = filtStopWords(wordFrequences) 6 // Follow Sort by word frequency, the array becomes ['code','language'...] 7 contentWordsSortbyFreq = sortByFreqThenDropFreq(contentWordFrequences) 8 // Divide the article into sentences 9 sentences = getSentences(originalText) 10 // Select the sentence where the keyword appears first 11 setSummarySentences = {} 12 foreach word in contentWordsSortbyFreq: 13 firstMatchingSentence = search(sentences, word) 14 setSummarySentences.add(firstMatchingSentence) 15 if setSummarySentences.size() = maxSummarySize: 16 break 17 // Combine the selected sentences in the order of appearance to form a summary 18 summary = "" 19 foreach sentence in sentences: 20 if sentence in setSummarySentences: 21 summary = summary + "" + sentence 22 return summary