Jieba conducts word frequency statistics and keyword extraction

1 word frequency statistics

1.1 Simple word frequency statistics

  1. import jiebalibrary and define text
import jieba
text = "Python是一种高级编程语言,广泛应用于人工智能、数据分析、Web开发等领域。"
  1. Tokenize the text
words = jieba.cut(text)

This step divides the text into words and returns a generator object wordsthat can be used forto loop through all the words.
3. Statistics word frequency

word_count = {
    
    }
for word in words:
    if len(word) > 1:
        word_count[word] = word_count.get(word, 0) + 1

This step traverses all the words, counts the number of occurrences of each word, and saves it in a dictionary word_count. When counting word frequency, it can be optimized by removing stop words, etc. Here, words with a length less than 2 are simply filtered.
4. Result output

for word, count in word_count.items():
    print(word, count)

image.png

1.2 Add stop words

In order to count word frequency more accurately, we can add stop words in the word frequency statistics to remove some common but meaningless words. Specific steps are as follows:

  1. Define a list of stop words
import jieba

# 停用词列表
stopwords = ['是', '一种', '等']
  1. Tokenize the text and filter for stop words
text = "Python是一种高级编程语言,广泛应用于人工智能、数据分析、Web开发等领域。"
words = jieba.cut(text)
words_filtered = [word for word in words if word not in stopwords and len(word) > 1]
  1. Count the word frequency and output the result
word_count = {
    
    }
for word in words_filtered:
    word_count[word] = word_count.get(word, 0) + 1
for word, count in word_count.items():
    print(word, count)

After adding stop words, the output result is:
image.png
It can be seen that the word that has been stopped 一种is not displayed.

2 Keyword extraction

2.1 Keyword extraction principle

Unlike word frequency statistics that simply count words, jieba extracts keywords based on the TF-IDF (Term Frequency-Inverse Document Frequency) algorithm. The TF-IDF algorithm is a commonly used text feature extraction method, which can measure the importance of a word in a text.

Specifically, the TF-IDF algorithm consists of two parts:

  1. Term Frequency (term frequency): refers to the number of times a word appears in a text, usually represented by a simple statistical value, such as term frequency, binary word frequency, etc. Word frequency reflects the importance of a word in the text, but ignores the prevalence of the word in the entire corpus.
  2. Inverse Document Frequency (inverse document frequency): refers to the reciprocal of the frequency of a word appearing in all documents, used to measure the popularity of a word. The larger the inverse document frequency, the more common a word is, and the lower its importance; the smaller the inverse document frequency, the more unique a word is, and the higher its importance.

The TF-IDF algorithm calculates the importance of each word in the text by comprehensively considering word frequency and inverse document frequency, so as to extract keywords. In jieba, the specific implementation of keyword extraction includes the following steps:

  1. Segment the text to get the word segmentation result.
  2. Count the number of times each word appears in the text and calculate the word frequency.
  3. Count the number of occurrences of each word in all documents and calculate the inverse document frequency.
  4. Considering word frequency and inverse document frequency comprehensively, calculate the TF-IDF value of each word in the text.
  5. Sort the TF-IDF values, and select the words with the highest scores as keywords.

For example :
F (Term Frequency) refers to the frequency with which a word appears in a document. The calculation formula is as follows:
TF = (number of occurrences of word in document) / (total number of words in document) TF = (number of occurrences of word in document) / (total number of words in document)TF=( Number of times a word appears in the document ) / ( Total number of words in the document )
For example, in a document containing 100 words, a word appears 10 times, then the TF of the word 10 / 100 = 0.110/100=0.110/100=0.1 IDF (Inverse Document
Frequency) refers to the reciprocal of the number of documents in which a word appears in the document collection . The calculation formula is as follows:
IDF = log (total number of documents in the document collection / number of documents containing the word) IDF = log (total number of documents in the document collection / number of documents containing the word)IDF=l o g ( the total number of documents in the document collection / the number of documents containing the word )
For example, in a document collection containing 1000 documents, if a word appears in 100 documents, then The IDF of the word islog ( 1000 / 100 ) = 1.0 log(1000/100)=1.0log(1000/100)=1 . 0
TFIDF is the result of multiplying TF and IDF, the calculation formula is as follows:
TFIDF = TF ∗ IDF TFIDF = TF * IDFTFIDF=TFIDF

It should be noted that the TF-IDF algorithm only considers the occurrence of words in the text, while ignoring the correlation between words. Therefore, in some specific application scenarios, it is necessary to use other text feature extraction methods, such as word vectors, topic models, etc.

2.2 Keyword extraction code

import jieba.analyse

# 待提取关键字的文本
text = "Python是一种高级编程语言,广泛应用于人工智能、数据分析、Web开发等领域。"

# 使用jieba提取关键字
keywords = jieba.analyse.extract_tags(text, topK=5, withWeight=True)

# 输出关键字和对应的权重
for keyword, weight in keywords:
    print(keyword, weight)

In this example, we first import jieba.analysethe module, and then define a text to extract keywords from text. Next, we use jieba.analyse.extract_tags()the function to extract keywords, where topKthe parameter indicates the number of keywords to be extracted, and withWeightthe parameter indicates whether to return the weight value of the keyword. Finally, we iterate through the list of keywords and output each keyword and the corresponding weight value.
The output of this function is:
image.png
It can be seen that jieba extracts several keywords in the input text according to the TF-IDF algorithm, and returns the weight value of each keyword.

Guess you like

Origin blog.csdn.net/nkufang/article/details/129803982