Several methods to count text word frequency (Python)

Table of contents

1. Word frequency statistics of single sentences

2. Word frequency statistics of articles

Method 1: Use the collection deduplication method

Method 2: Use dictionary statistics

Method 3: Use a counter


Word frequency statistics is a basic task of natural language processing. For a sentence, an article or a group of articles, count the number of occurrences of each word in the article, and on this basis, discover the topic words and hot words of the article.

1. Word frequency statistics of single sentences

Idea: First define an empty dictionarymy_dict, then traverse the article (or sentence), and determine whether each word is in the dictionarymy_dict a>key, if it does not exist, treat the word as my_dict's key, and set the corresponding value The value is 1; if it already exists, the correspondingvaluevalue will be +1.

#统计单句中每个单词出现的次数
news = "Xi, also general secretary of the Communist Party of China (CPC) Central Committee and chairman of the Central Military Commission, made the remarks while attending a voluntary tree-planting activity in the Chinese capital's southern district of Daxing."    
def couWord(news_list): 
    ##定义计数函数  输入:句子的单词列表 输出:单词-次数 的字典
    my_dict = {}  #空字典 来保存单词出现的次数
    for v in news_list:
        if my_dict.get(v):
            my_dict[v] += 1
        else:
            my_dict[v] = 1
    return my_dict

print(couWord(news.split ()))

output

{‘Xi,’: 1, ‘also’: 1, ‘general’: 1, ‘secretary’: 1, ‘of’: 4, ‘the’: 4, ‘Communist’: 1, ‘Party’: 1, ‘China’: 1, ‘(CPC)’: 1, ‘Central’: 2, ‘Committee’: 1, ‘and’: 1, ‘chairman’: 1, ‘Military’: 1, ‘Commission,’: 1, ‘made’: 1, ‘remarks’: 1, ‘while’: 1, ‘attending’: 1, ‘a’: 1, ‘voluntary’: 1, ‘tree-planting’: 1, ‘activity’: 1, ‘in’: 1, ‘Chinese’: 1, “capital’s”: 1, ‘southern’: 1, ‘district’: 1, ‘Daxing.’: 1}

The above method achieves word frequency statistics through the couWord method, but there are the following two problems.

(1) Stopword is not removed

Protect stopwords such as 'also', 'and', 'in' in the output results. Stopwords have little to do with the topic of the article and need to be filtered out in various processes such as word frequency statistics.

(2) Not sorted according to the number of occurrences

After sorting according to the number of occurrences of each word, you can intuitively and effectively discover the topic words or hot words of the article.

The improved couWord function is as follows:

def couWord(news_list,word_list,N):
    #输入 文章单词的列表 停止词列表  输出:Top N的单词
    my_dict = {}  #空字典 来保存单词出现的次数
    for v in news_list:
        if (v not in word_list): # 判断是否在停止词列表中
            if my_dict.get(v):
                my_dict[v] += 1
            else:
                my_dict[v] = 1
                  
    topWord = sorted(zip(my_dict.values(),my_dict.keys()),reverse=True)[:N] 
    
    return topWord

Load a list of English stop words:

stopPath = r'Data/stopword.txt'
with open(stopPath,encoding = 'utf-8') as file:
    word_list = file.read().split()      #通过read()返回一个字符串函数,再将其转换成列表

print(couWord(news.split(),word_list,5)) 

output

[(2, ‘Central’), (1, ‘voluntary’), (1, ‘tree-planting’), (1, ‘southern’), (1, ‘secretary’)]

2. Word frequency statistics of articles

(1) Word frequency statistics of a single article

By defining a function to read the article, perform case conversion and other processing on it, a word list of the input article is formed.

https://python123.io/resources/pye/hamlet.txt

The above is the path to obtain the English version of hamlet text. After downloading, save it to the project path.

Use the open() function to open the hamlet.txt file, and use the read() method to read the file content and save the text in the txt variable.

def readFile(filePath): 
    #输入: 文件路径  输出:字符串列表
    with open(filePath,encoding = 'utf-8') as file:
        txt = file.read().lower() #返回一个字符串,都是小写
        words = txt.split()      #转换成列表 
    
    return words

filePath = r'Data/news/hamlet.txt'
new_list = readFile(filePath)  #读取文件
print(couWord(new_list,word_list,5))

Next, we need to preprocess the text, remove punctuation marks, segment it into words, etc. We can use regular expressions to achieve this step.

import re

# 去除标点符号
text = re.sub(r'[^\w\s]', '', text)

# 分割成单词
words = text.split()

We use the re.sub() function and the regular expression [^\w\s] to remove punctuation, then use the split() method to split the text into words and save the results in the words list.

or:

Our text contains noisy data of punctuation and characters, so we need to clean the data and process all documents into only the letter types we need (for convenience, replace the noise data with spaces and convert all documents into lowercase letters)

Open files, read, clean data, and archive data.

def getText():
    txt = open("Hmlet.txt","r").read()
    txt = txt.lower()
    for ch in '!@#$%^&*()_/*-~':
        txt = txt.replace(ch," ")
    return txt


hamlet = getText()
words = hamlet.split()
counts = {}
for word in words:
    counts[word] = counts.get(word,0) + 1

items = list(counts.items())
items.sort(key= lambda x:x[1],reverse=True)
for i in range(10):
    word, count = items[i]
    print("{0:<10}{1:>5}".format(word,count))

Now that we have the split word list words, we need to count the number of times each word appears. We can use Python's dictionary data structure to implement word frequency statistics.

word_counts = {}

for word in words:
    if word in word_counts:
        word_counts[word] += 1
    else:
        word_counts[word] = 1

In this code, we first create an empty dictionary word_counts, and then iterate through each word in the words list. For each word, if it already exists in the word_counts dictionary, add 1 to the corresponding count value; otherwise, add a new key-value pair in the dictionary, with the key being the word and the value being 1.

After counting word frequencies, we need to sort them in descending order by word frequency in order to output the results later. We can use Python's built-in function sorted() to achieve sorting.

sorted_word_counts = sorted(word_counts.items(), key=lambda x: x[1], reverse=True)

We use the word_counts.items() method to obtain all key-value pairs in the word_counts dictionary, and use key=lambda x: x[1] to specify sorting according to the values ​​in the key-value pairs, and reverse=True indicates descending order. The sorted results will be saved in the sorted_word_counts list.

Finally, we output the word frequency statistics results to the console or file.

for word, count in sorted_word_counts:
    print(f'{word}: {count}')

In this code, we use a for loop to traverse each element in the sorted_word_counts list (each element is a key-value pair), and use the print() function to output the word and corresponding word frequency.

(2) Word frequency statistics of multiple articles

You need to use theos.listdir method to read the file list under the folder, and then process the files one by one.

import os 
folderPath = r'Data/news' #文件夹路径
tmpFile = os.listdir(folderPath)
allNews = []
for file in tmpFile:  #读取文件
    newsfile = folderPath + '//' + file #拼接完整的文件路径  \\ 转义字符
    allNews += readFile(newsfile)   #把所有的字符串列表拼接到allText中
    
print(couWord(allNews,word_list,5))  

output

[(465, ‘china’), (323, ‘chinese’), (227, ‘xi’), (196, “china’s”), (134, ‘global’)]

(3) Processing of Chinese articles

For word frequency statistics of Chinese articles, first use a word segmenter such as jieba to segment the article, load the Chinese stop word list, and then perform word frequency statistics.

3. Frequency of appearances of characters in the Romance of the Three Kingdoms

Use the jieba library to segment Chinese words, store them in the list of words, traverse, store the phrases and word frequencies as key-value pairs in the list of counts, use the orderliness of the list to sort, and then output

https://python123.io/resources/pye/threekingdoms.txt

The above is the link to obtain the Chinese version of Romance of the Three Kingdoms. After downloading, save it to the project path.

import jieba
txt = open("threekingdoms.txt","r",encoding="utf-8").read()
counts = {}
words = jieba.lcut(txt)
for word in words:
    if len(word) == 1:
        continue
    else:
        counts[word] = counts.get(word,0) + 1
items = list(counts.items())
items.sort(key = lambda x:x[1] , reverse=True)
for i in range(15):
    word , count = items[i]
    print("{0:<10}{1:>5}".format(word,count))

This method is simpler than the word frequency of English Hamlet and does not need to deal with character noise data. This also benefits from the simple operation of the jieba library.

But what follows is the ambiguity of word frequency. Due to the characteristics of the jieba database, phrases that are not personal names are also counted.

For example, the words "two people" and "Kong Ming said" in the results are redundant and phrasing errors.

Therefore, we should need to perform further processing to allow word frequency to count the number of times the character's name is used.

After the previous steps, we have output the 15 phrases with the highest frequency of occurrence, but what if we want the frequency of appearances of characters? This requires filtering the original file and deleting the output we do not need.

Because the previous output can easily obtain phrases that appear frequently but are not names of people, so here we store them in a collection, traverse and delete these phrases that exist in the original file.

excludes = {"将军","却说","二人","不可","荆州","不能","如此","商议","如何","主公","军士","左右","军马"}
for i in excludes:
    del counts[i]

Redundancy processing: Unify the aliases of the same characters that appear frequently

 elif word == "诸葛亮" or word == "孔明曰":
        rword = "孔明"
    elif word == "关公" or word == "云长":
        rword = "关羽"
    elif word == "玄德" or word == "玄德曰":
        rword = "刘备"
    elif word == "孟德" or word ==  "丞相":
        rword = "曹操"

 After repeated processing, we can get the output we want

import jieba
txt = open("threekingdoms.txt","r",encoding="utf-8").read()
counts = {}
excludes = {"将军","却说","二人","不可","荆州","不能","如此","商议","如何","主公","军士","左右","军马"}
words = jieba.lcut(txt)
for word in words:
    if len(word) == 1:
        continue
    elif word == "诸葛亮" or word == "孔明曰":
        rword = "孔明"
    elif word == "关公" or word == "云长":
        rword = "关羽"
    elif word == "玄德" or word == "玄德曰":
        rword = "刘备"
    elif word == "孟德" or word ==  "丞相":
        rword = "曹操"
    else:
        rword = word
    counts[rword] = counts.get(rword,0) + 1
for i in excludes:
    del counts[i]
items = list(counts.items())
items.sort(key = lambda x:x[1] , reverse=True)
for i in range(7):
    word,count = items[i]
    print("{0:<10}{1:>5}".format(word,count))

 

Method 1: Use the collection deduplication method

1

2

3

4

5

6

7

8

9

def word_count1(words,n):

   word_list = []

   for word in set(words):

       num = words.counts(word)

       word_list.append([word,num])

       word_list.sort(key=lambda x:x[1], reverse=True)

   for i in range(n):

       word, count = word_list[i]

       print('{0:<15}{1:>5}'.format(word, count))

Description: Use the set to deduplicate the text string list so that the statistical words will not be repeated. Use the counts method of the list to count the frequency. Pack each word and the number of times it appears into a list and add it to word_list. Use the sort method of the list. Sort and you're done.

Method 2: Use dictionary statistics

1

2

3

4

5

6

7

8

9

10

11

12

def word_count2(words,n):

    counts = {}

    for word in words:

        if len(word) == 1:

            continue

        else:

            counts[word] = counts.get(word, 0) + 1

    items = list(counts.items())

    items.sort(key=lambda x:x[1], reverse=True)

    for i in range(n):

        word, count = items[i]

        print("{0:<15}{1:>5}".format(word, count))

Method 3: Use a counter

1

2

3

4

5

6

7

def word_count3(words,n):

    from collections import Counter

    counts = Counter(words)

    for ch in "":  # 删除一些不需要统计的元素

        del counts[ch]

    for word, count in counts.most_common(n):  # 已经按数量大小排好了

        print("{0:<15}{1:>5}".format(word, count))

Guess you like

Origin blog.csdn.net/greatau/article/details/134044945