Table of contents
1. Word frequency statistics of single sentences
2. Word frequency statistics of articles
Method 1: Use the collection deduplication method
Method 2: Use dictionary statistics
Word frequency statistics is a basic task of natural language processing. For a sentence, an article or a group of articles, count the number of occurrences of each word in the article, and on this basis, discover the topic words and hot words of the article.
1. Word frequency statistics of single sentences
Idea: First define an empty dictionarymy_dict
, then traverse the article (or sentence), and determine whether each word is in the dictionarymy_dict
a>key
, if it does not exist, treat the word as my_dict
's key
, and set the corresponding value
The value is 1; if it already exists, the correspondingvalue
value will be +1.
#统计单句中每个单词出现的次数
news = "Xi, also general secretary of the Communist Party of China (CPC) Central Committee and chairman of the Central Military Commission, made the remarks while attending a voluntary tree-planting activity in the Chinese capital's southern district of Daxing."
def couWord(news_list):
##定义计数函数 输入:句子的单词列表 输出:单词-次数 的字典
my_dict = {} #空字典 来保存单词出现的次数
for v in news_list:
if my_dict.get(v):
my_dict[v] += 1
else:
my_dict[v] = 1
return my_dict
print(couWord(news.split ()))
output
{‘Xi,’: 1, ‘also’: 1, ‘general’: 1, ‘secretary’: 1, ‘of’: 4, ‘the’: 4, ‘Communist’: 1, ‘Party’: 1, ‘China’: 1, ‘(CPC)’: 1, ‘Central’: 2, ‘Committee’: 1, ‘and’: 1, ‘chairman’: 1, ‘Military’: 1, ‘Commission,’: 1, ‘made’: 1, ‘remarks’: 1, ‘while’: 1, ‘attending’: 1, ‘a’: 1, ‘voluntary’: 1, ‘tree-planting’: 1, ‘activity’: 1, ‘in’: 1, ‘Chinese’: 1, “capital’s”: 1, ‘southern’: 1, ‘district’: 1, ‘Daxing.’: 1}
The above method achieves word frequency statistics through the couWord method, but there are the following two problems.
(1) Stopword is not removed
Protect stopwords such as 'also', 'and', 'in' in the output results. Stopwords have little to do with the topic of the article and need to be filtered out in various processes such as word frequency statistics.
(2) Not sorted according to the number of occurrences
After sorting according to the number of occurrences of each word, you can intuitively and effectively discover the topic words or hot words of the article.
The improved couWord function is as follows:
def couWord(news_list,word_list,N):
#输入 文章单词的列表 停止词列表 输出:Top N的单词
my_dict = {} #空字典 来保存单词出现的次数
for v in news_list:
if (v not in word_list): # 判断是否在停止词列表中
if my_dict.get(v):
my_dict[v] += 1
else:
my_dict[v] = 1
topWord = sorted(zip(my_dict.values(),my_dict.keys()),reverse=True)[:N]
return topWord
Load a list of English stop words:
stopPath = r'Data/stopword.txt'
with open(stopPath,encoding = 'utf-8') as file:
word_list = file.read().split() #通过read()返回一个字符串函数,再将其转换成列表
print(couWord(news.split(),word_list,5))
output
[(2, ‘Central’), (1, ‘voluntary’), (1, ‘tree-planting’), (1, ‘southern’), (1, ‘secretary’)]
2. Word frequency statistics of articles
(1) Word frequency statistics of a single article
By defining a function to read the article, perform case conversion and other processing on it, a word list of the input article is formed.
https://python123.io/resources/pye/hamlet.txt
The above is the path to obtain the English version of hamlet text. After downloading, save it to the project path.
Use the open() function to open the hamlet.txt file, and use the read() method to read the file content and save the text in the txt variable.
def readFile(filePath):
#输入: 文件路径 输出:字符串列表
with open(filePath,encoding = 'utf-8') as file:
txt = file.read().lower() #返回一个字符串,都是小写
words = txt.split() #转换成列表
return words
filePath = r'Data/news/hamlet.txt'
new_list = readFile(filePath) #读取文件
print(couWord(new_list,word_list,5))
Next, we need to preprocess the text, remove punctuation marks, segment it into words, etc. We can use regular expressions to achieve this step.
import re
# 去除标点符号
text = re.sub(r'[^\w\s]', '', text)
# 分割成单词
words = text.split()
We use the re.sub() function and the regular expression [^\w\s] to remove punctuation, then use the split() method to split the text into words and save the results in the words list.
or:
Our text contains noisy data of punctuation and characters, so we need to clean the data and process all documents into only the letter types we need (for convenience, replace the noise data with spaces and convert all documents into lowercase letters)
Open files, read, clean data, and archive data.
def getText():
txt = open("Hmlet.txt","r").read()
txt = txt.lower()
for ch in '!@#$%^&*()_/*-~':
txt = txt.replace(ch," ")
return txt
hamlet = getText()
words = hamlet.split()
counts = {}
for word in words:
counts[word] = counts.get(word,0) + 1
items = list(counts.items())
items.sort(key= lambda x:x[1],reverse=True)
for i in range(10):
word, count = items[i]
print("{0:<10}{1:>5}".format(word,count))
Now that we have the split word list words, we need to count the number of times each word appears. We can use Python's dictionary data structure to implement word frequency statistics.
word_counts = {}
for word in words:
if word in word_counts:
word_counts[word] += 1
else:
word_counts[word] = 1
In this code, we first create an empty dictionary word_counts, and then iterate through each word in the words list. For each word, if it already exists in the word_counts dictionary, add 1 to the corresponding count value; otherwise, add a new key-value pair in the dictionary, with the key being the word and the value being 1.
After counting word frequencies, we need to sort them in descending order by word frequency in order to output the results later. We can use Python's built-in function sorted() to achieve sorting.
sorted_word_counts = sorted(word_counts.items(), key=lambda x: x[1], reverse=True)
We use the word_counts.items() method to obtain all key-value pairs in the word_counts dictionary, and use key=lambda x: x[1] to specify sorting according to the values in the key-value pairs, and reverse=True indicates descending order. The sorted results will be saved in the sorted_word_counts list.
Finally, we output the word frequency statistics results to the console or file.
for word, count in sorted_word_counts:
print(f'{word}: {count}')
In this code, we use a for loop to traverse each element in the sorted_word_counts list (each element is a key-value pair), and use the print() function to output the word and corresponding word frequency.
(2) Word frequency statistics of multiple articles
You need to use theos.listdir
method to read the file list under the folder, and then process the files one by one.
import os
folderPath = r'Data/news' #文件夹路径
tmpFile = os.listdir(folderPath)
allNews = []
for file in tmpFile: #读取文件
newsfile = folderPath + '//' + file #拼接完整的文件路径 \\ 转义字符
allNews += readFile(newsfile) #把所有的字符串列表拼接到allText中
print(couWord(allNews,word_list,5))
output
[(465, ‘china’), (323, ‘chinese’), (227, ‘xi’), (196, “china’s”), (134, ‘global’)]
(3) Processing of Chinese articles
For word frequency statistics of Chinese articles, first use a word segmenter such as jieba
to segment the article, load the Chinese stop word list, and then perform word frequency statistics.
3. Frequency of appearances of characters in the Romance of the Three Kingdoms
Use the jieba library to segment Chinese words, store them in the list of words, traverse, store the phrases and word frequencies as key-value pairs in the list of counts, use the orderliness of the list to sort, and then output
https://python123.io/resources/pye/threekingdoms.txt
The above is the link to obtain the Chinese version of Romance of the Three Kingdoms. After downloading, save it to the project path.
import jieba
txt = open("threekingdoms.txt","r",encoding="utf-8").read()
counts = {}
words = jieba.lcut(txt)
for word in words:
if len(word) == 1:
continue
else:
counts[word] = counts.get(word,0) + 1
items = list(counts.items())
items.sort(key = lambda x:x[1] , reverse=True)
for i in range(15):
word , count = items[i]
print("{0:<10}{1:>5}".format(word,count))
This method is simpler than the word frequency of English Hamlet and does not need to deal with character noise data. This also benefits from the simple operation of the jieba library.
But what follows is the ambiguity of word frequency. Due to the characteristics of the jieba database, phrases that are not personal names are also counted.
For example, the words "two people" and "Kong Ming said" in the results are redundant and phrasing errors.
Therefore, we should need to perform further processing to allow word frequency to count the number of times the character's name is used.
After the previous steps, we have output the 15 phrases with the highest frequency of occurrence, but what if we want the frequency of appearances of characters? This requires filtering the original file and deleting the output we do not need.
Because the previous output can easily obtain phrases that appear frequently but are not names of people, so here we store them in a collection, traverse and delete these phrases that exist in the original file.
excludes = {"将军","却说","二人","不可","荆州","不能","如此","商议","如何","主公","军士","左右","军马"}
for i in excludes:
del counts[i]
Redundancy processing: Unify the aliases of the same characters that appear frequently
elif word == "诸葛亮" or word == "孔明曰":
rword = "孔明"
elif word == "关公" or word == "云长":
rword = "关羽"
elif word == "玄德" or word == "玄德曰":
rword = "刘备"
elif word == "孟德" or word == "丞相":
rword = "曹操"
After repeated processing, we can get the output we want
import jieba
txt = open("threekingdoms.txt","r",encoding="utf-8").read()
counts = {}
excludes = {"将军","却说","二人","不可","荆州","不能","如此","商议","如何","主公","军士","左右","军马"}
words = jieba.lcut(txt)
for word in words:
if len(word) == 1:
continue
elif word == "诸葛亮" or word == "孔明曰":
rword = "孔明"
elif word == "关公" or word == "云长":
rword = "关羽"
elif word == "玄德" or word == "玄德曰":
rword = "刘备"
elif word == "孟德" or word == "丞相":
rword = "曹操"
else:
rword = word
counts[rword] = counts.get(rword,0) + 1
for i in excludes:
del counts[i]
items = list(counts.items())
items.sort(key = lambda x:x[1] , reverse=True)
for i in range(7):
word,count = items[i]
print("{0:<10}{1:>5}".format(word,count))
Method 1: Use the collection deduplication method
1 2 3 4 5 6 7 8 9 |
|
Description: Use the set to deduplicate the text string list so that the statistical words will not be repeated. Use the counts method of the list to count the frequency. Pack each word and the number of times it appears into a list and add it to word_list. Use the sort method of the list. Sort and you're done.
Method 2: Use dictionary statistics
1 2 3 4 5 6 7 8 9 10 11 12 |
|
Method 3: Use a counter
1 2 3 4 5 6 7 |
|