[Compilation principle] Python implements word frequency statistics for an English text

Use Python to realize the word frequency statistics of an English text. Text link: https://www.philippinetimes.com/news/257886068/australia-blocks-chinese-firms-huawei-zte-from-5g-network 

1. Tuple creation

tup1 = ('Google','atguigu', 1997, 2000);
tup2 = (1, 2, 3, 4, 5 );
tup3 = "a", "b", "c", "d"; # No Brackets are required

2. Dictionary creation

dict = {'Alice': '2341', 'Beth': '9102', 'Cecil': '3258'}
dict = {x:x+1 for x in range(10)}

3. Set collection creation

s = {'name','aa','bb'}
s = set(sequence) # dict sequence, value addition key
s = {x for x in range(10) if x not in range(5,10)}

According to the flowchart shown in Figure 4.2, the document word frequency statistics are carried out, and the specific codes are detailed in the appendix.

Figure 4.2 Word frequency statistics flow chart

3 word frequency statistics results

(1) For specific word frequency statistics, see the appendix ex1_word frequency statistics results.xlsx and ex1_word frequency statistics results.txt. The first file is the code written by myself, and the data is processed when storing the results. The data inside is more standardized; the second file refers to the teacher's explanation, and then uses the list format to store the file when storing the file.
           

Figure 4.3 Schematic diagram of word frequency statistics

It can be seen from the above documents that the top five word frequencies are the, to, and, huawei, and that, most of which are demonstrative pronouns or conjunctions.

(2) Generate a word cloud image based on word frequency, as shown in Figure 4.4.

Figure 4.4 Word Cloud Diagram

Reference

  1. Anaconda installation and usage tutorial: https://zhuanlan.zhihu.com/p/32805175
  2. Getting started with Python basics-introduction and environment configuration: https://www.jianshu.com/p/8e56607b0abc
import re
file=open('ex1_news.txt',encoding='ansi')
lowerText=file.read().lower()
file.close()
arr=re.split('[ ,.+"\n]',lowerText)
voc={};
for each in arr:
    if each not in voc:
        voc[each]=1;
    else:
        voc[each]+=1;
voc.pop('');
vocSorted=sorted(voc.items(),key=lambda x:x[1],reverse=True)# 按照键值进行排序
newFile=open('ex1_词频统计结果.txt','w')
newFile.write(str(vocSorted))
newFile.close()

#根据词频生成云图
from wordcloud import WordCloud, ImageColorGenerator
from PIL import Image
import numpy as np
image=Image.open('china_map.jpg');
graph = np.array(image)
# 参数分别是指定字体、背景颜色、最大的词的大小、使用给定图作为背景形状
wc = WordCloud(font_path="C:/Windows/Fonts/simfang.ttf", background_color='white', max_words=100, mask=graph)
wc.generate_from_frequencies(voc)#根据给定词频生成词云
image_color = ImageColorGenerator(graph)
#生成图片
image=wc.to_image()
#显示图片
image.show()
#存储图片
image.save('云图1.jpg')

 

Guess you like

Origin blog.csdn.net/weixin_43442778/article/details/114970932