Hamlet.txt download the full text: https://python123.io/resources/pye/hamlet.txt
Text the code word frequency statistics ① as follows:
# CalHamlet_1.py
def getText():
txt = open("Hamlet.txt",'r').read()
txt = txt.lower() #将所有文本中的英文全部换为小写字母
for ch in '!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~':
txt = txt.replace(ch, ' ') #将文本中的特殊字符替换为空格
return txt
hamletTxt = getText()
words = hamletTxt.split()
counts = {}
for word in words:
counts[word] = counts.get(word,0) + 1
items = list(counts.items())
items.sort(key = lambda x:x[1], reverse = True)
for i in range(10):
word, count = items[i]
print('{0:<10}{1:>5}'.format(word, count))
operation result:
D:\anaconda\new_launch\python.exe D:/pycharm/program/untitled/test.py
the 1138
and 965
to 754
of 669
you 550
i 542
a 542
my 514
hamlet 462
in 436
Process finished with exit code 0
② the code word frequency statistics are as follows:
(exclude most of the articles, pronouns, conjunctions and other grammar-type vocabulary)
# CalHamlet_2.py
excludes = {"the","and","of","you","a","i","my","in"}
#建立排除库,排除掉大多数冠词、代词、连接词等语法型词汇
def getText():
txt = open("Hamlet.txt",'r').read()
txt = txt.lower() #将所有文本中的英文全部换为小写字母
for ch in '!"#$%&()*+,-./:;<=>?@[\\]^_`{|}~':
txt = txt.replace(ch, ' ') #将文本中的特殊字符替换为空格
return txt
hamletTxt = getText()
words = hamletTxt.split()
counts = {}
for word in words:
counts[word] = counts.get(word,0) + 1
for word in excludes:
del(counts[word])
items = list(counts.items())
items.sort(key = lambda x:x[1], reverse = True)
for i in range(10):
word, count = items[i]
print('{0:<10}{1:>5}'.format(word, count))
operation result:
D:\anaconda\new_launch\python.exe D:/pycharm/program/untitled/test.py
to 754
hamlet 462
it 416
that 391
is 340
not 314
lord 309
his 296
this 295
but 269
Process finished with exit code 0
references:
. [1] Song-day ceremony Yan, Huang Tianyu python language programming foundation [M] second edition Beijing: Higher Education Press, 2019: 171-174.