Python implements word statistics for Harry Potter novels


Claim

Count the word frequency of HarryPotter5.txt English novels, count the top 20 most frequent words, and print them out or write them into files


One, open the file

Open the file and replace the non-word characters in the word with spaces
Code:

#读取小说内容
fp = open('HarryPotter5.txt')
content = fp.read()
#所有标点符号 用空格代替
#匹配非单词字符的字符
content = re.sub('\W',' ',content)
# Python split() 通过指定分隔符对字符串进行切片
words = content.split() # 以空格为分隔符,包含 \n

Regular expression

Regular re

\W
__ matches characters that are not word characters. _ This \ w the opposite. If the ASCII flag is used, this is equivalent to [^a-zA-Z0-9 ]. If the LOCALE flag is used, it will match characters that are neither alphanumeric nor underscore in the current region.

spilt() function

Python split() slices the string by specifying the separator. If the parameter num has a specified value, separate num+1 substrings
str.split(str="", num=string.count(str)).
   str-Separator, the default is all empty characters, including spaces, newlines (\n), tabs (\t), etc.
   num-the number of divisions. The default is -1, which separates all

Instance

str = "Line1-abcdef \nLine2-abc \nLine4-abcd";
print str.split( );       # 以空格为分隔符,包含 \n
print str.split(' ', 1 ); # 以空格为分隔符,分隔成两个

The output of the above example is as follows:

[‘Line1-abcdef’, ‘Line2-abc’, ‘Line4-abcd’]
[‘Line1-abcdef’, ‘\nLine2-abc \nLine4-abcd’]

The following example uses the # sign as the separator, specifies the second parameter as 1, and returns two parameter lists.

txt = "Google#Runoob#Taobao#Facebook"
# 第二个参数为 1,返回两个参数列表
x = txt.split("#", 1)
print x

The output of the above example is as follows:

[‘Google’, ‘Runoob#Taobao#Facebook’]


2. Word frequency statistics

The code is as follows (example):

#对所有的单词出现次数进行统计
#key-->count  数据结构:Dict字典
wordCounter = {
    
    }
for word in words:
    if word in wordCounter:
        wordCounter[word] += 1
    else:
        wordCounter[word] = 1
#print(wordCounter)

#This printout is out of order and needs to be sorted sorted


Three, word sort

The code is as follows (example):

#默认按照增序  使用reverse参数改变顺序
sortedWordCounter = sorted(wordCounter.items(),key=lambda item: item[1],reverse=True)
#print(sortedWordCounter) #这是打印排序完毕的词频

The bottom layer of the Dict dictionary is a hash structure. The hash structure does not support sorting
wordCounter.items() replaced with a
list. Each element in this list is a tuple
. There are two elements in the tuple. The first element is the key The second element is the value

The effect is as follows:
Insert picture description here

Explain lambda item: item[1]: It is
equivalent to defining a function lambda, which is equivalent to an anonymous function. The
effect is as follows
lambda item: item[1]: before is the parameter: after is the return value

def func(item):
    return item[1]

An anonymous function has the same effect as a normal function, except that it has no name.


Four, output or write files

The code is as follows (example):

#方法一
print(sortedWordCounter[:20])
#方法二
for item in sortedWordCounter[:20]:
    print(item)
#也可以写入文件
fp = open('countwords_result.csv','w')
for (word,count) in sortedWordCounter:
    line = word+','+str(count)+'\n'
    fp.write(line)

Insert picture description here

Create your own after running
Insert picture description here

python file writing

Writing to an existing file
To write to an existing file, you must add parameters to the open() function:
   "a"-append-will be appended to the end of the file
   "w"-write-will overwrite any existing content
Annotation: "W" method will cover all content.

#打开文件 "demofile2.txt" 并将内容追加到文件中:

f = open("demofile2.txt", "a")
f.write("Now the file has more content!")
f.close()

# 追加后,打开并读取该文件:
f = open("demofile2.txt", "r")
print(f.read())
#打开文件 "demofile3.txt" 并覆盖内容:

f = open("demofile3.txt", "w")
f.write("Woops! I have deleted the content!")
f.close()

# 写入后,打开并读取该文件:
f = open("demofile3.txt", "r")
print(f.read())

To create a new file in Python, use the open() method and use one of the following parameters:
   "x"-create-a file will be created, if the file exists, an error
   "a" will be returned -append-if the specified file Does not exist, a file will be created
   "w"-write-if the specified file does not exist, a file will be created

#创建名为 "myfile.txt" 的文件:
f = open("myfile.txt", "x")
#结果:已创建新的空文件!

#如果不存在,则创建新文件:
f = open("myfile.txt", "w")

Guess you like

Origin blog.csdn.net/HG0724/article/details/112298542