First, the compiler environment
(1) Testing Tools: PyCharm Community Edition 2019.2.2 x64
(2) python Version: 3.7.2
Second, program analysis
(1) read the file into a buffer
1 def process_file (path): # File to read buffer 2 try: # Open File 3 f = open (path, ' r') # path to the file path . 4 the except IOError AS S: . 5 Print (S) . 6 None return . 7 try: # file read into a buffer . 8 bvffer reached, f.read = () . 9 the except: 10 Print ( 'the read file Error!') . 11 return None 12 is f.close () 13 is return bvffer
(2) treatment buffer, each word frequency statistics
def process_buffer (bvffer): # processing buffer, each word frequency back to a storage dictionary word_freq IF bvffer: Add the following processing buffer # bvffer codes, statistical frequency of each word stored in the dictionary word_freq word_freq} = { # text content to lowercase and remove text in English punctuation for CH in ' "';:!.?," ': . bvffer = bvffer.lower () the replace (CH, "") # Strip () deletes blank character (including '/ n', '/ r ', '/ t'); split () strings separated by spaces . bvffer.strip words = () split () for Word in words: word_freq [Word] = word_freq. GET (Word, 0) +. 1 return word_freq
(3) the output frequency of ten words
def output_result(word_freq): if word_freq: sorted_word_freq = sorted(word_freq.items(), key=lambda v: v[1], reverse=True) for item in sorted_word_freq[:10]: # 输出 Top 10 的单词 print("单词:%s 频数:%d " % (item[0], item[1]))
(4) The main function calls the function previously encapsulated
if __name__ == "__main__": path = "Gone_with_the_wind.txt" bvffer = process_file(path) word_freq = process_buffer(bvffer) output_result(word_freq)
Third, the result of the program screenshot
1. Gone_with_the_wind.txt word frequency statistics
2. A_Tale_of_Two_Cities.txt word frequency statistics
Fourth, the performance analysis and improvement
1. visualization:
Fifth, performance analysis and improvement
(1) Performance Analysis Code
def main (): # main function package Frequencies running path = "Gone_with_the_wind.txt" bvffer = process_file (path) word_freq = process_buffer (bvffer) output_result (word_freq) IF the __name__ == "__main__": Import cProfile Import pstats cProfile.run ( "main ()", filename = "word_freq.out") # create Stats object the p-pstats.Stats = ( 'word_freq.out') # output to call the top ten here row function # sort_stats (): Sort # print_stats () : the print analysis, few lines before the specified print p.sort_stats ( 'Calls') print_stats (10). # output according to the operational time of the top ten function # strip_dirs (): removing the irrelevant path information . p.strip_dirs () sort_stats ( "cumulative", "name" ). print_stats (10) # Process_buffer discovery function of the operating results of the above () the most time-consuming # View process_buffer () function call which functions p.print_callees ( "process_buffer")
(2) the longest execution time code
for word in words:
word_freq[word] = word_freq.get(word, 0)+1
(5) Improved Code
The bvffer.lower () into an outer loop for
After modifying the code:
1.bvffer = bvffer.lower() 2.for ch in '“‘!;:,.?”': 3.bvffer = bvffer.replace(ch, " ")
Performance Analysis FIG before modification:
Performance Analysis FIG modified:
It can be seen 0.247s faster than the original.