Second job: word frequency statistics

First, the compiler environment

 (1) Testing Tools: PyCharm Community Edition 2019.2.2 x64

   (2) python Version: 3.7.2

 

Second, program analysis

(1) read the file into a buffer

Copy the code
1 def process_file (path): # File to read buffer 
 2 try: # Open File 
 3 f = open (path, ' r') # path to the file path 
 . 4 the except IOError AS S: 
 . 5 Print (S) 
 . 6 None return 
 . 7 try: # file read into a buffer 
 . 8 bvffer reached, f.read = () 
 . 9 the except: 
10 Print ( 'the read file Error!') 
. 11 return None 
12 is f.close () 
13 is return bvffer
Copy the code

(2) treatment buffer, each word frequency statistics

Copy the code
def process_buffer (bvffer): # processing buffer, each word frequency back to a storage dictionary word_freq 
    IF bvffer: 
        Add the following processing buffer # bvffer codes, statistical frequency of each word stored in the dictionary word_freq 
        word_freq} = { 
        # text content to lowercase and remove text in English punctuation 
        for CH in ' "';:!.?," ': 
            . bvffer = bvffer.lower () the replace (CH, "") 
        # Strip () deletes blank character (including '/ n', '/ r ', '/ t'); split () strings separated by spaces 
        . bvffer.strip words = () split () 
        for Word in words: 
            word_freq [Word] = word_freq. GET (Word, 0) +. 1 
        return word_freq
Copy the code

(3) the output frequency of ten words

def output_result(word_freq):
    if word_freq:
        sorted_word_freq = sorted(word_freq.items(), key=lambda v: v[1], reverse=True)
        for item in sorted_word_freq[:10]:  # 输出 Top 10 的单词
            print("单词:%s 频数:%d " % (item[0], item[1]))

(4) The main function calls the function previously encapsulated

if __name__ == "__main__":
    path = "Gone_with_the_wind.txt"  
    bvffer = process_file(path)
    word_freq = process_buffer(bvffer)
    output_result(word_freq)

 

 

 

Third, the result of the program screenshot

1. Gone_with_the_wind.txt word frequency statistics

 2. A_Tale_of_Two_Cities.txt word frequency statistics

 

 

Fourth, the performance analysis and improvement

1. visualization:

Fifth, performance analysis and improvement

(1) Performance Analysis Code

Copy the code
def main (): # main function package Frequencies running 
    path = "Gone_with_the_wind.txt" 
    bvffer = process_file (path) 
    word_freq = process_buffer (bvffer) 
    output_result (word_freq) 


IF the __name__ == "__main__": 
    Import cProfile 
    Import pstats 
    cProfile.run ( "main ()", filename = "word_freq.out") 
    # create Stats object 
    the p-pstats.Stats = ( 'word_freq.out') 
    # output to call the top ten here row function 
    # sort_stats (): Sort 
    # print_stats () : the print analysis, few lines before the specified print 
    p.sort_stats ( 'Calls') print_stats (10). 
    # output according to the operational time of the top ten function 
    # strip_dirs (): removing the irrelevant path information 
    . p.strip_dirs () sort_stats ( "cumulative", "name" ). print_stats (10)
    # Process_buffer discovery function of the operating results of the above () the most time-consuming  
    # View process_buffer () function call which functions
    p.print_callees ( "process_buffer")
Copy the code

 

(2) the longest execution time code

for word in words:

word_freq[word] = word_freq.get(word, 0)+1

 

 

 

 

 

(5) Improved Code

The bvffer.lower () into an outer loop for

After modifying the code:

1.bvffer = bvffer.lower()
2.for ch in '“‘!;:,.?”':
3.bvffer = bvffer.replace(ch, " ")
        

Performance Analysis FIG before modification:

Performance Analysis FIG modified:

It can be seen 0.247s faster than the original.

Guess you like

Origin www.cnblogs.com/KKKK1/p/11586361.html