8.2 English word frequency statistics (project)

Table of contents

Level 1: Read the file

Level 2 Count the number of words

Level 3 Count the number of occurrences of words

Level 4 Count the occurrences of non-special words


Level 1  read file

The task of this level: write a small program to read files.

Problem Description

《谁动了我的奶酪?》It is a fable created by the American writer Spencer Johnson, which was first published in 1998. The book mainly tells about 4a "character"—two little mice "Sniff" and "Scurry" and two dwarves "Hem" and "Haw" looking for cheese. story. ‪‬‪‬‪‬‪‬‪‬‮‬‪‬‭‬‪‬‪‬‪‬‪‬‪‬‮‬‪‬‪‬‪‬‪‬‪‬‪‬‪‬‮‬‫‬‪‬‪‬ ‪‬‪‬‪‬‪‬‮‬‪‬‪‬

import string


def read_file(file):
    """接收文件名为参数,将文件中的内容读为字符串,
    只保留文件中的英文字母和西文符号,过滤掉中文
    所有字符转为小写,
    将其中所有标点、符号替换为空格,返回字符串"""
    ########## Begin ##########
    with open (file) as f :
        txt = f.read().lower()
        for i in ',."-':
            txt = txt.replace(i,' ')
        return txt
    
    ########## End ##########


if __name__ == '__main__':
    filename = 'Who Moved My Cheese.txt'  # 文件名
    content = read_file(filename)  # 调用函数返回字典类型的数据
    n = int(input())
    print(content[:n])

Level 2  counts the number of words

The task of this level: Write a small program that can count the number of words.

import string


def count_of_words(txt):
    """接收去除标点、符号的字符串,统计并返回其中单词数量和不重复的单词数量"""
    ########## Begin ##########
    txt = txt.split()
    counts = {}
    for i in txt:
        counts[i] = counts.get(i,0) + 1
    return len(txt),len(counts)
    


    ########## End ##########

def read_file(file):
    """接收文件名为参数,将文件中的内容读为字符串,
    只保留文件中的英文字母和西文符号,过滤掉中文
    所有字符转为小写,
    将其中所有标点、符号替换为空格,返回字符串"""
    with open(file, 'r', encoding='utf-8') as novel:
        txt = novel.read()
    english_only_txt = ''.join(x for x in txt if ord(x) < 256)
    english_only_txt = english_only_txt.lower()
    for character in string.punctuation:
        english_only_txt = english_only_txt.replace(character, ' ')
    return english_only_txt

if __name__ == '__main__':
    filename = 'Who Moved My Cheese.txt'  # 文件名
    content = read_file(filename)  # 调用函数返回字典类型的数据
    amount_results = count_of_words(content)
    print('文章共有单词{}个,其中不重复单词{}个'.format(*amount_results))
    

Level 3  counts the number of occurrences of words

Expected output:

  1. the 369
  2. he 337
  3. to 333
  4. and 312
  5. cheese 214
  6. it 187
  7. they 166
  8. of 158
  9. a 146
  10. had 142
import string


def word_frequency(txt):
    """接收去除标点、符号的字符串,统计并返回每个单词出现的次数
    返回值为字典类型,单词为键,对应出现的次数为值"""
    ########## Begin ##########
    txt = txt.split()
    counts = {}
    for i in txt:
        counts[i] = counts.get(i,0) + 1
    return counts
    ########## End ##########


def top_ten_words(frequency, cnt):
    """接收词频字典,输出出现次数最多的cnt个单词及其出现次数"""
    ########## Begin ##########
    dic = sorted(frequency.items(),key = lambda x: x[1], reverse = True)
    for i in dic[0:cnt]:
        print(*i)
    
    ########## End ##########

def read_file(file):
    """接收文件名为参数,将文件中的内容读为字符串,
    只保留文件中的英文字母和西文符号,过滤掉中文
    所有字符转为小写,
    将其中所有标点、符号替换为空格,返回字符串"""
    with open(file, 'r', encoding='utf-8') as novel:
        txt = novel.read()
    english_only_txt = ''.join(x for x in txt if ord(x) < 256)
    english_only_txt = english_only_txt.lower()
    for character in string.punctuation:
        english_only_txt = english_only_txt.replace(character, ' ')
    return english_only_txt

if __name__ == '__main__':
    filename = 'Who Moved My Cheese.txt'  # 文件名
    content = read_file(filename)  # 调用函数返回字典类型的数据
    frequency_result = word_frequency(content)  # 统计词频
    n = int(input())
    top_ten_words(frequency_result, n)
    

Level 4  counts the number of occurrences of non-special words

Test input:8

Expected output:

  1. cheese 214
  2. haw 113
  3. what 105
  4. change 86
  5. hem 83
  6. new 70
  7. said 60
  8. maze 46
import string


def top_ten_words_no_excludes(frequency, cnt):
    """接收词频字典,去除常见的冠词、代词、系动词和连接词后,输出出现次数最多的cnt个单词及其出现次数
    需排除的单词如下:
    excludes_words = ['a', 'an', 'the', 'i', 'he', 'she', 'his', 'my', 'we','or', 'is', 'was', 'do',
                      'and', 'at', 'to', 'of', 'it', 'on', 'that', 'her', 'c','in', 'you', 'had',
                      's', 'with', 'for', 't', 'but', 'as', 'not', 'they', 'be', 'were', 'so', 'our',
                      'all', 'would', 'if', 'him', 'from', 'no', 'me', 'could', 'when', 'there',
                      'them', 'about', 'this', 'their', 'up', 'been', 'by', 'out', 'did', 'have']
    """
    ########## Begin ##########
    excludes_words = ['a', 'an', 'the', 'i', 'he', 'she', 'his', 'my', 'we','or', 'is', 'was', 'do',
                      'and', 'at', 'to', 'of', 'it', 'on', 'that', 'her', 'c','in', 'you', 'had',
                      's', 'with', 'for', 't', 'but', 'as', 'not', 'they', 'be', 'were', 'so', 'our',
                      'all', 'would', 'if', 'him', 'from', 'no', 'me', 'could', 'when', 'there',
                      'them', 'about', 'this', 'their', 'up', 'been', 'by', 'out', 'did', 'have']
    for i  in excludes_words:
        frequency.pop(i)
    dic = sorted(frequency.items(),key = lambda x: x[1], reverse = True)

    for i in dic[0:cnt]:
        print(*i)

    ########## End ##########


def read_file(file):
    """接收文件名为参数,将文件中的内容读为字符串,
    只保留文件中的英文字母和西文符号,过滤掉中文
    所有字符转为小写,
    将其中所有标点、符号替换为空格,返回字符串"""
    with open(file, 'r', encoding='utf-8') as novel:
        txt = novel.read()
    english_only_txt = ''.join(x for x in txt if ord(x) < 256)
    english_only_txt = english_only_txt.lower()
    for character in string.punctuation:
        english_only_txt = english_only_txt.replace(character, ' ')
    return english_only_txt

def word_frequency(txt):
    """接收去除标点、符号的字符串,统计并返回每个单词出现的次数
    返回值为字典类型,单词为键,对应出现的次数为值"""
    frequency = dict()
    words_list = txt.split()
    for word in words_list:
        frequency[word] = frequency.get(word, 0) + 1
    return frequency



if __name__ == '__main__':
    filename = 'Who Moved My Cheese.txt'  # 文件名
    content = read_file(filename)  # 调用函数返回字典类型的数据
    frequency_result = word_frequency(content)  # 统计词频
    n = int(input())
    top_ten_words_no_excludes(frequency_result, n)

 

Guess you like

Origin blog.csdn.net/m0_70456205/article/details/130716386