8-4 Character appearance statistics in "The Romance of the Three Kingdoms" (unlisted version) python

The Romance of the Three Kingdoms.docx
"The Romance of the Three Kingdoms" is one of the four classic Chinese classics. There are hundreds of unique characters in the book. Write a program to count the number of appearances of each character and output the top 20 with the most appearances.

Because it is a .docx file format, please make sure you have installed the python-docx library.

Now I use Thonny as an example to demonstrate the installation

Open management pack 

Install the plug-in (because it is a foreign website, it may be very slow, you can also go to the Tsinghua mirror website to install it) 

You can also install it using the following command

pip install python-docx
 Complete code

from collections import Counter
from docx import Document

def count_character_appearances(text):
    # 人物名单,直接嵌入代码
    character_list = [
        '荀彧', '荀攸', '贾诩', '郭嘉', '程昱', '戏志才', '刘晔', '蒋济', '陈群', '华歆', 
        '钟繇', '满宠', '董昭', '王朗', '崔琰', '毛玠', '杜畿', '田畴', '王修', '杨修',
        '辛毗', '杨阜', '田豫', '王粲', '蒯越', '张继', '杜袭', '枣祗', '任峻', '陈矫',
        '郗虑', '桓玠', '丁仪', '丁廙', '司马朗', '韩暨', '韦康', '邴原', '赵俨', '娄圭',
        '贾逵', '陈琳', '司马懿', '张辽', '徐晃', '夏侯惇', '夏侯渊', '庞德', '张郃',
        '李典', '乐进', '典韦', '曹洪', '曹仁', '曹彰', '曹纯', '于禁', '许褚', '吕虔',
        '李通', '文聘', '臧霸', '郭淮', '钟会', '邓艾', '曹休', '张燕', '张绣', '朱灵',
        '路昭', '史涣', '韩浩', '王凌', '孙礼', '秦朗', '郑文', '夏侯尚', '毌丘俭',
        '诸葛诞', '孙乾', '简雍', '糜竺', '糜芳', '庞统', '法正', '许靖', '马良', '徐庶',
        '陈震', '杨仪', '费祎', '蒋琬', '孟优', '黄皓', '诸葛亮', '关羽', '张飞', '马超',
        '黄忠', '赵云', '魏延', '关平', '周仓', '关兴', '张苞', '陈到', '李严', '姜维',
        '廖化', '马谡', '马岱', '陈式', '雷铜', '吴兰', '王平', '任夔', '张翼', '马忠',
        '张南', '冯习', '傅佥', '关索', '陆逊', '张昭', '张紘', '鲁肃', '虞翻', '顾雍',
        '诸葛谨', '诸葛恪', '陆凯', '骆统', '周鲂', '周瑜', '吕蒙', '甘宁', '太史慈',
        '程普', '黄盖', '韩当', '周泰', '蒋钦', '丁奉', '徐盛', '陈武', '凌操', '凌统',
        '潘璋', '朱然', '孙桓', '马忠', '孙韶', '朱桓', '夏恂', '周平', '全琮', '于诠',
        '张角', '何进', '董卓', '袁绍', '吕布', '袁术', '刘表', '刘璋', '马腾', '张鲁',
        '韩遂', '公孙瓒', '韩馥', '刘岱', '王匡', '张邈', '孔伷', '陶谦', '鲍信', '桥瑁',
        '袁遗', '孔融', '张超', '张杨', '刘度', '赵范', '金旋', '韩玄', '黄巾军',
        '张宝', '张梁', '程远志', '邓茂', '马元义', '赵弘', '韩忠', '孙夏', '管亥',
        '何仪', '刘辟', '龚都', '裴元绍', '高升', '张闿', '韩暹', '李乐', '杨奉', '董承',
        '王子服', '李儒', '陈宫', '田丰', '沮授', '审配', '许攸', '郭图', '逢纪', '辛评',
        '荀谌', '辛毗', '陈登', '蒯良', '王累', '韩胤', '沮鹄', '杨弘', '阎象', '蒯越',
        '伍孚', '李傕', '郭汜', '颜良', '文丑', '潘凤', '俞涉', '武安国', '穆顺', '华雄',
        '牛辅', '张济', '樊稠', '胡轸', '胡车儿', '李肃', '高顺', '张任', '高览', '曹性',
        '闵纯', '纪灵', '马休', '马铁', '高览', '袁谭', '袁熙', '袁尚', '高干', '麴义',
        '吕翔', '吕旷', '韩猛', '淳于琼', '焦触', '张南', '马延', '雷薄', '张勋', '陈纪',
        '桥蕤', '郝萌', '侯成', '宋宪', '魏续', '成廉', '蔡瑁', '张允', '黄祖', '苏飞',
        '吕公', '侯选', '程银', '李堪', '张横', '梁兴', '成宜', '马玩', '杨秋', '张让',
        '赵忠', '封谞', '段珪', '曹节', '侯览', '蹇硕', '程旷', '夏恽', '郭胜', '吕伯奢',
        '普净', '华佗', '于吉', '左慈', '吉平'
    ]

    # 统计人物出场次数
    character_counts = Counter()
    for character in character_list:
        count = text.count(character)
        character_counts[character] = count
    return character_counts

def main():
    docx_file = "三国演义.docx"

    # 读取docx文件中的文本内容
    doc = Document(docx_file)
    text = " ".join(paragraph.text for paragraph in doc.paragraphs)

    character_counts = count_character_appearances(text)

    # 找出出场次数最多的前20个人物
    most_common_characters = character_counts.most_common(20)

    # 输出结果
    print("出场最多的前20个人物:")
    for character, count in most_common_characters:
        print(f"{character}:{count}次")

if __name__ == "__main__":
    main()

ps: In fact, this statistics is wrong because in "The Romance of the Three Kingdoms", various titles refer to the same person. For example, Cao Cao and Meng De are both the same person, so a more detailed list of characters is needed.

Guess you like

Origin blog.csdn.net/c3872931/article/details/131819772