python每日一题:统计文档里单词的频率

题目: 你有一篇日记,为了避免分词的问题,假设内容都是英文,请统计出你认为每篇日记单词的频率。

要求:1.以字典格式输出每个单词的出现频率 2.算法尽量简洁

方法1: 采用findall函数,列出所有单词。然后采用counter函数进行单词统计

import re
from  collections import Counter
b=[]
with open(r'c:\hello.log','r') as f:
    for i in f:
        s=re.findall(r'[a-zA-Z0-9]+' '',i.lower())#查找单词,其标志为一系列的字母+空格。findall函数最方便的地方是罗列出所有的单词,不用再转换
        b.extend(s)
    print(b)
    print(Counter(b))

  调试结果如下:

['everyone', 'has', 'their', 'own', 'dreams', 'i', 'am', 'the', 'same', 'but', 'my', 'dream', 'is', 'not', 'a', 'lawyer', 'not', 'a', 'doctor', 'not', 'actors', 'not', 'even', 'an', 'industry', 'perhaps', 'my', 'dream', 'big', 'people', 'will', 'find', 'it', 'ridiculous', 'but', 'this', 'has', 'been', 'my', 'pursuit', 'my', 'dream', 'is', 'to', 'want', 'to', 'have', 'a', 'folk', 'life', 'i', 'want', 'it', 'to', 'become', 'a', 'beautiful', 'painting', 'it', 'is', 'not', 'only', 'sharp', 'colors', 'but', 'also', 'the', 'colors', 'are', 'bleak', 'i', 'do', 'not', 'rule', 'out', 'the', 'painting', 'is', 'part', 'of', 'the', 'black', 'but', 'i', 'will', 'treasure', 'these', 'bleak', 'colors', 'not', 'yet', 'how', 'about', 'a', 'colorful', 'painting', 'if', 'not', 'bleak', 'add', 'color', 'how', 'can', 'it', 'more', 'prominent', 'american', 'life', 'is', 'like', 'painting', 'ainting', 'the', 'bright', 'red', 'color', 'represents', 'life', 'beautiful', 'happy', 'moments', 'painting', 'a', 'bleak', 'color', 'represents', 'life', 'difficult', 'unpleasant', 'time', 'you', 'may', 'find', 'a', 'flat', 'with', 'a', 'beautiful', 'road', 'is', 'not', 'very', 'good', 'yet', 'but', 'i', 'do', 'not', 'think', 'it', 'will', 'if', 'a', 'person', 'lives', 'flat', 'then', 'what', 'is', 'the', 'point', 'life', 'is', 'only', 'a', 'short', 'few', 'decades', 'i', 'want', 'it', 'to', 'go', 'finally', 'each', 'memory', 'is', 'a', 'solid']
Counter({'a': 11, 'not': 10, 'is': 9, 'i': 6, 'the': 6, 'it': 6, 'but': 5, 'life': 5, 'painting': 5, 'my': 4, 'to': 4, 'bleak': 4, 'dream': 3, 'will': 3, 'want': 3, 'beautiful': 3, 'colors': 3, 'color': 3, 'has': 2, 'find': 2, 'only': 2, 'do': 2, 'yet': 2, 'how': 2, 'if': 2, 'represents': 2, 'flat': 2, 'everyone': 1, 'their': 1, 'own': 1, 'dreams': 1, 'am': 1, 'same': 1, 'lawyer': 1, 'doctor': 1, 'actors': 1, 'even': 1, 'an': 1, 'industry': 1, 'perhaps': 1, 'big': 1, 'people': 1, 'ridiculous': 1, 'this': 1, 'been': 1, 'pursuit': 1, 'have': 1, 'folk': 1, 'become': 1, 'sharp': 1, 'also': 1, 'are': 1, 'rule': 1, 'out': 1, 'part': 1, 'of': 1, 'black': 1, 'treasure': 1, 'these': 1, 'about': 1, 'colorful': 1, 'add': 1, 'can': 1, 'more': 1, 'prominent': 1, 'american': 1, 'like': 1, 'ainting': 1, 'bright': 1, 'red': 1, 'happy': 1, 'moments': 1, 'difficult': 1, 'unpleasant': 1, 'time': 1, 'you': 1, 'may': 1, 'with': 1, 'road': 1, 'very': 1, 'good': 1, 'think': 1, 'person': 1, 'lives': 1, 'then': 1, 'what': 1, 'point': 1, 'short': 1, 'few': 1, 'decades': 1, 'go': 1, 'finally': 1, 'each': 1, 'memory': 1, 'solid': 1})

 方法二: 采用re.sub函数将非单词转换为空格,然后采用strip函数去除开头的空格,转换为列表。再进行统计。

import re
from  collections import Counter
b=[]
with open(r'c:\hello.log','r') as f:
    for i in f:
        s=re.sub(r'[^a-zA-Z0-9]+',' ',i.lower())
        print(s)
        b.extend(s.strip().split())
    print(Counter(b))

 调试结果如下:

everyone has their own dreams 
i am the same but my dream is not a lawyer 
not a doctor not actors not even an industry 
 perhaps my dream big people will find it ridiculous 
but this has been my pursuit my dream is to want to have a folk life 
i want it to become a beautiful painting it is not only sharp colors 
 but also the colors are bleak i do not rule out the painting is part of the black 
but i will treasure these bleak colors not yet how about a colorful painting 
if not bleak add color how can it more prominent american life is like painting 
ainting the bright red color represents life beautiful happy moments painting a bleak 
color represents life difficult unpleasant time you may find a flat with a beautiful 
road is not very good yet but i do not think it will if a person lives flat then what 
 is the point life is only a short few decades i want it to go finally each memory is 
 a solid 
Counter({'a': 11, 'not': 10, 'is': 9, 'i': 6, 'the': 6, 'it': 6, 'but': 5, 'life': 5, 'painting': 5, 'my': 4, 'to': 4, 'bleak': 4, 'dream': 3, 'will': 3, 'want': 3, 'beautiful': 3, 'colors': 3, 'color': 3, 'has': 2, 'find': 2, 'only': 2, 'do': 2, 'yet': 2, 'how': 2, 'if': 2, 'represents': 2, 'flat': 2, 'everyone': 1, 'their': 1, 'own': 1, 'dreams': 1, 'am': 1, 'same': 1, 'lawyer': 1, 'doctor': 1, 'actors': 1, 'even': 1, 'an': 1, 'industry': 1, 'perhaps': 1, 'big': 1, 'people': 1, 'ridiculous': 1, 'this': 1, 'been': 1, 'pursuit': 1, 'have': 1, 'folk': 1, 'become': 1, 'sharp': 1, 'also': 1, 'are': 1, 'rule': 1, 'out': 1, 'part': 1, 'of': 1, 'black': 1, 'treasure': 1, 'these': 1, 'about': 1, 'colorful': 1, 'add': 1, 'can': 1, 'more': 1, 'prominent': 1, 'american': 1, 'like': 1, 'ainting': 1, 'bright': 1, 'red': 1, 'happy': 1, 'moments': 1, 'difficult': 1, 'unpleasant': 1, 'time': 1, 'you': 1, 'may': 1, 'with': 1, 'road': 1, 'very': 1, 'good': 1, 'think': 1, 'person': 1, 'lives': 1, 'then': 1, 'what': 1, 'point': 1, 'short': 1, 'few': 1, 'decades': 1, 'go': 1, 'finally': 1, 'each': 1, 'memory': 1, 'solid': 1})

猜你喜欢

转载自www.cnblogs.com/xuehaiwuya0000/p/10132444.html