python词频统计_英文

代码

大家都在写中文的词频统计,我接触了python都有好几年了,还写英文的,真的是,就。直接贴个代码吧。

text = """
British newspapers are much smaller than they used to be and their readers are often in a hurry , so newspapermen write as few words as possible . 
They tell their readers at once what happened , where , when and how it happened and what was the result : how many people were killed , what change was done and so on . 
Readers want the fact set out as fully and accurately as possible . 
Readers are also interested in the people who have seen the accident . 
So a newspaperman always likes to get some information from someone who was there , which can be given in the person’s own words . 
Because he can use only a few words , the newspaperman must choose those words carefully , every one must be effective . 
Instead of “ he called out in a loud voice ” , he writes ” he shouted ” ; instead of “the loose stones rolled noisily down the side of the mountain ” , he will write ” they thundered down the mountainside ” . 
Because many of the readers are not very clever, and most of them are in a hurry.
"""

def getTxt(txt): #对文本预处理(包括)
    txt = txt.lower()#将所有的单词全部转化成小写
    for ch in ",,,.!、!@#$%^'”“;'’": #将所有除了单词以外的符号换成空格
        txt=txt.replace(ch, ' ')
    return txt

txtArr = getTxt(text).split()
counts = {}
for word in txtArr:
    counts[word] = counts.get(word, 0) + 1
countsList = list(counts.items())
countsList.sort(key=lambda x:x[1], reverse=True)
for i in range(20):
    word, count = countsList[i]
    print('{0:<10}{1:>10}'.format(word,count))

代码解说

  • 在百度找了一篇英语阅读,作为text统计词频。
  • str.lower(),将所有的单词全部转化成小写然后返回转化结果,原str不变
  • str.replace(‘a’, ‘b’),将str中的所有的a字符换成b字符并返回换后结果,原str不变
  • str.split(),split()不带参数默认为以所有的空字符,包括空格、换行(\n)、制表符(\t)等为分隔符分割str,并返回分割结果(list)
  • dic.get(“a”,val),在字典dic中取出键为a对应的值,如果字典中不存在键为a的键值对,则返回val
  • list.sort( key=None, reverse=False)
    key – 主要是用来进行比较的元素,只有一个参数,具体的函数的参数就是取自于可迭代对象中,指定可迭代对象中的一个元素来进行排序。
    reverse – 排序规则,reverse = True 降序, reverse = False 升序(默认)。
    文中用了lambda表达式,lambda是声明符,后面跟参数,前面是参数,冒号后面的表达式是lambda的处理结果,这个表达式中,参数是x,处理结果是x[1]。sort中key参数会给后面的表达式赋值一个list中的元素。如:list为[('a':5),('b':3)],执行sort时会分别把('a':5)('b':3)赋值给key后面的lambda表达式,也就是x参数会接受到这两个值。
    countsList.sort(key=lambda x:x[1], reverse=True)
    #等同与
    def takeSecond(elem):
    	return elem[1]
    countsList.sort(key=takeSecond, reverse=True)
    
  • print在python3 中已经被函数化了,python2中可以print a,python3 中必须print(a).
  • 在python3中可以help(print), (注意,在python2中是不能help(print)的,因为其不是一个函数)
  • print('{0:<10}{1:>10}'.format(word,count))参数括号里第一个大括号的0表示这个大括号是给format中第一个参数word占位的,:后<号表示这一列左对齐,10表示这一列长度为10。第二个大括号里的1表示这个大括号是给format中第二个参数count占位的,:后的>表示这一列右对齐,1010表示这一列长度为10。只有单位的话,有人弄清楚可以跟我讲。。。

运行结果

没截完整,看看就行

  • 下一个做。。。中文的分词和词云图吧,看着好像挺好玩的。
发布了4 篇原创文章 · 获赞 0 · 访问量 38

猜你喜欢

转载自blog.csdn.net/weixin_44385465/article/details/104278874
今日推荐