Code
Everyone is writing Chinese word frequency statistics, I contacted the python has for several years, also wrote in English, really, it is. Directly attached to a bar code.
text = """
British newspapers are much smaller than they used to be and their readers are often in a hurry , so newspapermen write as few words as possible .
They tell their readers at once what happened , where , when and how it happened and what was the result : how many people were killed , what change was done and so on .
Readers want the fact set out as fully and accurately as possible .
Readers are also interested in the people who have seen the accident .
So a newspaperman always likes to get some information from someone who was there , which can be given in the person’s own words .
Because he can use only a few words , the newspaperman must choose those words carefully , every one must be effective .
Instead of “ he called out in a loud voice ” , he writes ” he shouted ” ; instead of “the loose stones rolled noisily down the side of the mountain ” , he will write ” they thundered down the mountainside ” .
Because many of the readers are not very clever, and most of them are in a hurry.
"""
def getTxt(txt): #对文本预处理(包括)
txt = txt.lower()#将所有的单词全部转化成小写
for ch in ",,,.!、!@#$%^'”“;'’": #将所有除了单词以外的符号换成空格
txt=txt.replace(ch, ' ')
return txt
txtArr = getTxt(text).split()
counts = {}
for word in txtArr:
counts[word] = counts.get(word, 0) + 1
countsList = list(counts.items())
countsList.sort(key=lambda x:x[1], reverse=True)
for i in range(20):
word, count = countsList[i]
print('{0:<10}{1:>10}'.format(word,count))
Code Description
- Baidu to find an article in English reading, word frequency statistics as text.
- str.lower (), all the words all converted to lowercase and then return to the conversion result, the original unchanged str
- str.replace ( 'a', 'b'), the str all the characters into a character b and returns the result after the change, the original unchanged str
- str.split (), split () with no parameters to default to all null characters, including spaces, linefeed (\ n-), tab (\ t) as the delimiter divided str, segmentation and returns the result (list)
- dic.get ( "a", val) , to remove the key in the dictionary in dic
a
corresponding value, if the key does not exist in the dictionarya
of key-value pairs, then return val - the list.sort (Key = None, Reverse = False)
Key - mainly used for the comparison element, only one parameter is a function of specific parameters taken from the iterables, the element may specify a subject to iteration Sort.
reverse - collation, reverse = True descending, reverse = False ascending (default).
Wen using a lambda expression, lambda is specifier, followed by the parameter,:
in front of the parameters, after the colon lambda expression is a result of the processing of this expression, the parameters arex
, the processing resultx[1]
. the latter sort of key parameters give expression evaluation element of a list. Such as: list as[('a':5),('b':3)]
, respectively, the execution will sort('a':5)
and('b':3)
assign key lambda expression behind, that is,x
the parameters will receive these two values.countsList.sort(key=lambda x:x[1], reverse=True) #等同与 def takeSecond(elem): return elem[1] countsList.sort(key=takeSecond, reverse=True)
- In python3 print it has been a function of, python2 may be
print a
, must python3print(a)
. - In python3 may
help(print)
, (note, in python2 is nothelp(print)
because it is not a function) print('{0:<10}{1:>10}'.format(word,count))
Parameter parentheses first brace 0 indicates that the braces to the first parameter word format placeholder,: the <Left numerals this column, 10 represents the length of column 10. Second braces. 1 shows the format of the braces to the second count parameter placeholder,: after> column represents the right alignment, the column 1010 represents the length of 10. Only units, then someone can tell me to figure out. . .
operation result
- Next to do. . . Word and Chinese word cloud it, looked like fun.