Python自然语言处理 第一章函数总结

进行本章内容需要首先进行包的导入

>>> import nltk
>>> from nltk.book import *
*** Introductory Examples for the NLTK Book ***
Loading text1, ..., text9 and sent1, ..., sent9
Type the name of the text or sentence to view it.
Type: 'texts()' or 'sents()' to list the materials.
text1: Moby Dick by Herman Melville 1851
text2: Sense and Sensibility by Jane Austen 1811
text3: The Book of Genesis
text4: Inaugural Address Corpus
text5: Chat Corpus
text6: Monty Python and the Holy Grail
text7: Wall Street Journal
text8: Personals Corpus
text9: The Man Who Was Thursday by G . K . Chesterton 1908

concordance():查看文本上下文

>>> text1.concordance('monstrous')
Displaying 11 of 11 matches:
ong the former , one was of a most monstrous size . ... This came towards us , 
ON OF THE PSALMS . " Touching that monstrous bulk of the whale or ork we have r
ll over with a heathenish array of monstrous clubs and spears . Some were thick
d as you gazed , and wondered what monstrous cannibal and savage could ever hav
that has survived the flood ; most monstrous and most mountainous ! That Himmal

similar():查看与参数文本出现在相同上下文的单词(查看文本中相似单词)

>>> text1.similar('monstrous')
true contemptible christian abundant few part mean careful puzzled
mystifying passing curious loving wise doleful gamesome singular
delightfully perilous fearless

common_contexts():研究共用两个或两个以上词汇的上下文

>>> text2.common_contexts(['monstrous','very'])
a_pretty am_glad a_lucky is_pretty be_glad

dispersion_plot():创建一个图表,显示文字再文本中的分布(需要安装pylab)

>>> text4.dispersion_plot(['citizens','democracy','freedom','duties','America'])



generate():此函数已经失效

sorted():按字符排序

set():得到唯一元素的集合(多余重复元素被去掉)


' '.join(list):返回一个字符串,字符串为list中每段字符相连,中间以调用函数的字符串相连

>>> ' '.join(['Monty','Python'])
'Monty Python'
>>> '&'.join(['Monty','Python'])
'Monty&Python'

FreqDist():统计参数文本中各词汇出现的频率,返回字典,关键字为词汇,值为频率

>>> FreqDist(text1)
FreqDist({',': 18713, 'the': 13721, '.': 6862, 'of': 6536, 'and': 6024, 'a': 4569, 'to': 4542, ';': 4072, 'in': 3916, 'that': 2982, ...})

bigrams():获取词汇搭配

>>> bigrams(['more','is','said','than','done'])
<generator object bigrams at 0x000001B848A13620>

collocations():找出频繁出现的双连词

>>> text4.collocations()
United States; fellow citizens; four years; years ago; Federal
Government; General Government; American people; Vice President; Old
World; Almighty God; Fellow citizens; Chief Magistrate; Chief Justice;
God bless; every citizen; Indian tribes; public debt; one another;
foreign nations; political parties

dict.keys():得到字典的值

dict.items(): 把字典转换为以元组链表

>>> fdist = FreqDist([len(w) for w in text1])
>>> fdist
FreqDist({3: 50223, 1: 47933, 4: 42345, 2: 38513, 5: 26597, 6: 17111, 7: 14399, 8: 9966, 9: 6428, 10: 3528, ...})
>>> fdist.keys()
dict_keys([1, 4, 2, 6, 8, 9, 11, 5, 7, 3, 10, 12, 13, 14, 16, 15, 17, 18, 20])
>>> fdist.items()
dict_items([(1, 47933), (4, 42345), (2, 38513), (6, 17111), (8, 9966), (9, 6428), (11, 1873), (5, 26597), (7, 14399), (3, 50223), (10, 3528), (12, 1053), (13, 567), (14, 177), (16, 22), (15, 70), (17, 12), (18, 1), (20, 1)])

fdist.freq():获得参数出现的频率

>>> fdist.freq(3)
0.19255882431878046















猜你喜欢

转载自blog.csdn.net/qq_36303924/article/details/80928281
今日推荐