NLTK（条件频率分布）

本系列博客为学习《用Python进行自然语言处理》一书的学习笔记。
2.2节 P55

前面的博客中我们学习了一些语料库，我们知道布朗语料库是一个按照文体分类的语料库。我们还学习频率分布对象FreqDist。我们指定单词列表变量mylist，FreqDist(mylist)会计算列表中每个项目出现的次数。本章我们将学习条件频率分布对象（ConditionalFreqDist），条件频率分布是频率分布的集合，每个频率分布有一个不同的条件，这个条件通常是文本的类别

ConditionalFreqDist

import nltk

from nltk.corpus import brown

pairs = [(genre, word) for genre in brown.categories() for word in brown.words(categories=genre)]

import nltk

pairs = [(genre, word) for genre in ['news','romance'] for word in brown.words(categories=genre)]

cfd = nltk.ConditionalFreqDist(pairs)

print cfd.conditions()
['romance', 'news']

条件频率分布需要处理的是配对列表，每对的形式是（条件，事件），在示例中条件为文体类别，事件为单词。

ConditionalFreqDist::conditions()：返回条件列表。

cfd = nltk.ConditionalFreqDist(
...    (target,fileid[:4])
...    for fileid in inaugural.fileids()
...    for w in inaugural.words(fileid)
...    for target in ['america','citizen']
...    if w.lower().startswith(target))

cfd.plot()

图 2-1
这里写图片描述

ConditionalFreqDist::plot(conditions, samples)：根据给定的条件和样本，绘制条件频率分布图。

from nltk.corpus import udhr

language = ['Chickasaw','English','German_Deutsch']

cfd = nltk.ConditionalFreqDist(
...    (lang,len(word))
...    for lang in language
...    for word in udhr.words(lang+'-Latin1'))

cfd.plot(cumulative =True)

图2-2
这里写图片描述

ConditionalFreqDist::tabulate(conditions, samples)：根据指定的条件和样本，打印条件频率分布表格。

import nltk

from nltk.corpus import brown

pairs = [(genre, word) for genre in brown.categories() for word in brown.words(categories=genre)]

cfd = nltk.ConditionalFreqDist(pairs)

genres = ['news', 'religion', 'hobbies', 'science_fiction', 'romance', 'humor']

modals = ['can', 'could', 'may', 'might', 'must', 'will']

cfd.tabulate(conditions=genres, samples=modals)
                  can could   may might  must  will 
           news    93    86    66    38    50   389 
       religion    82    59    78    12    54    71 
        hobbies   268    58   131    22    83   264 
science_fiction    16    49     4    12     8    16 
        romance    74   193    11    51    45    43 
          humor    16    30     8     8     9    13

我们可以看到不同情态动词在不同类别下的出现次数。

双连词

nltk.bigrams(words)：根据给定的单词列表，生成所有的双连词组。

sent = ['I', 'am', 'a', 'good', 'man']
print(list(nltk.bigrams(sent)))

[('I', 'am'), ('am', 'a'), ('a', 'good'), ('good', 'man')]

如果我们对一个文本使用bigrams方法，那么我们便得到该文本的所有双连词，如果我们对所有的双连词进行条件频率分布处理，那么我们就可以知道单词的后续词的频率分布。

text = brown.words(categories='news')
bigrams_words = nltk.bigrams(text)
cfd = nltk.ConditionalFreqDist(bigrams_words)
fd = cfd['can']
fd.plot(10)

这里写图片描述

我们可以看到can单词最常见的后续词是be。

NLTK（条件频率分布）

ConditionalFreqDist

双连词

猜你喜欢