python.nlp随笔（五）词性标注详解

1. 准备工作：分词和清洗

[python]view plain copy
import nltk  
from nltk.corpus import stopwords  
from nltk.corpus import brown  
import numpy as np  
  
#分词  
text = "Sentiment analysis is a challenging subject in machine learning.\  
 People express their emotions in language that is often obscured by sarcasm,\  
  ambiguity, and plays on words, all of which could be very misleading for \  
  both humans and computers.".lower()  
text_list = nltk.word_tokenize(text)  
#去掉标点符号  
english_punctuations = [',', '.', ':', ';', '?', '(', ')', '[', ']', '&', '!', '*', '@', '#', '$', '%']  
text_list = [word for word in text_list if word not in english_punctuations]  
#去掉停用词  
stops = set(stopwords.words("english"))  
text_list = [word for word in text_list if word not in stops]  

2. 使用词性标注器：处理一个词序列，为每个词附加一个词性标记

[python]view plain copy
nltk.pos_tag(text_list)  
Out[81]:   
[('sentiment', 'NN'),  
 ('analysis', 'NN'),  
 ('challenging', 'VBG'),  
 ('subject', 'JJ'),  
 ('machine', 'NN'),  
 ('learning', 'VBG'),  
 ('people', 'NNS'),  
 ('express', 'JJ'),  
 ('emotions', 'NNS'),  
 ('language', 'NN'),  
 ('often', 'RB'),  
 ('obscured', 'VBD'),  
 ('sarcasm', 'JJ'),  
 ('ambiguity', 'NN'),  
 ('plays', 'NNS'),  
 ('words', 'NNS'),  
 ('could', 'MD'),  
 ('misleading', 'VB'),  
 ('humans', 'NNS'),  
 ('computers', 'NNS')]  

3. 读取已标注的语料库：NLTK中包括的若干语料库已经标注了词性

[python]view plain copy
brown_taged= nltk.corpus.brown.tagged_words()  

4. 自动标注

[python]view plain copy
brown_tagged_sents = brown.tagged_sents(categories='news')  
brown_sents = brown.sents(categories='news')  
#默认标注  
tags = [tag for (word,tag) in brown.tagged_words(categories='news')]  
print(nltk.FreqDist(tags).max())  
  
NN  

[python]view plain copy
raw = 'I do not like green eggs and ham, I do not like them Sam I am!'  
tokens = nltk.word_tokenize(raw)  
default_tagger = nltk.DefaultTagger('NN')  
print(default_tagger.tag(tokens))  
print(default_tagger.evaluate(brown_tagged_sents))  
  
[('I', 'NN'), ('do', 'NN'), ('not', 'NN'), ('like', 'NN'), ('green', 'NN'), ('eggs', 'NN'), ('and', 'NN'), ('ham', 'NN'), (',', 'NN'), ('I', 'NN'), ('do', 'NN'), ('not', 'NN'), ('like', 'NN'), ('them', 'NN'), ('Sam', 'NN'), ('I', 'NN'), ('am', 'NN'), ('!', 'NN')]  
0.13089484257215028  

[python]view plain copy
#正则表达式标注器  
patterns= [(r'.*ing$','VBG'),(r'.*ed$','VBD'),(r'.*es$','VBZ'),(r'.*ould$','MD'),\  
           (r'.*\'s$','NN$'),(r'.*s$','NNS'),(r'^-?[0-9]+(.[0-9]+)?$','CD'),(r'.*','NN')]  
regexp_tagger = nltk.RegexpTagger(patterns)  
regexp_tagger.tag(brown_sents[3])  
print(regexp_tagger.evaluate(brown_tagged_sents))  
  
0.20326391789486245  

[python]view plain copy
#查询标注器：找出100个最频繁的词，存储它们最有可能的标记。然后可以使用这个信息作为  
#"查询标注器"（NLTK UnigramTagger）的模型  
fd = nltk.FreqDist(brown.words(categories='news'))  
cfd = nltk.ConditionalFreqDist(brown.tagged_words(categories='news'))  
most_freq_words = list(fd.keys())[:100]  
likely_tags = dict((word,cfd[word].max()) for word in most_freq_words)  
# baseline_tagger = nltk.UnigramTagger(model=likely_tags)  
#许多词都被分配了None标签，因为它们不在100个最频繁的词中，可以使用backoff参数设置这些词的默认词性  
baseline_tagger = nltk.UnigramTagger(model=likely_tags,backoff=nltk.DefaultTagger('NN'))  
print(baseline_tagger.evaluate(brown_tagged_sents))  
0.46063806511923944  

5. N-gram 标注

（1）一元标注器：利用一种简单的算法，对每个标识符分配最有可能的标记，不考虑上下文

[python]view plain copy
In[87]: unigram_tagger = nltk.UnigramTagger(brown_tagged_sents)     #训练一个一元标注器  
print(unigram_tagger.tag(brown_sents[2007]))  
unigram_tagger.evaluate((brown_tagged_sents))  
[('Various', 'JJ'), ('of', 'IN'), ('the', 'AT'), ('apartments', 'NNS'), ('are', 'BER'), ('of', 'IN'), ('the', 'AT'), ('terrace', 'NN'), ('type', 'NN'), (',', ','), ('being', 'BEG'), ('on', 'IN'), ('the', 'AT'), ('ground', 'NN'), ('floor', 'NN'), ('so', 'QL'), ('that', 'CS'), ('entrance', 'NN'), ('is', 'BEZ'), ('direct', 'JJ'), ('.', '.')]  
Out[87]: 0.9349006503968017  

[python]view plain copy
#分离训练集和测试集  
size = int(len(brown_tagged_sents)*0.9)  
train_sents = brown_tagged_sents[:size]  
test_sents = brown_tagged_sents[size:]  
unigram_tagger = nltk.UnigramTagger(train_sents)  
unigram_tagger.evaluate(test_sents)  
Out[89]: 0.8121200039868434  

（2）一般的N-gram标注：n-gram标注器是unigram标注器的一般化，它的上下文是当期词和它前面n-1个标注分的词性标记。

NgramTagger类使用一个已标注的训练语料库来确定每个上下文中哪个词性标记最有可能。

[python]view plain copy
bigram_tagger = nltk.BigramTagger(train_sents)  
bigram_tagger.tag(brown_sents[2007])  
bigram_tagger.evaluate(test_sents)  
Out[90]: 0.10206319146815508  

注意，bigram标注器能够标注训练量中它看到过的句子中的所有词，但对一个没见过的句子却不行。只要遇到一个新词就无法给它分配标记，也无法给新词后面的一个词

分配标记，因为在训练过程中从来没有见过哪个词前面有None标记的词。它的整体准确度非常低。

（3）组合标注器

 尝试使用bigram标注器标注标识符

 如果bigram无法找到标记，尝试unigram标注器

[python]view plain copy
t0 = nltk.DefaultTagger('NN')  
t1 = nltk.UnigramTagger(train_sents,backoff=t0)  
t2 = nltk.BigramTagger(train_sents,backoff=t1)  
t2.evaluate(test_sents)  
Out[92]: 0.8452108043456593  

[python]view plain copy
t3 = nltk.BigramTagger(train_sents,cutoff=2,backoff=t1)  
t3.evaluate(test_sents)  
Out[95]: 0.8424200139539519  

cutoff=2表示将丢弃那些只出现一次或者两次的上下文。

（4）存储标注器：在大语料库中训练标注器可能需要大量的时间，保存标注器很有必要

[python]view plain copy
In[101]: #保存标注器  
from pickle import dump  
output = open('t2.pkl','wb')  
dump(t2,output,-1)  
output.close()  
#加载标注器  
from pickle import load  
input = open('t2.pkl','rb')  
tagger = load(input)  
input.close()  
#使用标注器  
text = "Sentiment analysis is a challenging subject in machine learning."  
tokens = text.split()  
tagger.tag(tokens)  
Out[101]:   
[('Sentiment', 'NN'),  
 ('analysis', 'NN'),  
 ('is', 'BEZ'),  
 ('a', 'AT'),  
 ('challenging', 'JJ'),  
 ('subject', 'NN'),  
 ('in', 'IN'),  
 ('machine', 'NN'),  
 ('learning.', 'NN')]  

## 原文代码来源于书籍《Python自然语言处理》

python.nlp随笔（五）词性标注详解

猜你喜欢