part-of-speech tagging
Write directory title here
Part-of-speech tagging: the process of determining the grammatical category of each word in a given sentence, determining the part-of-speech and tagging it.
Difficulties: disambiguation of concurrent words, tagging of unregistered words
In a specific language environment, a word can only belong to a certain type of part of speech.
Special Issues with Part-of-Speech Tagging
- Morphological standard: does not meet the Chinese classification;
- Meaning criteria: reference function;
- distribution criteria (functional criteria);
Part-of-speech tagging method
-
Rule-Based Part-of-Speech Tagging
-
Part-of-speech tagging based on machine learning
Unsupervised learning - clustering method
Semi-supervised learning - self-training algorithm, multi-view algorithm, direct push -
A research method based on the combination of rules and statistics
For the initial part-of-speech tagging results of sentences, firstly disambiguate rules, then disambiguate through statistics, and speculate on unregistered words, and finally manually proofread to get more correct tagging results. -
Perceptron-based tagging method
Input: word feature set
Output: tagging result (part of speech)
Markov chain: the future state is only related to the current, short-range dependence.
Hidden Markov chain HMM
Markov chain: describe the transition of the state, described by the transition probability;
general stochastic process: describe the relationship between the state and the observation sequence, described by the probability of the observation value.
Designing a simple tagger
- Default tagger nltk.DefaultTagger
Only the statistical information of the word itself is used.
The evaluation basis of nltk is a standard test data, which is manually marked by people .
import nltk
from nltk.corpus import brown
from nltk.probability import FreqDist
tags = [tag for (word, tag) in brown.tagged_words(categories='news')]
word = FreqDist(tags).max()
# 结果:NN
print(word)
brown_sents = brown.sents(categories='news')
brown_tagged_sents = brown.tagged_sents(categories='news')
dt = nltk.DefaultTagger(word) # 以word作为标注器的输出,设计标注器
dt.tag(brown_sents[0]) # 对句子brown_sents[0]进行标注
# 结果:0.13089484257215028
print(dt.evaluate(brown_tagged_sents)) # 对标注器进行评估
- The regular expression tagger nltk.RegexpTagger
leverages morphological knowledge to initially improve performance.
Define a regular pattern: There is a certain order among the regular expressions, and the former takes precedence .
import nltk
from nltk.corpus import brown
from nltk.probability import FreqDist
import re
patterns = [(r'.*ing$', 'VBG'), # gerunds 动名词
(r'.*ed$', 'VBD'), # simple past
(r'.*es$', 'VBZ'), # 3rd singular present
(r'.*ould$', 'MD'), # modals情态动词
(r'.*\'s$', 'NN$'), # possessive nouns名词所有格
(r'.*s$', 'NNS'), # plural nouns 复数名词
(r'^-?[0-9]+(.[0-9]+)?$', 'CD'), # cardinal numbers基数词
(r'.*', 'NN') # nouns (default)
]
brown_sents = brown.sents(categories='news')
brown_tagged_sents = brown.tagged_sents(categories='news')
rt = nltk.RegexpTagger(patterns) # 定义正则表达式标注器
rt.tag(brown_sents[0]) # 对句子brown_sents[0]进行标注
# 结果:0.20326391789486245
print(rt.evaluate(brown_tagged_sents)) # 对标注器进行评估
- The query tagger nltk.UnigramTagger
exploits word frequency knowledge to further improve performance.
Implementation: Specially mark high-frequency words, and find out the most likely labels of the top N (such as 100) most frequent words.
import nltk
from nltk.corpus import brown
from nltk.probability import FreqDist
# 选择brown语料库中的词
fd = nltk.FreqDist(brown.words(categories='news'))
# 获得brown语料库中的词及其标注的分布
cfd = nltk.ConditionalFreqDist(brown.tagged_words(categories='news')) # cfd是一个继承词典
# 结果:{'AT': 5558, 'AT-TL': 18, 'AT-HL': 4}
print(dict(cfd['the']))
# 单词词性AT出现了5888次,词性AT-TL出现了18次,AT-HL出现了4次
# 获取高频的N个词,并进行词的标注
N = 100
most_freq_words = [word for (word, num) in fd.most_common(N)]
likely_tags = dict((word, cfd[word].max()) for word in most_freq_words)
brown_sents = brown.sents(categories='news')
brown_tagged_sents = brown.tagged_sents(categories='news')
baseline_tagger = nltk.UnigramTagger(model=likely_tags) # 实现标注器
baseline_tagger.tag(brown_sents[0]) # 标注句子
# 结果:0.45578495136941344
print(baseline_tagger.evaluate(brown_tagged_sents)) # 评估该标注器
As the number of words increases, the performance of the tagger increases significantly. Generally, it is more appropriate to set about 3000 words. The high-frequency words in this part are generally irrelevant to the topic.
- Combining taggers
The ideal tagging process is: first use the query tagger, if not found, then use the default tagger or regular tagger.
# 之前的代码 略
btr = nltk.UnigramTagger(model=likely_tags, backoff=dt) # 查询标注器和默认标注器组合
btr.tag(brown_sents[0]) # 对句子brown_sents[0]进行标注
# 结果:0.5817769556656125
print(btr.evaluate(brown_tagged_sents)) # 评估该标注器
btr = nltk.UnigramTagger(model=likely_tags, backoff=rt) # 查询标注器和正则标注器组合
btr.tag(brown_sents[0]) # 对句子brown_sents[0]进行标注
# 结果:0.6498697217415518
print(btr.evaluate(brown_tagged_sents)) # 评估该标注器
-
One-gram Uni-gram Binary Bi-gram tagger
adds consideration of context features.In the implementation process of the tagger, the tagged corpus is the basis of the tagger. According to the tagging frequency of each word , the highest tag is selected as its tag. This process is called "training".
Unigram Tagging
siz = 100 train_sents = brown_tagged_sents[siz:] test_sents = brown_tagged_sents[:siz] ug = nltk.UnigramTagger(train_sents) # 训练 # 结果:[('the', 'AT'), ('man', 'NN'), ('is', 'BEZ')] print(list(ug.tag(['the', 'man', 'is']))) # 结果:0.8562610229276896 print(ug.evaluate(test_sents)) # 评估
Bigram Tagging
bg = nltk.BigramTagger(train_sents) # 训练 # 结果:[('the', 'AT'), ('man', 'NN'), ('is', 'BEZ')] print(list(bg.tag(['the', 'man', 'is']))) # 结果:0.1318342151675485 print(bg.evaluate(test_sents)) # 评估
The ternary is marked as Trigram and the n-ary is marked as Ngram
The problem with Bigram is that the data is sparse: the tagged corpus cannot be tagged correctly when it encounters new words.
-
The n-gram combined tagger
Bigram has a very low tagging ratio when used alone.
Introduction to commonly used taggers
- nltk part-of-speech tagging tool
word_lis = nltk.word_tokenize(str(text1)) # 词列表的构建
nltk.pos_tag(word_lis) # 词串标记工具
result = nltk.corpus.nps_chat.tagged_words()# 读取带标记的语料库nps_chat
# 结果:[('now', 'RB'), ('im', 'PRP'), ('left', 'VBD'), ...]
print(result)
# 结果:('fly', 'NN')
print(nltk.tag.str2tuple('fly/NN')) # 字符串标注转换为元组形式
- thulac Chinese part-of-speech tagging tool
import thulac
thu = thulac.thulac()
# 结果:[['从', 'p'], ['繁体', 'n'], ['转换', 'v'], ['为', 'v'], ['简体', 'n'], ['。', 'w']]
print(thu.cut("从繁体转换为简体。"))
- jieba part-of-speech tagger
import jieba
from jieba import posseg
pos = list(jieba.posseg.cut())
Applications of part-of-speech taggers
part of speech distribution
In a corpus, find the most common parts of speech.
import nltk
from nltk.corpus import brown
brown_news_tagged = brown.tagged_words(categories='news')
tag_fd = nltk.FreqDist(tag for (word, tag) in brown_news_tagged)
# 结果:[('NN', 13162), ('IN', 10616), ('AT', 8893)]
print(tag_fd.most_common(3))
Combination of words based on part-of-speech tagging
- double word combination
Find the word after often.
import nltk
from nltk.corpus import brown
text = brown.words(categories='news')
bitext = nltk.bigrams(text)
# 结果:['ambiguous', ',', 'hear', 'a', 'needs', 'that', 'in', 'enough', 'did', 'acceptable', 'mar', '.', 'obstructed', 'build']
ss = [word for word in bitext]
print([b for (a,b) in ss if a=='often']) # 得到全部紧邻在often后的词
bd = brown.tagged_words(categories='news')
bibd = nltk.bigrams(bd)
# 结果:['JJ', ',', 'VB', 'AT', 'VBZ', 'CS', 'IN', 'QLP', 'DOD', 'JJ', 'VB', '.', 'VBD', 'VB']
print([b[1] for (a,b) in bibd if a[0]=='often']) # 得到的是often后的词的标注
- trigram
Find trigrams, the middle word is to, and the front and back are verbs.
# 之前的代码 略
tribd = nltk.trigrams(bd)
lis = [(a[0],b[0],c[0]) for (a,b,c) in tribd if a[1].startswith('V') and c[1].startswith('V') and b[1]=='TO'] # 查找
# 结果:344
print(len(lis))