Python part-of-speech tagging

part-of-speech tagging

Part-of-speech tagging: the process of determining the grammatical category of each word in a given sentence, determining the part-of-speech and tagging it.

Difficulties: disambiguation of concurrent words, tagging of unregistered words

In a specific language environment, a word can only belong to a certain type of part of speech.

Special Issues with Part-of-Speech Tagging

  1. Morphological standard: does not meet the Chinese classification;
  2. Meaning criteria: reference function;
  3. distribution criteria (functional criteria);

Part-of-speech tagging method

  1. Rule-Based Part-of-Speech Tagging

  2. Part-of-speech tagging based on machine learning
    Unsupervised learning - clustering method
    Semi-supervised learning - self-training algorithm, multi-view algorithm, direct push

  3. A research method based on the combination of rules and statistics
    For the initial part-of-speech tagging results of sentences, firstly disambiguate rules, then disambiguate through statistics, and speculate on unregistered words, and finally manually proofread to get more correct tagging results.

  4. Perceptron-based tagging method
    Input: word feature set
    Output: tagging result (part of speech)

Markov chain: the future state is only related to the current, short-range dependence.
Hidden Markov chain HMM
Markov chain: describe the transition of the state, described by the transition probability;
general stochastic process: describe the relationship between the state and the observation sequence, described by the probability of the observation value.

Designing a simple tagger

  1. Default tagger nltk.DefaultTagger

Only the statistical information of the word itself is used.
The evaluation basis of nltk is a standard test data, which is manually marked by people .

import nltk
from nltk.corpus import brown
from nltk.probability import FreqDist

tags = [tag for (word, tag) in brown.tagged_words(categories='news')]
word = FreqDist(tags).max()
# 结果:NN
print(word)

brown_sents = brown.sents(categories='news')
brown_tagged_sents = brown.tagged_sents(categories='news')

dt = nltk.DefaultTagger(word) # 以word作为标注器的输出,设计标注器
dt.tag(brown_sents[0]) # 对句子brown_sents[0]进行标注

# 结果:0.13089484257215028
print(dt.evaluate(brown_tagged_sents)) # 对标注器进行评估
  1. The regular expression tagger nltk.RegexpTagger
    leverages morphological knowledge to initially improve performance.

Define a regular pattern: There is a certain order among the regular expressions, and the former takes precedence .

import nltk
from nltk.corpus import brown
from nltk.probability import FreqDist
import re

patterns = [(r'.*ing$', 'VBG'), # gerunds 动名词
(r'.*ed$', 'VBD'), # simple past
(r'.*es$', 'VBZ'), # 3rd singular present
(r'.*ould$', 'MD'), # modals情态动词
(r'.*\'s$', 'NN$'), # possessive nouns名词所有格
(r'.*s$', 'NNS'), # plural nouns 复数名词
(r'^-?[0-9]+(.[0-9]+)?$', 'CD'), # cardinal numbers基数词
(r'.*', 'NN') # nouns (default)
]

brown_sents = brown.sents(categories='news')
brown_tagged_sents = brown.tagged_sents(categories='news')

rt = nltk.RegexpTagger(patterns) # 定义正则表达式标注器
rt.tag(brown_sents[0]) # 对句子brown_sents[0]进行标注

# 结果:0.20326391789486245
print(rt.evaluate(brown_tagged_sents)) # 对标注器进行评估
  1. The query tagger nltk.UnigramTagger
    exploits word frequency knowledge to further improve performance.

Implementation: Specially mark high-frequency words, and find out the most likely labels of the top N (such as 100) most frequent words.

import nltk
from nltk.corpus import brown
from nltk.probability import FreqDist

# 选择brown语料库中的词
fd = nltk.FreqDist(brown.words(categories='news'))
# 获得brown语料库中的词及其标注的分布
cfd = nltk.ConditionalFreqDist(brown.tagged_words(categories='news')) # cfd是一个继承词典

# 结果:{'AT': 5558, 'AT-TL': 18, 'AT-HL': 4}
print(dict(cfd['the']))
# 单词词性AT出现了5888次,词性AT-TL出现了18次,AT-HL出现了4次

# 获取高频的N个词,并进行词的标注
N = 100
most_freq_words = [word for (word, num) in fd.most_common(N)]
likely_tags = dict((word, cfd[word].max()) for word in most_freq_words)

brown_sents = brown.sents(categories='news')
brown_tagged_sents = brown.tagged_sents(categories='news')

baseline_tagger = nltk.UnigramTagger(model=likely_tags) # 实现标注器
baseline_tagger.tag(brown_sents[0]) # 标注句子

# 结果:0.45578495136941344
print(baseline_tagger.evaluate(brown_tagged_sents)) # 评估该标注器

As the number of words increases, the performance of the tagger increases significantly. Generally, it is more appropriate to set about 3000 words. The high-frequency words in this part are generally irrelevant to the topic.

  1. Combining taggers
    The ideal tagging process is: first use the query tagger, if not found, then use the default tagger or regular tagger.
# 之前的代码 略

btr = nltk.UnigramTagger(model=likely_tags, backoff=dt) # 查询标注器和默认标注器组合
btr.tag(brown_sents[0]) # 对句子brown_sents[0]进行标注

# 结果:0.5817769556656125
print(btr.evaluate(brown_tagged_sents)) # 评估该标注器

btr = nltk.UnigramTagger(model=likely_tags, backoff=rt) # 查询标注器和正则标注器组合
btr.tag(brown_sents[0]) # 对句子brown_sents[0]进行标注

# 结果:0.6498697217415518
print(btr.evaluate(brown_tagged_sents)) # 评估该标注器
  1. One-gram Uni-gram Binary Bi-gram tagger
    adds consideration of context features.

    In the implementation process of the tagger, the tagged corpus is the basis of the tagger. According to the tagging frequency of each word , the highest tag is selected as its tag. This process is called "training".

    Unigram Tagging

    siz = 100
    train_sents = brown_tagged_sents[siz:]
    test_sents = brown_tagged_sents[:siz]
    
    ug = nltk.UnigramTagger(train_sents) # 训练
    
    # 结果:[('the', 'AT'), ('man', 'NN'), ('is', 'BEZ')]
    print(list(ug.tag(['the', 'man', 'is'])))
    # 结果:0.8562610229276896
    print(ug.evaluate(test_sents)) # 评估
    

    Bigram Tagging

    bg = nltk.BigramTagger(train_sents) # 训练
    
    # 结果:[('the', 'AT'), ('man', 'NN'), ('is', 'BEZ')]
    print(list(bg.tag(['the', 'man', 'is'])))
    # 结果:0.1318342151675485
    print(bg.evaluate(test_sents)) # 评估
    

    The ternary is marked as Trigram and the n-ary is marked as Ngram

    The problem with Bigram is that the data is sparse: the tagged corpus cannot be tagged correctly when it encounters new words.

  2. The n-gram combined tagger
    Bigram has a very low tagging ratio when used alone.

Introduction to commonly used taggers

  1. nltk part-of-speech tagging tool
word_lis = nltk.word_tokenize(str(text1)) # 词列表的构建
nltk.pos_tag(word_lis) # 词串标记工具

result = nltk.corpus.nps_chat.tagged_words()# 读取带标记的语料库nps_chat
# 结果:[('now', 'RB'), ('im', 'PRP'), ('left', 'VBD'), ...]
print(result)

# 结果:('fly', 'NN')
print(nltk.tag.str2tuple('fly/NN')) # 字符串标注转换为元组形式
  1. thulac Chinese part-of-speech tagging tool
import thulac

thu = thulac.thulac()
# 结果:[['从', 'p'], ['繁体', 'n'], ['转换', 'v'], ['为', 'v'], ['简体', 'n'], ['。', 'w']]   
print(thu.cut("从繁体转换为简体。"))
  1. jieba part-of-speech tagger
import jieba
from jieba import posseg

pos = list(jieba.posseg.cut())

Applications of part-of-speech taggers

part of speech distribution

In a corpus, find the most common parts of speech.

import nltk
from nltk.corpus import brown

brown_news_tagged = brown.tagged_words(categories='news')
tag_fd = nltk.FreqDist(tag for (word, tag) in brown_news_tagged)

# 结果:[('NN', 13162), ('IN', 10616), ('AT', 8893)]
print(tag_fd.most_common(3))

Combination of words based on part-of-speech tagging

  1. double word combination

Find the word after often.

import nltk
from nltk.corpus import brown

text = brown.words(categories='news')
bitext = nltk.bigrams(text)
# 结果:['ambiguous', ',', 'hear', 'a', 'needs', 'that', 'in', 'enough', 'did', 'acceptable', 'mar', '.', 'obstructed', 'build']
ss = [word for word in bitext]

print([b for (a,b) in ss if a=='often']) # 得到全部紧邻在often后的词

bd = brown.tagged_words(categories='news')
bibd = nltk.bigrams(bd)
# 结果:['JJ', ',', 'VB', 'AT', 'VBZ', 'CS', 'IN', 'QLP', 'DOD', 'JJ', 'VB', '.', 'VBD', 'VB']    
print([b[1] for (a,b) in bibd if a[0]=='often']) # 得到的是often后的词的标注
  1. trigram

Find trigrams, the middle word is to, and the front and back are verbs.

# 之前的代码 略

tribd = nltk.trigrams(bd)
lis = [(a[0],b[0],c[0]) for (a,b,c) in tribd if a[1].startswith('V') and c[1].startswith('V') and b[1]=='TO'] # 查找
# 结果:344
print(len(lis))

Guess you like

Origin blog.csdn.net/xukeke12138/article/details/111602898