python natural language processing - NLTK - dictionary reconstruction of part-of-speech tags (pos_tag)

Continue the previous article - part-of-speech tags
After running the code, I found the problem

It turns out that like and hate are not added to ret[].

But like and hate are our very important emotional keywords. The reason is that the parts of speech of like and hate are actually counted as IN and NN in the dictionary

a_sentence = 'like hate'
token=word_tokenize(a_sentence)
pos_tag(word_tokenize(a_sentence))
[('like', 'IN'), ('hate', 'NN')]

solution idea

There are two possible solutions:

Manually find similar keywords with inaccurate parts of speech that are not classified, and add these keywords to content.
Modify the dictionary of pos_tag.

The first method requires too much labor, but it is simpler. If the amount of data is smaller, the first method can be used directly.

I can only use the second method here because I have to process 40,000 pieces of data.

But where is this dictionary, I can't find the source code in python's lib directory for a long time.

Just ask the network. After looking for Chinese and foreign websites for an afternoon, I finally found some knowledge, among which ntlk.org is the most clear. If there is no English foundation, it is recommended to translate the webpage and look at it.

http://www.nltk.org/book/ch05.html

http://www.cs.cmu.edu/~ark/TweetNLP/

https://stackoverflow.com/questions/30791194/nltk-get-and-simplify-list-of-tags

https://wenku.baidu.com/view/c63bec3b366baf1ffc4ffe4733687e21af45ffab.html

https://www.jianshu.com/p/22be6550c18b

One of the most useful is a blog on CSDN https://blog.csdn.net/fxjtoday/article/details/5841453, I made a note in italics, reproduced as follows:

POS tagging : part-of-speech tagging , or word classes or lexical categories . Many sayings are actually part-of-speech tagging.

Then use the off-the-shelf tool of nltk's toolset to simply POS tagging the text

>>> text = nltk.word_tokenize("And now for something completely different")
>>> nltk.pos_tag(text)
[('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'), ('completely', 'RB'), ('different', 'JJ')]

This is how this interface is introduced in the API Document

Use NLTK's currently recommended part of speech tagger to tag the given list of tokens.

I checked the code, pos_tag load the Standard treebank POS tagger

1.     CC     Coordinating conjunction
2.     CD     Cardinal number
3.     DT     Determiner
4.     EX     Existential there
5.     FW     Foreign word
6.     IN     Preposition or subordinating conjunction
7.     JJ     Adjective
8.     JJR     Adjective, comparative
9.     JJS     Adjective, superlative
10.     LS     List item marker
11.     MD     Modal
12.     NN     Noun, singular or mass
13.     NNS     Noun, plural
14.     NNP     Proper noun, singular
15.     NNPS     Proper noun, plural
16.     PDT     Predeterminer
17.     POS     Possessive ending
18.     PRP     Personal pronoun
19.     PRP$     Possessive pronoun
20.     RB     Adverb
21.     RBR     Adverb, comparative
22.     RBS     Adverb, superlative
23.     RP     Particle
24.     SYM     Symbol
25.     TO     to
26.     UH     Interjection
27.     VB     Verb, base form
28.     VBD     Verb, past tense
29.     VBG     Verb, gerund or present participle
30.     VBN     Verb, past participle
31.     VBP     Verb, non-3rd person singular present
32.     VBZ     Verb, 3rd person singular present
33.     WDT     Wh-determiner
34.     WP     Wh-pronoun
35.     WP$     Possessive wh-pronoun
36.     WRB     Wh-adverb

Now, according to the explanation of the above main part-of-speech abbreviations, it is easier to understand the part-of-speech tagging given by the above interface.

In nltk's corpus, corpus, some of which are tagged with part-of-speech, these can be used for training sets, tagged corpors have tagged_words() method

>>> nltk.corpus.brown.tagged_words()
[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ...]
>>> nltk.corpus.brown.tagged_words(simplify_tags=True)
[('The', 'DET'), ('Fulton', 'N'), ('County', 'N'), ...]

Automatic Tagging

Let's talk about various automatic tagging methods, because tags are based on the context of words, so tags are in units of sentences, not words, because if the units are words, the ending word of a sentence will affect the The tag of the word at the beginning of the next sentence is unreasonable. Taking the sentence as a unit can avoid such mistakes, so that the influence of the context will not exceed the sentense.

Let's use brown corpus as an example,

>>> from nltk.corpus import brown

>>> brown_tagged_sents = brown.tagged_sents(categories='news')

#do thiscategories mean anything?

>>> brown_sents = brown.sents(categories='news')

The labeled sentence set and the unlabeled sentence set can be distributed and used as the validation set and test set of the labeling algorithm respectively.

The Default Tagger

The simplest possible tagger assigns the same tag to each token.

>>> raw = 'I do not like green eggs and ham, I do not like them Sam I am!'
>>> tokens = nltk.word_tokenize(raw)

>>> default_tagger = nltk.DefaultTagger('NN')

#Define a tagger and mark all parts of speech as NN.

>>> default_tagger.tag(tokens)
[('I', 'NN'), ('do', 'NN'), ('not', 'NN'), ('like', 'NN'), ('green', 'NN'),
('eggs', 'NN'), ('and', 'NN'), ('ham', 'NN'), (',', 'NN'), ('I', 'NN'),
198 | Chapter 5: Categorizing and Tagging Words
('do', 'NN'), ('not', 'NN'), ('like', 'NN'), ('them', 'NN'), ('Sam', 'NN'),
('I', 'NN'), ('am', 'NN'), ('!', 'NN')]

This Tagger, it is really simple to mark everything as the kind of tagger you told him, which seems to be meaningless, but as a backoff, it is still useful

In fact, Tagger is a labeling method. It creates a binary array for the corresponding word, [word, part of speech]. This part of speech, that is, tagger, can be set to any value, such as [], string class.

The Regular Expression Tagger

The regular expression tagger assigns tags to tokens on the basis of matching patterns.

>>> patterns = [
... (r'.*ing$', 'VBG'), # gerunds
... (r'.*ed$', 'VBD'), # simple past
... (r'.*es$', 'VBZ'), # 3rd singular present
... (r'.*ould$', 'MD'), # modals
... (r'.*/'s$', 'NN$'), # possessive nouns
... (r'.*s$', 'NNS'), # plural nouns
... (r'^-?[0-9]+(.[0-9]+)?$', 'CD'), # cardinal numbers
... (r'.*', 'NN') # nouns (default)

... ]

#Create tagger can also be created from a binary sequence, or you can specify two parts of speech for a word at a time. E.g:

>>> pos = {'colorless': 'ADJ', 'ideas': 'N', 'sleep': 'V', 'furiously': 'ADV'}
>>> pos = dict(colorless='ADJ', ideas='N', sleep='V', furiously='ADV')

#create from binary array

>>> pos = defaultdict(list)
>>> pos['sleep'] = ['NOUN', 'VERB']
>>> pos['ideas']
[]

#Specify the default part of speech as list, then the part of speech is []. It is also possible to specify two parts of speech for a word at the same time.

>>> regexp_tagger = nltk.RegexpTagger(patterns)

#This is a dedicated tagger method for regular expressions

>>> regexp_tagger.tag(brown_sents[3])

#I wonder what this brown_sents is? Change the output sentence by the following Int parameter. Guessing may be a preset sentence.

[('``', 'NN'), ('Only', 'NN'), ('a', 'NN'), ('relative', 'NN'), ('handful', 'NN'),
('of', 'NN'), ('such', 'NN'), ('reports', 'NNS'), ('was', 'NNS'), ('received', 'VBD'),
("''", 'NN'), (',', 'NN'), ('the', 'NN'), ('jury', 'NN'), ('said', 'NN'), (',', 'NN'),
('``', 'NN'), ('considering', 'VBG'), ('the', 'NN'), ('widespread', 'NN'), ...]

This Tagger has improved a bit, that is, you can define some regular grammar rules. If the rules are met, the tag becomes the corresponding part of speech, otherwise it is default.

The Lookup Tagger

A lot of high-frequency words do not have the NN tag. Let’s find the hundred most frequent words and store their most likely tag.

This method has begun to have some practical value. It is to perform part-of-speech tagging through statistical training of the most commonly used words in the corpus and the most likely part of speech.

>>> fd = nltk.FreqDist(brown.words(categories='news'))

[FreqDist({'The': 806,'Fulton': 14, 'County': 35, 'Grand': 6, 'Jury': 2, 'said': 402, 'Friday': 41,……]

# Count word frequencies from sentences

>>> cfd = nltk.ConditionalFreqDist(brown.tagged_words(categories='news'))

#Count the number of occurrences of each word's part of speech

>>> most_freq_words = fd.keys()[:100]

#Pick out the words with the word frequency top100

>>> likely_tags = dict((word, cfd[word].max()) for word in most_freq_words)
>>> baseline_tagger = nltk.UnigramTagger(model=likely_tags)

这段code就是从corpus中取出top 100的词，然后找到这100个词出现次数最多的词性，然后形成likely_tags的字典

然后将这个字典作为model传个unigramTagger

unigramTagger就是一元的tagger，即不考虑前后context的一种简单的tagger

这个方法有个最大的问题，你只指定了top 100词的词性，那么其他的词怎么办

好，前面的default tagger有用了

baseline_tagger = nltk.UnigramTagger(model=likely_tags, backoff=nltk.DefaultTagger('NN'))

这样就可以部分解决这个问题，不知道的就用default tagger来标注

这个方法的准确性完全取决于这个model的大小，这儿取了top100的词，可能准确性不高，但是随着你取的词的增多，这个准确率会不断提高。

N-Gram Tagging

Unigram taggers are based on a simple statistical algorithm: for each token, assign the tag that is most likely for that particular token.

上面给出的lookup tagger就是用的Unigram tagger，现在给出Unigram tagger更一般的用法

>>> from nltk.corpus import brown
>>> brown_tagged_sents = brown.tagged_sents(categories='news')
>>> brown_sents = brown.sents(categories='news')
>>> unigram_tagger = nltk.UnigramTagger(brown_tagged_sents) ＃Training
>>> unigram_tagger.tag(brown_sents[2007])
[('Various', 'JJ'), ('of', 'IN'), ('the', 'AT'), ('apartments', 'NNS'),
('are', 'BER'), ('of', 'IN'), ('the', 'AT'), ('terrace', 'NN'), ('type', 'NN'),
(',', ','), ('being', 'BEG'), ('on', 'IN'), ('the', 'AT'), ('ground', 'NN'),
('floor', 'NN'), ('so', 'QL'), ('that', 'CS'), ('entrance', 'NN'), ('is', 'BEZ'),
('direct', 'JJ'), ('.', '.')]

你可以来已标注的语料库对Unigram tagger进行训练

An n-gram tagger is a generalization of a unigram tagger whose context is the current word together with the part-of-speech tags of the n-1 preceding tokens.

n元就是要考虑context，即考虑前n-1个word的tag，来给当前的word进行tagging

就n元tagger的特例二元tagger作为例子

>>> bigram_tagger = nltk.BigramTagger(train_sents)
>>> bigram_tagger.tag(brown_sents[2007])

这样有个问题，如果tag的句子中的某个词的context在训练集里面没有，哪怕这个词在训练集中有，也无法对他进行标注，还是要通过backoff来解决这样的问题

>>> t0 = nltk.DefaultTagger('NN')
>>> t1 = nltk.UnigramTagger(train_sents, backoff=t0)
>>> t2 = nltk.BigramTagger(train_sents, backoff=t1)

Transformation-Based Tagging

n-gram tagger存在的问题是，model会占用比较大的空间，还有就是在考虑context时，只会考虑前面词的tag，而不会考虑词本身。

而要介绍的这种tagger可以比较好的解决这些问题，用存储rule来代替model，这样可以节省大量的空间，同时在rule中不限制仅考虑tag，也可以考虑word本身。

Brill tagging is a kind of transformation-based learning, named after its inventor. The general idea is very simple: guess the tag of each word, then go back and fix the mistakes.

那么Brill tagging的原理从底下这个例子就可以了解

(1) replace NN with VB when the previous word is TO;

(2) replace TO with IN when the next tag is NNS.

Phrase     to increase grants to states for vocational rehabilitation
Unigram TO    NN        NNS   TO NNS    IN    JJ                NN
Rule 1              VB
Rule 2                                    IN
Output     TO    VB        NNS    IN NNS    IN      JJ    NN

第一步用unigram tagger对所有词做一遍tagging，这里面可能有很多不准确的

下面就用rule来纠正第一步中guess错的那些词的tag，最终得到比较准确的tagging

那么这些rules是怎么生成的了，答案是在training阶段自动生成的

During its training phase, the tagger guesses values for T1, T2, and C, to create thousands of candidate rules. Each rule is scored according to its net benefit: the number of incorrect tags that it corrects, less the number
of correct tags it incorrectly modifies.

意思就是在training阶段，先创建thousands of candidate rules，这些rule创建可以通过简单的统计来完成，所以可能有一些rule是不准确的。那么用每条rule去fix mistakes，然后和正确tag对比，改对的数目减去改错的数目用来作为score评价该rule的好坏，自然得分高的留下，得分低的rule就删去，底下是些rules的例子

NN -> VB if the tag of the preceding word is 'TO'
NN -> VBD if the tag of the following word is 'DT'
NN -> VBD if the tag of the preceding word is 'NNS'
NN -> NNP if the tag of words i-2...i-1 is '-NONE-'
NN -> NNP if the tag of the following word is 'NNP'
NN -> NNP if the text of words i-2...i-1 is 'like'
NN -> VBN if the text of the following word is '*-1'

但是上面的这种方法只适合自己训练规则，或自己创建规则。有没有已经训练好的tagger呢？或者能不能修改默认的分类tagger呢?

通过

 nltk.ConditionalFreqDist(brown.tagged_words(categories='news'))

的结果：

ConditionalFreqDist(nltk.probability.FreqDist,
                    {'The': FreqDist({'AT': 775, 'AT-HL': 3, 'AT-TL': 28}),
                     'Fulton': FreqDist({'NP': 4, 'NP-TL': 10}),
                     'County': FreqDist({'NN-TL': 35}),
                     'Grand': FreqDist({'FW-JJ-TL': 1, 'JJ-TL': 5}),
                     'Jury': FreqDist({'NN-TL': 2}),
                     'said': FreqDist({'VBD': 382, 'VBN': 20}),
                     'Friday': FreqDist({'NR': 41}),
                     'an': FreqDist({'AT': 300}),
                     'investigation': FreqDist({'NN': 9}),
                     'of': FreqDist({'IN': 2716, 'IN-HL': 5, 'IN-TL': 128}),
                     "Atlanta's": FreqDist({'NP$': 4}),
                     'recent': FreqDist({'JJ': 20}),
                     'primary': FreqDist({'JJ': 4, 'NN': 13}),
                     'election': FreqDist({'NN': 38}),
                     'produced': FreqDist({'VBD': 5, 'VBN': 1}),

可以看出，自带的分类器其实并不是靠字典，而是靠规则分类词性的。然后有些句子分出来了hate，有些没分出来，其实是因为大写的HATE是不会被tags打标记的。所以解决方法就是tolower,把所有的文字在进filter之前全都变小写一次！

（逃了）

（还是学到不少东西的）

>>> pos = {'colorless': 'ADJ', 'ideas': 'N', 'sleep': 'V', 'furiously': 'ADV'}
>>> pos = dict(colorless='ADJ', ideas='N', sleep='V', furiously='ADV')

python natural language processing - NLTK - dictionary reconstruction of part-of-speech tags (pos_tag)

Guess you like