NLTK（标注词汇）

1.使用词性标注器

nltk.word_tokenize（text）：对指定的句子进行分词，返回单词列表。

nltk.pos_tag(words)：对指定的单词列表进行词性标记，返回标记列表。

import nltk
words = nltk.word_tokenize('And now for something completely different')
print(words)
word_tag = nltk.pos_tag(words)
print(word_tag)

['And', 'now', 'for', 'something', 'completely', 'different']
[('And', 'CC'), ('now', 'RB'), ('for', 'IN'), ('something', 'NN'), ('completely', 'RB'), ('different', 'JJ')]

2. 标注预料库

str2tuple 函数

str2tuple(s, sep='/')
    Given the string representation of a tagged token, return the
    corresponding tuple representation.  The rightmost occurrence of
    *sep* in *s* will be used to divide *s* into a word string and
    a tag string.  If *sep* does not occur in *s*, return (s, None).

from nltk.tag.util import str2tuple
str2tuple('fly/NN')
        ('fly', 'NN')

    :type s: str
    :param s: The string representation of a tagged token.
    :type sep: str
    :param sep: The separator string used to separate word strings
        from tags.

标记会转成大写
默认sep=’/’

t = nltk.str2tuple('fly~abc',sep='~')
t
Out[26]: ('fly', 'ABC')

t = nltk.str2tuple('fly/abc')
t
Out[28]: ('fly', 'ABC')

读取已标注的语料库

from nltk.corpus import brown
words_tag = brown.tagged_words(categories='news')
print(words_tag[:10])

[('The', 'AT'), ('Fulton', 'NP-TL'), ('County', 'NN-TL'), ('Grand', 'JJ-TL'), ('Jury', 'NN-TL'), ('said', 'VBD'), ('Friday', 'NR'), ('an', 'AT'), ('investigation', 'NN'), ('of', 'IN')]

简化的标记原先的 simplify_tags 在 python 3 中改为 tagset

words_tag = brown.tagged_words(categories='news',tagset = 'universal')
print(words_tag[:10])

[('The', 'DET'), ('Fulton', 'NOUN'), ('County', 'NOUN'), ('Grand', 'ADJ'), ('Jury', 'NOUN'), ('said', 'VERB'), ('Friday', 'NOUN'), ('an', 'DET'), ('investigation', 'NOUN'), ('of', 'ADP')]

brown可以看作是一个CategorizedTaggedCorpusReader实例对象。

CategorizedTaggedCorpusReader::tagged_words(fileids, categories)：该方法接受文本标识或者类别标识作为参数，返回这些文本被标注词性后的单词列表。

CategorizedTaggedCorpusReader::tagged_sents(fileids, categories)：该方法接受文本标识或者类别标识作为参数，返回这些文本被标注词性后的句子列表，句子为单词列表。

tagged_sents = brown.tagged_sents(categories='news')
print(tagged_sents)

[[('The', 'AT'), ... ('.', '.')], 
 [('The', 'AT'), ...('jury', 'NN').. ],
  ...]

1.使用词性标注器

2. 标注预料库

str2tuple 函数

读取已标注的语料库

猜你喜欢