NLP（4）Lemmatization

Lemmatizationは、テキストの前処理の重要な部分であり、ステミングと非常によく似ています。

1.レンマ化とは

「形態的復元」の機能は、英語の単語セグメンテーション後の品詞に従って、辞書内のプロトタイプの語彙に単語を復元することです。簡単に言うと、形態的復元とは、単語の接辞を削除して単語の主要部分を抽出することです。通常、抽出された単語は辞書にある単語になります。ステミングとは異なり、抽出された単語は単語に表示されない場合があります。。たとえば、形態復元後の「車」という単語は「車」であり、形態復元後の「食べる」という単語は「食べる」です。
Pythonのnltkモジュールでは、WordNetを使用すると、堅牢な形態学的復元機能が提供されます。次のサンプルPythonコードなど：

from nltk.stem import WordNetLemmatizer

wnl = WordNetLemmatizer()
# lemmatize nouns
print(wnl.lemmatize('cars', 'n'))
print(wnl.lemmatize('men', 'n'))

# lemmatize verbs
print(wnl.lemmatize('running', 'v'))
print(wnl.lemmatize('ate', 'v'))

# lemmatize adjectives
print(wnl.lemmatize('saddest', 'a'))
print(wnl.lemmatize('fancier', 'a'))

#------output--------------
car
men
run
eat
sad
fancy

上記のコードでは、wnl.lemmatize（）関数はlemmatizationを実行できます。最初のパラメーターは単語、2番目のパラメーターは名詞、動詞、形容詞などの単語の品詞であり、返される結果は入力単語の単語です。形状が復元された後の結果。

次に、単語の品詞を指定します

1. Lemmatizationは、品詞を指定する必要があります

Lemmatizationは一般的に単純ですが、特に使用する場合、指定された単語の品詞が非常に重要です。そうしないと、次のコードのように、Lemmatizationが効果的でない場合があります。

from nltk.stem import WordNetLemmatizer

wnl = WordNetLemmatizer()
print(wnl.lemmatize('ate', 'n'))
print(wnl.lemmatize('fancier', 'v'))

#------output--------------
ate
fancier

2.単語の品詞を取得する方法

では、単語の品詞を取得するにはどうすればよいですか？NLPでは、品詞（POS）テクノロジを使用して実装されます。nltkでは、nltk.pos_tag（）を使用して、次のPythonコードのように、文中の単語の品詞を取得できます。

sentence = 'The brown fox is quick and he is jumping over the lazy dog'
import nltk
tokens = nltk.word_tokenize(sentence)
tagged_sent = nltk.pos_tag(tokens)
print(tagged_sent)

#------output--------------
[('The', 'DT'), ('brown', 'JJ'), ('fox', 'NN'), ('is', 'VBZ'), ('quick', 'JJ'), ('and', 'CC'), ('he', 'PRP'), ('is', 'VBZ'), ('jumping', 'VBG'), ('over', 'IN'), ('the', 'DT'), ('lazy', 'JJ'), ('dog', 'NN')]

3.品詞の比較表

ここに画像の説明を挿入

第三に、レンマ化コード

OK、文中の単語の品詞を知った後、形態学的復元と組み合わせることで、形態学的復元機能をうまく完了することができます。サンプルのPythonコードは次のとおりです。

from nltk import word_tokenize, pos_tag
from nltk.corpus import wordnet
from nltk.stem import WordNetLemmatizer

# 获取单词的词性
def get_wordnet_pos(tag):
    if tag.startswith('J'):
        return wordnet.ADJ
    elif tag.startswith('V'):
        return wordnet.VERB
    elif tag.startswith('N'):
        return wordnet.NOUN
    elif tag.startswith('R'):
        return wordnet.ADV
    else:
        return None

sentence = 'football is a family of team sports that involve, to varying degrees, kicking a ball to score a goal.'
tokens = word_tokenize(sentence)  # 分词
tagged_sent = pos_tag(tokens)     # 获取单词词性

wnl = WordNetLemmatizer()
lemmas_sent = []
for tag in tagged_sent:
    wordnet_pos = get_wordnet_pos(tag[1]) or wordnet.NOUN
    lemmas_sent.append(wnl.lemmatize(tag[0], pos=wordnet_pos)) # 词形还原

print(lemmas_sent)

出力結果は、文中の単語の形態学的復元の結果です。