[Xiao Mu learns NLP] An introductory tutorial for Python using the NLTK library

1 Introduction

NLTK - Natural Language Toolkit - is an open source Python suite. Modules, datasets, and tutorials to support research and development in Nature Language Processing. NLTK requires Python version 3.7, 3.8, 3.9, 3.10, or 3.11.

NLTK is an efficient Python-built platform for processing human natural language data. It provides easy-to-use interfaces through which you can access more than 50 corpus and lexical resources (such as WordNet), a set of text processing libraries for classification, tokenization, stemming, parsing and semantic reasoning, and industrial-strength Wrappers for NLP libraries and an active discussion forum.

insert image description here

2. Installation

2.1 Install the nltk library

The Natural Language Toolkit (NLTK) is a Python package for natural language processing. NLTK requires Python 3.7, 3.8, 3.9, 3.10 or 3.11.

pip install nltk
# or
pip install nltk -i https://pypi.tuna.tsinghua.edu.cn/simple

insert image description here
The function of nltk word segmentation can be tested with the following code:

2.2 Install nltk corpus

There are dozens of complete corpora included in the NLTK module, which can be used for practice, as follows:
Gutenberg Corpus: Gutenberg, contains a small part of the text of the electronic documents of Project Gutenberg, about 36,000 free e-books.
Web chat corpus: webtext, nps_chat
Brown corpus: brown
Reuters corpus: reuters
Movie review corpus: movie_reviews, a corpus with reviews, marked as positive or negative;
Inaugural speech corpus: inaugural, a collection of 55 texts, each text is Speeches by a certain president at different times.

  • Method 1: Download online
import nltk
nltk.download()

Downloading through the above command code has a high probability of failure.
insert image description here
insert image description here

  • Method 2: Download manually, install offline
    github: https://github.com/nltk/nltk_data/tree/gh-pages
    gitee: https://gitee.com/qwererer2/nltk_data/tree/gh-pages
    insert image description here

  • Check which path the packages folder should be placed in.
    insert image description here
    Rename the downloaded packages folder to nltk_data and place it in the following folder:
    insert image description here

  • Verify that the installation was successful

from nltk.book import *

insert image description here

  • word segmentation test
import nltk
ret = nltk.word_tokenize("A pivot is the pin or the central point on which something balances or turns")
print(ret)

insert image description here

  • wordnet thesaurus test

WordNet is a large-scale English vocabulary database built in the 1980s by the team of George Miller, a famous cognitive psychologist at Princeton University. Nouns, verbs, adjectives and adverbs are stored in this database as synsets.

import nltk
nltk.download('wordnet')
from nltk.corpus import wordnet as wn
from nltk.corpus import brown
print(brown.words())

insert image description here

3. Test

3.1 Sentence segmentation

English sentence: nltk.sent_tokenize: Segment text according to sentences
English word segmentation: nltk.word_tokenize: Separate sentences according to words and return a list

from nltk.tokenize import sent_tokenize, word_tokenize

EXAMPLE_TEXT = "Hello Mr. Smith, how are you doing today? The weather is great, and Python is awesome. The sky is pinkish-blue. You shouldn't eat cardboard."

print(sent_tokenize(EXAMPLE_TEXT))
print(word_tokenize(EXAMPLE_TEXT))

from nltk.corpus import stopwords
stop_word = set(stopwords.words('english'))    # 获取所有的英文停止词
word_tokens = word_tokenize(EXAMPLE_TEXT)      # 获取所有分词词语
filtered_sentence = [w for w in word_tokens if not w in stop_word] #获取案例文本中的非停止词
print(filtered_sentence)

insert image description here

3.2 Stop word filtering

Stop words: stopwords of nltk.corpus: View the list of stop words in English.

A function for filtering English stop words is defined, and the vocabulary in the text is normalized to lowercase and extracted. Extract English stop words from the stop word corpus to differentiate the text.

from nltk.tokenize import sent_tokenize, word_tokenize   #导入 分句、分词模块
from nltk.corpus import stopwords                       #导入停止词模块
def remove_stopwords(text):
    text_lower=[w.lower() for w in text if w.isalpha()]
    stopword_set =set(stopwords.words('english'))
    result = [w for w in text_lower if w not in stopword_set]
    return result

example_text = "Stray birds of summer come to my window to sing and fly away. And yellow leaves of autumn,which have no songs,flutter and fall there with a sigh."
word_tokens = word_tokenize(example_text) 
print(remove_stopwords(word_tokens))

insert image description here

from nltk.tokenize import sent_tokenize, word_tokenize   #导入 分句、分词模块

example_text = "Stray birds of summer come to my window to sing and fly away. And yellow leaves of autumn,which have no songs,flutter and fall there with a sigh."
word_tokens = word_tokenize(example_text) 

from nltk.corpus import stopwords
test_words = [word.lower() for word in word_tokens]
test_words_set = set(test_words)
test_words_set.intersection(set(stopwords.words('english')))
filtered = [w for w in test_words_set if(w not in stopwords.words('english'))] 
print(filtered)

3.3 Stemming

Stem extraction: It is the process of removing affixes to obtain the root, for example: fishing, fished, for the same stem fish. Nltk, provides PorterStemmer for stemming.

from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize,word_tokenize
ps = PorterStemmer()
example_words = ["python","pythoner","pythoning","pythoned","pythonly"]
print(example_words)
for w in example_words:
    print(ps.stem(w),end=' ')

insert image description here

from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize,word_tokenize
ps = PorterStemmer()

example_text = "Stray birds of summer come to my window to sing and fly away. And yellow leaves of autumn,which have no songs,flutter and fall there with a sigh."
print(example_text)
words = word_tokenize(example_text)

for w in words:
    print(ps.stem(w), end=' ')

insert image description here

from nltk.stem import PorterStemmer
from nltk.tokenize import sent_tokenize, word_tokenize
ps = PorterStemmer()

example_text1 = "Stray birds of summer come to my window to sing and fly away. And yellow leaves of autumn,which have no songs,flutter and fall there with a sigh."
example_text2 = "There little thoughts are the rustle of leaves; they have their whisper of joy in my mind."
example_text3 = "We, the rustling leaves, have a voice that answers the storms,but who are you so silent? I am a mere flower."
example_text4 = "The light that plays, like a naked child, among the green leaves happily knows not that man can lie."
example_text5 = "My heart beats her waves at the shore of the world and writes upon it her signature in tears with the words, I love thee."
example_text_list = [example_text1, example_text2, example_text3, example_text4, example_text5]

for sent in example_text_list:
    words = word_tokenize(sent)
    print("tokenize: ", words)

    stems = [ps.stem(w) for w in words]
    print("stem: ", stems)

insert image description here

3.4 Lemmatization/stem restoration

Similar to stemming, stemming involves the creation of non-existing words, while lemmatization involves actual words.

from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
print('cats\t',lemmatizer.lemmatize('cats'))
print('better\t',lemmatizer.lemmatize('better',pos='a'))

insert image description here

from nltk.stem import WordNetLemmatizer

lemmatizer = WordNetLemmatizer()

print(lemmatizer.lemmatize("cats"))
print(lemmatizer.lemmatize("cacti"))
print(lemmatizer.lemmatize("geese"))
print(lemmatizer.lemmatize("rocks"))
print(lemmatizer.lemmatize("python"))
print(lemmatizer.lemmatize("better", pos="a"))
print(lemmatizer.lemmatize("best", pos="a"))
print(lemmatizer.lemmatize("run"))
print(lemmatizer.lemmatize("run",'v'))

insert image description here
The only thing to note is that lemmatize accepts a part-of-speech argument pos. If not provided, the default is "noun".

  • tenses and plural
from nltk.tokenize import sent_tokenize, word_tokenize
from nltk.stem import PorterStemmer

tokens = word_tokenize(text="All work and no play makes jack a dull boy, all work and no play,playing,played", language="english")
ps=PorterStemmer()
stems = [ps.stem(word)for word in tokens]
print(stems)

from nltk.stem import SnowballStemmer
snowball_stemmer = SnowballStemmer('english')
ret = snowball_stemmer.stem('presumably')
print(ret)

from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
ret = wordnet_lemmatizer.lemmatize('dogs')
print(ret)

insert image description here

3.5 Synonyms and antonyms

nltk provides a collection of vocabulary databases such as WordNet to define synonyms and antonyms.

  • synonyms
from nltk.corpus import wordnet
# 单词boy寻找同义词
syns = wordnet.synsets('girl')
print(syns[0].name())
# 只是单词
print(syns[0].lemmas()[0].name())
# 第一个同义词的定义
print(syns[0].definition())
# 单词boy的使用示例
print(syns[0].examples())

insert image description here

  • Synonyms and Antonyms
from nltk.corpus import wordnet
synonyms = []  # 定义近义词存储空间
antonyms = []  # 定义反义词存储空间
for syn in wordnet.synsets('bad'):
    for i in syn.lemmas():
        synonyms.append(i.name())
        if i.antonyms():
            antonyms.append(i.antonyms()[0].name())

print(set(synonyms))
print(set(antonyms))

insert image description here

3.6 Semantic relevance

wordnet's wup_similarity() method is used for semantic similarity.

from nltk.corpus import wordnet

w1 = wordnet.synset('ship.n.01')
w2 = wordnet.synset('boat.n.01')
print(w1.wup_similarity(w2))

w1 = wordnet.synset('ship.n.01')
w2 = wordnet.synset('car.n.01')
print(w1.wup_similarity(w2))

w1 = wordnet.synset('ship.n.01')
w2 = wordnet.synset('cat.n.01')
print(w1.wup_similarity(w2))

insert image description here

NLTK provides a variety of similarity scorers, such as:

  • path_similarity
  • lch_similarity
  • wup_similarity
  • res_similarity
  • jcn_similarity
  • lin_similarity

3.7 Part-of-speech tagging

Label the words in a sentence as nouns, adjectives, verbs, etc.

from nltk.tokenize import sent_tokenize, word_tokenize   #导入 分句、分词模块

example_text = "Stray birds of summer come to my window to sing and fly away. And yellow leaves of autumn,which have no songs,flutter and fall there with a sigh."
word_tokens = word_tokenize(example_text) 

from nltk import pos_tag
tags = pos_tag(word_tokens)
print(tags)

insert image description here

  • The annotations are defined as follows
| POS Tag |指代 |
| --- | --- |
| CC | 并列连词 |
| CD | 基数词 |
| DT | 限定符|
| EX | 存在词|
| FW |外来词 |
| IN | 介词或从属连词|
| JJ | 形容词 |
| JJR | 比较级的形容词 |
| JJS | 最高级的形容词 |
| LS | 列表项标记 |
| MD | 情态动词 |
| NN |名词单数|
| NNS | 名词复数 |
| NNP |专有名词|
| PDT | 前置限定词 |
| POS | 所有格结尾|
| PRP | 人称代词 |
| PRP$ | 所有格代词 |
| RB |副词 |
| RBR | 副词比较级 |
| RBS | 副词最高级 |
| RP | 小品词 |
| UH | 感叹词 |
| VB |动词原型 |
| VBD | 动词过去式 |
| VBG |动名词或现在分词 |
| VBN |动词过去分词|
| VBP |非第三人称单数的现在时|
| VBZ | 第三人称单数的现在时 |
| WDT |以wh开头的限定词 |
POS tag list:

CC  coordinating conjunction
CD  cardinal digit
DT  determiner
EX  existential there (like: "there is" ... think of it like "there exists")
FW  foreign word
IN  preposition/subordinating conjunction
JJ  adjective   'big'
JJR adjective, comparative  'bigger'
JJS adjective, superlative  'biggest'
LS  list marker 1)
MD  modal   could, will
NN  noun, singular 'desk'
NNS noun plural 'desks'
NNP proper noun, singular   'Harrison'
NNPS    proper noun, plural 'Americans'
PDT predeterminer   'all the kids'
POS possessive ending   parent's
PRP personal pronoun    I, he, she
PRP$    possessive pronoun  my, his, hers
RB  adverb  very, silently,
RBR adverb, comparative better
RBS adverb, superlative best
RP  particle    give up
TO  to  go 'to' the store.
UH  interjection    errrrrrrrm
VB  verb, base form take
VBD verb, past tense    took
VBG verb, gerund/present participle taking
VBN verb, past participle   taken
VBP verb, sing. present, non-3d take
VBZ verb, 3rd person sing. present  takes
WDT wh-determiner   which
WP  wh-pronoun  who, what
WP$ possessive wh-pronoun   whose
WRB wh-abverb   where, when

3.8 Named Entity Recognition

Named entity recognition (NER) is the first step in information extraction, which aims to find and classify named entities in text into predefined classifications, such as person name, organization, place, time, quantity, monetary value, percentage, etc.


import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag

ex= 'European authorities fined Google a record $5.1 billion on Wednesday for abusing its power in the mobile phone market and ordered the company to alter its practices'

def preprocess(sent):
    sent= nltk.word_tokenize(sent)
    sent= nltk.pos_tag(sent)
    return sent

# 单词标记和词性标注
sent= preprocess(ex)
print(sent)

# 名词短语分块
pattern='NP: {<DT>?<JJ> * <NN>}'
cp= nltk.RegexpParser(pattern)
cs= cp.parse(sent)
print(cs)

# IOB标签
from nltk.chunk import conlltags2tree, tree2conlltags
from pprint import pprint
iob_tagged= tree2conlltags(cs)
pprint(iob_tagged)

# 分类器识别命名实体,类别标签(如PERSON,ORGANIZATION和GPE)
from nltk import ne_chunk
ne_tree= ne_chunk(pos_tag(word_tokenize(ex)))
print(ne_tree)

insert image description here


import nltk
from nltk.tokenize import word_tokenize
from nltk.tag import pos_tag
from nltk.chunk import conlltags2tree, tree2conlltags

def learnAnaphora():
    sentences = [
        "John is a man. He walks",
        "John and Mary are married. They have two kids",
        "In order for Ravi to be successful, he should follow John",
        "John met Mary in Barista. She asked him to order a Pizza"
    ]

    for sent in sentences:
        chunks = nltk.ne_chunk(nltk.pos_tag(nltk.word_tokenize(sent)), binary=False)
        stack = []
        print(sent)
        items = tree2conlltags(chunks)
        for item in items:
            if item[1] == 'NNP' and (item[2] == 'B-PERSON' or item[2] == 'O'):
                stack.append(item[0])
            elif item[1] == 'CC':
                stack.append(item[0])
            elif item[1] == 'PRP':
                stack.append(item[0])
        print("\t {}".format(stack)) 
    
learnAnaphora()

insert image description here


import nltk

sentence = 'Peterson first suggested the name "open source" at Palo Alto, California'

# 先预处理
words = nltk.word_tokenize(sentence)
pos_tagged = nltk.pos_tag(words)

# 运行命名实体标注器
ne_tagged = nltk.ne_chunk(pos_tagged)
print("NE tagged text:")
print(ne_tagged)

# 只提取这个 树(tree)里的命名实体
print("Recognized named entities:")
for ne in ne_tagged:
    if hasattr(ne, "label"):
        print(ne.label(), ne[0:])

ne_tagged.draw()

insert image description here
insert image description here
NLTK's built-in named-entity tagger uses the Automatic Content Extraction (ACE) program from the University of Pennsylvania. The tagger can identify common entities (entites) such as organization (ORGANIZATION), person name (PERSON), place name (LOCATION), facility (FACILITY) and geopolitical entity (geopolitical entity).

NLTK can also use other taggers, such as the Stanford Named Entity Recognizer. The trained tagger is written in Java, but NLTK provides an interface to use it (see nltk.parse.stanford or nltk.tag.stanford for details ).

3.9 Text object

from nltk.tokenize import sent_tokenize, word_tokenize   #导入 分句、分词模块

example_text = "Stray birds of summer come to my window to sing and fly away. And yellow leaves of autumn,which have no songs,flutter and fall there with a sigh."
word_tokens = word_tokenize(example_text) 
word_tokens = [word.lower() for word in word_tokens]

from nltk.text import Text
t = Text(word_tokens)
print(t.count('and') )
print(t.index('and') )
t.plot(8)

insert image description here

3.10 Text Classification

import nltk
import random
from nltk.corpus import movie_reviews

documents = [(list(movie_reviews.words(fileid)), category)
             for category in movie_reviews.categories()
             for fileid in movie_reviews.fileids(category)]

random.shuffle(documents)

print(documents[1])

all_words = []
for w in movie_reviews.words():
    all_words.append(w.lower())

all_words = nltk.FreqDist(all_words)
print(all_words.most_common(15))
print(all_words["stupid"])

insert image description here

3.11 Other classifiers

  • Listed below are the classifiers that come with NLTK:
from nltk.classify.api import ClassifierI, MultiClassifierI
from nltk.classify.megam import config_megam, call_megam
from nltk.classify.weka import WekaClassifier, config_weka
from nltk.classify.naivebayes import NaiveBayesClassifier
from nltk.classify.positivenaivebayes import PositiveNaiveBayesClassifier
from nltk.classify.decisiontree import DecisionTreeClassifier
from nltk.classify.rte_classify import rte_classifier, rte_features, RTEFeatureExtractor
from nltk.classify.util import accuracy, apply_features, log_likelihood
from nltk.classify.scikitlearn import SklearnClassifier
from nltk.classify.maxent import (MaxentClassifier, BinaryMaxentFeatureEncoding,TypedMaxentFeatureEncoding,ConditionalExponentialClassifier)
  • Predict gender by name

import nltk
from nltk.corpus import names
from nltk import classify

#特征取的是最后一个字母
def gender_features(word):
    return {
    
    'last_letter': word[-1]}

#数据准备
name=[(n,'male') for n in names.words('male.txt')]+[(n,'female') for n in names.words('female.txt')]
print(len(name))

#特征提取和训练模型
features=[(gender_features(n),g) for (n,g) in name]
classifier = nltk.NaiveBayesClassifier.train(features[:6000])

#测试
print(classifier.classify(gender_features('Frank')))
print(classify.accuracy(classifier, features[6000:]))

print(classifier.classify(gender_features('Tom')))
print(classify.accuracy(classifier, features[6000:]))

print(classifier.classify(gender_features('Sonya')))
print(classify.accuracy(classifier, features[6000:]))

insert image description here

  • emotion analysis
import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
from nltk.corpus import names


def word_feats(words):
    return dict([(word, True) for word in words])

#数据准备
positive_vocab = ['awesome', 'outstanding', 'fantastic', 'terrific', 'good', 'nice', 'great', ':)']
negative_vocab = ['bad', 'terrible', 'useless', 'hate', ':(']
neutral_vocab = ['movie', 'the', 'sound', 'was', 'is', 'actors', 'did', 'know', 'words', 'not']

#特征提取
positive_features = [(word_feats(pos), 'pos') for pos in positive_vocab]
negative_features = [(word_feats(neg), 'neg') for neg in negative_vocab]
neutral_features = [(word_feats(neu), 'neu') for neu in neutral_vocab]

train_set = negative_features + positive_features + neutral_features

#训练
classifier = NaiveBayesClassifier.train(train_set)

# 测试
neg = 0
pos = 0
sentence = "Awesome movie, I liked it"
sentence = sentence.lower()
words = sentence.split(' ')
for word in words:
    classResult = classifier.classify(word_feats(word))
    if classResult == 'neg':
        neg = neg + 1
    if classResult == 'pos':
        pos = pos + 1

print('Positive: ' + str(float(pos) / len(words)))
print('Negative: ' + str(float(neg) / len(words)))

insert image description here

3.12 Data cleaning

  • Remove HTML tags, such as &
text_no_special_html_label = re.sub(r'\&\w+;|#\w*|\@\w*','',text)
print(text_no_special_html_label)
  • remove link tags
text_no_link = re.sub(r'http:\/\/.*|https:\/\/.*','',text_no_special_html_label)
print(text_no_link)
  • remove line breaks
text_no_next_line = re.sub(r'\n','',text_no_link)
print(text_no_next_line)
  • remove the $ sign
text_no_dollar = re.sub(r'\$\w*\s','',text_no_next_line)
print(text_no_dollar)
  • Remove abbreviated proper nouns
text_no_short = re.sub(r'\b\w{1,2}\b','',text_no_dollar)
print(text_no_short)
  • remove extra spaces
text_no_more_space = re.sub(r'\s+',' ',text_no_short)
print(text_no_more_space)
  • Use nltk word segmentation
tokens = word_tokenize(text_no_more_space)
tokens_lower = [s.lower() for s in tokens]
print(tokens_lower)
  • remove stop words
import re
from nltk.corpus import stopwords

cache_english_stopwords = stopwords.words('english')
tokens_stopwords = [s for s in tokens_lower if s not in cache_english_stopwords]
print(tokens_stopwords)
print(" ".join(tokens_stopwords))

In addition to NLTK, spaCy has been widely used in recent years. Its functions are similar to nltk, but its functions are stronger, its updates are faster, and it also has great advantages in language processing.

epilogue

如果您觉得该方法或代码有一点点用处,可以给作者点个赞,或打赏杯咖啡;╮( ̄▽ ̄)╭
如果您感觉方法或代码不咋地//(ㄒoㄒ)// ,就在评论处留言,作者继续改进;o_O???
如果您需要相关功能的代码定制化开发,可以留言私信作者;(✿◡‿◡)
感谢各位大佬童鞋们的支持!( ´ ▽´ )ノ ( ´ ▽´)! ! !

Guess you like

Origin blog.csdn.net/hhy321/article/details/132643832
Recommended