Article Directory
NLTK and natural language processing base
NLTK (Natural Language Toolkit)
NTLK famous Python Natural Language Processing Toolkit, but the main target is the English process. NLTK supporting a document, there is corpus, there are books.
- NLP field most commonly used a Python library
- Open source projects
- Own classification, segmentation and other functions
- Strong community support
- Corpus, the actual use of the language really appeared language materials
- http://www.nltk.org/py-modindex.html
In NLTK home page details how to install under Mac, Linux and Windows NLTK: http: //nltk.org/install.html, recommended direct download Anaconda, eliminating the need to install most of the packages, NLTK installation is complete, you can import nltk test, if there is no problem, as well as download NLTK related corpus official offer.
installation steps:
-
Download NLTK package
pip install nltk
-
Run Python, and enter the following command
import nltk nltk.download()
-
The following pop-up window, it is recommended to install all the packages that
all
[Image dump the chain fails, the source station may have security chain mechanism, it is recommended to save the picture down uploaded directly (img-iMoJDH5K-1579959367202) (... / images / nltk_install.png)]
-
Test use:
[Image dump the chain fails, the source station may have security chain mechanism, it is recommended to save the picture down uploaded directly (img-l3mGoGb2-1579959367203) (... / images / nltk_test.png)]
Corpus
nltk.corpus
import nltk
from nltk.corpus import brown # 需要下载brown语料库
# 引用布朗大学的语料库
# 查看语料库包含的类别
print(brown.categories())
# 查看brown语料库
print('共有{}个句子'.format(len(brown.sents())))
print('共有{}个单词'.format(len(brown.words())))
Results of the:
['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction']
共有57340个句子
共有1161192个单词
分词 (tokenize)
- The sentence has been split into words meaning on language semantics
- , The English word difference:
- English, is a space between words as a natural delimiters
- Chinese is not a formal delimiter word is more complex than English
- Chinese word segmentation tools, such as: stammer word
pip install jieba
- After obtaining segmentation result, the subsequent processing is not much difference in English
# 导入jieba分词
import jieba
seg_list = jieba.cut("欢迎来到黑马程序员Python学科", cut_all=True)
print("全模式: " + "/ ".join(seg_list)) # 全模式
seg_list = jieba.cut("欢迎来到黑马程序员Python学科", cut_all=False)
print("精确模式: " + "/ ".join(seg_list)) # 精确模式
operation result:
全模式: 欢迎/ 迎来/ 来到/ 黑马/ 程序/ 程序员/ Python/ 学科
精确模式: 欢迎/ 来到/ 黑马/ 程序员/ Python/ 学科
Word form problem
- look, looked, looking
- Accuracy affect learning corpus
- Word form normalization
1. Stemming (stemming,)
Example:
# PorterStemmer
from nltk.stem.porter import PorterStemmer
porter_stemmer = PorterStemmer()
print(porter_stemmer.stem('looked'))
print(porter_stemmer.stem('looking'))
# 运行结果:
# look
# look
Example:
# SnowballStemmer
from nltk.stem import SnowballStemmer
snowball_stemmer = SnowballStemmer('english')
print(snowball_stemmer.stem('looked'))
print(snowball_stemmer.stem('looking'))
# 运行结果:
# look
# look
Example:
# LancasterStemmer
from nltk.stem.lancaster import LancasterStemmer
lancaster_stemmer = LancasterStemmer()
print(lancaster_stemmer.stem('looked'))
print(lancaster_stemmer.stem('looking'))
# 运行结果:
# look
# look
2. merge word form (lemmatization)
-
stemming, stemming, such as the ing, ed removed, leaving only the word trunk
-
lemmatization, merging word form, the merging of various conjugations of words into a form, such as am, is, are -> be, went-> go
-
NLTK in stemmer
PorterStemmer, SnowballStemmer, LancasterStemmer
-
NLTK the lemma
WordNetLemmatizer
-
problem
went verb -> go, go Went noun -> Went, Covent
-
Indicating parts of speech can be more accurately lemma
Example:
from nltk.stem import WordNetLemmatizer
# 需要下载wordnet语料库
wordnet_lematizer = WordNetLemmatizer()
print(wordnet_lematizer.lemmatize('cats'))
print(wordnet_lematizer.lemmatize('boxes'))
print(wordnet_lematizer.lemmatize('are'))
print(wordnet_lematizer.lemmatize('went'))
# 运行结果:
# cat
# box
# are
# went
Example:
# 指明词性可以更准确地进行lemma
# lemmatize 默认为名词
print(wordnet_lematizer.lemmatize('are', pos='v'))
print(wordnet_lematizer.lemmatize('went', pos='v'))
# 运行结果:
# be
# go
3. speech tagging (Part-Of-Speech)
-
NLTK of speech tagging
nltk.word_tokenize()
Example:
import nltk
words = nltk.word_tokenize('Python is a widely used programming language.')
print(nltk.pos_tag(words)) # 需要下载 averaged_perceptron_tagger
# 运行结果:
# [('Python', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('widely', 'RB'), ('used', 'VBN'), ('programming', 'NN'), ('language', 'NN'), ('.', '.')]
4. Remove stop words
-
To save storage space and improve search efficiency, NLP will automatically filter out certain words or phrases
-
Stop words are entered manually generated non-automated form stop list
-
classification
Function word in the language, such as the, is ...
Vocabulary word, usually using a wide range of words such as want
-
Chinese stop list
Chinese disabled thesaurus
HIT stop list
Machine Intelligence Laboratory of Sichuan University disable thesaurus
Baidu list of stop words
-
Other language stoplist
http://www.ranks.nl/stopwords
-
Use NLTK remove stop words
stopwords.words()
Example:
from nltk.corpus import stopwords # 需要下载stopwords
filtered_words = [word for word in words if word not in stopwords.words('english')]
print('原始词:', words)
print('去除停用词后:', filtered_words)
# 运行结果:
# 原始词: ['Python', 'is', 'a', 'widely', 'used', 'programming', 'language', '.']
# 去除停用词后: ['Python', 'widely', 'used', 'programming', 'language', '.']
The pretreatment processes typical text
Example:
import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords
# 原始文本
raw_text = 'Life is like a box of chocolates. You never know what you\'re gonna get.'
# 分词
raw_words = nltk.word_tokenize(raw_text)
# 词形归一化
wordnet_lematizer = WordNetLemmatizer()
words = [wordnet_lematizer.lemmatize(raw_word) for raw_word in raw_words]
# 去除停用词
filtered_words = [word for word in words if word not in stopwords.words('english')]
print('原始文本:', raw_text)
print('预处理结果:', filtered_words)
operation result:
原始文本: Life is like a box of chocolates. You never know what you're gonna get.
预处理结果: ['Life', 'like', 'box', 'chocolate', '.', 'You', 'never', 'know', "'re", 'gon', 'na', 'get', '.']
Use Cases:
import nltk
from nltk.tokenize import WordPunctTokenizer
sent_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')
paragraph = "The first time I heard that song was in Hawaii on radio. I was just a kid, and loved it very much! What a fantastic song!"
# 分句
sentences = sent_tokenizer.tokenize(paragraph)
print(sentences)
sentence = "Are you old enough to remember Michael Jackson attending. the Grammys with Brooke Shields and Webster sat on his lap during the show?"
# 分词
words = WordPunctTokenizer().tokenize(sentence.lower())
print(words)
Output:
['The first time I heard that song was in Hawaii on radio.', 'I was just a kid, and loved it very much!', 'What a fantastic song!']
['are', 'you', 'old', 'enough', 'to', 'remember', 'michael', 'jackson', 'attending', '.', 'the', 'grammys', 'with', 'brooke', 'shields', 'and', 'webster', 'sat', 'on', 'his', 'lap', 'during', 'the', 'show', '?']