[] Data Analysis study notes day27 NLTK + NLTK natural language processing and natural language processing base + NLTK Natural Language Toolkit + + installation steps corpus + word + tokenize + question word form

NLTK and natural language processing base

NLTK (Natural Language Toolkit)

NTLK famous Python Natural Language Processing Toolkit, but the main target is the English process. NLTK supporting a document, there is corpus, there are books.

  • NLP field most commonly used a Python library
  • Open source projects
  • Own classification, segmentation and other functions
  • Strong community support
  • Corpus, the actual use of the language really appeared language materials
  • http://www.nltk.org/py-modindex.html

In NLTK home page details how to install under Mac, Linux and Windows NLTK: http: //nltk.org/install.html, recommended direct download Anaconda, eliminating the need to install most of the packages, NLTK installation is complete, you can import nltk test, if there is no problem, as well as download NLTK related corpus official offer.

installation steps:

  1. Download NLTK package pip install nltk

  2. Run Python, and enter the following command

     import nltk
     nltk.download()
    
  3. The following pop-up window, it is recommended to install all the packages thatall

    [Image dump the chain fails, the source station may have security chain mechanism, it is recommended to save the picture down uploaded directly (img-iMoJDH5K-1579959367202) (... / images / nltk_install.png)]

  4. Test use:

    [Image dump the chain fails, the source station may have security chain mechanism, it is recommended to save the picture down uploaded directly (img-l3mGoGb2-1579959367203) (... / images / nltk_test.png)]

Corpus

nltk.corpus
import nltk
from nltk.corpus import brown # 需要下载brown语料库
# 引用布朗大学的语料库

# 查看语料库包含的类别
print(brown.categories())

# 查看brown语料库
print('共有{}个句子'.format(len(brown.sents())))
print('共有{}个单词'.format(len(brown.words())))

Results of the:

['adventure', 'belles_lettres', 'editorial', 'fiction', 'government', 'hobbies', 'humor', 'learned', 'lore', 'mystery', 'news', 'religion', 'reviews', 'romance', 'science_fiction']

共有57340个句子
共有1161192个单词

分词 (tokenize)

  • The sentence has been split into words meaning on language semantics
  • , The English word difference:
    • English, is a space between words as a natural delimiters
    • Chinese is not a formal delimiter word is more complex than English
  • Chinese word segmentation tools, such as: stammer word pip install jieba
  • After obtaining segmentation result, the subsequent processing is not much difference in English
# 导入jieba分词
import jieba

seg_list = jieba.cut("欢迎来到黑马程序员Python学科", cut_all=True)
print("全模式: " + "/ ".join(seg_list))  # 全模式

seg_list = jieba.cut("欢迎来到黑马程序员Python学科", cut_all=False)
print("精确模式: " + "/ ".join(seg_list))  # 精确模式

operation result:

全模式: 欢迎/ 迎来/ 来到/ 黑马/ 程序/ 程序员/ Python/ 学科
精确模式: 欢迎/ 来到/ 黑马/ 程序员/ Python/ 学科

Word form problem

  • look, looked, looking
  • Accuracy affect learning corpus
  • Word form normalization

1. Stemming (stemming,)

Example:

# PorterStemmer
from nltk.stem.porter import PorterStemmer

porter_stemmer = PorterStemmer()
print(porter_stemmer.stem('looked'))
print(porter_stemmer.stem('looking'))

# 运行结果:
# look
# look

Example:

# SnowballStemmer
from nltk.stem import SnowballStemmer

snowball_stemmer = SnowballStemmer('english')
print(snowball_stemmer.stem('looked'))
print(snowball_stemmer.stem('looking'))

# 运行结果:
# look
# look

Example:

# LancasterStemmer
from nltk.stem.lancaster import LancasterStemmer

lancaster_stemmer = LancasterStemmer()
print(lancaster_stemmer.stem('looked'))
print(lancaster_stemmer.stem('looking'))

# 运行结果:
# look
# look

2. merge word form (lemmatization)

  • stemming, stemming, such as the ing, ed removed, leaving only the word trunk

  • lemmatization, merging word form, the merging of various conjugations of words into a form, such as am, is, are -> be, went-> go

  • NLTK in stemmer

    PorterStemmer, SnowballStemmer, LancasterStemmer

  • NLTK the lemma

    WordNetLemmatizer

  • problem

    went verb -> go, go Went noun -> Went, Covent

  • Indicating parts of speech can be more accurately lemma

Example:

from nltk.stem import WordNetLemmatizer 
# 需要下载wordnet语料库

wordnet_lematizer = WordNetLemmatizer()
print(wordnet_lematizer.lemmatize('cats'))
print(wordnet_lematizer.lemmatize('boxes'))
print(wordnet_lematizer.lemmatize('are'))
print(wordnet_lematizer.lemmatize('went'))

# 运行结果:
# cat
# box
# are
# went

Example:

# 指明词性可以更准确地进行lemma
# lemmatize 默认为名词
print(wordnet_lematizer.lemmatize('are', pos='v'))
print(wordnet_lematizer.lemmatize('went', pos='v'))

# 运行结果:
# be
# go

3. speech tagging (Part-Of-Speech)

  • NLTK of speech tagging

    nltk.word_tokenize()

Example:

import nltk

words = nltk.word_tokenize('Python is a widely used programming language.')
print(nltk.pos_tag(words)) # 需要下载 averaged_perceptron_tagger

# 运行结果:
# [('Python', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('widely', 'RB'), ('used', 'VBN'), ('programming', 'NN'), ('language', 'NN'), ('.', '.')]

4. Remove stop words

  • To save storage space and improve search efficiency, NLP will automatically filter out certain words or phrases

  • Stop words are entered manually generated non-automated form stop list

  • classification

    Function word in the language, such as the, is ...

    Vocabulary word, usually using a wide range of words such as want

  • Chinese stop list

    Chinese disabled thesaurus

    HIT stop list

    Machine Intelligence Laboratory of Sichuan University disable thesaurus

    Baidu list of stop words

  • Other language stoplist

    http://www.ranks.nl/stopwords

  • Use NLTK remove stop words

    stopwords.words()

Example:

from nltk.corpus import stopwords # 需要下载stopwords

filtered_words = [word for word in words if word not in stopwords.words('english')]
print('原始词:', words)
print('去除停用词后:', filtered_words)

# 运行结果:
# 原始词: ['Python', 'is', 'a', 'widely', 'used', 'programming', 'language', '.']
# 去除停用词后: ['Python', 'widely', 'used', 'programming', 'language', '.']

The pretreatment processes typical text

Example:

import nltk
from nltk.stem import WordNetLemmatizer
from nltk.corpus import stopwords

# 原始文本
raw_text = 'Life is like a box of chocolates. You never know what you\'re gonna get.'

# 分词
raw_words = nltk.word_tokenize(raw_text)

# 词形归一化
wordnet_lematizer = WordNetLemmatizer()
words = [wordnet_lematizer.lemmatize(raw_word) for raw_word in raw_words]

# 去除停用词
filtered_words = [word for word in words if word not in stopwords.words('english')]

print('原始文本:', raw_text)
print('预处理结果:', filtered_words)

operation result:

原始文本: Life is like a box of chocolates. You never know what you're gonna get.
预处理结果: ['Life', 'like', 'box', 'chocolate', '.', 'You', 'never', 'know', "'re", 'gon', 'na', 'get', '.']

Use Cases:

import nltk
from nltk.tokenize import WordPunctTokenizer

sent_tokenizer = nltk.data.load('tokenizers/punkt/english.pickle')  
paragraph = "The first time I heard that song was in Hawaii on radio.  I was just a kid, and loved it very much! What a fantastic song!"  

# 分句
sentences = sent_tokenizer.tokenize(paragraph) 
print(sentences)

sentence = "Are you old enough to remember Michael Jackson attending. the Grammys with Brooke Shields and Webster sat on his lap during the show?"  

# 分词
words = WordPunctTokenizer().tokenize(sentence.lower())  
print(words)

Output:

['The first time I heard that song was in Hawaii on radio.', 'I was just a kid, and loved it very much!', 'What a fantastic song!']

['are', 'you', 'old', 'enough', 'to', 'remember', 'michael', 'jackson', 'attending', '.', 'the', 'grammys', 'with', 'brooke', 'shields', 'and', 'webster', 'sat', 'on', 'his', 'lap', 'during', 'the', 'show', '?']
He published 192 original articles · won praise 56 · views 10000 +

Guess you like

Origin blog.csdn.net/qq_35456045/article/details/104084955