3.1 NLTK toolset
introduce
- NLTK is a natural language processing set
- Multiple corpora (Corpora)
- Lexicon resources (Lexicon), such as WordNet
- Basic natural language processing toolset
- Tokenization
- Stemming
- POS Tagging
- Syntactic Parsing
- install (command+R),
pip install nltk
3.1.1 Common corpus and dictionary resources
- download
nltk.download()
method
import nltk nltk.download() ```
-
- stop words
- Because the semantics are not important (such as articles), which can speed up the processing, so reduce the size, these words are called stopwords
from nltk.corpus import stopwards stopwards.words()
-
- common corpus
- Text datasets (images, chat logs, etc.)
- Unlabeled corpus (raw text/raw corpus/raw text)
- Use the download method as shown above to directly access the original text file (the directory is the storage directory selected when downloading)
- Call functions provided by NLTK
import nltk
from nltk.corpus import gutenberg
gutenberg.raw(“austen-emma.txt”) - Annotated corpus```
- The result of a certain task (eg: sentence memory corpus (sentence_polarity) contains positive and negative words, and has been preprocessed)
- How to use sentence_polarity
- sentence_polarity.categories returns praise and criticism information, neg and pos
- sentence_polarity.words(categories='pos') Select all word lists in the corpus
[(sentence, category) for category in sentence_polarity.categories() for sentence in sentence_polarity.sents(categories=category)] # 返回的是一个大列表,每个元素为一个句子的单词列表和对应褒贬构成元组
- Unlabeled corpus (raw text/raw corpus/raw text)
-
- common dictionary
-
- WordNet
- An English word meaning dictionary, Thesaurus defines a synset, which consists of words with the same meaning. Gloss (short definition), there is a certain relationship between different collections
- use icons
from nltk.corpus import wordnet nltk.download('wordnet') syns = wordnet.synsets("bank") # 返回的bank的全部十八个词义的synset同义词集合 print(syns[0].name) # 返回的是bank的第一个词义的名称,其中n代表名词 syns[0].definition() # 返回bank的第二个词义的定义,即为银行的定义,返回的是英语解释 syns[0].examples() # 返回的是bank的第一个使用实例 syns[0].hypernyms() # 返回的是bank第一次词义的同义词集合 dog = wordnet.synset('dog.n.01') cat = wordnet.synset('cat.n.01') dog.wup_similarity(cat) # 计算两个同义词之间的Wu-Palmer的相似度
- use icons
- synset returns all word meanings, definition returns definitions, examples returns usage examples, hypernyms returns the set of synonyms for the first word meaning, wup_similarity returns Wu-Palmer similarity
-
- SentiWordNet
- It is a sentimental orientation dictionary based on WordNet's marked words, and each word is marked with positive and negative meanings.
- use icons
from nltk.corpus import sentiwordnet
sentiwordnet.senti_synset('good.a.01')
# 词good在形容词adjective下的第一语义,返回的是Poscore和Negscore
3.1.2 Common natural language processing tool sets
-
- Participle
- Longer sentences need to be split into stem sentences (simple regular clauses), with exceptions such as (Mr.)
- sent_tokenize icon
from nltk.tokenize import sent_tokenize text = gutenberg.raw("austen-emma.txt") sentence = sent_tokenize(text) # 对Emma小说进行全文分句 print(sentence[100]) # 显示其中一个分句
-
- markup
- An orange is composed of several tokens (Token) in sequence, and the token can be a word/punctuation mark (the most basic input unit for natural language processing), and the process of separating the orange into tokens is called Tokenization. One is to talk about punctuation marks and the previous word splitting
- word_tokenize icon
from nltk.tokenize import word_tokenize word_tokenize(sentence[100])
-
- part-of-speech tagging
- NLTK provides a part-of-speech tagger (POS Tagger), NN is a noun, VBP is a verb, and uses the Penn Treebank (Penn Treebank) tagging standard. And there is a query function with part-of-speech tag meaning
- Illustration of how to use pos_tag
from nltk import pos_tag pos_tag(word_tokenize("They sat by the fire."))
- Part-of-speech tag meaning query diagram
nltk.help.upenn_tagset('NN') nltk.help.upenn_tagset('VBP') nltk.help.upenn_tagset() # 返回所有词性的标注集以及各种词性的示例
-
- Other tools
- named entity recognition
- Group fast analysis (Chunking)
- Syntax analysis