[Natural Language Processing] 3.1NLTK Toolset

3.1 NLTK toolset

introduce

  • NLTK is a natural language processing set
  • Multiple corpora (Corpora)
  • Lexicon resources (Lexicon), such as WordNet
  • Basic natural language processing toolset
    • Tokenization
    • Stemming
    • POS Tagging
    • Syntactic Parsing
  • install (command+R),pip install nltk

3.1.1 Common corpus and dictionary resources

  • download
    • nltk.download()method
    import nltk
    nltk.download() 
     ```
    
    1. stop words
    • Because the semantics are not important (such as articles), which can speed up the processing, so reduce the size, these words are called stopwords
       from nltk.corpus import stopwards
       stopwards.words()
      
    1. common corpus
    • Text datasets (images, chat logs, etc.)
      • Unlabeled corpus (raw text/raw corpus/raw text)
        • Use the download method as shown above to directly access the original text file (the directory is the storage directory selected when downloading)
        • Call functions provided by NLTK
          import nltk
          
        nltk.download(‘gutenberg’)
        from nltk.corpus import gutenberg
        gutenberg.raw(“austen-emma.txt”)
      • Annotated corpus```
        • The result of a certain task (eg: sentence memory corpus (sentence_polarity) contains positive and negative words, and has been preprocessed)
        • How to use sentence_polarity
          • sentence_polarity.categories returns praise and criticism information, neg and pos
          • sentence_polarity.words(categories='pos') Select all word lists in the corpus
        		[(sentence, category) for category in sentence_polarity.categories()
        		    for sentence in sentence_polarity.sents(categories=category)]
        		    # 返回的是一个大列表,每个元素为一个句子的单词列表和对应褒贬构成元组
        
    1. common dictionary
      1. WordNet
      • An English word meaning dictionary, Thesaurus defines a synset, which consists of words with the same meaning. Gloss (short definition), there is a certain relationship between different collections
        • use icons
           from nltk.corpus import wordnet
           nltk.download('wordnet')
           syns = wordnet.synsets("bank")
           # 返回的bank的全部十八个词义的synset同义词集合
           print(syns[0].name)
           # 返回的是bank的第一个词义的名称,其中n代表名词
           syns[0].definition()
           # 返回bank的第二个词义的定义,即为银行的定义,返回的是英语解释
           syns[0].examples() 
           # 返回的是bank的第一个使用实例
           syns[0].hypernyms()
           # 返回的是bank第一次词义的同义词集合
           dog = wordnet.synset('dog.n.01')
           cat = wordnet.synset('cat.n.01')
           dog.wup_similarity(cat)
          # 计算两个同义词之间的Wu-Palmer的相似度
          
    • synset returns all word meanings, definition returns definitions, examples returns usage examples, hypernyms returns the set of synonyms for the first word meaning, wup_similarity returns Wu-Palmer similarity
    1. SentiWordNet
    • It is a sentimental orientation dictionary based on WordNet's marked words, and each word is marked with positive and negative meanings.
    • use icons
		from nltk.corpus import sentiwordnet
		sentiwordnet.senti_synset('good.a.01')
		# 词good在形容词adjective下的第一语义,返回的是Poscore和Negscore

3.1.2 Common natural language processing tool sets

    1. Participle
    • Longer sentences need to be split into stem sentences (simple regular clauses), with exceptions such as (Mr.)
    • sent_tokenize icon
    from nltk.tokenize import sent_tokenize
    text = gutenberg.raw("austen-emma.txt")
    sentence = sent_tokenize(text)
    # 对Emma小说进行全文分句
    print(sentence[100])
    # 显示其中一个分句
    
    1. markup
    • An orange is composed of several tokens (Token) in sequence, and the token can be a word/punctuation mark (the most basic input unit for natural language processing), and the process of separating the orange into tokens is called Tokenization. One is to talk about punctuation marks and the previous word splitting
    • word_tokenize icon
    from nltk.tokenize import word_tokenize
    word_tokenize(sentence[100])
    
    1. part-of-speech tagging
    • NLTK provides a part-of-speech tagger (POS Tagger), NN is a noun, VBP is a verb, and uses the Penn Treebank (Penn Treebank) tagging standard. And there is a query function with part-of-speech tag meaning
    • Illustration of how to use pos_tag
    from nltk import pos_tag
    pos_tag(word_tokenize("They sat by the fire."))
    
    • Part-of-speech tag meaning query diagram
    nltk.help.upenn_tagset('NN')
    nltk.help.upenn_tagset('VBP')
    nltk.help.upenn_tagset() # 返回所有词性的标注集以及各种词性的示例
    
    1. Other tools
    • named entity recognition
    • Group fast analysis (Chunking)
    • Syntax analysis

Guess you like

Origin blog.csdn.net/xiaziqiqi/article/details/131362637
Recommended