Introduction to Natural Language Processing (NLP) (1)

1. How to represent words in a way that computers can process (also called "word segmentation")

import tensorflow as tf 
from tensorflow import keras 
from tensorflow.keras.preprocessing.test import Tokenizer 
sentences=['I love my dog',
           'I.love my cat'] 
tokenizer=Tokenizer(num_words=100)

Create an instance of Tokenizer Object and keep the 100 most frequent words in the library

tokenizer.fit_on_texts(sentences)

View all the text in sentences and match the text with the corresponding number

word_index= tokenizer.word_index

Get a list of all words, output all vocabulary and vocabulary (Note: all uppercase letters will be converted to lowercase letters, remember to convert to uppercase when initial embedding is done later)

print(word_index)

Output: {'i': 1,'m': 3,'dog': 4,'cat': 5,'love': 2} Output words and their corresponding identifiers

Note: The tokenizer is very smart. Even if there is a "!" after dog, like I love my dog!, the tokenizer can recognize the dog and automatically remove the "!".

2. Create a number sequence for the sentence and convert the sentence containing the above words into a number sequence

import tensorflow as tf
from tensorflow import keras
from tensorflow.keras.preprocessing.test import Tokenizer
sentences=['I love my dog''I.love my cat''Do you think my dog is amazing?']
tokenizer=Tokenizer(num_words=100)
tokenizer.fit_on_texts(sentences)
word_index= tokenizer.word_index
sequences=tokenizer.texts_to_sequences(sentences)

Created a sequence of identifiers representing each sentence

print(word_index)
print(sequences)

Insert picture description here

3. Sort the sentences in the test set

Insert picture description here

Since manatee and really and loves are words that are not in word_index and are not in the corpus of this sequence, they will be output like this.
Insert picture description here
In order not to lose the length of the sentence, you can use the ovv_token attribute to set up to replace the unrecognizable content in the corpus.
Insert picture description here
Insert picture description here

4. To deal with sentences of different lengths, you can use padding to fill first to make the short sentence and the longest sentence the same length

Insert picture description here
Insert picture description here

See the next blog! ! !

Guess you like

Origin blog.csdn.net/qq_45234219/article/details/114462107