NLP-TF2.0-C3W1L5-Text to sequence

Coursera课堂笔记Natural Language Processing in TensorFlow

基于上一节的单词向量化

例1.

from tensorflow.keras.preprocessing.text import Tokenizer

sentences = [
    'i love my dog',
    'I love my cat',
    'You love my dog!',
    'Do you think my dog is amazing?'
]

tokenizer = Tokenizer(num_words=100)
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
sequence = tokenizer.texts_to_sequences(sentences)

print(word_index)
print(sequence)

输出:

{'my': 1, 'love': 2, 'dog': 3, 'i': 4, 'you': 5, 'cat': 6, 'do': 7, 'think': 8, 'is': 9, 'amazing': 10}
[[4, 2, 1, 3], [4, 2, 1, 6], [5, 2, 1, 3], [7, 5, 8, 1, 3, 9, 10]]

留意:'i love my dog'被编码成[4,2,1,3],

----------------------------------------------------------------------------------------------------------------------------------------------------

例2:在上例基础上增加以下测试语句:


test_data = [
    'i really love my dog',
    'my dog loves my manatee'
]

test_seq = tokenizer.texts_to_sequences(test_data)
print(test_seq)

注意:第一句中出现的新词really,第二句中出现的新词loves/manatee

输出:

[[4, 2, 1, 3], [1, 3, 1]]

‘i really love my dog' 被编码成了[4,2,1,3]——跟上例的'i love my dog'

很明显新增的单词really没有被编码,同理'my dog loves my manatee'新增的loves、manatee也没有被编码。

这是因为我们的tokenizer的词汇中不包含这些单词,因此无法编码。

(以下是C2W1L6的内容,由于内容较少,合到本节来)

例3:对词汇表以外的词还有一个处理方法:用缺省值

from tensorflow.keras.preprocessing.text import Tokenizer

sentences = [
    'i love my dog',
    'I love my cat',
    'You love my dog!',
    'Do you think my dog is amazing?'
]

tokenizer = Tokenizer(num_words=100, oov_token='<OOV>')
tokenizer.fit_on_texts(sentences)
word_index = tokenizer.word_index
sequence = tokenizer.texts_to_sequences(sentences)

print(word_index)
print(sequence)

test_data = [
    'i really love my dog',
    'my dog loves my manatee'
]

test_seq = tokenizer.texts_to_sequences(test_data)
print(test_seq)

与例2的区别是第10行tokenizer=Tokenizer(num_words=100, oov_token='<OOV>')多了一个oov_token参数

oov=Out Of Vocabulary, 意思词汇以外的单词都用'<OOV>'来代替

输出:

{'<OOV>': 1, 'my': 2, 'love': 3, 'dog': 4, 'i': 5, 'you': 6, 'cat': 7, 'do': 8, 'think': 9, 'is': 10, 'amazing': 11}
[[5, 3, 2, 4], [5, 3, 2, 7], [6, 3, 2, 4], [8, 6, 9, 2, 4, 10, 11]]
[[5, 1, 3, 2, 4], [2, 4, 1, 2, 1]]
 

猜你喜欢

转载自blog.csdn.net/menghaocheng/article/details/93158111