深度学习—— 处理文本数据

深度学习模型不会接收原始文本作为输入，它只能处理数值张量。文本向量化（vectorize）是指将文本转换为数值张量的过程。

1. 单词和字符的one-hot编码

1）单词级

import numpy as np

samples = ['The cat sat on the mat.', 'The dog ate my homework.']  # 初始数据，本例中是一个句子，当然也可以是一篇文章

token_index = {}   # 构建数据中所有标记的索引
for sample in samples:
    for word in sample.split():   # 用split方法对样本进行分词，实际应用中，可能还需要考虑到标点符号
        if word not in token_index:

            token_index[word] = len(token_index) + 1  #为每个唯一单词指定唯一索引，注意我们没有为索引编号0指定单词


max_length = 10   # 对样本进行分词，只考虑样本前max_length单词

results = np.zeros((len(samples), max_length, max(token_index.values()) + 1))    # 将结果保存到results中
for i, sample in enumerate(samples):
    for j, word in list(enumerate(sample.split()))[:max_length]:
        index = token_index.get(word)
        results[i, j, index] = 1.

2）字符集

import string

samples = ['The cat sat on the mat.', 'The dog ate my homework.']
characters = string.printable   # 所有可打印的ASCII字符
token_index = dict(zip(characters, range(1, len(characters) + 1)))

max_length = 50
results = np.zeros((len(samples), max_length, max(token_index.values()) + 1))
for i, sample in enumerate(samples):
    for j, character in enumerate(sample[:max_length]):
        index = token_index.get(character)
        results[i, j, index] = 1.

3）Keras实现

from keras.preprocessing.text import Tokenizer

samples = ['The cat sat on the mat.', 'The dog ate my homework.']

tokenizer = Tokenizer(num_words=1000)  # i创建一个分词器（tokenizer），设置为只考虑前1000个最常见的单词

tokenizer.fit_on_texts(samples)   # 构建索引单词


sequences = tokenizer.texts_to_sequences(samples)   # 将字符串转换为整数索引组成的列表


one_hot_results = tokenizer.texts_to_matrix(samples, mode='binary')  #可以直接得到one-hot二进制表示。这个分词器也支持除
																												# one-hot编码外其他向量化模式


word_index = tokenizer.word_index  # 得到单词索引
print('Found %s unique tokens.' % len(word_index))

2. 使用词嵌入

待补充

深度学习—— 处理文本数据

1. 单词和字符的one-hot编码

1）单词级

2）字符集

3）Keras实现

2. 使用词嵌入

猜你喜欢