用词嵌入的方法处理原始文本

本次，我们使用的数据集来自电影音乐评论。我们使用词嵌入的方法对其进行处理分析。我们首先将句子嵌入到词向量中，然后将其展平，最后在上面训练一个Dense层。但此处，我们将使用预训练好的词嵌入，此外，我们不使用Keras内置的，已经预先做好分词的IMDB数据集，而是从头开始，使用下载的IMDB原始文本数据。

下载和处理IMDB原始数据

我们从http://mng.bz/0tIo中下载原始IMDB数据集并解压。

接下来，我们将训练评论转换为字符串列表，每个列表对应一条评论。

下面对IMDB数据集的标签进行处理：

import os

imdb_dir = '/home/einstellung/桌面/aclImdb'
train_dir = os.path.join(imdb_dir, 'train')

labels = []
texts = []

for label_type in ['neg', 'pos']:
    dir_name = os.path.join(train_dir, label_type)
    for fname in os.listdir(dir_name):
        if fname[-4:] == '.txt':
            f = open(os.path.join(dir_name, fname))
            texts.append(f.read())
            f.close()
            if label_type == 'neg':
                labels.append(0)
            else:
                labels.append(1)

对数据进行分词

我们下面对文本进行分词，并将其划分为训练集和验证集。因为预训练的词嵌入对训练数据很小的问题特别有效（否则，针对于具体任务的嵌入可能效果更好）。而这里我们希望使用预训练的词嵌入，所以我们添加一下限制：将训练数据限定为前200个样本。因此，你需要在读取前200个样本之后学习如何对电影评论进行分类。

from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
import numpy as np

maxlen = 100  # 在100个单词之后截断评论
training_samples = 200  # 在200个样本上训练
validation_samples = 10000  # 在10000个样本上验证
max_words = 10000  # 只考虑数据集中前10000个最常见的单词

tokenizer = Tokenizer(num_words=max_words)
tokenizer.fit_on_texts(texts)
sequences = tokenizer.texts_to_sequences(texts)

word_index = tokenizer.word_index
print('Found %s unique tokens.' % len(word_index))

data = pad_sequences(sequences, maxlen=maxlen)

labels = np.asarray(labels)
print('Shape of data tensor:', data.shape)
print('Shape of label tensor:', labels.shape)


indices = np.arange(data.shape[0])   # 将数据划分为训练集和验证集，但首先要打乱顺序，因为一开始数据中样本是排好顺序的
                                                           # 所有的负面评论在前面，然后是所有的正面评论
np.random.shuffle(indices)
data = data[indices]
labels = labels[indices]

x_train = data[:training_samples]
y_train = labels[:training_samples]
x_val = data[training_samples: training_samples + validation_samples]
y_val = labels[training_samples: training_samples + validation_samples]

Found 88582 unique tokens.
Shape of data tensor: (25000, 100)
Shape of label tensor: (25000,)

下面我们下载2014年英文维基百科的预计算嵌入，里面包含了400000个单词（或非单词的标记）的100维嵌入向量。https://nlp.stanford.edu/projects/glove/