本作业所有资料均来自吴恩达在Coursera课程平台，深度学习专项课程第五门Sequence Models课程中第二周的课后作业Emogify。
课程链接为：https://www.coursera.org/learn/nlp-sequence-models
本节作业所需资料以及代码可从该作业文件目录下下载，链接为：

https://www.coursera.org/learn/nlp-sequence-models/notebook/acNYU/emojify
如何从Coursera下载打包课程教程链接：
下载Coursera-notebooks全部作业文件coursera notebooks下载

注：本篇部分内容引于

【中文】【吴恩达课后编程作业】Course 5 - 序列模型 - 第二周作业 - 词向量的运算与Emoji生成器

1. 作业简介

	本次作业将实现通过词向量来构建一个表情生成器。

你有没有想过让你的文字也有更丰富表达能力呢？比如写下“Congratulations on the promotion! Lets get coffee and talk. Love you!”，那么你的表情生成器就会自动生成“Congratulations on the promotion! ? Lets get coffee and talk. ☕️ Love you! ❤️”。
另一方面，如果你对这些表情不感冒，而你的朋友给你发了一大堆的带表情的文字，那么你也可以使用表情生成器来怼回去。
我们要构建一个模型，输入的是文字（比如“Let’s go see the baseball game tonight!”），输出的是表情（⚾️）。在众多的Emoji表情中，比如“❤️”代表的是“心”而不是“爱”，但是如果你使用词向量，那么你会发现即使你的训练集只明确地将几个单词与特定的表情符号相关联，你的模型也了能够将测试集中的单词归纳、总结到同一个表情符号，甚至有些单词没有出现在你的训练集中也可以。
本次作业中，包含使用词向量的基准模型（Emojifier-V1），以及一个更复杂的包含了LSTM的模型（Emojifier-V2）。

2. 基准模型：Emogifier-V1 Pycharm复现

2.1- 数据集

数据集（X，Y）：
X：包含了127个字符串类型的短句
Y：包含了对应短句的标签（0-4）
图2-1

图2-1：EMOJISET - 五类分类问题数据集举例

导入模块：

import numpy as np
from emo_utils import *
import emoji
import matplotlib.pyplot as plt

加载数据集，训练集：127，测试集：56 ：

X_train, Y_train = read_csv('data/train_emoji.csv')
X_test, Y_test = read_csv('data/tesss.csv')

maxLen = len(max(X_train, key=len).split()) #计算语句最大长度，方便后面补0操作

你可以看看训练集中有什么：
注：由于字体缘故，故在Pycharm中显示的表情是黑色的，哭脸笑脸有点像。

for idx in range(8):
	print("example"+ str(idx) +":")
    print(X_train[idx], label_to_emoji(Y_train[idx]))

Pycharm输出：
2-2

图2-2: 训练集实例 - Pycharm输出

Notebook输出：
2-3

图2-3: 训练集实例 - Notebook输出

2.2 - Emojifier-V1的结构

实现“Emojifier-V1”基准模型：
2-4

图2-4: 基准模型 (Emojifier-V1).

模型的输入是一段文字（比如“l love you”），输出的是维度为(1,5)的向量，最后在argmax层找寻最大可能性的输出。现在我们将我们的标签Y转换成softmax分类器所需要的格式，即从(m,1)转换为one-hot编码(m,5)，每一行都是经过one-hot编码后的样本，其中Y_oh指的是“Y-one-hot”。

Y_oh_train = convert_to_one_hot(Y_train, C = 5)
Y_oh_test = convert_to_one_hot(Y_test, C = 5)

下面实现模型。

2.3 - 实现Emojifier-V1模型

①如图2-4所示，第一步我们需要做的是将输入的句子转换为对应的词向量表示；然后获取均值；我们将使用预训练的50维的GloVe词嵌入。
加载词嵌入：

word_to_index, index_to_word, word_to_vec_map = read_glove_vecs('data/glove.6B.50d.txt')

这里我们加载了：
word_to_index：字典类型的词汇（400,001个）与索引的映射（有效范围：0-400,000）
index_to_word：字典类型的索引与词汇之间的映射。
word_to_vec_map：字典类型的词汇与对应GloVe向量的映射。
②接下来我们将实现sentence_to_avg()函数，我们可以将之分为以下两个步骤：

把每个句子转换为小写，然后分割为列表。我们可以使用X.lower() 与 X.split()。
对于句子中的每一个单词，转换为GloVe向量，然后对它们取平均。

def sentence_to_avg(sentence, word_to_vec_map):
    # Step 1: 分割句子，转换为列表。
    words = sentence.lower().split()

    # 初始化均值词向量
    avg = np.zeros((50,))
    
    # Step 2: 对词向量取均值
    total = 0
    for w in words:
        total += word_to_vec_map[w]
    avg = total/len(words)
     
    return avg

现在我们可以实现所有的模型结构了，在使用sentence_to_avg()之后，进行前向传播，计算损失，再进行反向传播，最后再更新参数。
根据图2-4模型结构实现model()函数，Yoh是已经经过独热编码后的Y，那么前向传播以及计算损失的公式如下：

def model(X, Y, word_to_vec_map, learning_rate = 0.01, num_iterations = 400):
   """
    在numpy中训练词向量模型。
    参数：
        X -- 输入的字符串类型的数据，维度为(m, 1)。
        Y -- 对应的标签，0-7的数组，维度为(m, 1)。
        word_to_vec_map -- 字典类型的单词到50维词向量的映射。
        learning_rate -- 学习率.
        num_iterations -- 迭代次数。 
    返回：
        pred -- 预测的向量，维度为(m, 1)。
        W -- 权重参数，维度为(n_y, n_h)。
        b -- 偏置参数，维度为(n_y,)
    """
    np.random.seed(1)

	# 定义训练数量
    m = Y.shape[0]
    n_y = 5
    n_h = 50
    
    # 使用Xavier初始化参数
    W = np.random.randn(n_y, n_h) / np.sqrt(n_h)
    b = np.zeros((n_y,))
    
    # 将Y转换成独热编码
    Y_oh = emo_utils.convert_to_one_hot(Y, C=n_y)
    
    # 优化循环
    for t in range(num_iterations):
        for i in range(m):
            # 获取第i个训练样本的均值
            avg = sentence_to_avg(X[i], word_to_vec_map)
            
            # 前向传播
            z = np.dot(W, avg) + b
            a = emo_utils.softmax(z)
            
            # 计算第i个训练的损失
            cost = -np.sum(Y_oh[i]*np.log(a))
            
            # 计算梯度
            dz = a - Y_oh[i]
            dW = np.dot(dz.reshape(n_y,1), avg.reshape(1, n_h))
            db = dz
            
            # 更新参数
            W = W - learning_rate * dW
            b = b - learning_rate * db
        if t % 100 == 0:
            print("第{t}轮，损失为{cost}".format(t=t,cost=cost))
            pred = emo_utils.predict(X, Y, W, b, word_to_vec_map)
            
    return pred, W, b

2.4 - 训练模型

训练模型：

pred, W, b = model(X_train, Y_train, word_to_vec_map)
print(pred)

执行结果：
2-5

图2-5: 训练结果
模型训练的准确率达到97%。

2.5 - 验证测试集

print("Training set:")
pred_train = predict(X_train, Y_train, W, b, word_to_vec_map)
print('Test set:')
pred_test = predict(X_test, Y_test, W, b, word_to_vec_map)

执行结果：
2-6

图2-6: 验证集执行结果
验证集准确可达到86%。

2.6 - Emojifier-V1模型测试

下面我们给出几个句子，看看模型给出的表情时候正确吧！

X_my_sentences = np.array(["i adore you", "i love you", "funny lol", "lets play with a ball", "food is ready", "not feeling happy"])
Y_my_labels = np.array([[0], [0], [2], [1], [4],[3]])

pred = predict(X_my_sentences, Y_my_labels , W, b, word_to_vec_map)
print_predictions(X_my_sentences, pred)

执行结果：
2-7

图2-7: 语句测试结果
从图2-7我们可以发现准确率达到83%，即六个句子中最后一个句子是错误的，从而我们可以发现这个模型无法预测“not feeling happy”，“This movie is not good and not enjoyable”这一类的句子，因为它只是将所有单词的向量做了平均，没有关心过句中词的顺序。所以引出了本文介绍的第二个模型：Emojifier-V2模型。

3. Emojifier-V2：在Keras中使用LSTM模块

现在我们构建一个能够接受输入文字序列的模型，这个模型会考虑到文字的顺序。Emojifier-V2依然会使用已经训练好的词嵌入。

import numpy as np
np.random.seed(0)
import keras
from keras.models import Model
from keras.layers import Dense, Input, Dropout, LSTM, Activation
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence

np.random.seed(1)
from keras.initializers import glorot_uniform

导入数据集、测试集及词嵌入：

X_train, Y_train = read_csv('data/train_emoji.csv')
X_test, Y_test = read_csv('data/tesss.csv')

word_to_index, index_to_word, word_to_vec_map = read_glove_vecs('data/glove.6B.50d.txt')

3.1 - Emojifier-V2模型的结构

3-1

图3-1: Emojifier-V2模型：一个两层的LSTM的序列分类器。

3.2 - Keras与mini-batching

在这个部分中，我们会使用mini-batches来训练Keras模型，但是大部分深度学习框架需要使用相同的长度的文字，这是因为如果你使用3个单词与4个单词的句子，那么转化为向量之后，计算步骤就有所不同（一个是需要3个LSTM，另一个需要4个LSTM），所以我们不可能对这些句子进行同时训练。
那么通用的解决方案是使用填充。指定最长句子的长度，然后对其他句子进行填充到相同长度。比如：指定最大的句子的长度为20，我们可以对每个句子使用“0”来填充，直到句子长度为20，因此，句子“I love you”就可以表示为:

3-1
所以在这个例子中，任何任何一个超过20个单词的句子将被截取，所以一个比较简单的方式就是找到最长句子，获取它的长度maxLen，然后指定它的长度为最长句子的长度。

maxLen = len(max(X_train, key=len).split())

3.3 - 嵌入层（ The Embedding layer）

在keras里面，嵌入矩阵被表示为“layer”，并将正整数（对应单词的索引）映射到固定大小的Dense向量（词嵌入向量），它可以使用训练好的词嵌入来接着训练或者直接初始化。在这里，我们将学习如何在Keras中创建一个Embedding()层，然后使用Glove的50维向量来初始化。因为我们的数据集很小，所以我们不会更新词嵌入，而是会保留词嵌入的值。
在Embedding()层中，输入一个整数矩阵（batch的大小，最大的输入长度），我们可以看看下图：
3-2

图3-2: Embedding 层结构
所以接下来将实现输入句子并将其转换为嵌入层可以接收的单词指引所构成的列表。

def sentences_to_indices(X, word_to_index, max_len):
    """
    输入的是X（字符串类型的句子的数组），再转化为对应的句子列表，
    输出的是能够让Embedding()函数接受的列表或矩阵（参见图3-2）。
    
    参数：
        X -- 句子数组，维度为(m, 1)
        word_to_index -- 字典类型的单词到索引的映射
        max_len -- 最大句子的长度，数据集中所有的句子的长度都不会超过它。
        
    返回：
        X_indices -- 对应于X中的单词索引数组，维度为(m, max_len)
    """
    
    m = X.shape[0]  # 训练集数量
    # 使用0初始化X_indices
    X_indices = np.zeros((m, max_len))
    
    for i in range(m):
        # 将第i个居住转化为小写并按单词分开。
        sentences_words = X[i].lower().split()
        
        # 初始化j为0
        j = 0
        
        # 遍历这个单词列表
        for w in sentences_words:
            # 将X_indices的第(i, j)号元素为对应的单词索引
            X_indices[i, j] = word_to_index[w]
            
            j += 1
            
    return X_indices

下面就可以构造嵌入层，我们使用的是已经训练好了的词向量，在构建之后，使用sentences_to_indices()生成的数据作为输入，Embedding()层将返回每个句子的词嵌入。
实现pretrained_embedding_layer()函数，它可以分为以下几个步骤：

使用0来初始化嵌入矩阵。
使用word_to_vec_map来将词嵌入矩阵填充进嵌入矩阵。
在Keras中定义嵌入层，当调用Embedding()的时候需要让这一层的参数不能被训练，所以我们可以设置trainable=False。
将词嵌入的权值设置为词嵌入的值。

def pretrained_embedding_layer(word_to_vec_map, word_to_index):
    """
    创建Keras Embedding()层，加载已经训练好了的50维GloVe向量
    
    参数：
        word_to_vec_map -- 字典类型的单词与词嵌入的映射
        word_to_index -- 字典类型的单词到词汇表（400,001个单词）的索引的映射。
        
    返回：
        embedding_layer() -- 训练好了的Keras的实体层。
    """
    vocab_len = len(word_to_index) + 1
    emb_dim = word_to_vec_map["cucumber"].shape[0]
    
    # 初始化嵌入矩阵
    emb_matrix = np.zeros((vocab_len, emb_dim))
    
    # 将嵌入矩阵的每行的“index”设置为词汇“index”的词向量表示
    for word, index in word_to_index.items():
        emb_matrix[index, :] = word_to_vec_map[word]
    
    # 定义Keras的embbeding层
    embedding_layer = Embedding(vocab_len, emb_dim, trainable=False)
    
    # 构建embedding层。
    embedding_layer.build((None,))
    
    # 将嵌入层的权重设置为嵌入矩阵。
    embedding_layer.set_weights([emb_matrix])
    
    return embedding_layer

3.4 - 构建Emojifier-V2模型

def Emojify_V2(input_shape, word_to_vec_map, word_to_index):
    """
    实现Emojify-V2模型的计算图
    
    参数：
        input_shape -- 输入的维度，通常是(max_len,)
        word_to_vec_map -- 字典类型的单词与词嵌入的映射。
        word_to_index -- 字典类型的单词到词汇表（400,001个单词）的索引的映射。
    
    返回：
        model -- Keras模型实体
    """
    # 定义sentence_indices为计算图的输入，维度为(input_shape,)，类型为dtype 'int32' 
    sentence_indices = Input(input_shape, dtype='int32')
    
    # 创建embedding层
    embedding_layer = pretrained_embedding_layer(word_to_vec_map, word_to_index)
    
    # 通过嵌入层传播sentence_indices，你会得到嵌入的结果
    embeddings = embedding_layer(sentence_indices)
    
    # 通过带有128维隐藏状态的LSTM层传播嵌入
    # 需要注意的是，返回的输出应该是一批序列。
    X = LSTM(128, return_sequences=True)(embeddings)
    # 使用dropout，概率为0.5
    X = Dropout(0.5)(X)
    # 通过另一个128维隐藏状态的LSTM层传播X
    # 注意，返回的输出应该是单个隐藏状态，而不是一组序列。
    X = LSTM(128, return_sequences=False)(X)
    # 使用dropout，概率为0.5
    X = Dropout(0.5)(X)
    # 通过softmax激活的Dense层传播X，得到一批5维向量。
    X = Dense(5)(X)
    # 添加softmax激活
    X = Activation('softmax')(X)
    
    # 创建模型实体
    model = Model(inputs=sentence_indices, outputs=X)
    
    return model

3.5 - 编译训练模型

在Keras中创建模型以后，我们需要编译并评估这个模型。我们可以使用categorical_crossentropy 损失, adam 优化器与 [‘accuracy’] 指标。

model = Emojify_V2((maxLen,), word_to_vec_map, word_to_index)
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

训练模型前，由于Emojifier-V2模型是以(m, max_len)为输入，(m, number of classes)为输出。所以我们需要将X_train转化为X_train_indices，Y_train转化为Y_train_oh。
我们需要使用X_train_indices 与 Y_train_oh来拟合模型，我们使用epochs = 50 与 batch_size = 32

X_train_indices = sentences_to_indices(X_train, word_to_index, maxLen)
Y_train_oh = convert_to_one_hot(Y_train, C=5)

model.fit(X_train_indices, Y_train_oh, epochs=50, batch_size=32, shuffle=True)`

执行结果:

Epoch 1/50
132/132 [==============================] - 5s 41ms/step - loss: 1.6106 - acc: 0.1667
Epoch 2/50
132/132 [==============================] - 0s 1ms/step - loss: 1.5380 - acc: 0.3106
Epoch 3/50
132/132 [==============================] - 0s 1ms/step - loss: 1.5063 - acc: 0.3030

...

Epoch 48/50
132/132 [==============================] - 0s 1ms/step - loss: 0.0759 - acc: 0.9697
Epoch 49/50
132/132 [==============================] - 0s 1ms/step - loss: 0.0467 - acc: 0.9924
Epoch 50/50
132/132 [==============================] - 0s 1ms/step - loss: 0.0417 - acc: 0.9848

3.6 - 验证测试集

在训练集准确率基本接近100%情况下验证测试集。

X_test_indices = sentences_to_indices(X_test, word_to_index, max_len=maxLen)
Y_test_oh = convert_to_one_hot(Y_test, C=5)
loss, acc = model.evaluate(X_test_indices, Y_test_oh)
print("Test accuracy = ", acc)

结果如下：

56/56 [==============================] - 1s 10ms/step
Test accuracy =  0.928571428571

验证集的准确率为93%。
我们可以看看哪些结果是错误的：

print("Mislabeled examples:")
C = 5
y_test_oh = np.eye(C)[Y_test.reshape(-1)]
X_test_indices = sentences_to_indices(X_test, word_to_index, maxLen)
pred = model.predict(X_test_indices)
for i in range(len(X_test)):
    x = X_test_indices
    num = np.argmax(pred[i])
    if (num != Y_test[i]):
        print('Expected emoji:' + label_to_emoji(Y_test[i]) + ' prediction: ' + X_test[i] + label_to_emoji(num).strip())
print('\n')

结果如下：
3-3

图3-3: 异常预测数据.

3.7 - Emojifier-V2模型测试

x_test = np.array(['not feeling happy','no one knows America better than Trump', 'I want to have lunch with you', 'I love playing basketball', 'I love China',
                   'I love yangping'])
X_test_indices = sentences_to_indices(x_test, word_to_index, maxLen)
pred = model.predict(X_test_indices)
for i in range(len(x_test)):
    #    print(x_test[i] +' '+  label_to_emoji(np.argmax(model.predict(X_test_indices))))
    x = X_test_indices
    num = np.argmax(pred[i])
    print(x_test[i] + ' ' + label_to_emoji(num).strip())

Pycharm结果如下：
3-4

图3-4: Pycharm运行结果
为了看得更清晰
Notebook结果如下：
3-5

图3-5: Notebook运行结果
注：最后输出Pycharm与Notebook不太相似，可能是两中编译器所训练的权重不同导致，同时在Pycharm实现时，所用的GloVe词嵌入模型是从网上下载的，可能与Notebook中的版本不同。
但可以看出，Emojifier-V2模型可以正确预测“not feeling happy”所定义的表情符号。

结语

经过一下午的尝试总算成功将Coursera上面的项目在自己本地电脑上复现出来，其中也遇到了许多bug，但还好都一一解决了，其中注意，在Pycharm上复现该项目时，读取txt文件会出现gbk编码错误的bug，解决方式如下：
打开emo_utils.py文件，其中将read_glove_vecs(glove_file)函数中open(glove_file, ‘r’)改为open(glove_file, ‘r’ , encoding=‘utf-8’)，即解决。

def read_glove_vecs(glove_file):
    with open(glove_file, 'r' , encoding='utf-8') as f:
        words = set()
        word_to_vec_map = {
    
    }
        for line in f:
            line = line.strip().split()
            curr_word = line[0]
            words.add(curr_word)
            word_to_vec_map[curr_word] = np.array(line[1:], dtype=np.float64)

最后也墙裂推荐一下吴恩达老师的深度学习课程，真的非常好！！！
这篇博客相当于记录自己的一次尝试，也希望可以帮助到需要的同学们！

吴恩达（Andrew Ng）deep learning课程-Sequence Models编程作业Emojify Pycharm实现

目录