coursera deep learning course5 week2

Word embeddings
之前每个单词或字符都采用one-hot向量进行表示,但是任意两个向量之间的点积都为0,即都不相关,无法衡量单词之间的相似度,因此将每个单词表示成dense vector的形式,可以将每一维度都当作单词的一个属性,这样的方法称为word embedding。通过监督学习得到embedding matrix,方法有word2vec,negative sampling,GloVe等,依次简化。
Properties of word embeddings
如下图所示,man-woman与king-queen都能大致表示gender向量,因此能够进行单词的类比。
这里写图片描述
不同词之间的相似性可以用cosine或者欧氏距离进行衡量。
这里写图片描述
Embedding matrix
用embedding matrix乘以某个单词的one-hot向量就能得到对应embedding的形式,但是在实践中会造成计算浪费,因为有大量的数值乘以零,因此比如在keras中有专门的embedding层进行高效计算。
这里写图片描述
Word2vec skip-grams model
任意选择一句的一个单词为context word,然后在这个词的左右一定范围内任意选择另一个词为target word,这样产生的训练集可能不能达到高的准确率,但是能够得到好的word embeddings。下图中的E即embedding matrix与theta都是要训练的对象。在实际中由于softmax分类器分母和的项可能很多导致计算较慢,因此可以考虑采用分层的形式进行计算,且树结构一般不用对称。
这里写图片描述
这里写图片描述
对于context word 的选择若采用均匀抽样的话就很容易抽到the、of这样的单词,这样训练集的效率就很低,因此需要进行改进。
Negative Sampling
比word2vec更高效,即固定context word,用上面的方法选出一个正样本,然后在语料库中按照一定方法选出负样本。
这里写图片描述
注意数据集越大,k反而越小
关于对负样本的选择:1.若按照empirical frequency选择,即按在corpus(语料库中出现的概率选择,容易选到the,of,and等词。2.按照vocabulary均匀选择,也不能反应单词的分布情况。3.论文中采用前面两者的中庸方法,如下所示,其中f(wi)表示语料库中wi单词的频率
这里写图片描述
GloVe(global vectors for word representation)更加简单
这里写图片描述
Sentiment Classification(情感分类)
1. 做法一:对句子进行简单的和或平均运算,对于如图中所示的很多good的情况易分类错误,因为没有考虑单词的前后关系。
这里写图片描述
2.做法2:RNN
这里写图片描述
Debiasing word embeddings( 消除算法中的诸如gender bias等偏见)
这里写图片描述

import numpy as np
from keras.models import Model
from keras.layers import Dense, Input, Dropout, LSTM, Activation
from keras.layers.embeddings import Embedding
from keras.preprocessing import sequence
from keras.initializers import glorot_uniform
def sentences_to_indices(X, word_to_index, max_len):
    """
    Converts an array of sentences (strings) into an array of indices corresponding to words in the sentences.
    The output shape should be such that it can be given to `Embedding()` (described in Figure 4). 

    Arguments:
    X -- array of sentences (strings), of shape (m, 1)
    word_to_index -- a dictionary containing the each word mapped to its index
    max_len -- maximum number of words in a sentence. You can assume every sentence in X is no longer than this. 

    Returns:
    X_indices -- array of indices corresponding to words in the sentences from X, of shape (m, max_len)
    """

    m = X.shape[0]                                   # number of training examples

    ### START CODE HERE ###
    # Initialize X_indices as a numpy matrix of zeros and the correct shape (≈ 1 line)
    X_indices = np.zeros((m, max_len))

    for i in range(m):                               # loop over training examples

        # Convert the ith training sentence in lower case and split is into words. You should get a list of words.
        sentence_words =X[i].lower().split()

        # Initialize j to 0
        j = 0

        # Loop over the words of sentence_words
        for w in sentence_words:
            # Set the (i,j)th entry of X_indices to the index of the correct word.
            X_indices[i, j] = word_to_index[w]
            # Increment j to j + 1
            j = j + 1

    ### END CODE HERE ###   
    return X_indices

    def pretrained_embedding_layer(word_to_vec_map, word_to_index):
    """
    Creates a Keras Embedding() layer and loads in pre-trained GloVe 50-dimensional vectors.

    Arguments:
    word_to_vec_map -- dictionary mapping words to their GloVe vector representation.
    word_to_index -- dictionary mapping from words to their indices in the vocabulary (400,001 words)

    Returns:
    embedding_layer -- pretrained layer Keras instance
    """

    vocab_len = len(word_to_index) + 1                  # adding 1 to fit Keras embedding (requirement)
    emb_dim = word_to_vec_map["cucumber"].shape[0]      # define dimensionality of your GloVe word vectors (= 50)

    ### START CODE HERE ###
    # Initialize the embedding matrix as a numpy array of zeros of shape (vocab_len, dimensions of word vectors = emb_dim)
    emb_matrix = np.zeros((vocab_len, emb_dim))

    # Set each row "index" of the embedding matrix to be the word vector representation of the "index"th word of the vocabulary
    for word, index in word_to_index.items():
        emb_matrix[index, :] = word_to_vec_map[word]

    # Define Keras embedding layer with the correct output/input sizes, make it trainable. Use Embedding(...). Make sure to set trainable=False. 
    embedding_layer = Embedding(vocab_len, emb_dim, trainable=False)
    ### END CODE HERE ###

    # Build the embedding layer, it is required before setting the weights of the embedding layer. Do not modify the "None".
    embedding_layer.build((None,))

    # Set the weights of the embedding layer to the embedding matrix. Your layer is now pretrained.
    embedding_layer.set_weights([emb_matrix])

    return embedding_layer

def Emojify_V2(input_shape, word_to_vec_map, word_to_index):
    """
    Function creating the Emojify-v2 model's graph.

    Arguments:
    input_shape -- shape of the input, usually (max_len,)
    word_to_vec_map -- dictionary mapping every word in a vocabulary into its 50-dimensional vector representation
    word_to_index -- dictionary mapping from words to their indices in the vocabulary (400,001 words)

    Returns:
    model -- a model instance in Keras
    """

    ### START CODE HERE ###
    # Define sentence_indices as the input of the graph, it should be of shape input_shape and dtype 'int32' (as it contains indices).
    sentence_indices = Input(shape = input_shape, dtype = 'int32')

    # Create the embedding layer pretrained with GloVe Vectors (≈1 line)
    embedding_layer = pretrained_embedding_layer(word_to_vec_map, word_to_index)

    # Propagate sentence_indices through your embedding layer, you get back the embeddings
    embeddings = embedding_layer(sentence_indices) 

    # Propagate the embeddings through an LSTM layer with 128-dimensional hidden state
    # Be careful, the returned output should be a batch of sequences.
    X = LSTM(128, return_sequences=True)(embeddings)
    # Add dropout with a probability of 0.5
    X = Dropout(0.5)(X)
    # Propagate X trough another LSTM layer with 128-dimensional hidden state
    # Be careful, the returned output should be a single hidden state, not a batch of sequences.
    X = LSTM(128)(X)
    # Add dropout with a probability of 0.5
    X = Dropout(0.5)(X)
    # Propagate X through a Dense layer with softmax activation to get back a batch of 5-dimensional vectors.
    X = Dense(5)(X)
    # Add a softmax activation
    X = Activation('softmax')(X)

    # Create Model instance which converts sentence_indices into X.
    model = Model(input=sentence_indices, output=X)

    ### END CODE HERE ###

    return model

model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])
X_train_indices = sentences_to_indices(X_train, word_to_index, maxLen)
Y_train_oh = convert_to_one_hot(Y_train, C = 5)
model.fit(X_train_indices, Y_train_oh, epochs = 50, batch_size = 32, shuffle=True)
# This code allows you to see the mislabelled examples
C = 5
y_test_oh = np.eye(C)[Y_test.reshape(-1)]
X_test_indices = sentences_to_indices(X_test, word_to_index, maxLen)
pred = model.predict(X_test_indices)
for i in range(len(X_test)):
    x = X_test_indices
    num = np.argmax(pred[i])
    if(num != Y_test[i]):
        print('Expected emoji:'+ label_to_emoji(Y_test[i]) + ' prediction: '+ X_test[i] + label_to_emoji(num).strip())
# Change the sentence below to see your prediction. Make sure all the words are in the Glove embeddings.  
x_test = np.array(['feel not very good'])
X_test_indices = sentences_to_indices(x_test, word_to_index, maxLen)
print(x_test[0] +' '+  label_to_emoji(np.argmax(model.predict(X_test_indices))))

这里写图片描述
这里写图片描述

猜你喜欢

转载自blog.csdn.net/yb564645735/article/details/79319060