《CS224n斯坦福课程》-----第一部分的大作业

看到简化版的题目，我觉得我就像一个脑残，根本看不懂，只有看到原题目，我才知道要做啥。我现在把原题目贴出来，然后一一的解答。

题目意思：

(a) 证明softmax函数的一个性质，在输入中存在偏移，但softmax的值是不随着偏移而改变。在实践中，我们认为这个偏移值一般是输入中的最大值。

(b) 给出输入矩阵，N行D列，然后计算每行的softmax函数值，最好是采用向量化来实现，以便为后续提供一个好的基础。一个非向量化实现的方式，不会得到全部的分数。

解答二个部分：

（a）

根据指数函数的性质，可以得到此偏移不变形。

（b）编程实现

在这个问题中，我们需要主要的几点：

1，在处理此类问题的时候，向量化操作真的很重要，所以能向量化就向量化，当然对于初学者来说，这种向量化的思路可能刚开始很难理解，但需要不断的熟悉这种思想，然后不断的去应用。在向量化中，必不可少的一个库就是numpy。

numpy的学习网站：https://www.jianshu.com/p/358948fbbc6e

2，在这个问题中，一个技巧，就是利用到了上一步的偏移以及trick，就是这个偏移值是最大的值，否则在官网给出的测试用例就会有溢出的问题。

import numpy as np

def softmax(x):
    """Compute the softmax function for each row of the input x.

    It is crucial that this function is optimized for speed because
    it will be used frequently in later code. You might find numpy
    functions np.exp, np.sum, np.reshape, np.max, and numpy
    broadcasting useful for this task.

    Numpy broadcasting documentation:
    http://docs.scipy.org/doc/numpy/user/basics.broadcasting.html

    You should also make sure that your code works for a single
    D-dimensional vector (treat the vector as a single row) and
    for N x D matrices. This may be useful for testing later. Also,
    make sure that the dimensions of the output match the input.

    You must implement the optimization in problem 1(a) of the
    written assignment!

    Arguments:
    x -- A D dimensional vector or N x D dimensional numpy matrix.

    Return:
    x -- You are allowed to modify x in-place
    """
    orig_shape = x.shape

    if len(x.shape) > 1:
        # Matrix
        x = x - np.max(x, axis=1, keepdims=True)
        x = np.exp(x)/np.sum(np.exp(x), axis=1, keepdims=True)
    else:
        # Vector
        x = x - np.max(x)
        x = np.exp(x)/np.sum(np.exp(x))

    assert x.shape == orig_shape
    return x

def test_softmax_basic():
    """
    Some simple tests to get you started.
    Warning: these are not exhaustive.
    """
    print "Running basic tests..."
    test1 = softmax(np.array([1,2]))
    print test1
    ans1 = np.array([0.26894142,  0.73105858])
    assert np.allclose(test1, ans1, rtol=1e-05, atol=1e-06)

    test2 = softmax(np.array([[1001,1002],[3,4]]))
    print test2
    ans2 = np.array([
        [0.26894142, 0.73105858],
        [0.26894142, 0.73105858]])
    assert np.allclose(test2, ans2, rtol=1e-05, atol=1e-06)

    test3 = softmax(np.array([[-1001,-1002]]))
    print test3
    ans3 = np.array([0.73105858, 0.26894142])
    assert np.allclose(test3, ans3, rtol=1e-05, atol=1e-06)

    print "You should be able to verify these results by hand!\n"


if __name__ == "__main__":
    test_softmax_basic()

测试结果：

我把最后一个print注释掉了，截图如下：

手工去测试一下程序：对于指数函数而言，图像如下：

所以对于数值很大的情况，数值可以说很爆炸了，对于数值很小的情况，又太小了，所以采用偏移不变性是一个很好的解决措施。

对于程序里，老师给出的框架中，assert是做断言，用于捕捉错误信息，看是否计算出来的数值与真实的数值相差的情况。

题目解释：

（a）推导sigmod函数的梯度，并且将其写成复合函数的形式。假定输入的x是标量。

（b）推导梯度下降（采用交叉熵的softmax函数），此时class label可以视为0-1的one-hot编码形式，也就是只有一个1，其余均为0。

（c）推导梯度下降，输入x，只有一层隐藏层的神经网络，损失函数利用交叉熵来度量，神经网络中激活函数利用sigmod函数来作为激活函数，利用softmax函数来作用于输出层，标签采用one-hot的形式。(其实就是神经网络的常规推导)

（d）在上图的神经网络中有多少个参数，假定输入有Dx维，输出有Dy维，有H个神经元。

（e）编写sigmod激活函数和其梯度的程序

（f）编写梯度检测的程序

（g）编写只有一个sigmod隐藏层的神经网络的前向和后向推导。

解答这几个问题：

（a）sigmod函数求导

(b) 输出层的求导情况

（c）主要考的就是链式法则的应用，在这个问题中，主要注意的是矩阵的维度的问题，就是到底用不用转置，需要根据具体的维度变化来决定。

（d）考虑参数的个数，第一层（隐藏层）+第二层（输出层）

（e）编写sigmod函数及其求导函数

#!/usr/bin/env python

import numpy as np


def sigmoid(x):
    """
    Compute the sigmoid function for the input here.

    Arguments:
    x -- A scalar or numpy array.

    Return:
    s -- sigmoid(x)
    """
    s = 1 / (1 + np.exp(-x))
    return s


def sigmoid_grad(s):
    """
    Compute the gradient for the sigmoid function here. Note that
    for this implementation, the input s should be the sigmoid
    function value of your original input x.

    Arguments:
    s -- A scalar or numpy array.

    Return:
    ds -- Your computed gradient.
    """
    ds = s * (1 - s)

    return ds


def test_sigmoid_basic():
    """
    Some simple tests to get you started.
    Warning: these are not exhaustive.
    """
    print "Running basic tests..."
    x = np.array([[1, 2], [-1, -2]])
    f = sigmoid(x)
    g = sigmoid_grad(f)
    print f
    f_ans = np.array([
        [0.73105858, 0.88079708],
        [0.26894142, 0.11920292]])
    assert np.allclose(f, f_ans, rtol=1e-05, atol=1e-06)
    print g
    g_ans = np.array([
        [0.19661193, 0.10499359],
        [0.19661193, 0.10499359]])
    assert np.allclose(g, g_ans, rtol=1e-05, atol=1e-06)
    print "You should verify these results by hand!\n"


if __name__ == "__main__":
    test_sigmoid_basic()

测试结果：

（f）梯度检查，利用双边检查，得到的精确度更高。

#!/usr/bin/env python

import numpy as np
import random

# First implement a gradient checker by filling in the following functions
def gradcheck_naive(f, x):
    """ Gradient check for a function f.

    Arguments:
    f -- a function that takes a single argument and outputs the
         cost and its gradients
    x -- the point (numpy array) to check the gradient at
    """

    rndstate = random.getstate()
    random.setstate(rndstate)
    fx, grad = f(x)  # Evaluate function value at original point
    h = 1e-4         # Do not change this!

    # Iterate over all indexes ix in x to check the gradient.
    it = np.nditer(x, flags=['multi_index'], op_flags=['readwrite'])
    while not it.finished:
        ix = it.multi_index

        # Try modifying x[ix] with h defined above to compute numerical
        # gradients (numgrad).

        # Use the centered difference of the gradient.
        # It has smaller asymptotic error than forward / backward difference
        # methods. If you are curious, check out here:
        # https://math.stackexchange.com/questions/2326181/when-to-use-forward-or-central-difference-approximations

        # Make sure you call random.setstate(rndstate)
        # before calling f(x) each time. This will make it possible
        # to test cost functions with built in randomness later.
        x[ix] += h
        f1 = f(x)[0]
        x[ix] -= 2 * h
        f2 = f(x)[0]
        x[ix] += h
        numgrad = (f1-f2)/(2*h)

        # Compare gradients
        reldiff = abs(numgrad - grad[ix]) / max(1, abs(numgrad), abs(grad[ix]))
        if reldiff > 1e-5:
            print "Gradient check failed."
            print "First gradient error found at index %s" % str(ix)
            print "Your gradient: %f \t Numerical gradient: %f" % (
                grad[ix], numgrad)
            return

        it.iternext() # Step to next dimension

    print "Gradient check passed!"

def sanity_check():
    """
    Some basic sanity checks.
    """
    quad = lambda x: (np.sum(x ** 2), x * 2)

    print "Running sanity checks..."
    gradcheck_naive(quad, np.array(123.456))      # scalar test
    gradcheck_naive(quad, np.random.randn(3,))    # 1-D test
    gradcheck_naive(quad, np.random.randn(4,5))   # 2-D test
    print ""

if __name__ == "__main__":
    sanity_check()

程序解释：在函数gradcheck_naive(f, x) ，其中f是一个函数，接受一个参数的函数，返回的是一个元祖，包含二项，第一项为损失函数cost的数值，第二项为梯度数值；x为进行检测的输入的数值，可以是标量，也可以是矩阵（向量）。设置了一个随机种子，以便你的测试是同一的随机种子产生，产生正确的结果。然后np.nditer就是一个迭代器，多重索引的迭代器，然后基于索引的基础上，然后进行双边的梯度检查，然后换下一个数据进行迭代。

【只有实践，才能发现原来有含糊的地方，是一定会出错的，不过早出错比较好。】

在这个地方：我有一个小bug的出现，在这个单个测试中，输入只是一个数值的情况下，这种方法是适用的，但如果是多个的情况下，f(x[ix]+h)是会出错的，因为这个时候输入的是一个数值，而不是整个数据。

（g）最后一个是实现二层的神经网络（其中一层为隐藏层，一层为输出层）

#!/usr/bin/env python

import numpy as np
import random

from q1_softmax import softmax
from q2_sigmoid import sigmoid, sigmoid_grad
from q2_gradcheck import gradcheck_naive

def forward_backward_prop(X, labels, params, dimensions):
    """
    Forward and backward propagation for a two-layer sigmoidal network

    Compute the forward propagation and for the cross entropy cost,
    the backward propagation for the gradients for all parameters.

    Notice the gradients computed here are different from the gradients in
    the assignment sheet: they are w.r.t. weights, not inputs.

    Arguments:
    X -- M x Dx matrix, where each row is a training example x.
    labels -- M x Dy matrix, where each row is a one-hot vector.
    params -- Model parameters, these are unpacked for you.
    dimensions -- A tuple of input dimension, number of hidden units
                  and output dimension
    """

    ### Unpack network parameters (do not modify)
    ofs = 0
    Dx, H, Dy = (dimensions[0], dimensions[1], dimensions[2])

    W1 = np.reshape(params[ofs:ofs + Dx * H], (Dx, H))
    ofs += Dx * H
    b1 = np.reshape(params[ofs:ofs + H], (1, H))
    ofs += H
    W2 = np.reshape(params[ofs:ofs + H * Dy], (H, Dy))
    ofs += H * Dy
    b2 = np.reshape(params[ofs:ofs + Dy], (1, Dy))

    # Note: compute cost based on `sum` not `mean`.
    z1 = X.dot(W1) + b1
    a1 = sigmoid(z1)
    z2 = a1.dot(W2) + b2
    a2 = softmax(z2)
    cost = -np.sum(labels * np.log(a2))

    gradz2 = (a2 - labels)
    gradW2 = a1.T.dot(gradz2)
    gradb2 = np.sum(gradz2, axis=0, keepdims=True)
    grada1 = gradz2.dot(W2.T)
    gradz1 = grada1*sigmoid_grad(a1)
    gradW1 = X.T.dot(gradz1)
    gradb1 = np.sum(gradz1, axis=0, keepdims=True)


    ### Stack gradients (do not modify)
    grad = np.concatenate((gradW1.flatten(), gradb1.flatten(), gradW2.flatten(), gradb2.flatten()))
    grad.resize((len(grad), 1))

    return cost, grad

def sanity_check():
    """
    Set up fake data and parameters for the neural network, and test using
    gradcheck.
    """
    print "Running sanity check..."

    N = 20
    dimensions = [10, 5, 10]
    data = np.random.randn(N, dimensions[0])   # each row will be a datum
    labels = np.zeros((N, dimensions[2]))
    for i in xrange(N):
        labels[i, random.randint(0, dimensions[2]-1)] = 1

    params = np.random.randn((dimensions[0] + 1) * dimensions[1] + (
        dimensions[1] + 1) * dimensions[2], 1)

    gradcheck_naive(lambda params: forward_backward_prop(data, labels, params, dimensions), params)



if __name__ == "__main__":
    sanity_check()

在这个问题的实现中，我也犯了一个蠢，就是对于sigmod的求导的地方有含糊，就是不知道到底是那个进行变化，我现在对其重新确定一下。

然后在本程序中，调用sigmod_grad(a1)就是求上面的这个倒数值。

测试结果：

题目解释：

（a）中心词的索引为c,预测索引为o的词是否为中心词的窗口范围的词，其中u(w)为字典中的所有的词的词向量，其实就是用二套词向量来进行表示，方便解耦合，简化学习过程。说了这么多，这个题目就是求一个梯度。

（b）仍然求一个梯度。

（c）在（a）与（b）中，采用的传统的，也就是初步的word2vec来实现的，但我们知道采用负采样的方法，实现效率更高。所以，这个题目就是用来验证这个结论。用CE loss的运行时间除以negative sampling loss的运行时间来作为speed-up ratio。

（d）word2vec中有二种类别，一种为CBOW，另一种为skip-gram。窗口大小为m，然后二种方式的梯度的推导。这是一个不断扩展的问题，一步步的从抽象的情况，扩展到具体的情况。

（e）补充word2vec模型，然后利用随机梯度下降来训练你自己的词向量。在文件中，需要书写的有：一个归一化矩阵的行的函数，填充softmax函数和梯度下降函数，补充skip-gram的损失函数和梯度下降函数。（其实就是实现skip-gram）

（f）实现随机梯度下降算法

（g）下载实际生活的数据，然后利用补充的程序来训练词向量。使用的是斯坦福的语义分析树语料库来训练，这个训练好的部分会用于下一个部分的语义分析任务。

（h）扩展补充CBOW算法。

题目解答：

(a)(b)（这个地方涉及到了矩阵微分的知识，这个地方的求导有些许的问题，待我看完矩阵微分方法之后来解决这个BUG）

（c）word2vec的负采样实现中，一次迭代中只需要计算的是K+1个数据，而对于传统的softmax方式中，则需要计算的是W+1个数据，所以，时间花费大约为（W+1）/（K+1）

(d)对于skip-gram而言，推导如下：

现在我们考虑的是一个具体的情况，然后更新的就是vc，对于vj而言，是不做处理的，所以就是窗口范围内的更新参数。

对于CBOW而言，推导如下：

对于skip-gram而言，简单说：就是已知中心词，然后求窗口上下文的情况；对于CBOW而言，简单说：就是知道窗口上下文的情况，然后求中心词的情况，中心词采用上下文次的平均来计算。这样就可以知道，损失函数到底与什么有关，与什么无关了。

（e）这一部分卡了好久，尤其在负采样那个函数里面，卡了真的好久好久。虽然难产，但还是产出来了。

在这个程序中，第一个函数NormalizeRows(x)是对输入数据进行归一化操作，归一化的方法是通过按行的模长的归一化方法。也就是每一行的每一个数值除以此行的向量的模长来实现。（根据给出的例子,也就是test_normalize_rows我们可以推断出来。）

对于softmaxCostAndGradient函数，就是通过（a）（b）的公式来实现。但一定注意的是矩阵的维度，尤其是对于（3L, ）这个问题，很容易出现各种莫名其妙的问题，所以最好就把矩阵的维度统一，这样比较容易求。

对于getNegativeSamples函数，就是通过随机采样函数，来得到K个负采样的单词的索引。

对于negSamplingCostAndGradient函数，就是通过(c)的公式来是实现，也是要注意矩阵维度的变化。

对于skipgram（cbow）函数，就是通过（d）的公式，来把每一个中心词，上下文词串起来，最后得到结果。

对于word2vec_sgd_wrapper函数，类似与一个框架，将Word2vecmodel全部框起来，也就是可以采用skipgram实现，也可以采用cbow来实现。

对于test_word2vec函数，就是测试的接口。

#!/usr/bin/env python
# -*- coding:utf-8 -*-

import numpy as np
import random

from q1_softmax import softmax
from q2_gradcheck import gradcheck_naive
from q2_sigmoid import sigmoid, sigmoid_grad

def normalizeRows(x):
    """ Row normalization function
    # 除以模长的归一化方法
    Implement a function that normalizes each row of a matrix to have
    unit length.
    """

    x = x / (np.sqrt(np.sum(x*x, axis=1, keepdims=True)))
    return x


def test_normalize_rows():
    print "Testing normalizeRows..."
    x = normalizeRows(np.array([[3.0,4.0],[1, 2]]))
    print x
    ans = np.array([[0.6,0.8],[0.4472136,0.89442719]])
    assert np.allclose(x, ans, rtol=1e-05, atol=1e-06)
    print ""


def softmaxCostAndGradient(predicted, target, outputVectors, dataset):
    """ Softmax cost function for word2vec models

    Implement the cost and gradients for one predicted word vector
    and one target word vector as a building block for word2vec
    models, assuming the softmax prediction function and cross
    entropy loss.

    Arguments:
    predicted -- numpy ndarray, predicted word vector (\hat{v} in
                 the written component)
    target -- integer, the index of the target word
    outputVectors -- "output" vectors (as rows) for all tokens
    dataset -- needed for negative sampling, unused here.

    Return:
    cost -- cross entropy cost for the softmax word prediction
    gradPred -- the gradient with respect to the predicted word
           vector
    grad -- the gradient with respect to all the other word
           vectors

    We will not provide starter code for this function, but feel
    free to reference the code you previously wrote for this
    assignment!
    """
    # 为了避免出错，最好利用reshape来将矩阵来转变为自己需要的那一种类型, 因为softmax是对行来进行
    predicted = predicted.reshape([1, predicted.shape[0]])
    y_hot = softmax(predicted.dot(outputVectors.T)).reshape([outputVectors.shape[0], 1])
    y_real = np.zeros_like(y_hot)
    y_real[target] = 1
    cost = -np.log(y_hot[target])
    gradPred = (y_hot-y_real).T.dot(outputVectors)
    grad = (y_hot-y_real).dot(predicted)
    return cost, gradPred, grad


def getNegativeSamples(target, dataset, K):

    """ Samples K indexes which are not the target
    随机负采样K个数值
    """

    indices = [None] * K
    for k in xrange(K):
        newidx = dataset.sampleTokenIdx()
        while newidx == target:
            newidx = dataset.sampleTokenIdx()
        indices[k] = newidx
    return indices


def negSamplingCostAndGradient(predicted, target, outputVectors, dataset,
                               K=10):
    """ Negative sampling cost function for word2vec models

    Implement the cost and gradients for one predicted word vector
    and one target word vector as a building block for word2vec
    models, using the negative sampling technique. K is the sample
    size.

    Note: See test_word2vec below for dataset's initialization.

    Arguments/Return Specifications: same as softmaxCostAndGradient
    """

    # Sampling of indices is done for you. Do not modify this if you
    # wish to match the autograder and receive points!



    indices = [target]
    indices.extend(getNegativeSamples(target, dataset, K))

    predicted = predicted.reshape([predicted.shape[0], 1])
    gradPred = np.zeros(predicted.shape)
    cost = 0

    soft_vc = sigmoid(outputVectors[target, :].dot(predicted))  # [1, D]*[D, 1]=[1, 1]
    cost -= np.log(soft_vc)
    gradPred += (soft_vc-1.0) * outputVectors[target, :].reshape(predicted.shape)  # [D,1]
    grad_temp = np.zeros([outputVectors.shape[0], 1])    # [M, 1]
    grad_temp[target] = soft_vc-1.0
    for i in range(1, len(indices)):
        soft_vk = sigmoid(-outputVectors[indices[i], :].dot(predicted))
        cost -= np.log(soft_vk)
        gradPred -= (soft_vk-1.0) * outputVectors[indices[i], :].reshape(predicted.shape)
        grad_temp[indices[i]] -= (soft_vk-1.0)
    grad = grad_temp.dot(predicted.T)   # [M, 1]*[1, D]=[M, D]

    return cost, gradPred, grad

def skipgram(currentWord, C, contextWords, tokens, inputVectors, outputVectors,
             dataset, word2vecCostAndGradient=softmaxCostAndGradient):
    """ Skip-gram model in word2vec

    Implement the skip-gram model in this function.

    Arguments:
    currentWord -- a string of the current center word
    C -- integer, context size
    contextWords -- list of no more than 2*C strings, the context words
    tokens -- a dictionary that maps words to their indices in
              the word vector list
    inputVectors -- "input" word vectors (as rows) for all tokens
    outputVectors -- "output" word vectors (as rows) for all tokens
    word2vecCostAndGradient -- the cost and gradient function for
                               a prediction vector given the target
                               word vectors, could be one of the two
                               cost functions you implemented above.

    Return:
    cost -- the cost function value for the skip-gram model
    grad -- the gradient with respect to the word vectors
    """

    cost = 0.0
    gradIn = np.zeros(inputVectors.shape)
    gradOut = np.zeros(outputVectors.shape)

    for word in contextWords:
        cost_1, gradPred1, grad1 = word2vecCostAndGradient(inputVectors[tokens[currentWord]], tokens[word],
                                                              outputVectors, dataset)
        cost += cost_1
        gradIn[tokens[currentWord], :] += np.squeeze([gradPred1])
        gradOut += grad1

    return cost, gradIn, gradOut


def cbow(currentWord, C, contextWords, tokens, inputVectors, outputVectors,
         dataset, word2vecCostAndGradient=softmaxCostAndGradient):
    """CBOW model in word2vec

    Implement the continuous bag-of-words model in this function.

    Arguments/Return specifications: same as the skip-gram model

    Extra credit: Implementing CBOW is optional, but the gradient
    derivations are not. If you decide not to implement CBOW, remove
    the NotImplementedError.
    """

    cost = 0.0
    gradIn = np.zeros(inputVectors.shape)
    gradOut = np.zeros(outputVectors.shape)

    ### YOUR CODE HERE
    raise NotImplementedError
    ### END YOUR CODE

    return cost, gradIn, gradOut


#############################################
# Testing functions below. DO NOT MODIFY!   #
#############################################

def word2vec_sgd_wrapper(word2vecModel, tokens, wordVectors, dataset, C,
                         word2vecCostAndGradient=softmaxCostAndGradient):
    batchsize = 50
    cost = 0.0
    grad = np.zeros(wordVectors.shape)
    N = wordVectors.shape[0]
    inputVectors = wordVectors[:N/2,:]
    outputVectors = wordVectors[N/2:,:]
    for i in xrange(batchsize):
        C1 = random.randint(1,C)
        centerword, context = dataset.getRandomContext(C1)
        if word2vecModel == skipgram:
            denom = 1
        else:
            denom = 1
        c, gin, gout = word2vecModel(
            centerword, C1, context, tokens, inputVectors, outputVectors,
            dataset, word2vecCostAndGradient)
        cost += c / batchsize / denom
        grad[:N/2, :] += gin / batchsize / denom
        grad[N/2:, :] += gout / batchsize / denom

    return cost, grad


def test_word2vec():
    """ Interface to the dataset for negative sampling """
    dataset = type('dummy', (), {})()
    def dummySampleTokenIdx():
        return random.randint(0, 4)

    def getRandomContext(C):
        tokens = ["a", "b", "c", "d", "e"]
        return tokens[random.randint(0,4)], \
            [tokens[random.randint(0,4)] for i in xrange(2*C)]
    dataset.sampleTokenIdx = dummySampleTokenIdx
    dataset.getRandomContext = getRandomContext

    random.seed(31415)
    np.random.seed(9265)
    dummy_vectors = normalizeRows(np.random.randn(10,3))
    dummy_tokens = dict([("a",0), ("b",1), ("c",2),("d",3),("e",4)])
    print "==== Gradient check for skip-gram ===="
    gradcheck_naive(lambda vec: word2vec_sgd_wrapper(
        skipgram, dummy_tokens, vec, dataset, 5, softmaxCostAndGradient),
        dummy_vectors)
    gradcheck_naive(lambda vec: word2vec_sgd_wrapper(
        skipgram, dummy_tokens, vec, dataset, 5, negSamplingCostAndGradient),
        dummy_vectors)
    # print "\n==== Gradient check for CBOW      ===="
    # gradcheck_naive(lambda vec: word2vec_sgd_wrapper(
    #     cbow, dummy_tokens, vec, dataset, 5, softmaxCostAndGradient),
    #     dummy_vectors)
    # gradcheck_naive(lambda vec: word2vec_sgd_wrapper(
    #     cbow, dummy_tokens, vec, dataset, 5, negSamplingCostAndGradient),
    #     dummy_vectors)

    print "\n=== Results ==="
    print skipgram("c", 3, ["a", "b", "e", "d", "b", "c"],
        dummy_tokens, dummy_vectors[:5,:], dummy_vectors[5:,:], dataset)
    print skipgram("c", 1, ["a", "b"],
        dummy_tokens, dummy_vectors[:5,:], dummy_vectors[5:,:], dataset,
        negSamplingCostAndGradient)
    # print cbow("a", 2, ["a", "b", "c", "a"],
    #     dummy_tokens, dummy_vectors[:5,:], dummy_vectors[5:,:], dataset)
    # print cbow("a", 2, ["a", "b", "a", "c"],
    #     dummy_tokens, dummy_vectors[:5,:], dummy_vectors[5:,:], dataset,
    #     negSamplingCostAndGradient)


if __name__ == "__main__":
    test_normalize_rows()
    test_word2vec()

（f）这个是实现SGD函数，这一个填写的部分比较简单。

拿到一个这样需要补全的程序，第一步首先需要明确自己需要补全的那部分程序是那部分了，第二步，if __name__=='__main__'看起，因为这个是程序的入口，然后根据程序的流程来理解。

对于sanity_check函数，就是调用sgd的一个接口。

对于sgd函数，参数的每个的意思都在下面。然后在迭代过程中，我们需要通过梯度来更新变量，尤其注意，不要忘记postprocessing函数，因为是一个迭代的过程，不能只在初始的时候对变量进行预处理（归一化），在迭代过程，也不能忘记呀。

对于save_params函数，就是每隔多少次的迭代，就保存一下参数的作用。

对于load_saved_params函数，就是导入之前保存的参数。

#!/usr/bin/env python

# Save parameters every a few SGD iterations as fail-safe
SAVE_PARAMS_EVERY = 5000

import glob
import random
import numpy as np
import os.path as op
import cPickle as pickle


def load_saved_params():
    """
    A helper function that loads previously saved parameters and resets
    iteration start.
    """
    st = 0
    for f in glob.glob("saved_params_*.npy"):
        iter = int(op.splitext(op.basename(f))[0].split("_")[2])
        if (iter > st):
            st = iter

    if st > 0:
        with open("saved_params_%d.npy" % st, "r") as f:
            params = pickle.load(f)
            state = pickle.load(f)
        return st, params, state
    else:
        return st, None, None


def save_params(iter, params):
    with open("saved_params_%d.npy" % iter, "w") as f:
        pickle.dump(params, f)
        pickle.dump(random.getstate(), f)


def sgd(f, x0, step, iterations, postprocessing=None, useSaved=False,
        PRINT_EVERY=10):
    """ Stochastic Gradient Descent

    Implement the stochastic gradient descent method in this function.

    Arguments:
    f -- the function to optimize, it should take a single
         argument and yield two outputs, a cost and the gradient
         with respect to the arguments
    x0 -- the initial point to start SGD from
    step -- the step size for SGD
    iterations -- total iterations to run SGD for
    postprocessing -- postprocessing function for the parameters
                      if necessary. In the case of word2vec we will need to
                      normalize the word vectors to have unit length.
    PRINT_EVERY -- specifies how many iterations to output loss

    Return:
    x -- the parameter value after SGD finishes
    """

    # Anneal learning rate every several iterations
    ANNEAL_EVERY = 20000

    if useSaved:
        start_iter, oldx, state = load_saved_params()
        if start_iter > 0:
            x0 = oldx
            step *= 0.5 ** (start_iter / ANNEAL_EVERY)

        if state:
            random.setstate(state)
    else:
        start_iter = 0

    x = x0

    if not postprocessing:
        postprocessing = lambda x: x

    expcost = None

    for iter in xrange(start_iter + 1, iterations + 1):
        # Don't forget to apply the postprocessing after every iteration!
        # You might want to print the progress every few iterations.

        cost = None
        ### YOUR CODE HERE
        cost, grad = f(x)
        x -= step * grad
        postprocessing(x)
        ### END YOUR CODE

        if iter % PRINT_EVERY == 0:
            if not expcost:
                expcost = cost
            else:
                expcost = .95 * expcost + .05 * cost
            print "iter %d: %f" % (iter, expcost)

        if iter % SAVE_PARAMS_EVERY == 0 and useSaved:
            save_params(iter, x)

        if iter % ANNEAL_EVERY == 0:
            step *= 0.5

    return x


def sanity_check():
    quad = lambda x: (np.sum(x ** 2), x * 2)

    print "Running sanity checks..."
    t1 = sgd(quad, 0.5, 0.01, 1000, PRINT_EVERY=100)
    print "test 1 result:", t1
    assert abs(t1) <= 1e-6

    t2 = sgd(quad, 0.0, 0.01, 1000, PRINT_EVERY=100)
    print "test 2 result:", t2
    assert abs(t2) <= 1e-6

    t3 = sgd(quad, -1.5, 0.01, 1000, PRINT_EVERY=100)
    print "test 3 result:", t3
    assert abs(t3) <= 1e-6

    print ""




if __name__ == "__main__":
    sanity_check()

（g）训练一个语料库，代码如下。

数据集采用的是斯坦福语义分析的数据集。词向量的维度为10，单词上下文窗户大小为5，WordVectors包含二个部分的向量，也就是我们常说的u,v的情况。外层是sgd函数，然后利用sgd的函数应用到word2vec_sgd_warpper上。

迭代40000次，，真的好花费时间呀。

然后对得到的词向量的情况，对其中的某些单词进行降维，来看最后的情况单词的情况。

#!/usr/bin/env python

import random
import numpy as np
from utils.treebank import StanfordSentiment
import matplotlib
matplotlib.use('agg')
import matplotlib.pyplot as plt
import time

from q3_word2vec import *
from q3_sgd import *

# Reset the random seed to make sure that everyone gets the same results
random.seed(314)
dataset = StanfordSentiment()
tokens = dataset.tokens()
nWords = len(tokens)

# We are going to train 10-dimensional vectors for this assignment
dimVectors = 10

# Context size
C = 5

# Reset the random seed to make sure that everyone gets the same results
random.seed(31415)
np.random.seed(9265)

startTime=time.time()
wordVectors = np.concatenate(
    ((np.random.rand(nWords, dimVectors) - 0.5) /
       dimVectors, np.zeros((nWords, dimVectors))),
    axis=0)
wordVectors = sgd(
    lambda vec: word2vec_sgd_wrapper(skipgram, tokens, vec, dataset, C,
        negSamplingCostAndGradient),
    wordVectors, 0.3, 40000, None, True, PRINT_EVERY=10)
# Note that normalization is not called here. This is not a bug,
# normalizing during training loses the notion of length.

print "sanity check: cost at convergence should be around or below 10"
print "training took %d seconds" % (time.time() - startTime)

# concatenate the input and output word vectors
wordVectors = np.concatenate(
    (wordVectors[:nWords,:], wordVectors[nWords:,:]),
    axis=0)
# wordVectors = wordVectors[:nWords,:] + wordVectors[nWords:,:]

visualizeWords = [
    "the", "a", "an", ",", ".", "?", "!", "``", "''", "--",
    "good", "great", "cool", "brilliant", "wonderful", "well", "amazing",
    "worth", "sweet", "enjoyable", "boring", "bad", "waste", "dumb",
    "annoying"]

visualizeIdx = [tokens[word] for word in visualizeWords]
visualizeVecs = wordVectors[visualizeIdx, :]
temp = (visualizeVecs - np.mean(visualizeVecs, axis=0))
covariance = 1.0 / len(visualizeIdx) * temp.T.dot(temp)
U,S,V = np.linalg.svd(covariance)
coord = temp.dot(U[:,0:2])

for i in xrange(len(visualizeWords)):
    plt.text(coord[i,0], coord[i,1], visualizeWords[i],
        bbox=dict(facecolor='green', alpha=0.1))

plt.xlim((np.min(coord[:,0]), np.max(coord[:,0])))
plt.ylim((np.min(coord[:,1]), np.max(coord[:,1])))

plt.savefig('q3_word_vectors.png')

画出的单词的图像如下：

题目解释：使用已经训练好的词向量，然后进行一个语义情感分析的步骤，对于语料库中的每个句子，我们采用单词的平均词向量来作为句子的特征，然后预测句子的情感分析。分为了5个等级，训练一个softmax分类器来实现目的。

很负面（0），负面（1），中立（2）， 积极（3），很积极（4）

（a）完成句子向量特征的计算。利用句子中的词向量的平均来表示。

（b）解释为什么需要在分类时进行正则化。（normalization, regulariza就tion）

（c）填写超参数选择的代码来寻找最好的超参数。至少要在验证集和测试集上达到36.5%的准确率。

（d）使用自己训练的词向量来跑情感分类的程序，然后使用已经训练好的GLOVE模型来跑情感分类的程序，比较在训练集，验证集和测试集上的准确率。为什么预训练的GLOVE模型效果更好，明确并提出至少3个不同的原因。

（e）画出使用GLOVE词向量的训练集合验证集上的准确率曲线，采用log函数来做处理。剪短的解释从曲线中得到什么。

（f）分析模型产生错误的原因，简短的解释混淆矩阵。

（g）分析3个分类错误的例子来进行解释，并且简短的说明什么样的特征总能够将会被分类错误。尽力的找出错误的原因。

题目解答：

（a）就是利用句子的每个单词的词向量的平均来作为句子的特征，来进行处理。具体程序在后面。

但自己写的程序效率不是特别高，因为没有很好的利用到向量化的手段，需要不断的进行学习。

（b）正则化的好坏就在于防止过拟合。正则化的常用的方法为L1正则化，L2正则化。对于L2正则化而言，减少参数的数值，以便在数据发生偏移的时候，结果影响不大，也就是提高模型对于未知实例的泛化能力。

（c）在getRegularizarionValues函数中，通过函数来给values赋一系列的浮点数的值。

在chooseBestModel函数中，通过字典中的关键字dev来进行排序。

（d）结果显示：

对于youevectors而言:

=== Recap ===
Reg             Train   Dev     Test
1.00E-02        30.946  32.334  29.910
1.10E-02        30.922  32.334  29.955
1.20E-02        30.946  32.334  29.910
1.32E-02        30.922  32.243  29.955
1.45E-02        30.840  32.153  30.000
1.59E-02        30.770  32.153  30.000
1.75E-02        30.735  32.243  29.910
1.92E-02        30.817  31.789  29.955
2.10E-02        30.735  31.698  29.955
2.31E-02        30.770  31.698  29.955
2.54E-02        30.618  31.608  30.000
2.78E-02        30.501  31.608  30.090
3.05E-02        30.524  31.698  29.910
3.35E-02        30.431  31.608  29.955
3.68E-02        30.360  31.698  29.819
4.04E-02        30.325  31.608  29.864
4.43E-02        30.302  31.880  30.045
4.86E-02        30.349  31.880  30.136
5.34E-02        30.384  31.971  29.955
5.86E-02        30.396  32.062  29.955
6.43E-02        30.349  32.153  30.000
7.05E-02        30.372  32.243  30.045
7.74E-02        30.325  32.425  30.045
8.50E-02        30.290  32.062  30.136
9.33E-02        30.302  31.880  29.955
1.02E-01        30.302  31.971  29.910
1.12E-01        30.314  31.789  29.729
1.23E-01        30.279  31.971  29.638
1.35E-01        30.185  31.789  29.774
1.48E-01        30.162  31.880  29.638
1.63E-01        30.044  31.789  29.502
1.79E-01        29.998  32.062  29.367
1.96E-01        29.963  31.971  29.412
2.15E-01        29.740  32.062  29.502
2.36E-01        29.635  31.698  29.321
2.60E-01        29.717  31.971  29.095
2.85E-01        29.494  32.062  29.005
3.13E-01        29.459  31.880  28.824
3.43E-01        29.506  31.789  28.778
3.76E-01        29.459  31.517  28.371
4.13E-01        29.424  31.335  28.326
4.53E-01        29.295  31.335  28.281
4.98E-01        29.260  31.244  28.326
5.46E-01        29.377  30.790  28.100
5.99E-01        29.412  31.244  28.054
6.58E-01        29.436  31.153  28.145
7.22E-01        29.389  30.881  28.145
7.92E-01        29.377  30.336  27.873
8.70E-01        29.190  29.973  27.783
9.55E-01        28.968  29.882  27.240
1.05E+00        28.816  29.609  27.059
1.15E+00        28.862  29.064  26.561
1.26E+00        28.663  28.520  26.335
1.38E+00        28.640  28.065  25.928
1.52E+00        28.546  27.520  25.928
1.67E+00        28.500  27.430  25.701
1.83E+00        28.265  27.339  25.339
2.01E+00        27.926  26.794  25.204
2.21E+00        27.938  26.431  24.887
2.42E+00        27.961  26.703  24.615
2.66E+00        27.891  26.703  24.525
2.92E+00        27.680  26.431  24.208
3.20E+00        27.680  26.158  23.846
3.51E+00        27.575  25.704  23.846
3.85E+00        27.551  25.704  23.665
4.23E+00        27.446  25.522  23.620
4.64E+00        27.353  25.522  23.394
5.09E+00        27.353  25.522  23.213
5.59E+00        27.317  25.704  23.122
6.14E+00        27.306  25.704  23.122
6.73E+00        27.271  25.704  23.122
7.39E+00        27.247  25.522  23.122
8.11E+00        27.247  25.522  23.122
8.90E+00        27.235  25.522  23.122
9.77E+00        27.247  25.522  23.077
1.07E+01        27.235  25.522  23.077
1.18E+01        27.235  25.522  23.077
1.29E+01        27.235  25.522  23.077
1.42E+01        27.235  25.522  23.077
1.56E+01        27.235  25.522  23.032
1.71E+01        27.235  25.522  23.032
1.87E+01        27.235  25.522  23.032
2.06E+01        27.235  25.522  23.032
2.26E+01        27.235  25.522  23.032
2.48E+01        27.235  25.522  23.032
2.72E+01        27.235  25.522  23.032
2.98E+01        27.247  25.522  23.032
3.27E+01        27.247  25.522  23.032
3.59E+01        27.247  25.522  23.032
3.94E+01        27.247  25.522  23.032
4.33E+01        27.247  25.522  23.032
4.75E+01        27.247  25.522  23.032
5.21E+01        27.247  25.522  23.032
5.72E+01        27.247  25.522  23.032
6.28E+01        27.247  25.522  23.032
6.89E+01        27.247  25.522  23.032
7.56E+01        27.247  25.522  23.032
8.30E+01        27.247  25.522  23.032
9.11E+01        27.247  25.522  23.032
1.00E+02        27.247  25.522  23.032

Best regularization value: 7.74E-02
Test accuracy (%): 30.045249

对于pretrained而言：

=== Recap ===
Reg             Train   Dev     Test
1.00E-02        39.923  36.331  37.195
1.10E-02        39.934  36.331  37.195
1.20E-02        39.911  36.240  37.195
1.32E-02        39.899  36.240  37.195
1.45E-02        39.899  36.421  37.285
1.59E-02        39.888  36.694  37.285
1.75E-02        39.876  36.603  37.240
1.92E-02        39.841  36.603  37.195
2.10E-02        39.853  36.694  37.285
2.31E-02        39.864  36.421  37.240
2.54E-02        39.864  36.421  37.421
2.78E-02        39.853  36.331  37.285
3.05E-02        39.853  36.421  37.376
3.35E-02        39.876  36.421  37.466
3.68E-02        39.841  36.331  37.511
4.04E-02        39.817  36.331  37.511
4.43E-02        39.829  36.331  37.466
4.86E-02        39.923  36.240  37.511
5.34E-02        39.888  36.240  37.466
5.86E-02        39.853  36.331  37.421
6.43E-02        39.876  36.331  37.330
7.05E-02        39.864  36.331  37.195
7.74E-02        39.864  36.421  37.195
8.50E-02        39.864  36.331  37.240
9.33E-02        39.853  36.331  37.195
1.02E-01        39.817  36.240  37.149
1.12E-01        39.735  36.240  37.195
1.23E-01        39.771  36.512  37.285
1.35E-01        39.735  36.512  37.466
1.48E-01        39.794  36.512  37.511
1.63E-01        39.806  36.512  37.466
1.79E-01        39.841  36.421  37.376
1.96E-01        39.747  36.421  37.330
2.15E-01        39.724  36.512  37.240
2.36E-01        39.665  36.512  37.195
2.60E-01        39.654  36.512  37.195
2.85E-01        39.560  36.421  37.330
3.13E-01        39.583  36.240  37.285
3.43E-01        39.642  36.240  37.285
3.76E-01        39.630  36.331  37.285
4.13E-01        39.630  36.331  37.285
4.53E-01        39.607  36.149  37.330
4.98E-01        39.618  36.149  37.330
5.46E-01        39.583  36.149  37.330
5.99E-01        39.583  36.149  37.285
6.58E-01        39.607  36.421  37.240
7.22E-01        39.548  36.512  37.285
7.92E-01        39.537  36.512  37.285
8.70E-01        39.525  36.603  37.376
9.55E-01        39.525  36.603  37.285
1.05E+00        39.478  36.512  37.330
1.15E+00        39.525  36.603  37.285
1.26E+00        39.537  36.512  37.330
1.38E+00        39.548  36.512  37.330
1.52E+00        39.490  36.512  37.285
1.67E+00        39.490  36.603  37.059
1.83E+00        39.466  36.694  37.195
2.01E+00        39.501  36.876  37.240
2.21E+00        39.431  36.694  37.240
2.42E+00        39.408  36.603  37.240
2.66E+00        39.302  36.876  37.195
2.92E+00        39.279  36.876  37.195
3.20E+00        39.197  36.966  37.104
3.51E+00        39.022  36.876  37.240
3.85E+00        39.115  36.694  37.285
4.23E+00        39.092  36.785  37.376
4.64E+00        39.010  36.876  37.285
5.09E+00        38.963  36.694  37.376
5.59E+00        39.010  36.512  37.466
6.14E+00        38.951  36.785  37.466
6.73E+00        38.928  36.876  37.783
7.39E+00        38.893  36.785  37.828
8.11E+00        38.846  36.694  37.828
8.90E+00        38.729  36.694  37.828
9.77E+00        38.647  36.876  37.738
1.07E+01        38.706  36.876  37.466
1.18E+01        38.659  37.239  37.511
1.29E+01        38.542  37.057  37.421
1.42E+01        38.448  36.876  37.330
1.56E+01        38.343  36.694  37.149
1.71E+01        38.319  36.966  37.240
1.87E+01        38.191  36.694  37.149
2.06E+01        38.191  36.603  37.059
2.26E+01        38.003  36.421  36.968
2.48E+01        37.863  36.331  36.833
2.72E+01        37.746  36.694  36.923
2.98E+01        37.617  36.966  36.697
3.27E+01        37.500  36.876  36.471
3.59E+01        37.512  36.785  36.290
3.94E+01        37.477  36.603  36.425
4.33E+01        37.383  36.512  36.244
4.75E+01        37.161  36.512  36.199
5.21E+01        37.125  36.421  36.290
5.72E+01        36.997  36.149  36.063
6.28E+01        36.821  35.786  36.154
6.89E+01        36.809  35.695  36.018
7.56E+01        36.610  35.876  35.882
8.30E+01        36.575  35.332  35.611
9.11E+01        36.482  34.968  35.656
1.00E+02        36.330  35.059  35.701

Best regularization value: 1.18E+01
Test accuracy (%): 37.511312

对于Glove比自己训练的词向量性能更好的原因：

1，因为Glove使用的维度更高，使用的是50维，而我们自己训练的词向量是10维。

2，训练GLove时采用的是很大的语料库，然后可以得到一个更全面的效果，而我们训练的语料库数据量不够大，不能得到无偏的词向量。

3，对于Glove而言，使用到了全局的信息，使用了词向量共现的信息，而对于Word2vec而言，使用的是上下文的局部关系。

（e）画出的不同正则化参数的训练集，验证集的准确率的情况。（采用Glove训练的）

这幅图展现的是不同的正则化系数对于训练集和验证集的准确率的影响。对于正则化系数一直增大，对于训练集的准确率一直在下降，而验证集数据有一个小范围的上升的过程，说明起到了避免过拟合训练数据的效果，而正则化系数太大，二者的准确率都很低，说明模型太简单了，导致了模型没有很好的拟合数据。

（f）画出的混淆矩阵图形如下：（使用Glove模型）

分析上面的混淆矩阵，在很消极的数据中，很多的被分为了消极，次多被分为了积极；消极的数据中，很多被分为了消极，但也有次多的被分为了积极；在中立的数据汇总，很多被分为了消极，次多被分为了积极；在积极的数据中，很多的被分为积极，少部分被分为了消极；在很积极的数据中，很多被分为了积极，次多的被分为很积极。

在我们构建的这个模型中，对于积极的数据分类效果最好，其次是消极的数据，其次是很积极的数据，中立的数据，最后是很消极的数据。

总体来看，对于积极方面的数据效果更好，对于中立和消极的数据分类效果一般。

（g）分析三条分错的数据。

数据：

True	Predicted	Text
3	4	it 's a lovely film with lovely performances by buy and accorsi .
2	1	no one goes unindicted here , which is probably for the best .
3	1	and if you 're not nearly moved to tears by a couple of scenes , you 've got ice water in your veins .

对于第一条：

从积极的数据预测为很积极的数据。因为这二者的界限很模糊，不是将明显的积极的数据分为消极的这类严重的错误。

对于第二条:

可能否定词和不确定的词有点多，所以将其偏向了消极的观点。

对于第三条:

涉及到了反语，有一些偏消极的词，没有学习到反语的意思，出错。

解释程序：这个程序的入口程序：main函数，参数为另一个函数getArguments。

首先说，getArguments函数，这个调用了argparse包，这个包用于从python内置的一个用于命令项选项与参数解析的模块，通过在程序中定义好需要的参数，然后argparse会帮我们从sys.argv中解析出这些参数，并自动生成帮助和使用信息。这个是需要在命令行下进行运行的。

具体讲解链接：https://www.jianshu.com/p/fef2d215b91d

https://blog.csdn.net/u013177568/article/details/62432761/

我觉得我差一点就阵亡在这个地方，但我最后还是解决了这个问题。在这个地方遇到的问题是我对于那个argparse包的不熟悉，不知道该怎么运行，具体详见上面的链接。

还有一个问题，因为我是windows环境，利用命令行的时候要cd到这个文件夹下运行，否则的话会报找不到文件的错误，因为是从系统盘C盘找的，这当然找不到了。

这里再贴一个链接，https://www.cnblogs.com/wangguoyuan-09/p/6866798.html，讲的是pycharm来运行命令行程序。

其次，说的是main函数，数据集采用的是斯坦福情感分析的数据集，然后根据命令行的参数（互斥的）,选择是采用pretrained，还是采用yourvectors；然后读取到数据集中的训练集，验证集和测试集，抽取数据的特征；然后验证不同的正则化参数的训练结果，然后输出结果。对于pretrained的类别，还会画出图来进行错误分析。

对于getSentenceFeatures函数，就是（a）中，利用句子的每个单词的词向量的平均来作为整个句子的特征来进行处理。

对于getRegularizationValues函数，就是（b）中，得到一系列的正则化的参数，然后对这些数据进行排序。

对于chooseBestModel函数，就是（b）中，根据验证集的准确率来选择一个好的模型。

对于accuracy函数，就是通过向量化的手段，来得到准确率的计算。

对于plotRegVsAccuracy函数，就是通过画出正则化参数以及准确率的变化的曲线的函数。

对于outputConfusionMatrix是用来画出混淆矩阵的函数。

在这个地方有比较多的学习的地方：

1）对于混淆矩阵的画法，不是仅仅得到一个矩阵就OK, 也是可以画出图像的，很美观。

2）自己有很多可以借鉴的地方

贴出matplotlib官网的链接以便后续查看：https://matplotlib.org/api/pyplot_summary.html

对于outputPredictions是用来输出一个txt文档，里面记录了验证集数据的真实的标签，预测的标签，以及数据。

最后的程序实现如下：

#!/usr/bin/env python
# -*- coding:utf-8 -*-

import argparse
import numpy as np
import matplotlib
matplotlib.use('agg')
import matplotlib.pyplot as plt
import itertools

from utils.treebank import StanfordSentiment
import utils.glove as glove

from q3_sgd import load_saved_params, sgd

# We will use sklearn here because it will run faster than implementing
# ourselves. However, for other parts of this assignment you must implement
# the functions yourself!
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix


def getArguments():
    parser = argparse.ArgumentParser()
    group = parser.add_mutually_exclusive_group(required=True)
    group.add_argument("--pretrained", dest="pretrained", action="store_true",
                       help="Use pretrained GloVe vectors.")
    group.add_argument("--yourvectors", dest="yourvectors", action="store_true",
                       help="Use your vectors from q3.")
    return parser.parse_args()


def getSentenceFeatures(tokens, wordVectors, sentence):
    """
    Obtain the sentence feature for sentiment analysis by averaging its
    word vectors
    """

    # Implement computation for the sentence features given a sentence.

    # Inputs:
    # tokens -- a dictionary that maps words to their indices in
    #           the word vector list
    # wordVectors -- word vectors (each row) for all tokens
    # sentence -- a list of words in the sentence of interest

    # Output:
    # - sentVector: feature vector for the sentence

    sentVector = np.zeros((wordVectors.shape[1],))

    for word in sentence:
        sentVector += wordVectors[tokens[word]]
    sentVector *= 1.0/len(sentence)

    assert sentVector.shape == (wordVectors.shape[1],)
    return sentVector


def getRegularizationValues():
    """Try different regularizations

    Return a sorted list of values to try.
    """
    # Assign a list of floats in the block below
    values = np.logspace(-2, 2, num=100, base=10)
    return sorted(values)


def chooseBestModel(results):
    """Choose the best model based on dev set performance.

    Arguments:
    results -- A list of python dictionaries of the following format:
        {
            "reg": regularization,
            "clf": classifier,
            "train": trainAccuracy,
            "dev": devAccuracy,
            "test": testAccuracy
        }

    Each dictionary represents the performance of one model.

    Returns:
    Your chosen result dictionary.
    """
    # 对于利用dev的关键字来进行排序
    bestResult = max(results, key=lambda x: x['dev'])

    return bestResult


def accuracy(y, yhat):
    """ Precision for classifier """
    assert(y.shape == yhat.shape)
    return np.sum(y == yhat) * 100.0 / y.size


def plotRegVsAccuracy(regValues, results, filename):
    """ Make a plot of regularization vs accuracy """
    plt.plot(regValues, [x["train"] for x in results])
    plt.plot(regValues, [x["dev"] for x in results])
    plt.xscale('log')
    plt.xlabel("regularization")
    plt.ylabel("accuracy")
    plt.legend(['train', 'dev'], loc='upper left')
    plt.savefig(filename)


def outputConfusionMatrix(features, labels, clf, filename):
    """ Generate a confusion matrix """
    pred = clf.predict(features)
    cm = confusion_matrix(labels, pred, labels=range(5))
    plt.figure()
    plt.imshow(cm, interpolation='nearest', cmap=plt.cm.Reds)
    plt.colorbar()
    classes = ["- -", "-", "neut", "+", "+ +"]
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes)
    plt.yticks(tick_marks, classes)
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, cm[i, j],
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")
    plt.tight_layout()
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    plt.savefig(filename)


def outputPredictions(dataset, features, labels, clf, filename):
    """ Write the predictions to file """
    pred = clf.predict(features)
    with open(filename, "w") as f:
        print >> f, "True\tPredicted\tText"
        for i in xrange(len(dataset)):
            print >> f, "%d\t%d\t%s" % (
                labels[i], pred[i], " ".join(dataset[i][0]))


def main(args):
    """ Train a model to do sentiment analyis"""

    # Load the dataset
    dataset = StanfordSentiment()
    tokens = dataset.tokens()
    nWords = len(tokens)

    if args.yourvectors:
        _, wordVectors, _ = load_saved_params()
        wordVectors = np.concatenate(
            (wordVectors[:nWords,:], wordVectors[nWords:,:]),
            axis=1)
    elif args.pretrained:
        wordVectors = glove.loadWordVectors(tokens)
    dimVectors = wordVectors.shape[1]

    # Load the train set
    trainset = dataset.getTrainSentences()
    nTrain = len(trainset)
    trainFeatures = np.zeros((nTrain, dimVectors))
    trainLabels = np.zeros((nTrain,), dtype=np.int32)
    for i in xrange(nTrain):
        words, trainLabels[i] = trainset[i]
        trainFeatures[i, :] = getSentenceFeatures(tokens, wordVectors, words)

    # Prepare dev set features
    devset = dataset.getDevSentences()
    nDev = len(devset)
    devFeatures = np.zeros((nDev, dimVectors))
    devLabels = np.zeros((nDev,), dtype=np.int32)
    for i in xrange(nDev):
        words, devLabels[i] = devset[i]
        devFeatures[i, :] = getSentenceFeatures(tokens, wordVectors, words)

    # Prepare test set features
    testset = dataset.getTestSentences()
    nTest = len(testset)
    testFeatures = np.zeros((nTest, dimVectors))
    testLabels = np.zeros((nTest,), dtype=np.int32)
    for i in xrange(nTest):
        words, testLabels[i] = testset[i]
        testFeatures[i, :] = getSentenceFeatures(tokens, wordVectors, words)

    # We will save our results from each run
    results = []
    regValues = getRegularizationValues()
    for reg in regValues:
        print "Training for reg=%f" % reg
        # Note: add a very small number to regularization to please the library
        clf = LogisticRegression(C=1.0/(reg + 1e-12))
        clf.fit(trainFeatures, trainLabels)

        # Test on train set
        pred = clf.predict(trainFeatures)
        trainAccuracy = accuracy(trainLabels, pred)
        print "Train accuracy (%%): %f" % trainAccuracy

        # Test on dev set
        pred = clf.predict(devFeatures)
        devAccuracy = accuracy(devLabels, pred)
        print "Dev accuracy (%%): %f" % devAccuracy

        # Test on test set
        # Note: always running on test is poor style. Typically, you should
        # do this only after validation.
        pred = clf.predict(testFeatures)
        testAccuracy = accuracy(testLabels, pred)
        print "Test accuracy (%%): %f" % testAccuracy

        results.append({
            "reg": reg,
            "clf": clf,
            "train": trainAccuracy,
            "dev": devAccuracy,
            "test": testAccuracy})

    # Print the accuracies
    print ""
    print "=== Recap ==="
    print "Reg\t\tTrain\tDev\tTest"
    for result in results:
        print "%.2E\t%.3f\t%.3f\t%.3f" % (
            result["reg"],
            result["train"],
            result["dev"],
            result["test"])
    print ""

    bestResult = chooseBestModel(results)
    print "Best regularization value: %0.2E" % bestResult["reg"]
    print "Test accuracy (%%): %f" % bestResult["test"]

    # do some error analysis
    if args.pretrained:
        plotRegVsAccuracy(regValues, results, "q4_reg_v_acc.png")
        outputConfusionMatrix(devFeatures, devLabels, bestResult["clf"],
                              "q4_dev_conf.png")
        outputPredictions(devset, devFeatures, devLabels, bestResult["clf"],
                          "q4_dev_pred.txt")


if __name__ == "__main__":
    main(getArguments())

《CS224n斯坦福课程》-----第一部分的大作业

猜你喜欢