神经网络和深度学习（一）

这两天看了 Neural Networks and Deep Learning 网上在线书目的第一章的内容和斯坦福大学《机器学习》的公开课，学习了两种主要的神经网络结构和机器学习中重要的算法——随机梯度下降算法。现在总结如下：

一个计算模型要划分为神经网络，通常需要大量彼此连接的节点（神经元），具有两个特特性：

1.每个神经元通过某种特定的输出函数（或称激励函数 activation function）计算处理来自其他相邻神经元的加权输入值

2.神经元之间的信息传递的强度，用所谓加权值来定义，算法不断自我学习，调整权值weights

在此基础上，神经网络的计算模型，依靠大量的数据来训练。

几个概念：

cost function(成本函数) ：用来定量评估根据特定输入值计算出来的输出结果，离正确值的偏差

learning algorithm : 根据cost function的结果，自我纠正，最快找到神经元之间的最优化的weights权重

神经元Perceptron :

图1 Perceptron neuron

其中x1, x2, x3 为inputs ,且必须为二进制数字（0 or 1），outputs 也是只有二进制输出。w 为计算权重，这个w设计是重点也是难点。其中计算公式如下：

简化公式，，其中w 和 x 分别代表权重和输入向量，用偏置 b= -threshold,

神经元Sigmoid :

图2 Sigmoid neurons

比较percrptron 神经元和 Sigmoid神经元，发现他们的结构是一样的，但是对于inputs取值不同，Sigmoid 神经元的Inputs 可以取0~1中的任意值，而且输出值不是0 or 1, 而是

，这里σ 被称为 sigmoid 函数，定义如下：

所以，inputs为x1, x2, ..., weights w1, w2,..., bias b 所对应的sigmoid neuron 的输出为：

根据公式，可以得到sigmoid 函数的响应曲线，如下

神经网络的架构

如上图所示，神经网络架构包括输入层、输出层和隐藏层。这种多层网络被称为 multilayer perceptrons or MLPs 。

梯度下降算法（gradient descent）：

为了能够检验对于所有的训练输入值x,我们选择的weihts权重和 biases偏置使得输出值都近似和 y(x) 相等，使用了一个cost function(成本函数 or loss or objective function)：

其中，w 代表网络中所有权重的集合，b 代表所有的偏置，n 是训练输入的总数目，a 是输出向量（依赖于x 、w、b）

如果C(w,b) ≈ 0 ，那么对于所有的training inputs x, y(x) 约等于output a .非常好

如果 C(w,b) 非常大，那么说明对于很多inputs ，y(x)不收敛到outputs a。

我们训练算法的目的是minimize 函数C(w,b)，换句话说，我们想要找到一系列的w(权重)和b(偏置)，使得 C(w,b) 尽可能的小。

我们使用的算法就是 梯度下降算法。

我们要找到上图中的最低值，使用的方法是高数中的【梯度】，就是用来求变化率最大的地方，也即是沿着哪个方向，C(w,b)的值下降最快，这就是梯度下降算法的核心思想。（此处用v1, v2来代表w 和 b）

∆v1 和∆v2 分别代表在v1方向和v2方向上的变化量， ∆C表示C(v1,v2)的变化量

我们现在的想法是找到合适的∆v1 、∆v2 使得∆C为负值，这样 C就向着变小的方向变化了。

定义梯度向量：

此时，公式（7）可以重新表示如下：

我们令

其中η 是一个小的正参数（被称为学习率）

可以得到新公式：

然后我们可以不断更新：

如何将梯度下降算法应用在神经网络中呢？就是用梯度下降算法来不断寻找、纠正权重w 和偏置b 来使得等式（6）取得最小值。公式如下：

随机梯度下降算法（Stochastic Gradient Descent）

为了解决梯度下降算法训练样本输入数据太大，学习速度太慢的问题，来加速学习，产生了一种新的算法是随机梯度下降算法。这个算法通过随机选择一定的训练输入样本来计算出一个

来代表梯度

。

其中，m为随机选取的输入样本数量。标记X1, X2,...,Xm 称作 mini-batch。

可以得到：

应用如上网络进行简单的手写数字识别的代码实现

"""
network.py
~~~~~~~~~~

A module to implement the stochastic gradient descent learning
algorithm for a feedforward neural network.  Gradients are calculated
using backpropagation.  Note that I have focused on making the code
simple, easily readable, and easily modifiable.  It is not optimized,
and omits many desirable features.
"""

#### Libraries
# Standard library
import random

# Third-party libraries
import numpy as np

class Network(object):

    def __init__(self, sizes):
        """The list ``sizes`` contains the number of neurons in the
        respective layers of the network.  For example, if the list
        was [2, 3, 1] then it would be a three-layer network, with the
        first layer containing 2 neurons, the second layer 3 neurons,
        and the third layer 1 neuron.  The biases and weights for the
        network are initialized randomly, using a Gaussian
        distribution with mean 0, and variance 1.  Note that the first
        layer is assumed to be an input layer, and by convention we
        won't set any biases for those neurons, since biases are only
        ever used in computing the outputs from later layers."""
        self.num_layers = len(sizes)
        self.sizes = sizes
        self.biases = [np.random.randn(y, 1) for y in sizes[1:]]
        self.weights = [np.random.randn(y, x)
                        for x, y in zip(sizes[:-1], sizes[1:])]

    def feedforward(self, a):
        """Return the output of the network if ``a`` is input."""
        for b, w in zip(self.biases, self.weights):
            a = sigmoid(np.dot(w, a)+b)
        return a

    def SGD(self, training_data, epochs, mini_batch_size, eta,
            test_data=None):
        """Train the neural network using mini-batch stochastic
        gradient descent.  The ``training_data`` is a list of tuples
        ``(x, y)`` representing the training inputs and the desired
        outputs.  The other non-optional parameters are
        self-explanatory.  If ``test_data`` is provided then the
        network will be evaluated against the test data after each
        epoch, and partial progress printed out.  This is useful for
        tracking progress, but slows things down substantially."""
        if test_data: n_test = len(test_data)
        n = len(training_data)
        for j in xrange(epochs):
            random.shuffle(training_data)
            mini_batches = [
                training_data[k:k+mini_batch_size]
                for k in xrange(0, n, mini_batch_size)]
            for mini_batch in mini_batches:
                self.update_mini_batch(mini_batch, eta)
            if test_data:
                print "Epoch {0}: {1} / {2}".format(
                    j, self.evaluate(test_data), n_test)
            else:
                print "Epoch {0} complete".format(j)

    def update_mini_batch(self, mini_batch, eta):
        """Update the network's weights and biases by applying
        gradient descent using backpropagation to a single mini batch.
        The ``mini_batch`` is a list of tuples ``(x, y)``, and ``eta``
        is the learning rate."""
        nabla_b = [np.zeros(b.shape) for b in self.biases]
        nabla_w = [np.zeros(w.shape) for w in self.weights]
        for x, y in mini_batch:
            delta_nabla_b, delta_nabla_w = self.backprop(x, y)
            nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]
            nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]
        self.weights = [w-(eta/len(mini_batch))*nw
                        for w, nw in zip(self.weights, nabla_w)]
        self.biases = [b-(eta/len(mini_batch))*nb
                       for b, nb in zip(self.biases, nabla_b)]

    def backprop(self, x, y):
        """Return a tuple ``(nabla_b, nabla_w)`` representing the
        gradient for the cost function C_x.  ``nabla_b`` and
        ``nabla_w`` are layer-by-layer lists of numpy arrays, similar
        to ``self.biases`` and ``self.weights``."""
        nabla_b = [np.zeros(b.shape) for b in self.biases]
        nabla_w = [np.zeros(w.shape) for w in self.weights]
        # feedforward
        activation = x
        activations = [x] # list to store all the activations, layer by layer
        zs = [] # list to store all the z vectors, layer by layer
        for b, w in zip(self.biases, self.weights):
            z = np.dot(w, activation)+b
            zs.append(z)
            activation = sigmoid(z)
            activations.append(activation)
        # backward pass
        delta = self.cost_derivative(activations[-1], y) * \
            sigmoid_prime(zs[-1])
        nabla_b[-1] = delta
        nabla_w[-1] = np.dot(delta, activations[-2].transpose())
        # Note that the variable l in the loop below is used a little
        # differently to the notation in Chapter 2 of the book.  Here,
        # l = 1 means the last layer of neurons, l = 2 is the
        # second-last layer, and so on.  It's a renumbering of the
        # scheme in the book, used here to take advantage of the fact
        # that Python can use negative indices in lists.
        for l in xrange(2, self.num_layers):
            z = zs[-l]
            sp = sigmoid_prime(z)
            delta = np.dot(self.weights[-l+1].transpose(), delta) * sp
            nabla_b[-l] = delta
            nabla_w[-l] = np.dot(delta, activations[-l-1].transpose())
        return (nabla_b, nabla_w)

    def evaluate(self, test_data):
        """Return the number of test inputs for which the neural
        network outputs the correct result. Note that the neural
        network's output is assumed to be the index of whichever
        neuron in the final layer has the highest activation."""
        test_results = [(np.argmax(self.feedforward(x)), y)
                        for (x, y) in test_data]
        return sum(int(x == y) for (x, y) in test_results)

    def cost_derivative(self, output_activations, y):
        """Return the vector of partial derivatives \partial C_x /
        \partial a for the output activations."""
        return (output_activations-y)

#### Miscellaneous functions
def sigmoid(z):
    """The sigmoid function."""
    return 1.0/(1.0+np.exp(-z))

def sigmoid_prime(z):
    """Derivative of the sigmoid function."""
    return sigmoid(z)*(1-sigmoid(z))

抓紧时间充电——面向对象的编程C++ / Python、神经网络知识体系架构！