神经网络及优化步骤实现

1.神经元

神经网络的基础是一种被称作“感知机”的人工神经元，或许你在支持向量机的概念里也见到过。感知机的工作模式相当简单，其接受多个二进制输入,，并产生一个二进制输出：

⽰例中的感知器有三个输⼊，。通常可以有更多或更少输⼊。 Rosenblatt 提议⼀个简单的规则来计算输出。他引⼊权重，，表⽰相应输⼊对于输出重要性的实数。神经元的输出，0或者1，则由分配权重后的总和⼩于或者⼤于⼀些阈值决定。和权重⼀样，阈值是⼀个实数，⼀个神经元的参数。⽤更精确的代数形式：

将感知机组合成更复杂的感知机网络，能够进行更复杂的决策：

感知器运算的通⽤性既是令⼈⿎舞的，⼜是令⼈失望的。令⼈⿎舞是因为它告诉我们感知器⽹络能和其它计算设备⼀样强⼤。但是它也令⼈失望，因为它看上去只不过是⼀种新的与⾮⻔。

我们可以设计学习算法，能够⾃动调整⼈⼯神经元的权重和偏置。这种调整可以响应外部的刺激，⽽不需要⼀个程序员的直接⼲预。这些学习算法是我们能够以⼀种根本区别于传统逻辑⻔的⽅式使⽤⼈⼯神经元。

下面我们引入S型神经元，一个比感知机更强大的神经网络单元。我们⽤描绘感知器的相同⽅式来描绘S型神经元：

扫描二维码关注公众号，回复： 177957 查看本文章

正如⼀个感知器， S 型神经元有多个输⼊，。但是这些输⼊可以取 0 和 1 中的任意值，⽽不仅仅是 0 或 1。例如，0.638，是⼀个 S 型神经元的有效输⼊。同样， S 型神经元对每个输⼊有权重，，和⼀个总的偏置， b。但是输出不是 0 或 1。相反，它现在是 σ(w · x + b)，这⾥ σ 被称为 S 型函数，定义为：

为了理解和感知器模型的相似性，假设 z ≡ w · x + b 是⼀个很⼤的正数。σ(z) ≈ 1。即，当 z = w · x + b 很⼤并且为正， S 型神经元的输出近似为 1，正好和感知器⼀样。
相反地，假设 z = w · x + b 是⼀个很⼤的负数。所以当 z = w · x + b
是⼀个很⼤的负数， S 型神经元的⾏为也⾮常近σ(z) ≈ 0。S 型神经元的⾏为也⾮常近似⼀个感知器。只有在 w · x + b 取中间值时，和感知器模型有⽐较⼤的偏离。

如果 σ 实际是个阶跃函数，既然输出会依赖于 w · x + b 是正数还是负数，那么 S 型神经元会成为⼀个感知器，即平滑的感知器。因此，感知机和S型神经元很大的不同，是S型神经元不仅仅只输出0和1。

2.神经网络结构

由于历史的原因，尽管是由 S 型神经元⽽不是感知器构成，这种多层⽹络有时被称为多层感知器或者 MLP。

上图的神经网络结构中，都是以上一层的输出作为下一层的输入，这种网络称为前馈神经网络。也就是说网络中不存在回路，信息总是向前传播，结果不会反馈到上一层。

然而，随着技术的发展，也提出了带有反馈回路的人工神经网络模型，这些模型被称之为递归神经网络。如大名鼎鼎的RNN、LSTM模型。即带有记忆功能的神经元组成的网络。它们原理上⽐前馈⽹络更接近我们⼤脑的实际⼯作。并且递归⽹络能解决⼀些重要的问题。

上图中网络的输入层包含6个神经元，即可以接受6个属性（特征）的样本，网络的第二层和第三层为隐藏层，第四层为输出层，输出一维，当然也可以有多个输出，而且在某些情况下，多个输出的效果要更好。

3.学习算法

现在我们有了神经⽹络的设计，它怎样可以学习识别数字呢？我们需要的第⼀样东西是⼀个⽤来学习的数据集 —— 称为训练数据集。我们需要将数据集分为训练集和测试集，当然也可以将数据集分为训练集、验证集、测试集。即为了训练的模型有更好的预测能力。

我们希望有⼀个算法，能让我们找到权重和偏置，以⾄于⽹络的输出 y能够拟合所有的输入x。为了量化我们如何实现这个⽬标，定义⼀个代价函数：

这⾥ w 表⽰所有的⽹络中权重的集合， b 是所有的偏置， n 是训练输⼊数据的个数， a 是表示当输⼊为x时输出的向量，求和则是在总的训练输⼊ x 上进⾏的。C即为二次代价函数；有时也称被称为均⽅误差或者 MSE。该代价函数若使用二维特征，则图形可以很容易让人理解学习算法。如果我们的学习算法能找到合适的权重和偏置，使得，它就能很好地⼯作。相反，当很大时就不怎么好了，那意味着对于⼤量地输⼊， y与输出 a 相差很⼤。因此我们的训练算法的⽬的，是最⼩化权重和偏置的代价函数。想象 C 是⼀个只有两个变量 v1 和 v2 的函数：

我们想要的是找到 C 的全局最⼩值，微积分显然是不能用的，由于参数太多，复杂度超乎想象。但是微积分的思想可以帮我我们更好地理解。当v1 和 v2发生微小的变化，C将会有如下变化：

我们要搜寻一种使∆C为负的方法，即朝向上图谷底的方向。我们定义

则得到

假设：

我们可以得到参数优化的步骤即为：

是不是很熟悉，以上就是经典的梯度下降法，η为学习率。

优化：梯度下降法有一个很严重的问题，就是当训练样本的数量过大时，会使得学习变得缓慢。因为在实践中，为了计算梯度 ∇C，我们需要为每个训练输⼊ x 单独地计算梯度值 ∇Cx，然后求平均值。

随机梯度下降法：其思想就是通过随机选取⼩量训练输⼊样本来计算。假设每批小数据量为m个样本，取n批小样本集。显然，小样本下训练要比全样本下快的多。而梯度下降算法⼀个极端的版本是把⼩批量数据的⼤⼩设为 1 。每次输入一个样本进行训练，更新权重和偏置。如此重复，这个过程被称为在线，online或者增量学习。

4.反向传播算法

在上一章中，我们描述了如何使用梯度下降法来调整权重和偏置，即

但是，这⾥还留下了⼀个问题：我们并没有讨论如何计算代价函数的梯度。

反向传播的核⼼是⼀个对代价函数 C 关于权重 w（或者偏置 b ）的偏
导数的表达式。

我们⾸先引⼊⼀个中间量，，表示第l层的第j个神经元上的误差。

推导过程详见《Neural Network and Deep Learning》第2.4，2.5节。

我们可以看到，其中BP3和BP4直接决定了w和b的学习速度，并且有BP2中的决定。而由和，所以当和趋于0时，学习的速度变得非常缓慢。

优化1：交叉熵代价函数

优化2：规范化

优化3：更好的权重初始化方法

#-*-coding:utf-8
import numpy as np

#值域[-1,1]
def tanh(x,deriv=False):
if deriv == True:
return 1.0-np.tanh(x)*np.tanh(x)
return np.tanh(x)

#sigmoid function,值域[0,1]
def sigmoid_basic(z):
return 1.0/(1+np.exp(-z))
def sigmoid_derivative(z):
"""derivative of the sigmoid function"""
return sigmoid_basic(z)*(1-sigmoid_basic(z))
def sigmoid(z,deriv=False):
if deriv == True:
return sigmoid_derivative(z)
return sigmoid_basic(z)

X = np.array([[0,0,1,1,0],
[0,1,1,1,0],
[1,0,1,0,1],
[1,1,1,0,0],
[1,1,1,1,0],
[1,1,1,1,1]])

Y = np.array([[0,1,1,0,1,1]]).T

np.random.seed(1)

#代价函数为交叉熵代价函数
class CrossEntropyCost(object):
def __init__(self,activation):
self.activation = activation
def fn(self,a,y):
return np.sum(np.nan_to_num(-y*np.log(a)-(1-y)*np.log(1-a)))
def deltaC(self,z,a,y):
return (a-y)

#代价函数为二次代价函数
class QuadraticCost(object):
def __init__(self,activation):
self.activation = activation
def fn(self,a,y):
return 0.5 * np.linalg.norm(a-y)**2
def deltaC(self,z,a,y):
return (a-y) * self.activation(z,True)

class NeuralNetwork:
def __init__(self,layers,activation='sigmoid',cost='CrossEntropyCost'):
"""
:param layers: A list containing the number of units in each layer.Should be at least two values
:param activation: The activation function to be used.Can be "sigmoid" or "tanh"
"""
self.layers = layers
self.num_layers = len(layers)
if 'sigmoid' == activation:
self.activation = sigmoid
if 'tanh' == activation:
self.activation = tanh
#initialize values randomly with mean 0
# self.biases = [2*np.random.random((y,1))-1 for y in self.layers[1:]]
# self.weights = [2*np.random.random((y,x))-1 for x,y in zip(self.layers[:-1],self.layers[1:])]
# self.biases = [np.random.randn(y,1) for y in self.layers[1:]]
# self.weights = [np.random.randn(y,x)/np.sqrt(x) -1 for x,y in zip(self.layers[:-1],self.layers[1:])]
#效果最好
self.biases = [2*np.random.random((y,1)) - 1 for y in self.layers[1:]]
self.weights = [(2*np.random.random((y,x)) -1) / np.sqrt(x) for x,y
in zip(self.layers[:-1],self.layers[1:])]
if 'CrossEntropyCost' == cost:
self.cost = CrossEntropyCost(self.activation)
if 'QuadraticCost' == cost:
self.cost = QuadraticCost(self.activation)
self.flag = True

def feed_forward(self,li):
li = li.T
for b,w in zip(self.biases,self.weights):
li = self.activation(np.dot(w,li) + b)
return li

def fit(self,training_data,epochs,mini_batch_size,eta=1,monitor_training_accuracy=False,
evaluation_data=None,monitor_evaluation_accuracy=False,lmbda=0.0):
"""
Train the neural network using mini-batch stochastic gradient descent
:param training_data: A list of tuples (x,y)
:param epochs: Training times
:param mini_batch_size: The sample numbers of stochastic gradient descent
:param eta: Learning rate
:param monitor_training_accuracy:
:param evaluation_data:
:param monitor_evaluation_accuracy:
:param lmbda: L2范数
:return:
"""
if X.shape[1] != self.layers[0]:
print("input data dimension error!")
self.flag = False
return
if evaluation_data:n_data = len(evaluation_data)
n = len(training_data)
evaluation_accuracy = []
training_accuracy = []
for j in range(epochs):
np.random.shuffle(training_data)
mini_batches = [training_data[k:k+mini_batch_size] for k in range(0,n,mini_batch_size)]
for mini_batch in mini_batches:
self.update_mini_batch(mini_batch,eta,lmbda,n)

li_error = Y.T - self.feed_forward(X)
if (j % 10000) == 0:
print("ErrorValue:"+str(np.mean(np.abs(li_error))))
if monitor_training_accuracy:
accuracy = self.accuracy(training_data,multiOutput=False)
print("Accuracy on training data: {} / {}".format(accuracy,n))
#Feed forward through layers 0,1,...,i,...n
re = self.feed_forward(X)
print(re)

def update_mini_batch(self,mini_batch,eta,lmbda,n):
"""
Update the nerwork's weights and biases by applying gradient descent
using backpropagation to a single mini batch.
The 'mini_batch' is a list of tuples (x,y), and 'eta' is the learning rate.
:param mini_batch: sample batch
:param eta: learning rate
:param lmbda: L2范数
:param n: length of samples
:return:
"""
x = np.array([ele[0] for ele in mini_batch])
y = np.array([ele[1] for ele in mini_batch])
delta_nabla_b,delta_nabla_w = self.backprop(x,y)
# for x,y in mini_batch:
# delta_nabla_b,delta_nabla_w = self.backprop(x,y)
# nabla_b = [nb+dnb for nb,dnb in zip(nabla_b,delta_nabla_b)]
# nabla_w = [nw+dnw for nw,dnw in zip(nabla_w,delta_nabla_w)]
self.biases = [b - eta/len(mini_batch)*nb for b,nb in zip(self.biases,delta_nabla_b)]
self.weights = [(1-eta*(lmbda/n))*w - eta/len(mini_batch)*nw for w,nw in zip(self.weights,delta_nabla_w)]

def backprop(self,x,y):
"""
Return a tuple (delta_nabla_b,delta_nabla_w) representing the gradient for the cost function C_x.
'delta_nabla_b' and dalta_nabla_w are layer-by-layer lists of numpy arrays,
similar to 'self.biases' and 'self.weights'
:param x:
:param y:
:return:
"""
delta_nabla_b = [np.zeros(b.shape) for b in self.biases]
delta_nabla_w = [np.zeros(w.shape) for w in self.weights]
li = x.T
lis = [li]
zs = []
for b,w in zip(self.biases,self.weights):
z = np.dot(w,li) + b
zs.append(z)
li = self.activation(z)
lis.append(li)
#backward pass
delta = (self.cost).deltaC(zs[-1],lis[-1],y.T)

delta_nabla_b[-1] = np.add.reduce(delta,axis=1).reshape(delta_nabla_b[-1].shape)
delta_nabla_w[-1] = np.dot(delta,lis[-2].transpose())

for l in range(2,self.num_layers):
z = zs[-l]
sp = self.activation(z,True)
delta = np.dot(self.weights[-l +1].transpose(),delta) * sp
delta_nabla_b[-l] = np.add.reduce(delta,axis=1).reshape(delta_nabla_b[-l].shape)
delta_nabla_w[-l] = np.dot(delta,lis[-l -1].transpose())
return (delta_nabla_b,delta_nabla_w)

def accuracy(self,data,multiOutput=False):
"""
:param data: source data
:param multiOutput: multiOutput should beset to False if the output is single neuron
and to True if multiOutput
:return:accurate size
"""
x = np.array([ele[0] for ele in data])
y = np.array([ele[1] for ele in data])
error_threshold = 1e-3
if multiOutput:
results = [(np.argmax(self.feed_forward(x)),np.argmax(y))]
result = sum(int(x==y) for (x,y) in results)
else:
li = self.feed_forward(x)
result = np.sum(np.abs(y.T-li) < error_threshold)
return result

def predict(self,testing_data=None):
x = np.array([ele[0] for ele in testing_data])
y = np.array([ele[1] for ele in testing_data])
if self.flag:
re = self.feed_forward(x)
print("\nresults:{}".format(re))
if None != testing_data:
accuracy = self.accuracy(testing_data,multiOutput=False)
print("Accuracy on testing data:{} / {}".format(accuracy,len(y)))
else:
print([""])

if __name__ == "__main__":
layers = [5,4,3,1]
training_data = [(x,y) for x,y in zip(X,Y)]
epochs = 60000
mini_batch_size = 2
eta = 1
activation = 'sigmoid'
cost = 'CrossEntropyCost'
nn = NeuralNetwork(layers,activation,cost)
nn.fit(training_data,epochs,mini_batch_size,eta,monitor_training_accuracy=True,lmbda=0.00001)
tx = np.array([[0,1,1,1,0],
[0,0,1,0,0]])
ty = np.array([[1,0]]).T
testint_data = [(x,y) for x,y in zip(tx,ty)]
nn.predict(testint_data)

print("OK")