第4章神经网络学习

1 Non-linear Hypothesis
2 Neuron Model
3 Forward Propagation（前向传播算法）
4 Multi-class Classification（多类分类）
5 Cost Function（代价函数）
6 Backward Propagation（反向传播算法）
7 Numerical Gradient Checking
- Implementation Note
- Important
8 Random Initialization
9 Putting it together
10 Reference

1 Non-linear Hypothesis

解决普通逻辑回归模型所不能解决的问题：特征量太多的情况

2 Neuron Model

2.1 Logistic Unit

Logistic Unit

2.2 Neural Network

Nureal Network

If network has $s_j$ units in layer $j$ , $s_{j+1}$ units in layer $j + 1$ , then $\theta^{(j)}$ will be of dimension $s_{j+1}×(s_j+1)$

2.3 神经网络和逻辑回归的区别

nerual network just like logistic regression model, expect that rather than using the original features $x_1,x_2,x_3$ , is using these new features $a_1,a_2,a_3$
$a$ depend on $x$
因为是梯度下降的，所以 $a$ 是变化的，并且变得越来越厉害，所以这些更高级的特征值 $a$ 远比仅仅将 $x$ 次方厉害，也能更好的预测新数据

3 Forward Propagation（前向传播算法）

$\begin{aligned} a_1^{(2)}=g\left(\theta_{10}^{(1)}x_0+\theta_{11}^{(1)}x_1+\theta_{12}^{(1)}x_2+\theta_{13}^{(1)}x_3\right)\\ a_2^{(2)}=g\left(\theta_{20}^{(1)}x_0+\theta_{21}^{(1)}x_1+\theta_{22}^{(1)}x_2+\theta_{23}^{(1)}x_3\right)\\ a_3^{(2)}=g\left(\theta_{30}^{(1)}x_0+\theta_{31}^{(1)}x_1+\theta_{32}^{(1)}x_2+\theta_{33}^{(1)}x_3\right)\\ \end{aligned}$
$h_{\theta}(x)=a_1^{(3)}=g\left(\theta_{10}^{(2)}a_0^{(2)}+\theta_{11}^{(2)}a_1^{(2)}+\theta_{12}^{(2)}a_2^{(2)}+\theta_{13}^{(2)}a_3^{(2)}\right)$
$\theta \cdot X=a$
$z^{(2)}=\theta^{(1)}×X^T$
$a^{(2)}=g\left(z^{(2)}\right)$

4 Multi-class Classification（多类分类）

Multi-class classfication

$\{\left(x^{(1)},y^{(1)}\right),\left(x^{(2)},y^{(2)}\right),······,\left(x^{(m)},y^{(m)}\right)\}$

$L =$ total no. of layers in network
$s_l=$ no. of units（not counting bias unit）in layer $l$
$K =$ no. of classes

$K = 1$ ：binary classfication

5 Cost Function（代价函数）

$h_{\theta}(x)∈{\mathbb{R}}^K$
${\left(h_{\theta}(x)\right)}_i=i^{th}$ output

$J(\theta)=-\frac{1}{m}\left[\sum_{i=1}^m\sum_{k=1}^K\left(y_k^{(i)}\text{log}h_{\theta}{(x^{(i)})}_k+(1-y_k^{(i)})\text{log}(1-h_{\theta}{(x^{(i)})}_k)\right)\right]+\frac{\lambda}{2m}\sum_{l=1}^{L-1}\sum_{i=1}^{s_l}\sum_{j=1}^{s_l+1}(\theta_{ji}^{(l)})^2$
对于每一行特征，我们都会给出 $K$ 个预测，基本上我们可以利用循环，对每一行特征都预测 $K$ 个不同结果，然后再利用循环在 $K$ 个预测中选择可能性最高的一个，将其与 $y$ 中的实际数据进行比较
正则化的那一项只是排除了每一层 ${\theta}_0$ 后，每一层的 $\theta$ 矩阵的和
（1）循环 $j$ 循环所有的行，由 $s_l+1$ 层的激活单元数决定
（2）循环 $i$ 循环所有的列，由 $s_l$ 层的激活单元数决定
$h_{\theta}(x)$ 与真实值之间的距离为每个样本-每个类输出的加和，对参数进行regularization的bias项处理所有参数的平方和

6 Backward Propagation（反向传播算法）

计算 $J(\theta)$ 的偏导数 $\frac{\partial}{\partial{\theta}_{ij}^{(l)}}J(\theta)$
${\delta}_j^{(l)}=$ “error” of node $j$ in layer $l$

${\delta}_j^{(4)}=a_j^{(4)}-y_j$
${\delta}^{(3)}={({\theta}^{(3)})}^T{\delta}^{(4)}.*g'(z^{(3)})$
$g'(z^{(3)})=a^{(3)}.*(1-a^{3})$
${\delta}^{(2)}={({\theta}^{(2)})}^T{\delta}^{(3)}.*g'(z^{(2)})$
$g'(z^{(2)})=a^{(2)}.*(1-a^{2})$
$\lambda=0时，\frac{\partial}{\partial{\theta}_{ij}^{(l)}}J(\theta)=a_j^{(l)}{\delta}_i^{(l+1)}$

$l$ ：目前所计算的是第几层
$j$ ：目前计算层中的激活单元的下标，也将是下一层的第 $j$ 个输入变量的下标
$i$ ：受到权重矩阵中第 $i$ 层影响的下一层中的误差单元的下标
$\Delta_{ij}^{(l)}$ ：第 $l$ 层的第 $i$ 个激活单元受到第 $j$ 个参数影响而导致的误差所形成的矩阵

Training set： $\{\left(x^{(1)},y^{(1)}\right),\left(x^{(2)},y^{(2)}\right),······,\left(x^{(m)},y^{(m)}\right)\}$
Set $\Delta_{ij}^{(l)}=0$ （for all $l, i, j$ ）
For $i = 1$ to $m$
Set $a^{(1)}=x^{(i)}$
Perform forward propagation to compute $a^{(l)}$ for $l = 2, 3, \cdot \cdot \cdot, L$
Using $y^{(i)}$ ，compute $\delta^{(L)}=a^{(L)}-y^{(i)}$
Compute $\delta^{(L-1)},\delta^{(L-2)},···,\delta^{(2)}$
$\Delta_{ij}^{(l)}:=\Delta_{ij}^{(l)}+a_j^{(l)}\delta_i^{(l+1)}$

$\frac{\partial}{\partial{\theta}_{ij}^{(l)}}J(\theta)=D_{ij}^{(l)}$
$D_{ij}^{(l)}=\begin{cases} \frac{1}{m}\Delta_{ij}^{(l)}+\lambda\theta_{ij}^{(l)},&\text{if $j≠0$}\\ \frac{1}{m}\Delta_{ij}^{(l)},&\text{if $j=0$} \end{cases}$

7 Numerical Gradient Checking

思想：通过估计梯度值来检验我们计算的导数值是否符合要求
方法：在代价函数上沿着切线方向，选择两个离得非常近的点 $\theta-\varepsilon$ 和 $\theta+\varepsilon$ （ $\varepsilon$ 通常选取0.001），然后计算这两个点的平均值，用以估计在 $\theta$ 出的代价值
公式： $\frac{\partial}{\partial\theta}J(\theta)≈\frac{J(\theta+\varepsilon)-J(\theta-\varepsilon)}{2\varepsilon}（\varepsilon={10}^{-4}）$

$\theta∈{\mathbb{R}}^n$
$\theta=[\theta_1,\theta_2,···,\theta_n]$

$\begin{matrix} \frac{\partial}{\partial\theta_1}J(\theta)≈\frac{J(\theta_1+\varepsilon,\theta)-J(\theta_1-\varepsilon,\theta)}{2\varepsilon}\\ \frac{\partial}{\partial\theta_2}J(\theta)≈\frac{J(\theta_2+\varepsilon,\theta)-J(\theta_2-\varepsilon,\theta)}{2\varepsilon}\\ \vdots\\ \frac{\partial}{\partial\theta_n}J(\theta)≈\frac{J(\theta_n+\varepsilon,\theta)-J(\theta_n-\varepsilon,\theta)}{2\varepsilon} \end{matrix}$

for i=1:n,
	thetaPlus = theta;
	thetaPlus(i) = thetaPlus(i) + EPSILON;
	thetaMinus = theta;
	thetaMinus(i) = thetaMinus(i) - EPSILON;
	gradApprox(i) = (J(thetaPlus)-J(thetaMinus)) / (2 * EPSILON);
end;
Check that gradApprox ≈ DVec
\\DVec from backprop

Implementation Note

Implement backprop to compute DVec（unrolled $D^{(1)},D^{(2)},D^{(3)}$ ）
Implement numerical gradient check to compute gradApprox
Make sure they give similar values
Turn off gradient checking. Using backprop code for learning

Important

Be sure to disable your gradient checking code before training your classifier. If you run numerical gradient computation on every iteration of gradient descent（or in the inner loop of costFunction(…)）your code will be very slow

8 Random Initialization

to solve Symmetry breaking
Initialize each $\theta_{ij}^{(l)}$ to a random value in $[-\varepsilon,\varepsilon]$

Theta1 = rand(10,11) * (2 * INIT_EPSILON) - INIT_EPSILON
Theta2 = rand(1,11) * (2 * INIT_EPSILON) - INIT_EPSILON
//rand(i,j)：random i×j matrix between 0 and 1

9 Putting it together

Pick a network architecture（connectivity pattern between neurons）. 选择神经网络结构。Decides：
(1) No. of input units： Dimension of features $x^{(i)}$
(2) No. of output units：Number of classes
Reasonable default：1 hidden layer, or if ＞1 hidden layer, have same no. of hidden units in every layer（usually the more the better）
Randomly initialize weights. 参数的随机初始化
Implement foward propagation to get $h_{\theta}(x^{(i)})$ for any $x^{(i)}$ . 利用正向传播方法计算所有的 $h_\theta(x)$
Implement code to compute cost function $J(\theta)$ . 编写计算代价函数 $J$ 的代码
Implement backprop to compute partial derivatives. 利用反向传播方法计算所有偏导数
Use gradient checking to compare partial derivatives using backpropagation vs. using numerical estimate of gradient of $J(\theta)$ . 利用数值检验方法检验这些偏导数
Then disable graidient checking code. 关闭梯度检验代码
Use gradient descent or advanced optimization method with backpropagation to try to minimize $J(\theta)$ as a function of parameters $\theta$ . 使用优化算法来最小化代价函数

10 Reference

吴恩达机器学习 coursera machine learning
黄海广机器学习笔记

【机器学习】4 神经网络学习

第4章神经网络学习

1 Non-linear Hypothesis

2 Neuron Model

2.1 Logistic Unit

2.2 Neural Network

2.3 神经网络和逻辑回归的区别

3 Forward Propagation（前向传播算法）

4 Multi-class Classification（多类分类）

5 Cost Function（代价函数）

6 Backward Propagation（反向传播算法）

7 Numerical Gradient Checking

Implementation Note

Important

8 Random Initialization

9 Putting it together

10 Reference

猜你喜欢

【机器学习】4 神经网络学习

第4章 神经网络学习

1 Non-linear Hypothesis

2 Neuron Model

2.1 Logistic Unit

2.2 Neural Network

2.3 神经网络和逻辑回归的区别

3 Forward Propagation（前向传播算法）

4 Multi-class Classification（多类分类）

5 Cost Function（代价函数）

6 Backward Propagation（反向传播算法）

7 Numerical Gradient Checking

Implementation Note

Important

8 Random Initialization

9 Putting it together

10 Reference

猜你喜欢

第4章神经网络学习