【机器学习】4 神经网络学习

1 Non-linear Hypothesis

  • 解决普通逻辑回归模型所不能解决的问题:特征量太多的情况

2 Neuron Model

2.1 Logistic Unit

Logistic Unit

2.2 Neural Network

Nureal Network

  • If network has s j s_j sj units in layer j j j, s j + 1 s_{j+1} sj+1 units in layer j + 1 j+1 j+1, then θ ( j ) \theta^{(j)} θ(j) will be of dimension s j + 1 × ( s j + 1 ) s_{j+1}×(s_j+1) sj+1×(sj+1)

2.3 神经网络和逻辑回归的区别

  • nerual network just like logistic regression model, expect that rather than using the original features x 1 , x 2 , x 3 x_1,x_2,x_3 x1,x2,x3, is using these new features a 1 , a 2 , a 3 a_1,a_2,a_3 a1,a2,a3
  • a a a depend on x x x
  • 因为是梯度下降的,所以 a a a是变化的,并且变得越来越厉害,所以这些更高级的特征值 a a a远比仅仅将 x x x次方厉害,也能更好的预测新数据

3 Forward Propagation(前向传播算法)

  • a 1 ( 2 ) = g ( θ 10 ( 1 ) x 0 + θ 11 ( 1 ) x 1 + θ 12 ( 1 ) x 2 + θ 13 ( 1 ) x 3 ) a 2 ( 2 ) = g ( θ 20 ( 1 ) x 0 + θ 21 ( 1 ) x 1 + θ 22 ( 1 ) x 2 + θ 23 ( 1 ) x 3 ) a 3 ( 2 ) = g ( θ 30 ( 1 ) x 0 + θ 31 ( 1 ) x 1 + θ 32 ( 1 ) x 2 + θ 33 ( 1 ) x 3 ) \begin{aligned} a_1^{(2)}=g\left(\theta_{10}^{(1)}x_0+\theta_{11}^{(1)}x_1+\theta_{12}^{(1)}x_2+\theta_{13}^{(1)}x_3\right)\\ a_2^{(2)}=g\left(\theta_{20}^{(1)}x_0+\theta_{21}^{(1)}x_1+\theta_{22}^{(1)}x_2+\theta_{23}^{(1)}x_3\right)\\ a_3^{(2)}=g\left(\theta_{30}^{(1)}x_0+\theta_{31}^{(1)}x_1+\theta_{32}^{(1)}x_2+\theta_{33}^{(1)}x_3\right)\\ \end{aligned} a1(2)=g(θ10(1)x0+θ11(1)x1+θ12(1)x2+θ13(1)x3)a2(2)=g(θ20(1)x0+θ21(1)x1+θ22(1)x2+θ23(1)x3)a3(2)=g(θ30(1)x0+θ31(1)x1+θ32(1)x2+θ33(1)x3)
  • h θ ( x ) = a 1 ( 3 ) = g ( θ 10 ( 2 ) a 0 ( 2 ) + θ 11 ( 2 ) a 1 ( 2 ) + θ 12 ( 2 ) a 2 ( 2 ) + θ 13 ( 2 ) a 3 ( 2 ) ) h_{\theta}(x)=a_1^{(3)}=g\left(\theta_{10}^{(2)}a_0^{(2)}+\theta_{11}^{(2)}a_1^{(2)}+\theta_{12}^{(2)}a_2^{(2)}+\theta_{13}^{(2)}a_3^{(2)}\right) hθ(x)=a1(3)=g(θ10(2)a0(2)+θ11(2)a1(2)+θ12(2)a2(2)+θ13(2)a3(2))
  • θ ⋅ X = a \theta \cdot X=a θX=a
  • z ( 2 ) = θ ( 1 ) × X T z^{(2)}=\theta^{(1)}×X^T z(2)=θ(1)×XT
  • a ( 2 ) = g ( z ( 2 ) ) a^{(2)}=g\left(z^{(2)}\right) a(2)=g(z(2))

4 Multi-class Classification(多类分类)

Multi-class classfication

{ ( x ( 1 ) , y ( 1 ) ) , ( x ( 2 ) , y ( 2 ) ) , ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ , ( x ( m ) , y ( m ) ) } \{\left(x^{(1)},y^{(1)}\right),\left(x^{(2)},y^{(2)}\right),······,\left(x^{(m)},y^{(m)}\right)\} { (x(1),y(1)),(x(2),y(2)),,(x(m),y(m))}

  • L = L= L= total no. of layers in network
  • s l = s_l= sl= no. of units(not counting bias unit)in layer l l l
  • K = K= K= no. of classes

K = 1 K=1 K=1:binary classfication

5 Cost Function(代价函数)

h θ ( x ) ∈ R K h_{\theta}(x)∈{\mathbb{R}}^K hθ(x)RK
( h θ ( x ) ) i = i t h {\left(h_{\theta}(x)\right)}_i=i^{th} (hθ(x))i=ith output

  • J ( θ ) = − 1 m [ ∑ i = 1 m ∑ k = 1 K ( y k ( i ) log h θ ( x ( i ) ) k + ( 1 − y k ( i ) ) log ( 1 − h θ ( x ( i ) ) k ) ) ] + λ 2 m ∑ l = 1 L − 1 ∑ i = 1 s l ∑ j = 1 s l + 1 ( θ j i ( l ) ) 2 J(\theta)=-\frac{1}{m}\left[\sum_{i=1}^m\sum_{k=1}^K\left(y_k^{(i)}\text{log}h_{\theta}{(x^{(i)})}_k+(1-y_k^{(i)})\text{log}(1-h_{\theta}{(x^{(i)})}_k)\right)\right]+\frac{\lambda}{2m}\sum_{l=1}^{L-1}\sum_{i=1}^{s_l}\sum_{j=1}^{s_l+1}(\theta_{ji}^{(l)})^2 J(θ)=m1[i=1mk=1K(yk(i)loghθ(x(i))k+(1yk(i))log(1hθ(x(i))k))]+2mλl=1L1i=1slj=1sl+1(θji(l))2
  • 对于每一行特征,我们都会给出 K K K个预测,基本上我们可以利用循环,对每一行特征都预测 K K K个不同结果,然后再利用循环在 K K K个预测中选择可能性最高的一个,将其与 y y y中的实际数据进行比较
  • 正则化的那一项只是排除了每一层 θ 0 {\theta}_0 θ0后,每一层的 θ \theta θ矩阵的和
    (1)循环 j j j循环所有的行,由 s l + 1 s_l+1 sl+1层的激活单元数决定
    (2)循环 i i i循环所有的列,由 s l s_l sl层的激活单元数决定
  • h θ ( x ) h_{\theta}(x) hθ(x)与真实值之间的距离为每个样本-每个类输出的加和,对参数进行regularization的bias项处理所有参数的平方和

6 Backward Propagation(反向传播算法)

计算 J ( θ ) J(\theta) J(θ)的偏导数 ∂ ∂ θ i j ( l ) J ( θ ) \frac{\partial}{\partial{\theta}_{ij}^{(l)}}J(\theta) θij(l)J(θ)
δ j ( l ) = {\delta}_j^{(l)}= δj(l)= “error” of node j j j in layer l l l

  • δ j ( 4 ) = a j ( 4 ) − y j {\delta}_j^{(4)}=a_j^{(4)}-y_j δj(4)=aj(4)yj
  • δ ( 3 ) = ( θ ( 3 ) ) T δ ( 4 ) . ∗ g ′ ( z ( 3 ) ) {\delta}^{(3)}={({\theta}^{(3)})}^T{\delta}^{(4)}.*g'(z^{(3)}) δ(3)=(θ(3))Tδ(4).g(z(3))
    g ′ ( z ( 3 ) ) = a ( 3 ) . ∗ ( 1 − a 3 ) g'(z^{(3)})=a^{(3)}.*(1-a^{3}) g(z(3))=a(3).(1a3)
  • δ ( 2 ) = ( θ ( 2 ) ) T δ ( 3 ) . ∗ g ′ ( z ( 2 ) ) {\delta}^{(2)}={({\theta}^{(2)})}^T{\delta}^{(3)}.*g'(z^{(2)}) δ(2)=(θ(2))Tδ(3).g(z(2))
    g ′ ( z ( 2 ) ) = a ( 2 ) . ∗ ( 1 − a 2 ) g'(z^{(2)})=a^{(2)}.*(1-a^{2}) g(z(2))=a(2).(1a2)
  • λ = 0 时 , ∂ ∂ θ i j ( l ) J ( θ ) = a j ( l ) δ i ( l + 1 ) \lambda=0时,\frac{\partial}{\partial{\theta}_{ij}^{(l)}}J(\theta)=a_j^{(l)}{\delta}_i^{(l+1)} λ=0θij(l)J(θ)=aj(l)δi(l+1)

l l l:目前所计算的是第几层
j j j:目前计算层中的激活单元的下标,也将是下一层的第 j j j个输入变量的下标
i i i:受到权重矩阵中第 i i i层影响的下一层中的误差单元的下标
Δ i j ( l ) \Delta_{ij}^{(l)} Δij(l):第 l l l层的第 i i i个激活单元受到第 j j j个参数影响而导致的误差所形成的矩阵

Training set: { ( x ( 1 ) , y ( 1 ) ) , ( x ( 2 ) , y ( 2 ) ) , ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ , ( x ( m ) , y ( m ) ) } \{\left(x^{(1)},y^{(1)}\right),\left(x^{(2)},y^{(2)}\right),······,\left(x^{(m)},y^{(m)}\right)\} { (x(1),y(1)),(x(2),y(2)),,(x(m),y(m))}
Set Δ i j ( l ) = 0 \Delta_{ij}^{(l)}=0 Δij(l)=0(for all l , i , j l,i,j l,i,j
For i = 1 i=1 i=1 to m m m
  Set a ( 1 ) = x ( i ) a^{(1)}=x^{(i)} a(1)=x(i)
  Perform forward propagation to compute a ( l ) a^{(l)} a(l) for l = 2 , 3 , ⋅ ⋅ ⋅ , L l=2,3,···,L l=2,3,,L
  Using y ( i ) y^{(i)} y(i),compute δ ( L ) = a ( L ) − y ( i ) \delta^{(L)}=a^{(L)}-y^{(i)} δ(L)=a(L)y(i)
  Compute δ ( L − 1 ) , δ ( L − 2 ) , ⋅ ⋅ ⋅ , δ ( 2 ) \delta^{(L-1)},\delta^{(L-2)},···,\delta^{(2)} δ(L1),δ(L2),,δ(2)
   Δ i j ( l ) : = Δ i j ( l ) + a j ( l ) δ i ( l + 1 ) \Delta_{ij}^{(l)}:=\Delta_{ij}^{(l)}+a_j^{(l)}\delta_i^{(l+1)} Δij(l):=Δij(l)+aj(l)δi(l+1)

  • ∂ ∂ θ i j ( l ) J ( θ ) = D i j ( l ) \frac{\partial}{\partial{\theta}_{ij}^{(l)}}J(\theta)=D_{ij}^{(l)} θij(l)J(θ)=Dij(l)
    D i j ( l ) = { 1 m Δ i j ( l ) + λ θ i j ( l ) , if  j ≠ 0 1 m Δ i j ( l ) , if  j = 0 D_{ij}^{(l)}=\begin{cases} \frac{1}{m}\Delta_{ij}^{(l)}+\lambda\theta_{ij}^{(l)},&\text{if $j≠0$}\\ \frac{1}{m}\Delta_{ij}^{(l)},&\text{if $j=0$} \end{cases} Dij(l)={ m1Δij(l)+λθij(l),m1Δij(l),if j=0if j=0

7 Numerical Gradient Checking

  • 思想:通过估计梯度值来检验我们计算的导数值是否符合要求
  • 方法:在代价函数上沿着切线方向,选择两个离得非常近的点 θ − ε \theta-\varepsilon θε θ + ε \theta+\varepsilon θ+ε ε \varepsilon ε通常选取0.001),然后计算这两个点的平均值,用以估计在 θ \theta θ 出的代价值
  • 公式: ∂ ∂ θ J ( θ ) ≈ J ( θ + ε ) − J ( θ − ε ) 2 ε ( ε = 10 − 4 ) \frac{\partial}{\partial\theta}J(\theta)≈\frac{J(\theta+\varepsilon)-J(\theta-\varepsilon)}{2\varepsilon}(\varepsilon={10}^{-4}) θJ(θ)2εJ(θ+ε)J(θε)ε=104

θ ∈ R n \theta∈{\mathbb{R}}^n θRn
θ = [ θ 1 , θ 2 , ⋅ ⋅ ⋅ , θ n ] \theta=[\theta_1,\theta_2,···,\theta_n] θ=[θ1,θ2,,θn]

∂ ∂ θ 1 J ( θ ) ≈ J ( θ 1 + ε , θ ) − J ( θ 1 − ε , θ ) 2 ε ∂ ∂ θ 2 J ( θ ) ≈ J ( θ 2 + ε , θ ) − J ( θ 2 − ε , θ ) 2 ε ⋮ ∂ ∂ θ n J ( θ ) ≈ J ( θ n + ε , θ ) − J ( θ n − ε , θ ) 2 ε \begin{matrix} \frac{\partial}{\partial\theta_1}J(\theta)≈\frac{J(\theta_1+\varepsilon,\theta)-J(\theta_1-\varepsilon,\theta)}{2\varepsilon}\\ \frac{\partial}{\partial\theta_2}J(\theta)≈\frac{J(\theta_2+\varepsilon,\theta)-J(\theta_2-\varepsilon,\theta)}{2\varepsilon}\\ \vdots\\ \frac{\partial}{\partial\theta_n}J(\theta)≈\frac{J(\theta_n+\varepsilon,\theta)-J(\theta_n-\varepsilon,\theta)}{2\varepsilon} \end{matrix} θ1J(θ)2εJ(θ1+ε,θ)J(θ1ε,θ)θ2J(θ)2εJ(θ2+ε,θ)J(θ2ε,θ)θnJ(θ)2εJ(θn+ε,θ)J(θnε,θ)

for i=1:n,
	thetaPlus = theta;
	thetaPlus(i) = thetaPlus(i) + EPSILON;
	thetaMinus = theta;
	thetaMinus(i) = thetaMinus(i) - EPSILON;
	gradApprox(i) = (J(thetaPlus)-J(thetaMinus)) / (2 * EPSILON);
end;
Check that gradApprox ≈ DVec
\\DVec from backprop

Implementation Note

  • Implement backprop to compute DVec(unrolled D ( 1 ) , D ( 2 ) , D ( 3 ) D^{(1)},D^{(2)},D^{(3)} D(1),D(2),D(3)
  • Implement numerical gradient check to compute gradApprox
  • Make sure they give similar values
  • Turn off gradient checking. Using backprop code for learning

Important

  • Be sure to disable your gradient checking code before training your classifier. If you run numerical gradient computation on every iteration of gradient descent(or in the inner loop of costFunction(…))your code will be very slow

8 Random Initialization

  • to solve Symmetry breaking
  • Initialize each θ i j ( l ) \theta_{ij}^{(l)} θij(l) to a random value in [ − ε , ε ] [-\varepsilon,\varepsilon] [ε,ε]
Theta1 = rand(10,11) * (2 * INIT_EPSILON) - INIT_EPSILON
Theta2 = rand(1,11) * (2 * INIT_EPSILON) - INIT_EPSILON
//rand(i,j):random i×j matrix between 0 and 1

9 Putting it together

  1. Pick a network architecture(connectivity pattern between neurons). 选择神经网络结构。Decides:
    (1) No. of input units: Dimension of features x ( i ) x^{(i)} x(i)
    (2) No. of output units:Number of classes
    Reasonable default:1 hidden layer, or if >1 hidden layer, have same no. of hidden units in every layer(usually the more the better)
  2. Randomly initialize weights. 参数的随机初始化
  3. Implement foward propagation to get h θ ( x ( i ) ) h_{\theta}(x^{(i)}) hθ(x(i)) for any x ( i ) x^{(i)} x(i). 利用正向传播方法计算所有的 h θ ( x ) h_\theta(x) hθ(x)
  4. Implement code to compute cost function J ( θ ) J(\theta) J(θ). 编写计算代价函数 J J J的代码
  5. Implement backprop to compute partial derivatives. 利用反向传播方法计算所有偏导数
  6. Use gradient checking to compare partial derivatives using backpropagation vs. using numerical estimate of gradient of J ( θ ) J(\theta) J(θ). 利用数值检验方法检验这些偏导数
  7. Then disable graidient checking code. 关闭梯度检验代码
  8. Use gradient descent or advanced optimization method with backpropagation to try to minimize J ( θ ) J(\theta) J(θ) as a function of parameters θ \theta θ. 使用优化算法来最小化代价函数

10 Reference

吴恩达 机器学习 coursera machine learning
黄海广 机器学习笔记

猜你喜欢

转载自blog.csdn.net/qq_44714521/article/details/108431707