1. 向量化
m
m
个样本下的梯度下降
向量化就是使用矩阵操作代替for循环来加快运算速度的过程,但是向量化的前提是for循环中前后两次迭代中的变量没有因果依赖关系。2.10和2.11这两节中进行梯度下降算法可以进行向量化,也正是这个原因。 代价函数为
m
m
个样本的交叉熵均值
J
J
的表达式
J ( w , b ) = 1 m ∑ i = 1 m L ( a ( i ) , y ( i ) ) (1)
(1)
J
(
w
,
b
)
=
1
m
∑
i
=
1
m
L
(
a
(
i
)
,
y
(
i
)
)
其中
L ( a ( i ) , y ( i ) )
L
(
a
(
i
)
,
y
(
i
)
)
表示第
i
i
个样本的模型预测值
a ( i ) = s i g m o i d ( w T x ( i ) + b )
a
(
i
)
=
s
i
g
m
o
i
d
(
w
T
x
(
i
)
+
b
)
与真实标签
y ( i )
y
(
i
)
之间的交叉熵
L ( a , y ) = − y ( i ) l o g ( a ( i ) ) − ( 1 − y ( i ) ) l o g ( 1 − a ( i ) ) (2)
(2)
L
(
a
,
y
)
=
−
y
(
i
)
l
o
g
(
a
(
i
)
)
−
(
1
−
y
(
i
)
)
l
o
g
(
1
−
a
(
i
)
)
J = 0 , d w 1 = 0 , d w 2 = 0 , d b = 0
J
=
0
,
d
w
1
=
0
,
d
w
2
=
0
,
d
b
=
0
# initialize for
i = 1
i
=
1
to
m :
m
:
z ( i ) = w T x ( i ) + b
z
(
i
)
=
w
T
x
(
i
)
+
b
#- - - - - - - - – - - (3)
a ( i ) = σ ( z ( i ) )
a
(
i
)
=
σ
(
z
(
i
)
)
J + = − y ( i ) l o g ( a ( i ) ) − ( 1 − y ( i ) ) l o g ( 1 − a ( i ) )
J
+
=
−
y
(
i
)
l
o
g
(
a
(
i
)
)
−
(
1
−
y
(
i
)
)
l
o
g
(
1
−
a
(
i
)
)
d z ( i ) = a ( i ) − y ( i )
d
z
(
i
)
=
a
(
i
)
−
y
(
i
)
d w 1 + = x ( i ) 1 d z ( i )
d
w
1
+
=
x
1
(
i
)
d
z
(
i
)
# here
w 1
w
1
is corresponding to the feature
x 1
x
1
d w 2 + = x ( i ) 2 d z ( i )
d
w
2
+
=
x
2
(
i
)
d
z
(
i
)
d b + = d z ( i )
d
b
+
=
d
z
(
i
)
J / = J
J
/
=
J
d w 1 / = m , d w 2 / = m , d b / = m ,
d
w
1
/
=
m
,
d
w
2
/
=
m
,
d
b
/
=
m
,
w 1 := w 1 − α w 1
w
1
:=
w
1
−
α
w
1
# here
α
α
is the learning rate
w 2 := w 2 − α w 2
w
2
:=
w
2
−
α
w
2
b := b − α b
b
:=
b
−
α
b
在上述梯度下降算法的循环体中,(3)式中的
w
w
和
b
b
始终保持为初始值(不是
w 1 , w 2
w
1
,
w
2
)没有更新,因此,每一次循环所计算的
z ( i )
z
(
i
)
只与
x ( i )
x
(
i
)
有关,而与其他样本无关。因此梯度下降算法可以通过向量化计算。向量化之后的梯度下降算法可以描述为(针对
m
m
个样本):
前向传播
反向传播
Z [ 1 ] = W [ 1 ] ∗ X [ 1 ] + b [ 1 ]
Z
[
1
]
=
W
[
1
]
∗
X
[
1
]
+
b
[
1
]
d Z [ 2 ] = A [ 2 ] − Y
d
Z
[
2
]
=
A
[
2
]
−
Y
A [ 1 ] = g [ 1 ] ( Z [ 1 ] )
A
[
1
]
=
g
[
1
]
(
Z
[
1
]
)
d W [ 2 ] = 1 m d Z [ 2 ] ⋅ A [ 1 ] T
d
W
[
2
]
=
1
m
d
Z
[
2
]
⋅
A
[
1
]
T
Z [ 2 ] = W [ 2 ] ∗ A [ 1 ] + b [ 2 ]
Z
[
2
]
=
W
[
2
]
∗
A
[
1
]
+
b
[
2
]
d b [ 2 ] = 1 m
d
b
[
2
]
=
1
m
np.sum(
d Z [ 2 ]
d
Z
[
2
]
,axis=1,keepdims=True)
A [ 2 ] = g [ 2 ] ( Z [ 2 ] )
A
[
2
]
=
g
[
2
]
(
Z
[
2
]
)
d Z [ 1 ] = W [ 2 ] T ⋅ d Z [ 2 ] ∗ g ′ ( Z [ 1 ] )
d
Z
[
1
]
=
W
[
2
]
T
⋅
d
Z
[
2
]
∗
g
′
(
Z
[
1
]
)
d W [ 1 ] = 1 m d Z [ 1 ] ⋅ X T
d
W
[
1
]
=
1
m
d
Z
[
1
]
⋅
X
T
d b [ 1 ] = 1 m
d
b
[
1
]
=
1
m
np.sum(
d Z [ 1 ]
d
Z
[
1
]
,axis=1,keepdims=True)
2. 关于梯度下降中几个公式的注释
上面的表格分别列出了前向传播和反向传播种的关键步骤。在反向传播的6个关键步骤中,每一个都是相对于损失函数
L
L
的偏导数,并且几个式子和数学中微积分的偏导形式不太一样,因此看起来不是很形象。
d Z [ 2 ] = A [ 2 ] − Y (4)
(4)
d
Z
[
2
]
=
A
[
2
]
−
Y
这里的”
d Z [ 2 ]
d
Z
[
2
]
”实际上是
∂ L ∂ Z [ 2 ]
∂
L
∂
Z
[
2
]
因此,
d Z [ 2 ] = ∂ L ∂ A [ 2 ] ⋅ ∂ A [ 2 ] ∂ Z [ 2 ] = ( − Y A [ 2 ] − 1 − Y 1 − A [ 2 ] ) ⋅ ( A [ 2 ] ( 1 − A [ 2 ] ) ) = A [ 2 ] − Y (5)
(5)
d
Z
[
2
]
=
∂
L
∂
A
[
2
]
⋅
∂
A
[
2
]
∂
Z
[
2
]
=
(
−
Y
A
[
2
]
−
1
−
Y
1
−
A
[
2
]
)
⋅
(
A
[
2
]
(
1
−
A
[
2
]
)
)
=
A
[
2
]
−
Y
同理,“
d W [ 2 ]
d
W
[
2
]
”实际上是
∂ L W [ 2 ]
∂
L
W
[
2
]
,利用微分的链式法则与(5)式中的结果,可以得到
d W [ 2 ] = ∂ L W [ 2 ] = ∂ L Z [ 2 ] ⋅ ∂ Z [ 2 ] W [ 2 ] = d Z [ 2 ] ⋅ A [ 1 ] (6)
(6)
d
W
[
2
]
=
∂
L
W
[
2
]
=
∂
L
Z
[
2
]
⋅
∂
Z
[
2
]
W
[
2
]
=
d
Z
[
2
]
⋅
A
[
1
]
检查是(6)中变量的维度,由于
W [ 2 ]
W
[
2
]
为(
n [ 2 ] , n [ 1 ]
n
[
2
]
,
n
[
1
]
)维,所以
d W [ 2 ]
d
W
[
2
]
也为(
n [ 2 ] , n [ 1 ]
n
[
2
]
,
n
[
1
]
)维,而
d Z [ 2 ]
d
Z
[
2
]
为
( n [ 2 ] , m )
(
n
[
2
]
,
m
)
)维,而
A [ 1 ]
A
[
1
]
为
( n [ 1 ] , m )
(
n
[
1
]
,
m
)
)维,因此
式(6)改写为
d W [ 2 ] = ∂ L W [ 2 ] = ∂ L Z [ 2 ] ⋅ ∂ Z [ 2 ] W [ 2 ] = 1 m A [ 1 ] T ⋅ d Z [ 2 ] (7)
(7)
d
W
[
2
]
=
∂
L
W
[
2
]
=
∂
L
Z
[
2
]
⋅
∂
Z
[
2
]
W
[
2
]
=
1
m
A
[
1
]
T
⋅
d
Z
[
2
]
式(7)的前面加上了
1 m
1
m
是由于针对
m
m
个样本的情况下向量化实现后的归一化处理。
特别地,
d Z [ 1 ] = W [ 2 ] T ⋅ d Z [ 2 ] ∗ g ′ ( Z [ 1 ] )
d
Z
[
1
]
=
W
[
2
]
T
⋅
d
Z
[
2
]
∗
g
′
(
Z
[
1
]
)
中,“*”表示两个矩阵的逐元素相乘,要求两个矩阵的形状相同,而“
⋅
⋅
”表示两个矩阵点乘,需要满足矩阵点乘条件。
同理可以推导其他几个式子,在利用链式法则求导的过程中,始终要明确所有的倒数都是针对损失函数而求的。