注:线性回归的向量化表示见:线性回归向量化
在实际应用中为了计算更为方便,例如在编程中都是使用矩阵进行计算(参考 编程作业(2)逻辑回归 ),我们可以将整个模型向量化。
对于整个训练集而言:
1. 输入输出及参数
和线性回归一样,用 特征矩阵
X
X
X 来描述所有特征,用参数向量
θ
\theta
θ 来描述所有参数,用输出向量
y
y
y 表示所有输出变量:
X
=
[
x
0
(
1
)
x
1
(
1
)
x
2
(
1
)
⋅
⋅
⋅
x
n
(
1
)
x
0
(
2
)
x
1
(
2
)
x
2
(
2
)
⋅
⋅
⋅
x
n
(
2
)
:
:
:
⋅
⋅
⋅
:
x
0
(
m
)
x
1
(
m
)
x
2
(
m
)
⋅
⋅
⋅
x
n
(
m
)
]
,
θ
=
[
θ
0
θ
1
:
θ
n
]
,
y
=
[
y
(
1
)
y
(
2
)
:
y
(
m
)
]
X=\begin{bmatrix} x_0^{(1)}&x_1^{(1)}&x_2^{(1)}&···&x_n^{(1)}\\ \\ x_0^{(2)}&x_1^{(2)}&x_2^{(2)}&···&x_n^{(2)}\\ \\:&:&:&···&:\\ \\ x_0^{(m)}&x_1^{(m)}&x_2^{(m)}&···&x_n^{(m)}\\ \end{bmatrix}\ ,\ \theta=\begin{bmatrix} \theta_0\\ \\ \theta_1\\ \\:\\ \\ \theta_n \end{bmatrix}\ ,\ y=\begin{bmatrix} y^{(1)}\\ \\ y^{(2)}\\ \\:\\ \\ y^{(m)} \end{bmatrix}
X = ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎡ x 0 ( 1 ) x 0 ( 2 ) : x 0 ( m ) x 1 ( 1 ) x 1 ( 2 ) : x 1 ( m ) x 2 ( 1 ) x 2 ( 2 ) : x 2 ( m ) ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ x n ( 1 ) x n ( 2 ) : x n ( m ) ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎤ , θ = ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎡ θ 0 θ 1 : θ n ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎤ , y = ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎡ y ( 1 ) y ( 2 ) : y ( m ) ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎤
X
X
X 的维度是
m
∗
(
n
+
1
)
m*(n+1)
m ∗ ( n + 1 ) 且
x
0
=
1
x_0=1
x 0 = 1 ,
θ
\theta
θ 的维度为
(
n
+
1
)
∗
1
(n+1)*1
( n + 1 ) ∗ 1 ,
y
y
y 的维度为
m
∗
1
m*1
m ∗ 1 且
y
(
i
)
=
0
,
1
y^{(i)}=0,1
y ( i ) = 0 , 1
2. 假设函数
整个训练集 的 所有假设结果 也可以用一个
m
∗
1
m*1
m ∗ 1 维的向量表示:
h
θ
(
x
)
=
g
(
X
θ
)
=
[
g
(
x
0
(
1
)
θ
0
+
x
1
(
1
)
θ
1
+
x
2
(
1
)
θ
2
+
⋅
⋅
⋅
+
x
n
(
1
)
θ
n
)
g
(
x
0
(
2
)
θ
0
+
x
1
(
2
)
θ
1
+
x
2
(
2
)
θ
2
+
⋅
⋅
⋅
+
x
n
(
2
)
θ
n
)
:
g
(
x
0
(
m
)
θ
0
+
x
1
(
m
)
θ
1
+
x
2
(
m
)
θ
2
+
⋅
⋅
⋅
+
x
n
(
m
)
θ
n
)
]
=
[
h
θ
(
x
(
1
)
)
h
θ
(
x
(
2
)
)
:
h
θ
(
x
(
m
)
)
]
=
y
^
=
[
y
^
(
1
)
y
^
(
2
)
:
y
^
(
m
)
]
h_\theta(x)=g(X\theta)=\begin{bmatrix} g(x_0^{(1)}\theta_0+x_1^{(1)}\theta_1+x_2^{(1)}\theta_2+···+x_n^{(1)}\theta_n)\\ \\ g(x_0^{(2)}\theta_0+x_1^{(2)}\theta_1+x_2^{(2)}\theta_2+···+x_n^{(2)}\theta_n)\\ \\:\\ \\ g(x_0^{(m)}\theta_0+x_1^{(m)}\theta_1+x_2^{(m)}\theta_2+···+x_n^{(m)}\theta_n)\\ \end{bmatrix}=\begin{bmatrix}h_\theta(x^{(1)})\\ \\ h_\theta(x^{(2)})\\ \\:\\ \\ h_\theta(x^{(m)}) \end{bmatrix}=\hat{y}=\begin{bmatrix}\hat{y}^{(1)}\\ \\ \hat{y}^{(2)}\\ \\:\\ \\ \hat{y}^{(m)} \end{bmatrix}
h θ ( x ) = g ( X θ ) = ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎡ g ( x 0 ( 1 ) θ 0 + x 1 ( 1 ) θ 1 + x 2 ( 1 ) θ 2 + ⋅ ⋅ ⋅ + x n ( 1 ) θ n ) g ( x 0 ( 2 ) θ 0 + x 1 ( 2 ) θ 1 + x 2 ( 2 ) θ 2 + ⋅ ⋅ ⋅ + x n ( 2 ) θ n ) : g ( x 0 ( m ) θ 0 + x 1 ( m ) θ 1 + x 2 ( m ) θ 2 + ⋅ ⋅ ⋅ + x n ( m ) θ n ) ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎤ = ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎡ h θ ( x ( 1 ) ) h θ ( x ( 2 ) ) : h θ ( x ( m ) ) ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎤ = y ^ = ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎡ y ^ ( 1 ) y ^ ( 2 ) : y ^ ( m ) ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎤ 这里引入的新符号(读作y帽)
y
^
=
h
θ
(
x
)
\hat{y}=h_\theta(x)
y ^ = h θ ( x ) ,有的地方也用
y
^
\hat{y}
y ^ 来表示样本的预测值,跟假设函数
h
θ
(
x
)
h_\theta(x)
h θ ( x ) 的含义其实一样。
3 代价函数
原始公式:
J
(
θ
)
=
−
1
m
∑
i
=
1
m
[
y
(
i
)
∗
log
(
h
θ
(
x
(
i
)
)
)
+
(
1
−
y
(
i
)
)
∗
log
(
1
−
h
θ
(
x
(
i
)
)
)
]
=
−
1
m
∑
i
=
1
m
[
y
(
i
)
∗
log
(
y
^
(
i
)
)
+
(
1
−
y
(
i
)
)
∗
log
(
1
−
y
^
(
i
)
)
]
\begin{aligned} J(θ)&=-\frac{1}{m}\sum_{i=1}^{m} \left[y^{(i)}*\log(h_θ(x^{(i)}))+(1-y^{(i)})*\log(1-h_θ( x^{(i)}))\right]\\ &=-\frac{1}{m}\sum_{i=1}^{m} \left[y^{(i)}*\log(\hat{y}^{(i)})+(1-y^{(i)})*\log(1-\hat{y}^{(i)})\right] \end{aligned}
J ( θ ) = − m 1 i = 1 ∑ m [ y ( i ) ∗ log ( h θ ( x ( i ) ) ) + ( 1 − y ( i ) ) ∗ log ( 1 − h θ ( x ( i ) ) ) ] = − m 1 i = 1 ∑ m [ y ( i ) ∗ log ( y ^ ( i ) ) + ( 1 − y ( i ) ) ∗ log ( 1 − y ^ ( i ) ) ] 向量化表示为:
J
(
θ
)
=
−
1
m
S
U
M
[
y
∗
log
(
h
θ
(
x
)
)
+
(
1
−
y
)
∗
log
(
1
−
h
θ
(
x
)
)
]
=
−
1
m
S
U
M
[
y
∗
log
(
y
^
)
+
(
1
−
y
)
∗
log
(
1
−
y
^
)
]
\begin{aligned} J(θ)&=-\displaystyle\frac{1}{m} SUM \left[y*\log(h_\theta(x))+(1-y)*\log(1-h_\theta(x))\right]\\ &=-\displaystyle\frac{1}{m} SUM \left[y*\log(\hat{y})+(1-y)*\log(1-\hat{y})\right] \end{aligned}
J ( θ ) = − m 1 S U M [ y ∗ log ( h θ ( x ) ) + ( 1 − y ) ∗ log ( 1 − h θ ( x ) ) ] = − m 1 S U M [ y ∗ log ( y ^ ) + ( 1 − y ) ∗ log ( 1 − y ^ ) ] 上式中括号里的计算结果仍是一个向量,因此
S
U
M
SUM
S U M 表示对向量的所有项求和,最终得一个标量值。
1.4 梯度下降函数
原公式为:
θ
j
:
=
θ
j
−
α
1
m
∑
i
=
1
m
(
h
θ
(
x
(
i
)
)
−
y
(
i
)
)
x
j
(
i
)
\theta_j:=\theta_j-\alpha\frac{1}{m} \displaystyle\sum_{i=1}^{m} ( h_θ( x^{(i)} ) - y^{(i)})x_j^{(i)}
θ j : = θ j − α m 1 i = 1 ∑ m ( h θ ( x ( i ) ) − y ( i ) ) x j ( i ) 现用向量来表示所有参数 的更新过程:
θ
=
θ
−
α
δ
\theta=\theta-\alpha\delta
θ = θ − α δ 其中:
θ
=
[
θ
0
θ
1
:
θ
n
]
,
δ
=
1
m
[
∑
i
=
1
m
(
h
θ
(
x
(
i
)
)
−
y
(
i
)
)
x
0
(
i
)
∑
i
=
1
m
(
h
θ
(
x
(
i
)
)
−
y
(
i
)
)
x
1
(
i
)
⋅
⋅
⋅
⋅
⋅
⋅
∑
i
=
1
m
(
h
θ
(
x
(
i
)
)
−
y
(
i
)
)
x
n
(
i
)
]
\theta=\begin{bmatrix} \theta_0\\ \\ \theta_1\\ \\:\\ \\ \theta_n \end{bmatrix}\ \ ,\ \ \delta=\frac{1}{m} \begin{bmatrix} \sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)})x_0^{(i)}\\ \\ \sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)})x_1^{(i)}\\ \\······\\ \\ \sum_{i=1}^{m}(h_\theta(x^{(i)})-y^{(i)})x_n^{(i)} \end{bmatrix}
θ = ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎡ θ 0 θ 1 : θ n ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎤ , δ = m 1 ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎡ ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) x 0 ( i ) ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) x 1 ( i ) ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ∑ i = 1 m ( h θ ( x ( i ) ) − y ( i ) ) x n ( i ) ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎤ 又因为:
δ
=
1
m
[
x
0
(
1
)
x
0
(
2
)
⋅
⋅
⋅
x
0
(
m
)
x
1
(
1
)
x
1
(
2
)
⋅
⋅
⋅
x
1
(
m
)
:
:
⋅
⋅
⋅
:
x
0
(
1
)
x
0
(
2
)
⋅
⋅
⋅
x
0
(
m
)
]
[
h
θ
(
x
(
1
)
)
−
y
(
1
)
h
θ
(
x
(
2
)
)
−
y
(
2
)
⋅
⋅
⋅
⋅
⋅
⋅
h
θ
(
x
(
m
)
)
−
y
(
m
)
]
=
1
m
X
T
[
g
(
X
θ
)
−
y
]
\delta=\frac{1}{m} \begin{bmatrix} x_0^{(1)}&x_0^{(2)}&···&x_0^{(m)}\\ \\ x_1^{(1)}&x_1^{(2)}&···&x_1^{(m)}\\ \\:&:&···&:\\ \\ x_0^{(1)}&x_0^{(2)}&···&x_0^{(m)}\\ \end{bmatrix} \begin{bmatrix} h_\theta(x^{(1)})-y^{(1)}\\ \\ h_\theta(x^{(2)})-y^{(2)}\\ \\······\\ \\ h_\theta(x^{(m)})-y^{(m)} \end{bmatrix}=\frac{1}{m}X^T\left [ g(X\theta)-y \right]
δ = m 1 ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎡ x 0 ( 1 ) x 1 ( 1 ) : x 0 ( 1 ) x 0 ( 2 ) x 1 ( 2 ) : x 0 ( 2 ) ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ x 0 ( m ) x 1 ( m ) : x 0 ( m ) ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎤ ⎣ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎡ h θ ( x ( 1 ) ) − y ( 1 ) h θ ( x ( 2 ) ) − y ( 2 ) ⋅ ⋅ ⋅ ⋅ ⋅ ⋅ h θ ( x ( m ) ) − y ( m ) ⎦ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎤ = m 1 X T [ g ( X θ ) − y ] 因此,梯度下降可以表示为:
θ
=
θ
−
α
1
m
X
T
[
g
(
X
θ
)
−
y
]
\theta=\theta-\alpha\frac{1}{m}X^T\left [ g(X\theta)-y \right]
θ = θ − α m 1 X T [ g ( X θ ) − y ]