机器学习 - 理论 -广义线性模型
前言
Vue框架:
从项目学Vue
OJ算法系列:
神机百炼 - 算法详解
Linux操作系统:
风后奇门 - linux
C++11:
通天箓 - C++11
python cook book:
逐个熟悉常见模块
广义线性模型定义:
指数分布族:
p ( y ; η ) = b ( y ) ∗ e ( T ( y ) ∗ η T − a ( η ) ) p(y;η) = b(y)*e^{(T(y)*η^{T} - a(η))} p(y;η)=b(y)∗e(T(y)∗ηT−a(η))
- η:自然参数
- T(y):充分统计量,大多为y
- a(η):log partition function,用于正规化常量,保证 $ \sum{}{}p(y;η) = 1 $
Gaussian分布的指数分布族形式:
- 高斯分布公式:
f ( x ) = 1 2 π ∗ δ ∗ e − ( x − u ) 2 2 δ 2 f(x)= \frac{1}{\sqrt{2π}*δ}*e^{-\frac{(x−u)^2}{2δ^2}} f(x)=2π∗δ1∗e−2δ2(x−u)2 - 线性回归中,δ对于模型参数θ的选择没有影响,为了推导方便我们将其设为1:
p ( y ; μ ) = 1 2 π e − 1 2 ( y − μ ) 2 p(y;μ) = \frac{1}{\sqrt{2π}}e^{-\frac{1}{2}(y-μ)^2} p(y;μ)=2π1e−21(y−μ)2 - 分离y^2得:
p ( y ; μ ) = 1 2 π e − 1 2 y 2 ∗ e μ y − 1 2 μ 2 p(y;μ) = \frac{1}{\sqrt{2π}}e^{-\frac{1}{2}y^2}*e^{μy-\frac{1}{2}μ^2} p(y;μ)=2π1e−21y2∗eμy−21μ2 - 指数分布族系数:
η = μ η = μ η=μ
T ( y ) = y T(y) = y T(y)=y
a ( η ) = μ 2 2 = η 2 2 a(η) = \frac{μ^2}{2} = \frac{η^2}{2} a(η)=2μ2=2η2
b ( η ) = 1 2 π e − y 2 2 b(η) = \frac{1}{\sqrt{2π}}e^\frac{-y^2}{2} b(η)=2π1e2−y2
Bernouli分布的指数分布族形式:
- 伯努利分布公式:φ为正面事件发生概率
p ( y ; φ ) = φ y ∗ ( 1 − φ ) ( 1 − y ) p(y;φ) = φ^{y} * (1-φ)^{(1-y)} p(y;φ)=φy∗(1−φ)(1−y) - 逻辑回归服从伯努利分布
p ( y = 1 ; φ ) = φ p(y=1;φ) = φ p(y=1;φ)=φ
p ( y = 0 ; φ ) = 1 − φ p(y=0;φ) = 1-φ p(y=0;φ)=1−φ - 上底数e,指数取对数得:
p ( y ; φ ) = e y ∗ l o g φ ∗ e ( 1 − y ) ∗ l o g ( 1 − φ ) p(y;φ) = e^{y*log^{φ}} * e^{(1-y)*log^{(1-φ)}} p(y;φ)=ey∗logφ∗e(1−y)∗log(1−φ)
p ( y ; φ ) = e [ y ∗ l o g φ + ( 1 − y ) ∗ l o g ( 1 − φ ) ] p(y;φ) = e^{[y*log^{φ} + (1-y)*log^{(1-φ)}]} p(y;φ)=e[y∗logφ+(1−y)∗log(1−φ)]
p ( y ; φ ) = e [ y ∗ l o g φ − y ∗ l o g ( 1 − φ ) + l o g ( 1 − φ ) ] p(y;φ) = e^{[y*log^{φ} -y*log^{(1-φ)} + log^{(1-φ)}]} p(y;φ)=e[y∗logφ−y∗log(1−φ)+log(1−φ)] - 合并系数y得:
p ( y ; φ ) = e [ y ∗ ( l o g φ 1 − φ ) + l o g ( 1 − φ ) ] p(y;φ) = e^{[y*(log^{\frac{φ}{1-φ}})+ log^{(1-φ)}]} p(y;φ)=e[y∗(log1−φφ)+log(1−φ)] - 套用指数分布族得:
η = l o g φ 1 − φ η = log^{\frac{φ}{1-φ}} η=log1−φφ
φ = 1 1 + e − η φ = \frac{1}{1+e^{-η}} φ=1+e−η1
b ( y ) = 1 b(y) = 1 b(y)=1
T ( y ) = y T(y) = y T(y)=y
a ( y ) = − l o g ( 1 − φ ) = l o g ( 1 + e η ) a(y) = -log^{(1-φ)} = log^{(1+e^{η})} a(y)=−log(1−φ)=log(1+eη)
广义线性模型建模三大假设:
- 假设1:y的条件概率属于指数分布族
y ∣ x ; θ ∽ E x p o n e n t i a l F a m i l y ( η ) y|x;θ ∽ ExponentialFamily(η) y∣x;θ∽ExponentialFamily(η) - 假设2:
- 给定x 广义线性模型的目标是求解 T ( y ) ∣ x T(y)|x T(y)∣x
- 不过由于很多情况下 T ( y ) ∣ x = y T(y)|x = y T(y)∣x=y,所以目标为求解 y ∣ x y|x y∣x
- 也就是拟合函数为 h ( x ) = E [ y ∣ x ] h(x) = E[y|x] h(x)=E[y∣x]
- 如逻辑回归的hθ(x) = p(y=1|x;θ) = 0*p(y=0|x;θ)+1*p(y=1|x;θ) = E[y|x;θ]
- 假设3:自然参数η与x是线性关系
η = θ T x η = θ^Tx η=θTx - 若η为向量,则 $ η_{i} = θ_{i}^{T}x $
广义线性模型推导其他公式
推导线性回归方程
- 线性回归服从的高斯分布:
y ∣ x ; θ ∽ N ( μ , θ ) y|x;θ ∽N(μ,θ) y∣x;θ∽N(μ,θ) - 由假设2,拟合函数 h ( x ) = E [ y ∣ x ] h(x) = E[y|x] h(x)=E[y∣x]
h ( x ) = E [ y ∣ x ; θ ] = μ h(x) = E[y|x;θ] = μ h(x)=E[y∣x;θ]=μ - 已知高斯分布的广义线性模型:
η = μ η = μ η=μ - 由此可得:
h ( x ) = η h(x) = η h(x)=η - 由假设三:
h ( x ) = θ T x h(x) = θ^Tx h(x)=θTx
推导逻辑回归
- 逻辑回归服从的伯努利分布:
y ∣ x ; θ ∽ B e r n o u l l i ( φ ) y|x;θ ∽Bernoulli(φ) y∣x;θ∽Bernoulli(φ) - 由假设2,拟合函数 h ( x ) = E [ y ∣ x ] h(x) = E[y|x] h(x)=E[y∣x]
h ( x ) = E [ y ∣ x ; θ ] = φ h(x) = E[y|x;θ] = φ h(x)=E[y∣x;θ]=φ - 已知伯努利分布的广义线性模型:
η = l o g φ 1 − φ η = log^{\frac{φ}{1-φ}} η=log1−φφ
φ = 1 1 + e − η φ = \frac{1}{1+e^{-η}} φ=1+e−η1 - 由此可得:
h ( x ) = 1 1 + e − η h(x) = \frac{1}{1+e^{-η}} h(x)=1+e−η1 - 由假设三:
h ( x ) = 1 1 + e − θ T x h(x) = \frac{1}{1+e^{-θ^Tx}} h(x)=1+e−θTx1
推导softmax多分类算法
- y有多种可能取值,每种取值概率也不同:
[ y 1 y 2 . . . y k ] ⋅ [ φ 1 φ 2 . . . 1 − ∑ i = 1 k − 1 φ i ] \begin{bmatrix} y_{1} \\ y_{2} \\ ... \\ y_{k} \\ \end{bmatrix} \cdot \begin{bmatrix} φ_{1} \\ φ_{2} \\ ... \\ 1- \displaystyle\sum_{i=1}^{k-1}φ_{i} \\ \end{bmatrix} y1y2...yk ⋅ φ1φ2...1−i=1∑k−1φi - {y=i} 表示 最终分类到第i类的概率,可以用矩阵T(y)表达:
T ( i ) = [ 0 0 . . . 第 i 个位置为 1 . . . 0 ] T(i) = \begin{bmatrix} 0 \\ 0 \\ ... \\ 第i个位置为1\\ ... \\ 0 \\ \end{bmatrix} T(i)= 00...第i个位置为1...0 - 多分类指数分布族:
p ( y ; φ ) = φ 1 1 ∗ y = 1 ∗ φ 2 1 ∗ y = 2 ∗ . . . ∗ φ k 1 ∗ y = k p(y;φ) = φ_{1}^{1*{y=1}} * φ_{2}^{1*{y=2}} * ... * φ_{k}^{1*{y=k}} p(y;φ)=φ11∗y=1∗φ21∗y=2∗...∗φk1∗y=k
p ( y ; φ ) = φ 1 T ( y 1 ) ∗ φ 2 T ( y 2 ) ∗ . . . ∗ φ k T ( y k p(y;φ) = φ_{1}^{T(y_{1})} * φ_{2}^{T(y_{2})} * ... * φ_{k}^{T(y_{k}} p(y;φ)=φ1T(y1)∗φ2T(y2)∗...∗φkT(yk - 底数取e,指数取ln:
p ( y ; φ ) = e T ( y 1 ) ∗ l o g φ 1 + T ( y 2 ) ∗ l o g φ 2 + . . . + ( 1 − ∑ i = 1 k − 1 T ( y i ) ) ∗ l o g φ k p(y;φ) = e^{T(y_{1}) * log^{φ_{1}} + T(y_{2}) * log^{φ_{2}} + ... + (1-\displaystyle \sum_{i=1}^{k-1}T(y_{i})) * log^{φ_{k}}} p(y;φ)=eT(y1)∗logφ1+T(y2)∗logφ2+...+(1−i=1∑k−1T(yi))∗logφk - 将 ∑ i = 1 k − 1 T ( y i ) \sum_{i=1}^{k-1}T(y_{i}) ∑i=1k−1T(yi) 展开分给前面:
p ( y ; φ ) = e T ( y 1 ) ∗ l o g φ 1 φ k + T ( y 2 ) ∗ l o g φ 2 φ k + . . . + T ( y k − 1 ) ∗ l o g φ k − 1 φ k + l o g φ k p(y;φ) = e^{T(y_{1}) * log^{\frac{φ_{1}}{φ_{k}}} + T(y_{2}) * log^{\frac{φ_{2}}{φ_{k}}} + ... + T(y_{k-1}) * log^{\frac{φ_{k-1}}{φ_{k}}} + log^{φ_{k}}} p(y;φ)=eT(y1)∗logφkφ1+T(y2)∗logφkφ2+...+T(yk−1)∗logφkφk−1+logφk - 最终得到:
η = [ l o g ( φ 1 φ k ) l o g ( φ 2 φ k ) . . . l o g ( φ k − 1 φ k ) ] η = \begin{bmatrix} log^{(\frac{φ_{1}}{φ_{k}})} \\ log^{(\frac{φ_{2}}{φ_{k}})} \\ ... \\ log^{(\frac{φ_{k-1}}{φ_{k}})} \\ \end{bmatrix} η= log(φkφ1)log(φkφ2)...log(φkφk−1)
b ( y ) = 1 b(y) = 1 b(y)=1
a ( y ) = − l o g φ k a(y) = -log^{φ_{k}} a(y)=−logφk
- 进一步变型η:
η i = l o g ( φ i φ k ) η_{i} = log^{(\frac{φ_{i}}{φ_{k}})} ηi=log(φkφi)
e η i = φ i φ k e^{η_{i}} = \frac{φ_{i}}{φ_{k}} eηi=φkφi
e η i ∗ φ k = φ i e^{η_{i}} * φ_{k} = φ_{i} eηi∗φk=φi
φ k ∗ ∑ i = 1 k e η i = ∑ i = 1 k φ i = 1 φ_{k}*\sum_{i=1}^{k}e^{η_{i}} = \sum_{i=1}^{k}φ_{i} = 1 φk∗i=1∑keηi=i=1∑kφi=1
φ k = 1 ∑ i = 1 k e η i φ_{k} = \frac{1}{\sum_{i=1}^{k}e^{η_{i}}} φk=∑i=1keηi1 - 所以:
φ i = e η i ∑ j = 1 k e η j φ_{i} = \frac{e^{η_{i}}}{\sum_{j=1}^{k}e^{η_{j}}} φi=∑j=1keηjeηi
p ( y = i ∣ x ; θ ) = φ i p(y=i|x;θ) = φ_{i} p(y=i∣x;θ)=φi - 由假设三得:
p ( y = i ∣ x ; θ ) = e θ i T x ∑ j = 1 k e θ j T x p(y=i|x;θ) = \frac{e^{θ_{i}^{T}x}}{\sum_{j=1}^{k}e^{θ_{j}^{T}x}} p(y=i∣x;θ)=∑j=1keθjTxeθiTx - 所以hθ(x)为:
h θ ( x ) = E [ T ( y ) ∣ x ; θ ] hθ(x) = E[T(y)|x;θ] hθ(x)=E[T(y)∣x;θ]
h θ ( x ) = [ φ 1 φ 2 . . . φ k ] hθ(x) = \begin{bmatrix} φ_{1} \\ φ_{2} \\ ... \\ φ_{k} \\ \end{bmatrix} hθ(x)= φ1φ2...φk
h θ ( x ) = [ e θ 1 T x ∑ j = 1 k e θ j T x e θ 2 T x ∑ j = 1 k e θ j T x . . . e θ k − 1 T x ∑ j = 1 k e θ j T x ] hθ(x) = \begin{bmatrix} \frac{e^{θ_{1}^{T}x}}{\sum_{j=1}^{k}e^{θ_{j}^{T}x}} \\ \frac{e^{θ_{2}^{T}x}}{\sum_{j=1}^{k}e^{θ_{j}^{T}x}} \\ ... \\ \frac{e^{θ_{k-1}^{T}x}}{\sum_{j=1}^{k}e^{θ_{j}^{T}x}} \\ \end{bmatrix} hθ(x)= ∑j=1keθjTxeθ1Tx∑j=1keθjTxeθ2Tx...∑j=1keθjTxeθk−1Tx - 最大似然估计得:
l ( θ ) = ∑ i = 1 m l o g p ( y i ∣ x i ; θ ) l(θ) = \displaystyle\sum_{i=1}^{m}log^{p(y^i|x^i; θ)} l(θ)=i=1∑mlogp(yi∣xi;θ)
l ( θ ) = ∑ i = 1 m l o g ∏ l = 1 k ( e θ k − 1 T x ∑ j = 1 k e θ j T x ) 1 ∗ y ( i ) = l l(θ) = \displaystyle\sum_{i=1}^{m}log^{\prod_{l=1}^{k}(\frac{e^{θ_{k-1}^{T}x}}{\sum_{j=1}^{k}e^{θ_{j}^{T}x}})^{1*{y(i) = l}}} l(θ)=i=1∑mlog∏l=1k(∑j=1keθjTxeθk−1Tx)1∗y(i)=l
后续:
- 博文为博主手写latex推导,需要LinearRegression、LogisticRegression、softmax基础
- 更多背景讲解见于吴恩达机器学习经典课程:stanford - cs229