机器学习 - 理论 - Latex手推广义线性模型

stanford

前言

Vue框架:从项目学Vue
OJ算法系列:神机百炼 - 算法详解
Linux操作系统:风后奇门 - linux
C++11:通天箓 - C++11
python cook book:逐个熟悉常见模块

广义线性模型定义:

指数分布族:

p ( y ; η ) = b ( y ) ∗ e ( T ( y ) ∗ η T − a ( η ) ) p(y;η) = b(y)*e^{(T(y)*η^{T} - a(η))} p(y;η)=b(y)e(T(y)ηTa(η))

  • η:自然参数
  • T(y):充分统计量,大多为y
  • a(η):log partition function,用于正规化常量,保证 $ \sum{}{}p(y;η) = 1 $

Gaussian分布的指数分布族形式:

  • 高斯分布公式:
    f ( x ) = 1 2 π ∗ δ ∗ e − ( x − u ) 2 2 δ 2 ​ f(x)= \frac{1}{\sqrt{2π}*δ}*e^{-\frac{(x−u)^2}{2δ^2}}​ f(x)=2π δ1e2δ2(xu)2
  • 线性回归中,δ对于模型参数θ的选择没有影响,为了推导方便我们将其设为1:
    p ( y ; μ ) = 1 2 π e − 1 2 ( y − μ ) 2 p(y;μ) = \frac{1}{\sqrt{2π}}e^{-\frac{1}{2}(y-μ)^2} p(y;μ)=2π 1e21(yμ)2
  • 分离y^2得:
    p ( y ; μ ) = 1 2 π e − 1 2 y 2 ∗ e μ y − 1 2 μ 2 p(y;μ) = \frac{1}{\sqrt{2π}}e^{-\frac{1}{2}y^2}*e^{μy-\frac{1}{2}μ^2} p(y;μ)=2π 1e21y2eμy21μ2
  • 指数分布族系数:
    η = μ η = μ η=μ
    T ( y ) = y T(y) = y T(y)=y
    a ( η ) = μ 2 2 = η 2 2 a(η) = \frac{μ^2}{2} = \frac{η^2}{2} a(η)=2μ2=2η2
    b ( η ) = 1 2 π e − y 2 2 b(η) = \frac{1}{\sqrt{2π}}e^\frac{-y^2}{2} b(η)=2π 1e2y2

Bernouli分布的指数分布族形式:

  • 伯努利分布公式:φ为正面事件发生概率
    p ( y ; φ ) = φ y ∗ ( 1 − φ ) ( 1 − y ) p(y;φ) = φ^{y} * (1-φ)^{(1-y)} p(y;φ)=φy(1φ)(1y)
  • 逻辑回归服从伯努利分布
    p ( y = 1 ; φ ) = φ p(y=1;φ) = φ p(y=1;φ)=φ
    p ( y = 0 ; φ ) = 1 − φ p(y=0;φ) = 1-φ p(y=0;φ)=1φ
  • 上底数e,指数取对数得:
    p ( y ; φ ) = e y ∗ l o g φ ∗ e ( 1 − y ) ∗ l o g ( 1 − φ ) p(y;φ) = e^{y*log^{φ}} * e^{(1-y)*log^{(1-φ)}} p(y;φ)=eylogφe(1y)log(1φ)
    p ( y ; φ ) = e [ y ∗ l o g φ + ( 1 − y ) ∗ l o g ( 1 − φ ) ] p(y;φ) = e^{[y*log^{φ} + (1-y)*log^{(1-φ)}]} p(y;φ)=e[ylogφ+(1y)log(1φ)]
    p ( y ; φ ) = e [ y ∗ l o g φ − y ∗ l o g ( 1 − φ ) + l o g ( 1 − φ ) ] p(y;φ) = e^{[y*log^{φ} -y*log^{(1-φ)} + log^{(1-φ)}]} p(y;φ)=e[ylogφylog(1φ)+log(1φ)]
  • 合并系数y得:
    p ( y ; φ ) = e [ y ∗ ( l o g φ 1 − φ ) + l o g ( 1 − φ ) ] p(y;φ) = e^{[y*(log^{\frac{φ}{1-φ}})+ log^{(1-φ)}]} p(y;φ)=e[y(log1φφ)+log(1φ)]
  • 套用指数分布族得:
    η = l o g φ 1 − φ η = log^{\frac{φ}{1-φ}} η=log1φφ
    φ = 1 1 + e − η φ = \frac{1}{1+e^{-η}} φ=1+eη1

b ( y ) = 1 b(y) = 1 b(y)=1
T ( y ) = y T(y) = y T(y)=y
a ( y ) = − l o g ( 1 − φ ) = l o g ( 1 + e η ) a(y) = -log^{(1-φ)} = log^{(1+e^{η})} a(y)=log(1φ)=log(1+eη)

广义线性模型建模三大假设:

  • 假设1:y的条件概率属于指数分布族
    y ∣ x ; θ ∽ E x p o n e n t i a l F a m i l y ( η ) y|x;θ ∽ ExponentialFamily(η) yx;θExponentialFamily(η)
  • 假设2:
  1. 给定x 广义线性模型的目标是求解 T ( y ) ∣ x T(y)|x T(y)x
  2. 不过由于很多情况下 T ( y ) ∣ x = y T(y)|x = y T(y)x=y,所以目标为求解 y ∣ x y|x yx
  3. 也就是拟合函数为 h ( x ) = E [ y ∣ x ] h(x) = E[y|x] h(x)=E[yx]
  4. 如逻辑回归的hθ(x) = p(y=1|x;θ) = 0*p(y=0|x;θ)+1*p(y=1|x;θ) = E[y|x;θ]
  • 假设3:自然参数η与x是线性关系
    η = θ T x η = θ^Tx η=θTx
  • 若η为向量,则 $ η_{i} = θ_{i}^{T}x $

广义线性模型推导其他公式

推导线性回归方程

  • 线性回归服从的高斯分布:
    y ∣ x ; θ ∽ N ( μ , θ ) y|x;θ ∽N(μ,θ) yx;θN(μ,θ)
  • 由假设2,拟合函数 h ( x ) = E [ y ∣ x ] h(x) = E[y|x] h(x)=E[yx]
    h ( x ) = E [ y ∣ x ; θ ] = μ h(x) = E[y|x;θ] = μ h(x)=E[yx;θ]=μ
  • 已知高斯分布的广义线性模型:
    η = μ η = μ η=μ
  • 由此可得:
    h ( x ) = η h(x) = η h(x)=η
  • 由假设三:
    h ( x ) = θ T x h(x) = θ^Tx h(x)=θTx

推导逻辑回归

  • 逻辑回归服从的伯努利分布:
    y ∣ x ; θ ∽ B e r n o u l l i ( φ ) y|x;θ ∽Bernoulli(φ) yx;θBernoulli(φ)
  • 由假设2,拟合函数 h ( x ) = E [ y ∣ x ] h(x) = E[y|x] h(x)=E[yx]
    h ( x ) = E [ y ∣ x ; θ ] = φ h(x) = E[y|x;θ] = φ h(x)=E[yx;θ]=φ
  • 已知伯努利分布的广义线性模型:
    η = l o g φ 1 − φ η = log^{\frac{φ}{1-φ}} η=log1φφ
    φ = 1 1 + e − η φ = \frac{1}{1+e^{-η}} φ=1+eη1
  • 由此可得:
    h ( x ) = 1 1 + e − η h(x) = \frac{1}{1+e^{-η}} h(x)=1+eη1
  • 由假设三:
    h ( x ) = 1 1 + e − θ T x h(x) = \frac{1}{1+e^{-θ^Tx}} h(x)=1+eθTx1

推导softmax多分类算法

  • y有多种可能取值,每种取值概率也不同:
    [ y 1 y 2 . . . y k ] ⋅ [ φ 1 φ 2 . . . 1 − ∑ i = 1 k − 1 φ i ] \begin{bmatrix} y_{1} \\ y_{2} \\ ... \\ y_{k} \\ \end{bmatrix} \cdot \begin{bmatrix} φ_{1} \\ φ_{2} \\ ... \\ 1- \displaystyle\sum_{i=1}^{k-1}φ_{i} \\ \end{bmatrix} y1y2...yk φ1φ2...1i=1k1φi
  • {y=i} 表示 最终分类到第i类的概率,可以用矩阵T(y)表达:
    T ( i ) = [ 0 0 . . . 第 i 个位置为 1 . . . 0 ] T(i) = \begin{bmatrix} 0 \\ 0 \\ ... \\ 第i个位置为1\\ ... \\ 0 \\ \end{bmatrix} T(i)= 00...i个位置为1...0
  • 多分类指数分布族:
    p ( y ; φ ) = φ 1 1 ∗ y = 1 ∗ φ 2 1 ∗ y = 2 ∗ . . . ∗ φ k 1 ∗ y = k p(y;φ) = φ_{1}^{1*{y=1}} * φ_{2}^{1*{y=2}} * ... * φ_{k}^{1*{y=k}} p(y;φ)=φ11y=1φ21y=2...φk1y=k
    p ( y ; φ ) = φ 1 T ( y 1 ) ∗ φ 2 T ( y 2 ) ∗ . . . ∗ φ k T ( y k p(y;φ) = φ_{1}^{T(y_{1})} * φ_{2}^{T(y_{2})} * ... * φ_{k}^{T(y_{k}} p(y;φ)=φ1T(y1)φ2T(y2)...φkT(yk
  • 底数取e,指数取ln:
    p ( y ; φ ) = e T ( y 1 ) ∗ l o g φ 1 + T ( y 2 ) ∗ l o g φ 2 + . . . + ( 1 − ∑ i = 1 k − 1 T ( y i ) ) ∗ l o g φ k p(y;φ) = e^{T(y_{1}) * log^{φ_{1}} + T(y_{2}) * log^{φ_{2}} + ... + (1-\displaystyle \sum_{i=1}^{k-1}T(y_{i})) * log^{φ_{k}}} p(y;φ)=eT(y1)logφ1+T(y2)logφ2+...+(1i=1k1T(yi))logφk
  • ∑ i = 1 k − 1 T ( y i ) \sum_{i=1}^{k-1}T(y_{i}) i=1k1T(yi) 展开分给前面:
    p ( y ; φ ) = e T ( y 1 ) ∗ l o g φ 1 φ k + T ( y 2 ) ∗ l o g φ 2 φ k + . . . + T ( y k − 1 ) ∗ l o g φ k − 1 φ k + l o g φ k p(y;φ) = e^{T(y_{1}) * log^{\frac{φ_{1}}{φ_{k}}} + T(y_{2}) * log^{\frac{φ_{2}}{φ_{k}}} + ... + T(y_{k-1}) * log^{\frac{φ_{k-1}}{φ_{k}}} + log^{φ_{k}}} p(y;φ)=eT(y1)logφkφ1+T(y2)logφkφ2+...+T(yk1)logφkφk1+logφk
  • 最终得到:
    η = [ l o g ( φ 1 φ k ) l o g ( φ 2 φ k ) . . . l o g ( φ k − 1 φ k ) ] η = \begin{bmatrix} log^{(\frac{φ_{1}}{φ_{k}})} \\ log^{(\frac{φ_{2}}{φ_{k}})} \\ ... \\ log^{(\frac{φ_{k-1}}{φ_{k}})} \\ \end{bmatrix} η= log(φkφ1)log(φkφ2)...log(φkφk1)
    b ( y ) = 1 b(y) = 1 b(y)=1

a ( y ) = − l o g φ k a(y) = -log^{φ_{k}} a(y)=logφk

  • 进一步变型η:
    η i = l o g ( φ i φ k ) η_{i} = log^{(\frac{φ_{i}}{φ_{k}})} ηi=log(φkφi)
    e η i = φ i φ k e^{η_{i}} = \frac{φ_{i}}{φ_{k}} eηi=φkφi
    e η i ∗ φ k = φ i e^{η_{i}} * φ_{k} = φ_{i} eηiφk=φi
    φ k ∗ ∑ i = 1 k e η i = ∑ i = 1 k φ i = 1 φ_{k}*\sum_{i=1}^{k}e^{η_{i}} = \sum_{i=1}^{k}φ_{i} = 1 φki=1keηi=i=1kφi=1
    φ k = 1 ∑ i = 1 k e η i φ_{k} = \frac{1}{\sum_{i=1}^{k}e^{η_{i}}} φk=i=1keηi1
  • 所以:
    φ i = e η i ∑ j = 1 k e η j φ_{i} = \frac{e^{η_{i}}}{\sum_{j=1}^{k}e^{η_{j}}} φi=j=1keηjeηi
    p ( y = i ∣ x ; θ ) = φ i p(y=i|x;θ) = φ_{i} p(y=ix;θ)=φi
  • 由假设三得:
    p ( y = i ∣ x ; θ ) = e θ i T x ∑ j = 1 k e θ j T x p(y=i|x;θ) = \frac{e^{θ_{i}^{T}x}}{\sum_{j=1}^{k}e^{θ_{j}^{T}x}} p(y=ix;θ)=j=1keθjTxeθiTx
  • 所以hθ(x)为:
    h θ ( x ) = E [ T ( y ) ∣ x ; θ ] hθ(x) = E[T(y)|x;θ] hθ(x)=E[T(y)x;θ]
    h θ ( x ) = [ φ 1 φ 2 . . . φ k ] hθ(x) = \begin{bmatrix} φ_{1} \\ φ_{2} \\ ... \\ φ_{k} \\ \end{bmatrix} hθ(x)= φ1φ2...φk
    h θ ( x ) = [ e θ 1 T x ∑ j = 1 k e θ j T x e θ 2 T x ∑ j = 1 k e θ j T x . . . e θ k − 1 T x ∑ j = 1 k e θ j T x ] hθ(x) = \begin{bmatrix} \frac{e^{θ_{1}^{T}x}}{\sum_{j=1}^{k}e^{θ_{j}^{T}x}} \\ \frac{e^{θ_{2}^{T}x}}{\sum_{j=1}^{k}e^{θ_{j}^{T}x}} \\ ... \\ \frac{e^{θ_{k-1}^{T}x}}{\sum_{j=1}^{k}e^{θ_{j}^{T}x}} \\ \end{bmatrix} hθ(x)= j=1keθjTxeθ1Txj=1keθjTxeθ2Tx...j=1keθjTxeθk1Tx
  • 最大似然估计得:
    l ( θ ) = ∑ i = 1 m l o g p ( y i ∣ x i ; θ ) l(θ) = \displaystyle\sum_{i=1}^{m}log^{p(y^i|x^i; θ)} l(θ)=i=1mlogp(yixi;θ)
    l ( θ ) = ∑ i = 1 m l o g ∏ l = 1 k ( e θ k − 1 T x ∑ j = 1 k e θ j T x ) 1 ∗ y ( i ) = l l(θ) = \displaystyle\sum_{i=1}^{m}log^{\prod_{l=1}^{k}(\frac{e^{θ_{k-1}^{T}x}}{\sum_{j=1}^{k}e^{θ_{j}^{T}x}})^{1*{y(i) = l}}} l(θ)=i=1mlogl=1k(j=1keθjTxeθk1Tx)1y(i)=l

后续:

  • 博文为博主手写latex推导,需要LinearRegression、LogisticRegression、softmax基础
  • 更多背景讲解见于吴恩达机器学习经典课程:stanford - cs229

猜你喜欢

转载自blog.csdn.net/buptsd/article/details/129556767