广义线性模型(Generalized Linear Models)

版权声明:本文为博主原创文章,采用“署名-非商业性使用-禁止演绎 2.5 中国大陆”授权。欢迎转载,但请注明作者姓名和文章出处。 https://blog.csdn.net/njit_77/article/details/84452142

看了一下斯坦福大学公开课:机器学习教程(吴恩达教授),记录了一些笔记,写出来以便以后有用到。笔记如有误,还望告知。
本系列其它笔记:
线性回归(Linear Regression)
分类和逻辑回归(Classification and logistic regression)
广义线性模型(Generalized Linear Models)

广义线性模型(Generalized Linear Models)

我们目前学习的两种不同算法对p(y|x; θ \theta )进行建模:
y R G a u s s i a n d i s t r i b u t i o n l e a s t   s q u a r e s   o f   l i n e a r   r e g r e s s i o n y { 0 , 1 } B e r n o u l l i d i s t r i b u t i o n l o g i s t i c   r e g r e s s i o n y \in \R \quad Gaussian \quad distribution \rightarrow least \ squares \ of \ linear \ regression \\ y \in \lbrace 0, 1 \rbrace \quad Bernoulli \quad distribution \rightarrow logistic \ regression

1 指数分布族(The exponential family)

指数分布族可写成如下形式:
p ( y ; η ) = b ( y ) e x p ( η T T ( y ) a ( η ) ) η n a t u r a l p a r a m e t e r T ( y ) s u f f i c i e n t s t a t i s t i c T ( y ) = y p(y;\eta) = b(y)exp(\eta^{T}T(y) - a(\eta)) \\ \eta \rightarrow 分布的自然参数(natural \quad parameter) \\ T(y) \rightarrow 充分统计量(sufficient \quad statistic) 通常情况下T(y) = y
对于伯努利分布
B e r ( ϕ ) = { p ( y = 1     ϕ ) = ϕ p ( y = 0     ϕ ) = 1 ϕ Ber(\phi) = \left\{\begin{array}{} p(y = 1 \ | \ \phi) = \phi \\ p(y = 0 \ | \ \phi) = 1 - \phi \end{array}\right.

p ( y     ϕ ) = ϕ ( y ) ( 1 ϕ ) ( 1 y ) = exp ( log ( ϕ ( y ) ( 1 ϕ ) ( 1 y ) ) ) = exp ( log ( ϕ ( y ) ) + log ( ( 1 ϕ ) ( 1 y ) ) ) = exp ( y log ( ϕ ) + ( 1 y ) log ( 1 ϕ ) ) = exp ( y log ( ϕ 1 ϕ ) + log ( 1 ϕ ) ) p(y \ | \ \phi) = \phi^{(y)}(1-\phi)^{(1-y)} \\ = \exp(\log(\phi^{(y)}(1-\phi)^{(1-y)})) \\ = \exp(\log(\phi^{(y)}) + \log((1-\phi)^{(1-y)})) \\ = \exp(y\log(\phi) + (1-y)\log(1-\phi)) \\ = \exp(y\log(\frac{\phi}{1-\phi}) + \log(1-\phi))

T ( y ) = y , b ( y ) = 1 , η = log ϕ 1 ϕ T(y) = y, b(y) = 1, \eta = \log\frac{\phi}{1-\phi} ,则 ϕ = 1 1 + e η a ( η ) = log ( 1 ϕ ) = log ( 1 + e η ) \phi = \frac{1}{1 + e^{-\eta}},a(\eta) = -\log(1 - \phi) = \log(1+e^{\eta})

对于高斯分布
p ( y     μ ; σ 2 ) = 1 2 π σ exp ( ( y μ ) 2 2 σ 2 ) = 1 2 π σ exp ( ( y 2 2 y μ + μ 2 ) 2 σ 2 ) = 1 2 π σ exp ( y 2 2 σ 2 ) exp ( 2 y μ μ 2 2 σ 2 ) p(y \ |\ \mu; \sigma^2) = \frac{1}{\sqrt{2\pi}\sigma}\exp(-\frac{(y - \mu)^2}{2\sigma^2}) \\ = \frac{1}{\sqrt{2\pi}\sigma}\exp(-\frac{(y^2 - 2y\mu + \mu^2)}{2\sigma^2}) \\ = \frac{1}{\sqrt{2\pi}\sigma}\exp(-\frac{y^2}{2\sigma^2})\exp(\frac{2y\mu - \mu^2}{2\sigma^2})
T ( y ) = y , b ( y ) = 1 2 π σ exp ( y 2 2 σ 2 ) , η = μ σ 2 T(y) = y, b(y) = \frac{1}{\sqrt{2\pi}\sigma}\exp(-\frac{y^2}{2\sigma^2}), \eta = \frac{\mu}{\sigma^2} ,则 μ = η σ 2 a ( η ) = μ 2 2 σ 2 = η 2 σ 2 2 \mu = \eta\sigma^2,a(\eta) = \frac{\mu^2}{2\sigma^2} = \frac{\eta^2\sigma^2}{2}

2 构建广义线性模型(Constructing GLMs)

构建广义线性模型的三个前提条件:

1、y | x; θ \theta ~ Exp Family( η \eta ) 给定输入x和参数 θ \theta ,输出y的分布满足以 η \eta 为自然参数的指数分布族;

2、给定输入 X X ,我们的目标是输出 E [ T ( y ) x ] E[T(y)|x] ,即 h ( x ) = E [ T ( y ) x ] h(x) = E[T(y)|x] ;

3、 η = θ T x \eta = \theta^Tx .

对于伯努利分布
h θ ( x ) = E [ y x ; θ ] = p ( y = 1 x ; θ ) = ϕ = 1 1 + e η = 1 1 + e θ T x h_\theta(x) = E[y|x;\theta] = p(y=1|x;\theta) \\ = \phi \\ = \frac{1}{1 + e^{-\eta}} \\ = \frac{1}{1 + e^{-\theta^{T}x}}
对于高斯分布
h θ ( x ) = E [ y x ; θ ] = μ = η σ 2 = θ T x σ 2 h_\theta(x) = E[y|x;\theta] \\ = \mu \\ = \eta\sigma^2 \\ = \theta^{T}x\sigma^2
Softmax Regression

当我们需要分类的对象超过两项时,我们使用多项式分布(multinomial distribution)建模。

y { 1 , 2 , , k } y \in \lbrace 1,2,\dots,k\rbrace ,参数 ϕ 1 , ϕ 2 , , ϕ k \phi_1,\phi_2,\dots,\phi_k ,因为 i = 1 k ϕ i = 1 ϕ k = 1 i = 1 k 1 ϕ i \sum_{i=1}^{k}\phi_i = 1 \rightarrow \phi_k = 1 - \sum_{i=1}^{k-1}\phi_i , p ( y = i ; ϕ ) = ϕ i p(y = i;\phi) = \phi_i

为了将多项式分布表示为指数分布,我们定义 T ( y ) R k 1 T(y) \in \R^{k-1} 如下:

T ( 1 ) = [ 1 0 0 ] , T ( 2 ) = [ 0 1 0 ] , T ( k 1 ) = [ 0 0 1 ] , T ( k ) = [ 0 0 0 ] T(1) = \begin{bmatrix} 1 \\ 0 \\ \vdots \\ 0 \end{bmatrix},T(2) = \begin{bmatrix} 0 \\ 1 \\ \vdots \\ 0 \end{bmatrix},T(k-1) = \begin{bmatrix} 0 \\ 0 \\ \vdots \\ 1 \end{bmatrix},T(k) = \begin{bmatrix} 0 \\ 0 \\ \vdots \\ 0 \end{bmatrix}

T(y)不是之前的T(y) = y,而是一个变量。我们用 ( T ( y ) ) i T ( y ) i (T(y))_i表示T(y)向量的第i个元素 。再次使用1{True} = 1;1{False} = 0,那么 ( T ( y ) ) i = 1 { y = i } (T(y))_{i}= 1 \lbrace y = i \rbrace
p ( y ; ϕ ) = ϕ 1 1 { y = 1 } ϕ 2 1 { y = 2 } ϕ k 1 { y = k } = ϕ 1 1 { y = 1 } ϕ 2 1 { y = 2 } ϕ k 1 i = 1 k 1 1 { y = i } = ϕ 1 ( T ( y ) ) 1 ϕ 2 ( T ( y ) ) 2 ϕ k 1 i = 1 k 1 ( T ( y ) ) i = exp log ( ϕ 1 ( T ( y ) ) 1 ϕ 2 ( T ( y ) ) 2 ϕ k 1 i = 1 k 1 ( T ( y ) ) i ) = exp ( ( T ( y ) ) 1 log ( ϕ 1 ) + ( T ( y ) ) 2 log ( ϕ 2 ) + + ( 1 i = 1 k 1 ( T ( y ) ) i ) log ( ϕ k ) ) = exp ( ( T ( y ) ) 1 log ( ϕ 1 ϕ k ) + ( T ( y ) ) 2 log ( ϕ 2 ϕ k ) + + log ( ϕ k ) ) p(y;\phi) = \phi_{1}^{1 \lbrace y = 1 \rbrace}\phi_{2}^{1 \lbrace y = 2 \rbrace}\dots\phi_{k}^{1 \lbrace y = k \rbrace} \\ = \phi_{1}^{1 \lbrace y = 1 \rbrace}\phi_{2}^{1 \lbrace y = 2 \rbrace}\dots\phi_{k}^{1 - \sum_{i=1}^{k-1}1 \lbrace y = i \rbrace} \\ = \phi_{1}^{(T(y))_{1}}\phi_{2}^{(T(y))_{2}}\dots\phi_{k}^{1 - \sum_{i=1}^{k-1}(T(y))_{i}} \\ = \exp\log(\phi_{1}^{(T(y))_{1}}\phi_{2}^{(T(y))_{2}}\dots\phi_{k}^{1 - \sum_{i=1}^{k-1}(T(y))_{i}}) \\ = \exp((T(y))_{1}\log(\phi_1)+(T(y))_{2}\log(\phi_2) + \dots + (1 - \sum_{i=1}^{k-1}(T(y))_{i})\log(\phi_k)) \\ = \exp((T(y))_{1}\log(\frac{\phi_1}{\phi_k})+(T(y))_{2}\log(\frac{\phi_2}{\phi_k}) + \dots + \log(\phi_k))
b ( y ) = 1 , η = [ log ( ϕ 1 ϕ k ) log ( ϕ 2 ϕ k ) log ( ϕ k 1 ϕ k ) ] R k 1 b(y) = 1, \eta = \begin{bmatrix} \log(\frac{\phi_1}{\phi_k}) \\ \log(\frac{\phi_2}{\phi_k}) \\ \vdots \\ \log(\frac{\phi_{k-1}}{\phi_k})\end{bmatrix} \in \R^{k-1} ,则 a ( η ) = log ( ϕ k ) a(\eta) = -\log(\phi_k)
η = [ log ( ϕ 1 ϕ k ) log ( ϕ 2 ϕ k ) log ( ϕ k 1 ϕ k ) ] η i = log ( ϕ i ϕ k ) { i = 1 , 2 , , k 1 } ϕ i = ϕ k e η i i = 1 k ϕ i = i = 1 k ϕ k e η i = 1 ϕ k = 1 i = 1 k e η i ϕ i = e η i j = 1 k e η j ϕ i = e θ i T x j = 1 k e θ j T x \eta = \begin{bmatrix} \log(\frac{\phi_1}{\phi_k}) \\ \log(\frac{\phi_2}{\phi_k}) \\ \vdots \\ \log(\frac{\phi_{k-1}}{\phi_k})\end{bmatrix} \\ \Rightarrow \eta_i = \log(\frac{\phi_i}{\phi_k}) \lbrace i = 1,2,\dots,k-1 \rbrace \\ \Rightarrow \phi_i = \phi_ke^{\eta_i} \\ \Rightarrow \sum_{i=1}^{k}\phi_i = \sum_{i=1}^{k}\phi_ke^{\eta_i} = 1 \\ \Rightarrow \phi_k = \frac{1}{\sum_{i=1}^{k}e^{\eta_i}} \\ \Rightarrow \phi_i = \frac{e^{\eta_i}}{\sum_{j=1}^{k}e^{\eta_j}} \\ \Rightarrow \phi_i = \frac{e^{\theta_{i}^{T}x}}{\sum_{j=1}^{k}e^{\theta_{j}^{T}x}}
那么对于
(7) [ 1 2 3 4 5 6 ] \left[ \begin{array}{cc|c} 1 & 2 & 3 \\ 4 & 5 & 6 \end{array} \right] \tag{7}

h θ ( x ) = E [ T ( y ) x ; θ ] = E [ 1 { y = 1 } 1 { y = 2 } x ; θ 1 { y = k 1 ] = E [ ϕ 1 ϕ 2 ϕ k 1 ] = E [ exp ( θ 1 T x ) j = 1 k exp ( θ j T x ) exp ( θ 2 T x ) j = 1 k exp ( θ j T x ) exp ( θ k 1 T x ) j = 1 k exp ( θ j T x ) ] h_\theta(x) = E[T(y)|x;\theta] \\ = E \left[ \begin{array}{c|c} 1 \lbrace y = 1 \rbrace \\ 1 \lbrace y = 2 \rbrace \\ \vdots & x;\theta \\ 1 \lbrace y = k-1 \end{array} \right] \\ = E \begin{bmatrix} \phi_1 \\ \phi_2 \\ \vdots \\ \phi_{k-1} \end{bmatrix} \\ = E \begin{bmatrix} \frac{\exp({\theta_{1}^{T}x})}{\sum_{j=1}^{k}\exp({\theta_{j}^{T}x})} \\ \frac{\exp({\theta_{2}^{T}x})}{\sum_{j=1}^{k}\exp({\theta_{j}^{T}x})} \\ \vdots \\ \frac{\exp({\theta_{k-1}^{T}x})}{\sum_{j=1}^{k}\exp({\theta_{j}^{T}x})} \end{bmatrix}

成称为Softmax回归。
L ( θ ) = p ( y X ; θ ) = i = 1 m p ( y ( i ) x ( i ) ; θ ) = i = 1 m ( ϕ 1 1 { y ( i ) = 1 } ϕ 2 1 { y ( i ) = 2 } ϕ k 1 { y ( i ) = k } ) = i = 1 m l = 1 k ( ϕ l 1 { y ( i ) = l } ) = i = 1 m l = 1 k ( exp ( θ l T x ( i ) ) j = 1 k exp ( θ j T x ( i ) ) ) 1 { y ( i ) = l } L(\theta) = p(\vec y | X;\theta) \\ = \prod_{i=1}^{m} p(y^{(i)} | x^{(i)}; \theta) \\ = \prod_{i=1}^{m}(\phi_{1}^{1 \lbrace y^{(i)} = 1 \rbrace}\phi_{2}^{1 \lbrace y^{(i)} = 2 \rbrace}\dots\phi_{k}^{1 \lbrace y^{(i)} = k \rbrace}) \\ = \prod_{i=1}^{m}\prod_{l=1}^{k}(\phi_{l}^{1 \lbrace y^{(i)} = l \rbrace}) \\ = \prod_{i=1}^{m}\prod_{l=1}^{k}(\frac{\exp({\theta_{l}^{T}x^{(i)}})}{\sum_{j=1}^{k}\exp({\theta_{j}^{T}x^{(i)}})})^{1 \lbrace y^{(i)} = l \rbrace} \\ \Downarrow

( θ ) = log ( L ( θ ) ) = log ( i = 1 m l = 1 k ( exp ( θ l T x ( i ) ) j = 1 k exp ( θ j T x ( i ) ) ) 1 { y ( i ) = l } ) = i = 1 m log ( l = 1 k ( exp ( θ l T x ( i ) ) j = 1 k exp ( θ j T x ( i ) ) ) 1 { y ( i ) = l } ) \ell(\theta) = \log(L(\theta)) \\ = \log(\prod_{i=1}^{m}\prod_{l=1}^{k}(\frac{\exp({\theta_{l}^{T}x^{(i)}})}{\sum_{j=1}^{k}\exp({\theta_{j}^{T}x^{(i)}})})^{1 \lbrace y^{(i)} = l \rbrace}) \\ = \sum_{i=1}^{m}\log(\prod_{l=1}^{k}(\frac{\exp({\theta_{l}^{T}x^{(i)}})}{\sum_{j=1}^{k}\exp({\theta_{j}^{T}x^{(i)}})})^{1 \lbrace y^{(i)} = l \rbrace}) \\ \Downarrow

θ p ( θ ) = θ p log ( L ( θ ) ) = θ p i = 1 m log ( l = 1 k ( exp ( θ l T x ( i ) ) j = 1 k exp ( θ j T x ( i ) ) ) 1 { y ( i ) = l } ) \left.\frac{\partial}{\partial\theta_p}\right.\ell(\theta) = \left.\frac{\partial}{\partial\theta_p}\right.\log(L(\theta)) \\ = \left.\frac{\partial}{\partial\theta_p}\right.\sum_{i=1}^{m}\log(\prod_{l=1}^{k}(\frac{\exp({\theta_{l}^{T}x^{(i)}})}{\sum_{j=1}^{k}\exp({\theta_{j}^{T}x^{(i)}})})^{1 \lbrace y^{(i)} = l \rbrace})

猜你喜欢

转载自blog.csdn.net/njit_77/article/details/84452142