References:
- "Statistical Learning Methods" Li Hang
- https://www.zhihu.com/question/33959624
Step1. Likelihood function
In the Naive Bayesian model, the parameters we need to determine through the training set are θ k = P ( y = ck ) \theta_k=P(y=c_k)ik=P ( and=ck) 和 μ j l k = P ( x ( j ) = a j l ∣ y = c k ) \mu_{jlk}=P(x^{(j)}=a_{jl}|y=c_k) mj l k=P(x(j)=ajl∣y=ck)
Likelihood function:
L ( θ , μ ) = ∏ i = 1 N P ( x i , y i ) = ∏ i = 1 N P ( y i ) P ( x i ∣ y i ) (乘法公式) = ∏ i = 1 N ( P ( y i ) ∏ j = 1 n P ( x i ( j ) ∣ y i ) ) (条件独立假设) = ∏ i = 1 N ∏ k = 1 K ( P ( y = c k ) ∏ j = 1 n P ( x i ( j ) ∣ y i = c k ) ) I ( y i = c k ) = ∏ i = 1 N ∏ k = 1 K ( θ k ∏ j = 1 n ∏ l = 1 L j P I ( x i ( j ) = a j l ) ( x ( j ) = a j l ∣ y i = c k ) ) I ( y i = c k ) = ∏ i = 1 N ∏ k = 1 K ( θ k ∏ j = 1 n ∏ l = 1 L j μ j l k I ( x i ( j ) = a j l ) ) I ( y i = c k ) \begin{align} L(\theta,\mu)&=\prod\limits_{i=1}^{N}P(x_i,y_i)\notag\\ &=\prod\limits_{i=1}^{N}P(y_i)P(x_i|y_i)(乘法公式)\notag\\ &=\prod\limits_{i=1}^{N}\Big(P(y_i)\prod\limits_{j=1}^{n}P(x^{(j)}_i|y_i)\Big)(条件独立假设)\notag\\ &=\prod\limits_{i=1}^{N}\prod\limits_{k=1}^{K}\Big(P(y=c_k)\prod\limits_{j=1}^{n}P(x^{(j)}_i|y_i=c_k)\Big)^{I(y_i=c_k)}\notag\\ &=\prod\limits_{i=1}^{N}\prod\limits_{k=1}^{K}\Big(\theta_k\prod\limits_{j=1}^{n}\prod\limits_{l=1}^{L_j}P^{I(x^{(j)}_i=a_{jl})}(x^{(j)}=a_{jl}|y_i=c_k)\Big)^{I(y_i=c_k)}\notag\\ &=\prod\limits_{i=1}^{N}\prod\limits_{k=1}^{K}\Big(\theta_k\prod\limits_{j=1}^{n}\prod\limits_{l=1}^{L_j}\mu_{jlk}^{I(x^{(j)}_i=a_{jl})}\Big)^{I(y_i=c_k)}\notag\\ \end{align} L ( θ ,m )=i=1∏NP(xi,yi)=i=1∏NP ( andi)P(xi∣yi) (multiplication formula)=i=1∏N( P ( andi)j=1∏nP(xi(j)∣yi) ) (conditional independence assumption)=i=1∏Nk=1∏K( P ( and=ck)j=1∏nP(xi(j)∣yi=ck))I(yi=ck)=i=1∏Nk=1∏K( ikj=1∏nl=1∏LjPI(xi(j)=ajl)(x(j)=ajl∣yi=ck))I(yi=ck)=i=1∏Nk=1∏K( ikj=1∏nl=1∏Ljmjl k _I(xi(j)=ajl))I(yi=ck)
Among them, NNN is the number of samples,nnn isXXDimension of X , L j L_jLjfor X ( j ) X^{(j)}X( j ) The number of possible values,KKK为YYThe number of possible values for Y.
Form:
l ( θ , μ ) = ∑ i = 1 N ∑ k = 1 KI ( yi = ck ) ( log θ k + ∑ j = 1 n ∑ l = 1 L j I ( xi ( j ) = ajl ) log µ jlk ) \begin{align} l(\theta,\mu)&=\sum\limits_{i=1}^{N}\sum\limits_{k=1}^{K}I( y_i=c_k)\Big(\log\theta_k+\sum\limits_{j=1}^{n}\sum\limits_{l=1}^{L_j}I(x^{(j)}_i=a_{ jl})\log\mu_{jlk}\Big)\notag \end{align}l ( i ,m ).=i=1∑Nk=1∑KI(yi=ck)(logik+j=1∑nl=1∑LjI(xi(j)=ajl)logmj l k)
Step2 If k \theta_kik
Use the Lagrange multiplier method to introduce constraints ∑ k = 1 K θ k = 1 \sum\limits_{k=1}^{K}\theta_k=1k=1∑Kik=1 ,得:
F ( θ , μ , λ ) = ∑ i = 1 N ∑ k = 1 KI ( yi = ck ) ( log θ k + ∑ j = 1 n ∑ l = 1 L j I ( xi ( j ) = ajl ) log µ jlk ) + λ ( ∑ k = 1 K θ k − 1 ) \begin{align} F(\theta,\mu,\lambda)=\sum\limits_{i=1}^{; N}\sum\limits_{k=1}^{K}I(y_i=c_k)(\log\theta_k+\sum\limits_{j=1}^{n}\sum\limits_{l=1}^{ L_j}I(x^{(j)}_i=a_{jl})\log\mu_{jlk})+\lambda(\sum\limits_{k=1}^{K}\theta_k-1)\notag \end{align}F ( θ ,m ,l )=i=1∑Nk=1∑KI(yi=ck)(logik+j=1∑nl=1∑LjI(xi(j)=ajl)logmj l k)+l (k=1∑Kik−1)
to FFF Find the partial derivative and let the partial derivative be0 00 ,i:
θ k = − ∑ i = 1 IF ( yi = ck ) λ ∑ k = 1 K θ k = − N λ = 1 \begin{align} \theta_k&=-\frac{\sum\limits_{i =1}^{N}I(y_i=c_k)}{\lambda}\notag\\ \sum\limits_{k=1}^{K}\theta_k&=-\frac{N}{\lambda}=1 \notag\end{align}ikk=1∑Kik=−li=1∑NI(yi=ck)=−lN=1
Among them, N k N_kNkFor the sample Y = ck Y=c_kY=ckquantity. Combining the above two equations, we get:
θ k = ∑ i = 1 NI ( yi = ck ) N \begin{align} \theta_k=\frac{\sum\limits_{i=1}^{N}I( y_i=c_k)}{N}\notag \end{align}ik=Ni=1∑NI(yi=ck)
Step3. Find μ lk \mu_{lk}mlk
Use the Lagrange multiplier method to introduce constraints ∑ l = 1 L j μ lk = 1 \sum\limits_{l=1}^{L_j}\mu_{lk}=1l=1∑Ljmlk=1 ,得:
F ( θ , μ , λ ) = ∑ i = 1 N ∑ k = 1 KI ( yi = ck ) ( log θ k + ∑ j = 1 n ∑ l = 1 L j I ( xi ( j ) = ajl ) log µ jlk ) + λ ( ∑ l = 1 L j µ lk − 1 ) \begin{align} F(\theta,\mu,\lambda)=\sum\limits_{i=1}^ {N}\sum\limits_{k=1}^{K}I(y_i=c_k)\Big(\log\theta_k+\sum\limits_{j=1}^{n}\sum\limits_{l=1 }^{L_j}I(x^{(j)}_i=a_{jl})\log\mu_{jlk})+\lambda(\sum\limits_{l=1}^{L_j}\mu_{lk }-1\Big)\notag\end{align}F ( θ ,m ,l )=i=1∑Nk=1∑KI(yi=ck)(logik+j=1∑nl=1∑LjI(xi(j)=ajl)logmj l k)+l (l=1∑Ljmlk−1)
to FFF Find the partial derivative and let the partial derivative be0 00 ,得:
μ j l k = − ∑ i = 1 N I ( y i = c k , x i ( j ) = a j l ) λ ∑ l = 1 L j μ l k = − ∑ i = 1 N I ( y i = c k ) λ = 1 \begin{align} \mu_{jlk}&=-\frac{\sum\limits_{i=1}^{N}I(y_i=c_k,x^{(j)}_i=a_{jl})}{\lambda}\notag\\ \sum\limits_{l=1}^{L_j}\mu_{lk}&=-\frac{\sum\limits_{i=1}^{N}I(y_i=c_k)}{\lambda}=1\notag \end{align} mj l kl=1∑Ljmlk=−li=1∑NI(yi=ck,xi(j)=ajl)=−li=1∑NI(yi=ck)=1
联立上面两个方程,得:
μ j l k = ∑ i = 1 N I ( y i = c k , x i ( j ) = a j l ) ∑ i = 1 N I ( y i = c k ) \begin{align} \mu_{jlk}=\frac{\sum\limits_{i=1}^{N}I(y_i=c_k,x^{(j)}_i=a_{jl})}{\sum\limits_{i=1}^{N}I(y_i=c_k)}\notag \end{align} mj l k=i=1∑NI(yi=ck)i=1∑NI(yi=ck,xi(j)=ajl)