假设读者已经了解 LDA 的来龙去脉。
需要明确采样的含义: 随机变量是总体,采样就是按照总体的概率分布(指示了样本出现的概率)抽取样本的过程。样本应该能正确反映总体的情况,即样本与总体应该具有相同的统计性质(均值,方差等)。
一、《LDA数学八卦》中的推导
语料库中的第
i
i
i 个词对应的主题我们记为
z
i
z_i
z i ,其中
i
=
(
m
,
n
)
i=(m,n)
i = ( m , n ) 是一个二维下标,即语料库中第
i
i
i 个词对应于第
m
m
m 篇文档的第
n
n
n 词,我们用
¬
i
\neg i
¬ i 表示去除下标为
i
i
i 的词。 那么按照 Gibbs Sampling 的要求,我们要求得任一坐标轴
i
i
i 对应的条件分布
p
(
z
i
=
k
∣
Z
⃗
¬
i
,
W
⃗
)
p(z_i=k|\vec{Z}_{\neg i},\vec{W})
p ( z i = k ∣ Z
¬ i , W
) ,注意这是一个离散型概率分布的抽样。由贝叶斯法则可得(条件概率正比于联合概率):
p
(
z
i
=
k
∣
Z
⃗
¬
i
,
W
⃗
)
∝
p
(
z
i
=
k
,
w
i
=
t
∣
Z
⃗
¬
i
,
W
⃗
¬
i
)
p(z_i=k|\vec{Z}_{\neg i},\vec{W})\propto p(z_i=k,w_i=t|\vec{Z}_{\neg i},\vec{W}_{\neg i})
p ( z i = k ∣ Z
¬ i , W
) ∝ p ( z i = k , w i = t ∣ Z
¬ i , W
¬ i ) 由于
z
i
=
k
,
w
i
=
t
z_i=k,w_i=t
z i = k , w i = t 只涉及到第
m
m
m 篇文档和第
k
k
k 个主题,所以只会涉及到如下两个 Dirichlet-Multinomial 共轭结构
α
⃗
→
θ
⃗
m
→
z
⃗
m
\vec{\alpha}\rightarrow\vec{\theta}_m\rightarrow\vec{z}_m
α
→ θ
m → z
m
β
⃗
→
φ
⃗
k
→
w
⃗
k
\vec{\beta}\rightarrow\vec{\varphi}_k\rightarrow\vec{w}_k
β
→ φ
k → w
k
去掉了语料中第
i
i
i 个词对应的
(
z
i
,
w
i
)
(z_i,w_i)
( z i , w i ) ,并不会改变上面两个共轭结构,只是会减少对应的计数。所以
θ
⃗
m
,
φ
⃗
k
\vec{\theta}_m,\vec{\varphi}_k
θ
m , φ
k 的后验分布都还是狄利克雷分布:
p
(
θ
⃗
m
∣
Z
⃗
¬
i
,
W
⃗
¬
i
)
=
D
i
r
(
θ
⃗
m
∣
n
⃗
m
,
¬
i
+
α
⃗
)
p
(
φ
⃗
k
∣
Z
⃗
¬
i
,
W
⃗
¬
i
)
=
D
i
r
(
φ
⃗
k
∣
n
⃗
k
,
¬
i
+
β
⃗
)
p(\vec{\theta}_m|\vec{Z}_{\neg i},\vec{W}_{\neg i})=Dir(\vec{\theta}_m|\vec{n}_{m,\neg i}+\vec{\alpha})\\p(\vec{\varphi}_k|\vec{Z}_{\neg i},\vec{W}_{\neg i})=Dir(\vec{\varphi}_k|\vec{n}_{k,\neg i}+\vec{\beta})
p ( θ
m ∣ Z
¬ i , W
¬ i ) = D i r ( θ
m ∣ n
m , ¬ i + α
) p ( φ
k ∣ Z
¬ i , W
¬ i ) = D i r ( φ
k ∣ n
k , ¬ i + β
) 注意,
n
⃗
m
,
¬
i
,
n
⃗
k
,
¬
i
\vec{n}_{m,\neg i},\vec{n}_{k,\neg i}
n
m , ¬ i , n
k , ¬ i 都不是双下标,
n
⃗
m
=
(
n
m
(
1
)
,
⋯
 
,
n
m
(
K
)
)
\vec{n}_{m}=(n_m^{(1)},\cdots,n_m^{(K)})
n
m = ( n m ( 1 ) , ⋯ , n m ( K ) ) ,
n
m
(
k
)
n_m^{(k)}
n m ( k ) 表示第 m 篇文档中第 k 个主题的词的个数,
n
⃗
k
=
(
n
k
(
1
)
,
⋯
 
,
n
k
(
V
)
)
\vec{n}_k=(n_k^{(1)},\cdots,n_k^{(V)})
n
k = ( n k ( 1 ) , ⋯ , n k ( V ) ) ,
n
k
(
t
)
n_k^{(t)}
n k ( t ) 表示第 k 个主题中第 t(词典下标) 个词的个数。
¬
i
\neg i
¬ i 表示去除下标为
i
i
i (语料下标) 的词,所以在统计生成
n
⃗
m
,
n
⃗
k
\vec{n}_{m},\vec{n}_{k}
n
m , n
k 这两个向量时,我们可以当做第 m 篇文档中就没有这个词,也就不统计该词相关的计数,
n
⃗
m
,
¬
i
,
n
⃗
k
,
¬
i
\vec{n}_{m,\neg i},\vec{n}_{k,\neg i}
n
m , ¬ i , n
k , ¬ i 表示的就是这样一种含义。
Gibbs Sampling 公式的推导:
p
(
z
i
=
k
∣
Z
⃗
¬
i
,
W
⃗
)
∝
p
(
z
i
=
k
,
w
i
=
t
∣
Z
⃗
¬
i
,
W
⃗
¬
i
)
=
∫
p
(
z
i
=
k
,
w
i
=
t
,
θ
⃗
m
,
φ
⃗
k
∣
Z
⃗
¬
i
,
W
⃗
¬
i
)
d
θ
⃗
m
d
φ
⃗
k
=
∫
p
(
z
i
=
k
,
θ
⃗
m
∣
Z
⃗
¬
i
,
W
⃗
¬
i
)
⋅
p
(
w
i
=
t
,
φ
⃗
k
∣
Z
⃗
¬
i
,
W
⃗
¬
i
)
d
θ
⃗
m
d
φ
⃗
k
=
∫
p
(
z
i
=
k
∣
θ
⃗
m
)
p
(
θ
⃗
m
∣
Z
⃗
¬
i
,
W
⃗
¬
i
)
⋅
p
(
w
i
=
t
∣
φ
⃗
k
)
p
(
φ
⃗
k
∣
Z
⃗
¬
i
,
W
⃗
¬
i
)
d
θ
⃗
m
d
φ
⃗
k
=
∫
p
(
z
i
=
k
∣
θ
⃗
m
)
D
i
r
(
θ
⃗
m
∣
n
⃗
m
,
¬
i
+
α
⃗
)
d
θ
⃗
m
⋅
∫
p
(
w
i
=
t
∣
φ
⃗
k
)
D
i
r
(
φ
⃗
k
∣
n
⃗
k
,
¬
i
+
β
⃗
)
d
φ
⃗
k
=
∫
θ
m
k
D
i
r
(
θ
⃗
m
∣
n
⃗
m
,
¬
i
+
α
⃗
)
d
θ
⃗
m
⋅
∫
φ
k
t
D
i
r
(
φ
⃗
k
∣
n
⃗
k
,
¬
i
+
β
⃗
)
d
φ
⃗
k
=
E
(
θ
m
k
)
⋅
E
(
φ
k
t
)
=
θ
^
m
k
⋅
φ
^
k
t
\begin{aligned} p(z_i=k|\vec{Z}_{\neg i},\vec{W})\propto & p(z_i=k,w_i=t|\vec{Z}_{\neg i},\vec{W}_{\neg i})\\ =&\int p(z_i=k,w_i=t,\vec{\theta}_m,\vec{\varphi}_k|\vec{Z}_{\neg i},\vec{W}_{\neg i})\mathrm d\vec{\theta}_m\mathrm d\vec{\varphi}_k\\ =&\int p(z_i=k,\vec{\theta}_m|\vec{Z}_{\neg i},\vec{W}_{\neg i})\cdot p(w_i=t,\vec{\varphi}_k|\vec{Z}_{\neg i},\vec{W}_{\neg i})\mathrm d\vec{\theta}_m\mathrm d\vec{\varphi}_k\\ =&\int p(z_i=k|\vec{\theta}_m)p(\vec{\theta}_m|\vec{Z}_{\neg i},\vec{W}_{\neg i})\cdot p(w_i=t|\vec{\varphi}_k)p(\vec{\varphi}_k|\vec{Z}_{\neg i},\vec{W}_{\neg i})\mathrm d\vec{\theta}_m\mathrm d\vec{\varphi}_k\\ =&\int p(z_i=k|\vec{\theta}_m)Dir(\vec{\theta}_m|\vec{n}_{m,\neg i}+\vec{\alpha})\mathrm d\vec{\theta}_m\cdot\int p(w_i=t|\vec{\varphi}_k)Dir(\vec{\varphi}_k|\vec{n}_{k,\neg i}+\vec{\beta})\mathrm d\vec{\varphi}_k\\ =&\int \theta_{mk} Dir(\vec{\theta}_m|\vec{n}_{m,\neg i}+\vec{\alpha})\mathrm d\vec{\theta}_m\cdot\int \varphi_{kt} Dir(\vec{\varphi}_k|\vec{n}_{k,\neg i}+\vec{\beta})\mathrm d\vec{\varphi}_k\\ =&E(\theta_{mk})\cdot E(\varphi_{kt})\\ =&\hat{\theta}_{mk}\cdot\hat{\varphi}_{kt} \end{aligned}
p ( z i = k ∣ Z
¬ i , W
) ∝ = = = = = = = p ( z i = k , w i = t ∣ Z
¬ i , W
¬ i ) ∫ p ( z i = k , w i = t , θ
m , φ
k ∣ Z
¬ i , W
¬ i ) d θ
m d φ
k ∫ p ( z i = k , θ
m ∣ Z
¬ i , W
¬ i ) ⋅ p ( w i = t , φ
k ∣ Z
¬ i , W
¬ i ) d θ
m d φ
k ∫ p ( z i = k ∣ θ
m ) p ( θ
m ∣ Z
¬ i , W
¬ i ) ⋅ p ( w i = t ∣ φ
k ) p ( φ
k ∣ Z
¬ i , W
¬ i ) d θ
m d φ
k ∫ p ( z i = k ∣ θ
m ) D i r ( θ
m ∣ n
m , ¬ i + α
) d θ
m ⋅ ∫ p ( w i = t ∣ φ
k ) D i r ( φ
k ∣ n
k , ¬ i + β
) d φ
k ∫ θ m k D i r ( θ
m ∣ n
m , ¬ i + α
) d θ
m ⋅ ∫ φ k t D i r ( φ
k ∣ n
k , ¬ i + β
) d φ
k E ( θ m k ) ⋅ E ( φ k t ) θ ^ m k ⋅ φ ^ k t
根据狄利克雷分布的期望公式有:
θ
^
m
k
=
n
m
,
¬
i
(
k
)
+
α
k
∑
k
=
1
K
n
m
,
¬
i
(
k
)
+
α
k
φ
^
k
t
=
n
k
,
¬
i
(
t
)
+
β
t
∑
t
=
1
V
n
k
,
¬
i
(
t
)
+
β
t
\hat{\theta}_{mk}=\frac{n_{m,\neg i}^{(k)}+\alpha_k}{\sum_{k=1}^Kn_{m,\neg i}^{(k)}+\alpha_k}\\\hat{\varphi}_{kt}=\frac{n_{k,\neg i}^{(t)}+\beta_t}{\sum_{t=1}^Vn_{k,\neg i}^{(t)}+\beta_t}
θ ^ m k = ∑ k = 1 K n m , ¬ i ( k ) + α k n m , ¬ i ( k ) + α k φ ^ k t = ∑ t = 1 V n k , ¬ i ( t ) + β t n k , ¬ i ( t ) + β t
所以 LDA 模型的采样公式为:
p
(
z
i
=
k
∣
Z
⃗
¬
i
,
W
⃗
)
∝
n
m
,
¬
i
(
k
)
+
α
k
∑
k
=
1
K
n
m
,
¬
i
(
k
)
+
α
k
⋅
n
k
,
¬
i
(
t
)
+
β
t
∑
t
=
1
V
n
k
,
¬
i
(
t
)
+
β
t
p(z_i=k|\vec{Z}_{\neg i},\vec{W})\propto\frac{n_{m,\neg i}^{(k)}+\alpha_k}{\sum_{k=1}^Kn_{m,\neg i}^{(k)}+\alpha_k}\cdot\frac{n_{k,\neg i}^{(t)}+\beta_t}{\sum_{t=1}^Vn_{k,\neg i}^{(t)}+\beta_t}
p ( z i = k ∣ Z
¬ i , W
) ∝ ∑ k = 1 K n m , ¬ i ( k ) + α k n m , ¬ i ( k ) + α k ⋅ ∑ t = 1 V n k , ¬ i ( t ) + β t n k , ¬ i ( t ) + β t
二、《Parameter estimation for text analysis》中的推导
第一类狄利克雷积分,后面直接当作结论使用,不做证明:
Δ
(
α
⃗
)
=
∫
∑
x
i
=
1
∏
i
=
1
N
x
i
α
i
−
1
d
x
⃗
\Delta(\vec{\alpha})=\int_{\sum x_i=1}\prod_{i=1}^Nx_i^{\alpha_i-1}\mathrm d\vec{x}
Δ ( α
) = ∫ ∑ x i = 1 i = 1 ∏ N x i α i − 1 d x
考虑一个有狄利克雷先验的一元模型(文档的生成仅依赖于一个词分布),此时只有一个 Dirichlet-Multinomial 共轭结构,有:
p
(
W
∣
α
⃗
)
=
∫
p
(
W
∣
p
⃗
)
⋅
p
(
p
⃗
∣
α
⃗
)
d
p
⃗
p(W|\vec{\alpha})=\int{p(W|\vec{p})\cdot p(\vec{p}|\vec{\alpha})\mathrm d\vec{p}}
p ( W ∣ α
) = ∫ p ( W ∣ p
) ⋅ p ( p
∣ α
) d p
对比我们在概率论中学过的全概率公式,可以把这个式子理解为连续型变量的全概率公式。 具体的:
p
(
W
∣
α
⃗
)
=
∫
∏
n
=
1
N
M
u
l
t
(
w
=
w
n
∣
p
⃗
,
1
)
⋅
D
i
r
(
p
⃗
∣
α
⃗
)
d
p
⃗
=
∫
∏
v
=
1
V
p
v
n
(
v
)
⋅
1
Δ
(
α
⃗
)
∏
v
=
1
V
p
v
α
v
−
1
d
p
⃗
=
1
Δ
(
α
⃗
)
∫
∏
v
=
1
V
p
v
n
(
v
)
+
α
v
−
1
d
p
⃗
∣
D
i
r
i
c
h
l
e
t
∫
=
Δ
(
n
⃗
+
α
⃗
)
Δ
(
α
⃗
)
,
n
⃗
=
{
n
(
v
)
}
v
=
1
V
\begin{aligned} p(W|\vec{\alpha})=&\int{\prod_{n=1}^N\mathrm{Mult}(w=w_n|\vec{p},1)\cdot \mathrm{Dir}(\vec{p}|\vec{\alpha})\mathrm d\vec{p}}\\ =&\int{\prod_{v=1}^Vp_v^{n^{(v)}}\cdot \frac{1}{\Delta(\vec\alpha)}\prod_{v=1}^Vp_v^{\alpha_v-1}\mathrm d\vec{p}}\\ =&\frac{1}{\Delta(\vec\alpha)}\int{\prod_{v=1}^Vp_v^{n^{(v)}+\alpha_v-1}\mathrm d\vec{p}} & | \mathrm{Dirichlet}\int\\ =&\frac{\Delta(\vec{n}+\vec\alpha)}{\Delta(\vec\alpha)},\vec{n}=\{n^{(v)}\}_{v=1}^V \end{aligned}
p ( W ∣ α
) = = = = ∫ n = 1 ∏ N M u l t ( w = w n ∣ p
, 1 ) ⋅ D i r ( p
∣ α
) d p
∫ v = 1 ∏ V p v n ( v ) ⋅ Δ ( α
) 1 v = 1 ∏ V p v α v − 1 d p
Δ ( α
) 1 ∫ v = 1 ∏ V p v n ( v ) + α v − 1 d p
Δ ( α
) Δ ( n
+ α
) , n
= { n ( v ) } v = 1 V ∣ D i r i c h l e t ∫ 其中,
M
u
l
t
(
w
=
w
n
∣
p
⃗
,
1
)
\mathrm{Mult}(w=w_n|\vec{p},1)
M u l t ( w = w n ∣ p
, 1 ) 表示多项式分布 N 次实验中一次实验的观测结果,最后一步的推导使用了第一类狄利克雷积分。
上面这个推导的意义是,消去了多项式分布的参数
p
⃗
\vec{p}
p
的影响,因为它是未知的,仅使用词计数和狄利克雷超参数(伪计数)来表达观察到的语料的概率。
那么对于 LDA 模型来说,就是 K+M 个 Dirichlet-Multinomial 共轭结构。 首先对于 K 个主题的 Dirichlet-Multinomial 共轭结构有:
p
(
w
⃗
∣
z
⃗
,
β
⃗
)
=
∫
p
(
w
⃗
∣
z
⃗
,
Φ
)
p
(
Φ
∣
β
⃗
)
d
Φ
=
∏
k
=
1
K
∫
1
Δ
(
β
⃗
)
∏
t
=
1
V
φ
k
,
t
n
k
(
t
)
+
β
t
−
1
d
φ
⃗
k
=
∏
k
=
1
K
Δ
(
n
⃗
k
+
β
⃗
)
Δ
(
β
⃗
)
,
n
⃗
k
=
{
n
k
(
t
)
}
t
=
1
V
\begin{aligned} p(\vec w|\vec z,\vec \beta)=&\int p(\vec w|\vec z,\Phi)p(\Phi|\vec \beta)\mathrm d\Phi\\ =&\prod_{k=1}^K\int\frac{1}{\Delta(\vec\beta)}\prod_{t=1}^V\varphi_{k,t}^{n_k^{(t)}+\beta_t-1}\mathrm d\vec\varphi_k\\ =&\prod_{k=1}^K\frac{\Delta(\vec n_k+\vec\beta)}{\Delta(\vec\beta)},\vec n_k=\{n_k^{(t)}\}_{t=1}^V \end{aligned}
p ( w
∣ z
, β
) = = = ∫ p ( w
∣ z
, Φ ) p ( Φ ∣ β
) d Φ k = 1 ∏ K ∫ Δ ( β
) 1 t = 1 ∏ V φ k , t n k ( t ) + β t − 1 d φ
k k = 1 ∏ K Δ ( β
) Δ ( n
k + β
) , n
k = { n k ( t ) } t = 1 V 其次对于 M 个文档的 Dirichlet-Multinomial 共轭结构有:
p
(
z
⃗
∣
α
⃗
)
=
∫
p
(
z
⃗
∣
Θ
)
p
(
Θ
∣
α
⃗
)
d
Θ
=
∏
m
=
1
M
∫
1
Δ
(
α
⃗
)
∏
k
=
1
K
θ
m
,
k
n
m
(
k
)
+
α
k
−
1
d
θ
⃗
m
=
∏
m
=
1
M
Δ
(
n
⃗
m
+
α
⃗
)
Δ
(
α
⃗
)
,
n
⃗
m
=
{
n
m
(
k
)
}
k
=
1
K
\begin{aligned} p(\vec z|\vec \alpha)=&\int p(\vec z|\Theta)p(\Theta|\vec \alpha)\mathrm d\Theta\\ =&\prod_{m=1}^M\int\frac{1}{\Delta(\vec\alpha)}\prod_{k=1}^K\theta_{m,k}^{n_m^{(k)}+\alpha_k-1}\mathrm d\vec\theta_m\\ =&\prod_{m=1}^M\frac{\Delta(\vec n_m+\vec\alpha)}{\Delta(\vec\alpha)},\vec n_m=\{n_m^{(k)}\}_{k=1}^K \end{aligned}
p ( z
∣ α
) = = = ∫ p ( z
∣ Θ ) p ( Θ ∣ α
) d Θ m = 1 ∏ M ∫ Δ ( α
) 1 k = 1 ∏ K θ m , k n m ( k ) + α k − 1 d θ
m m = 1 ∏ M Δ ( α
) Δ ( n
m + α
) , n
m = { n m ( k ) } k = 1 K 所以模型的联合分布为:
p
(
z
⃗
,
w
⃗
∣
α
⃗
,
β
⃗
)
=
∏
k
=
1
K
Δ
(
n
⃗
k
+
β
⃗
)
Δ
(
β
⃗
)
⋅
∏
m
=
1
M
Δ
(
n
⃗
m
+
α
⃗
)
Δ
(
α
⃗
)
p(\vec z,\vec w|\vec\alpha,\vec\beta)=\prod_{k=1}^K\frac{\Delta(\vec n_k+\vec\beta)}{\Delta(\vec\beta)}\cdot\prod_{m=1}^M\frac{\Delta(\vec n_m+\vec\alpha)}{\Delta(\vec\alpha)}
p ( z
, w
∣ α
, β
) = k = 1 ∏ K Δ ( β
) Δ ( n
k + β
) ⋅ m = 1 ∏ M Δ ( α
) Δ ( n
m + α
) 因为 Gibbs Sampling 的要求就是根据条件分布进行采样,所以 LDA 模型的采样公式为:
p
(
z
i
=
k
∣
z
⃗
¬
i
,
w
⃗
)
=
p
(
w
⃗
,
z
⃗
)
p
(
w
⃗
,
z
⃗
¬
i
)
=
p
(
w
⃗
∣
z
⃗
)
p
(
w
⃗
¬
i
∣
z
⃗
¬
i
)
p
(
w
i
)
⋅
p
(
z
⃗
)
p
(
z
⃗
¬
i
)
(
1
)
∝
Δ
(
n
⃗
k
+
β
⃗
)
Δ
(
n
⃗
k
,
¬
i
+
β
⃗
)
⋅
Δ
(
n
⃗
m
+
α
⃗
)
Δ
(
n
⃗
m
,
¬
i
+
α
⃗
)
(
2
)
=
Γ
(
n
k
(
t
)
+
β
t
)
Γ
(
∑
t
=
1
V
n
k
,
¬
i
(
t
)
+
β
t
)
Γ
(
n
k
,
¬
i
(
t
)
+
β
t
)
)
Γ
(
∑
t
=
1
V
n
k
(
t
)
+
β
t
)
⋅
Γ
(
n
m
(
k
)
+
α
k
)
Γ
(
∑
k
=
1
K
n
m
,
¬
i
(
k
)
+
α
k
)
Γ
(
n
m
,
¬
i
(
k
)
+
α
k
)
Γ
(
∑
k
=
1
K
n
m
(
k
)
+
α
k
)
(
3
)
=
n
k
,
¬
i
(
t
)
+
β
t
∑
t
=
1
V
n
k
,
¬
i
(
t
)
+
β
t
⋅
n
m
,
¬
i
(
k
)
+
α
k
[
∑
k
=
1
K
n
m
(
k
)
+
α
k
]
−
1
(
4
)
∝
n
k
,
¬
i
(
t
)
+
β
t
∑
t
=
1
V
n
k
,
¬
i
(
t
)
+
β
t
⋅
(
n
m
,
¬
i
(
k
)
+
α
k
)
(
5
)
\begin{aligned} p(z_i=k|\vec{z}_{\neg i},\vec{w})=& \frac{p(\vec{w},\vec{z})}{p(\vec{w},\vec{z}_{\neg i})}=\frac{p(\vec{w}|\vec{z})}{p(\vec{w}_{\neg i}|\vec{z}_{\neg i})p(w_i)}\cdot\frac{p(\vec{z})}{p(\vec{z}_{\neg i})} & (1)\\ \propto & \frac{\Delta(\vec{n}_k+\vec\beta)}{\Delta(\vec{n}_{k,\neg i}+\vec\beta)}\cdot \frac{\Delta(\vec{n}_m+\vec\alpha)}{\Delta(\vec{n}_{m,\neg i}+\vec\alpha)} &(2)\\ =& \frac{\Gamma(n_k^{(t)}+\beta_t)\Gamma(\sum_{t=1}^Vn_{k,\neg i}^{(t)}+\beta_t)}{\Gamma(n_{k,\neg i}^{(t)}+\beta_t))\Gamma(\sum_{t=1}^Vn_k^{(t)}+\beta_t)}\cdot\frac{\Gamma(n_m^{(k)}+\alpha_k)\Gamma(\sum_{k=1}^Kn_{m,\neg i}^{(k)}+\alpha_k)}{\Gamma(n_{m,\neg i}^{(k)}+\alpha_k)\Gamma(\sum_{k=1}^Kn_m^{(k)}+\alpha_k)}&(3)\\ =&\frac{n_{k,\neg i}^{(t)}+\beta_t}{\sum_{t=1}^Vn_{k,\neg i}^{(t)}+\beta_t}\cdot\frac{n_{m,\neg i}^{(k)}+\alpha_k}{[\sum_{k=1}^Kn_m^{(k)}+\alpha_k]-1}&(4)\\ \propto & \frac{n_{k,\neg i}^{(t)}+\beta_t}{\sum_{t=1}^Vn_{k,\neg i}^{(t)}+\beta_t}\cdot(n_{m,\neg i}^{(k)}+\alpha_k)&(5) \end{aligned}
p ( z i = k ∣ z
¬ i , w
) = ∝ = = ∝ p ( w
, z
¬ i ) p ( w
, z
) = p ( w
¬ i ∣ z
¬ i ) p ( w i ) p ( w
∣ z
) ⋅ p ( z
¬ i ) p ( z
) Δ ( n
k , ¬ i + β
) Δ ( n
k + β
) ⋅ Δ ( n
m , ¬ i + α
) Δ ( n
m + α
) Γ ( n k , ¬ i ( t ) + β t ) ) Γ ( ∑ t = 1 V n k ( t ) + β t ) Γ ( n k ( t ) + β t ) Γ ( ∑ t = 1 V n k , ¬ i ( t ) + β t ) ⋅ Γ ( n m , ¬ i ( k ) + α k ) Γ ( ∑ k = 1 K n m ( k ) + α k ) Γ ( n m ( k ) + α k ) Γ ( ∑ k = 1 K n m , ¬ i ( k ) + α k ) ∑ t = 1 V n k , ¬ i ( t ) + β t n k , ¬ i ( t ) + β t ⋅ [ ∑ k = 1 K n m ( k ) + α k ] − 1 n m , ¬ i ( k ) + α k ∑ t = 1 V n k , ¬ i ( t ) + β t n k , ¬ i ( t ) + β t ⋅ ( n m , ¬ i ( k ) + α k ) ( 1 ) ( 2 ) ( 3 ) ( 4 ) ( 5 ) 上面的公式推导做以下五点解释:
(1) 式和 (2) 式正比,是因为
p
(
w
i
)
p(w_i)
p ( w i ) 是一个常数;由于
¬
i
\neg i
¬ i 只涉及两个共轭结构,所以 (1) 式到 (2) 式约去了相同的
Δ
\Delta
Δ 函数。
(2) 式到 (3) 式,同样是因为 i 只与第 k 个主题和第 m 篇文档有关,将
Δ
\Delta
Δ 函数展开成
Γ
\Gamma
Γ 函数后,约去了相同的
Γ
\Gamma
Γ 函数。
(3) 式到 (4) 式利用了
Γ
\Gamma
Γ 函数的性质:
Γ
(
x
+
1
)
=
x
Γ
(
x
)
\Gamma(x+1)=x\Gamma(x)
Γ ( x + 1 ) = x Γ ( x ) ,进行约去。
(4) 式里有一个减 1,是因为把第 m 篇文档里要去掉的那个计数分离了出来。
因为第 m 篇文档的单词数是已知的,所以 (4) 式和 (5) 式是正比关系。
三、原来对 LDA 的一个疑惑,以及解答
疑问:LDA 并没有做任何词语语义方面的工作,仅仅是做一些数量上的统计,它是怎么挖掘出文档具有语义含义的主题的呢?
LDA 本质上是对词进行了聚类(依据某方面的相似度),聚的这个类就是主题。换句话说就是,按照主题给词进行了聚类。
既然没有做词语义方面的工作,那词之间的相似度是怎么确定的呢? 读完 LDA 著名的科普文章《Parameter estimation for text analysis》,它里面提到潜在主题是高阶共现的结果,即,t1 和 t2 共现,t2 和 t3 共现,表明 t1 和 t3 是一个二阶共现,以此类推。(共现,在不同上下文中共同出现。两个词越经常在不同的上下文共现,就越表示这两个词相似。)