R语言实现:library(tidyverse)、library(gtools)、library(topicmodels)
1、Ndocs = 500 WordsPerDoc = rpois(Ndocs, 100)
2、 thetaList = list(c(A=.60, B=.25, C=.15),c(A=.10, B=.10, C=.80)) //主题A、B、C
theta_1 = t(replicate(Ndocs/2, thetaList[[1]]))
theta_2 = t(replicate(Ndocs/2, thetaList[[2]]))
theta = rbind(theta_1, theta_2) //500行3列,对应主题A、B、C
3、 Z:3行500列,对应1、2、3,为每篇文档确定一个主题,多项式采样
z:500维向量,每一维表示该文档对应的主题
4、 Nterms = max(WordsPerDoc)
breaks = quantile(1:Nterms, c(.4,.6,1)) %>% round()
cuts = list(1:breaks[1], (breaks[1]+1):breaks[2], (breaks[2]+1):Nterms)
5、 B_k = matrix(0, ncol=3, nrow=Nterms)
B_k[,1] = rdirichlet(n=1, alpha=c(rep(10, length(cuts[[1]])),rep(1, length(cuts[[2]])),rep(1, length(cuts[[3]]))))
B_k[,2] = rdirichlet(n=1, alpha=c(rep(1, length(cuts[[1]])),rep(10, length(cuts[[2]])),rep(1, length(cuts[[3]]))))
B_k[,3] = rdirichlet(n=1, alpha=c(rep(1, length(cuts[[1]])), rep(1, length(cuts[[2]])),rep(10, length(cuts[[3]]))))
6、 wordlist_1 = sapply(1:Ndocs, function(i) t(rmultinom(1, size=WordsPerDoc[i], prob=B_k[,z[i]])) , simplify = F) //500个文档,每个文档Nterms维,记录某词出现的次数,可转成词袋
wordlist_2 = lapply(wordlist_1, function(wds) rep(1:length(wds), wds))
dtm_1 = do.call(rbind, wordlist_1) //文档词矩阵
7、利用cat函数(结合for循环)输出到txt文件里
8、 LDA_1 = LDA(dtm_1, k=3, control=controlSettings)
LDA_1Post = posterior(LDA_1) //可输出$terms和$topics