Topic Model Demo

R语言实现:library(tidyverse)、library(gtools)、library(topicmodels)

  1、Ndocs = 500  WordsPerDoc = rpois(Ndocs, 100) 

  2、 thetaList = list(c(A=.60, B=.25, C=.15),c(A=.10, B=.10, C=.80))     //主题A、B、C
    theta_1 = t(replicate(Ndocs/2, thetaList[[1]])) 
    theta_2 = t(replicate(Ndocs/2, thetaList[[2]]))
    theta = rbind(theta_1, theta_2)        //500行3列,对应主题A、B、C

  3、 Z:3行500列,对应1、2、3,为每篇文档确定一个主题,多项式采样

    z:500维向量,每一维表示该文档对应的主题

  4、 Nterms = max(WordsPerDoc)

    breaks = quantile(1:Nterms, c(.4,.6,1)) %>% round()

    cuts = list(1:breaks[1], (breaks[1]+1):breaks[2], (breaks[2]+1):Nterms)

  5、 B_k = matrix(0, ncol=3, nrow=Nterms)

    B_k[,1] = rdirichlet(n=1, alpha=c(rep(10, length(cuts[[1]])),rep(1, length(cuts[[2]])),rep(1, length(cuts[[3]]))))

    B_k[,2] = rdirichlet(n=1, alpha=c(rep(1, length(cuts[[1]])),rep(10, length(cuts[[2]])),rep(1, length(cuts[[3]]))))

    B_k[,3] = rdirichlet(n=1, alpha=c(rep(1, length(cuts[[1]])), rep(1, length(cuts[[2]])),rep(10, length(cuts[[3]]))))

  6、 wordlist_1 = sapply(1:Ndocs, function(i) t(rmultinom(1, size=WordsPerDoc[i], prob=B_k[,z[i]])) , simplify = F)        //500个文档,每个文档Nterms维,记录某词出现的次数,可转成词袋

    wordlist_2 = lapply(wordlist_1, function(wds) rep(1:length(wds), wds))  

    dtm_1 = do.call(rbind, wordlist_1)      //文档词矩阵

  7、利用cat函数(结合for循环)输出到txt文件里

  8、 LDA_1 = LDA(dtm_1, k=3, control=controlSettings)

    LDA_1Post = posterior(LDA_1)         //可输出$terms和$topics

猜你喜欢

转载自www.cnblogs.com/yao1996/p/10221708.html
今日推荐