使用gensim实现lda，并计算perplexity（ gensim Perplexity Estimates in LDA Model）

使用gensim实现lda，并计算perplexity（ gensim Perplexity Estimates in LDA Model）
Neither. The values coming out of bound() depend on the number of topics (as well as number of words), so they’re not comparable across different num_topics (or different test corpora).
No, the opposite: a smaller bound value implies deterioration. For example, bound -6000 is “better” than -7000 (bigger is better
-====================================================

You can use method log_perplexity for evaluating your LdaModel

Small code example1

from gensim.models import LdaModel
from gensim.corpora import Dictionary
import numpy as np

docs = [["a", "a", "b"], 
        ["a", "c", "g"], 
        ["c"],
        ["a", "c", "g"]]

dct = Dictionary(docs)
corpus = [dct.doc2bow(_) for _ in docs]
c_train, c_test = corpus[:2], corpus[2:]

ldamodel = LdaModel(corpus=c_train, num_topics=2, id2word=dct)
Per-word Perplexity=ldamodel.log_perplexity(c_test)
print(Per-word Perplexity)

Small code example1

I am attempting to estimate an LDA topicmodel for a corpus of ~59,000 documents and ~500,000 unique tokens. I would prefer to estimate the final model in R to utilize its visualization tools for interpreting my results, however first I need to select the number of topics for my model. Since I have no intuition as to how many topics are in the latent structure, I was going to estimate a series of models with the number of topics k = 20, 25, 30... and estimate the perplexity of each model to determine the optimal number of topics as recommended in Blei (2003). The only packages for estimating LDA in R that I am aware of (LDA and topicmodels) utilize batch LDA and whenever I estimate a model with more than 70 topics, I run out of memory (and this is on a supercomputing cluster with up to 96 gigs of ram per processor). I thought I could use gensim to estimate the series of models using online LDA which is much less memory-intensive, calculate the perplexity on a held-out sample of documents, select the number of topics based off of these results, then estimate the final model using batch LDA in R.

The steps I followed are:
Generate the corpus from a series of text files in R, exporting the document-term matrix and dictionary in MM format.
Import the corpus and dictionary in Python.
Split the corpus into training/test datasets.
Estimate the LDA model using the training data.
Calculate bound and per-word perplexity using the test data.
My understanding is that perplexity is always decreasing as the number of topics increase, so the optimal number of topics should be where the marginal change in perplexity is small. However whenever I estimate the series of models, perplexity is in fact increasing with the number of topics. The perplexity values for k=20,25,30,35,40 are

Perplexity (20 topics):  -44138604.0036
Per-word Perplexity:  542.513884961
Perplexity (25 topics):  -44834368.1148
Per-word Perplexity:  599.120014719
Perplexity (30 topics):  -45627143.4341
Per-word Perplexity:  670.851965367
Perplexity (35 topics):  -46457210.907
Per-word Perplexity:  755.178877447
Perplexity (40 topics):  -47294658.5467
Per-word Perplexity:  851.001209258

Potential problems I've already thought of:
Is the model not running long enough to converge properly? I set the chunk size to 1000, so there should be 40-50 passes and by the last chunk I am seeing 980+/1000 documents converging within 50 iterations.
Am I not understanding what the lda.bound function is estimating?
Do I need to trim the dictionary more? I've already removed all tokens below the median TF-IDF score, so I cut the original dictionary in half.
Is my problem that I am using R to build the dictionary and corpus? I compared in a text editor the dictionary and MM corpus files generated from R to a smaller test dictionary/corpus built with gensim and I do not see any differences in how information is coded. I want to use R to build the corpus so I ensure I am using the exact same corpus for the online LDA as I will use for the final model in R and I do not know how to convert a gensim corpus into an R document-term matrix object.


The script I use is:

import logging
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

import numpy
import scipy
import gensim

import random
random.seed(11091987)           #set random seed


# load id->word mapping (the dictionary)
id2word =  gensim.corpora.Dictionary.load_from_text('../dict.dict')

# load corpus
## add top line to MM file since R does not automatically add this
## and save new version
with open('../dtm.mtx') as f:
    dtm = f.read()
    dtm = "%%MatrixMarket matrix coordinate real general\n" + dtm

with open('dtm.mtx', 'w+') as f:
    f.write(dtm)


corpus = gensim.corpora.MmCorpus('dtm.mtx')

print id2word
print corpus

# shuffle corpus
cp = list(corpus)
random.shuffle(cp)

# split into 80% training and 20% test sets
p = int(len(cp) * .8)
cp_train = cp[0:p]
cp_test = cp[p:]

import time
start_time = time.time()

lda = gensim.models.ldamodel.LdaModel(corpus=cp_train, id2word=id2word, num_topics=25,
                                      update_every=1, chunksize=1000, passes=2)

elapsed = time.time() - start_time
print('Elapsed time: '),
print elapsed


print lda.show_topics(topics=-1, topn=10, formatted=True)

print('Perplexity: '),
perplex = lda.bound(cp_test)
print perplex

print('Per-word Perplexity: '),
print numpy.exp2(-perplex / sum(cnt for document in cp_test for _, cnt in document))

elapsed = time.time() - start_time
print('Elapsed time: '),
print elapsed

详情见谷歌gensim

使用gensim实现lda，并计算perplexity（ gensim Perplexity Estimates in LDA Model）

猜你喜欢