When run cvb, there is a error
org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.io.IntWritable
Solution:
the new LDA requires SequenceFile<IntWritable, VectorWritable> as input (the same disk format as DistributedRowMatrix), which you can get out of SequenceFile<Text, VectorWritable> by running the RowIdJob ("$MAHOUT_HOME/bin/mahout rowid -h" for more details) before running CVB.
Interpret the result
doc-topic
mahout vectordump -i hdfs://192.168.122.1:2014/user/zhaohj/mahout/topics/lda/doc-topic -o data/lda/doc-topic -sort true -vs 1 -p true
Note: -vs 1 just dump the first topic a document belongs to, such as
#doc-index topic-id:properblity 0 {1:0.9999999918613426} 1 {2:0.999999958633294} 2 {0:0.9999999872590848} 3 {0:0.9999999914501596}
Warning: don't provide -d option to dump doc-topic, otherwise you' ll get meanless output.
topic-term
mahout vectordump -i hdfs://192.168.122.1:2014/user/zhaohj/mahout/topics/lda/topic-term -o data/lda/topic-term -d hdfs://192.168.122.1:2014/user/zhaohj/mahout/topics/docsvectors3/dictionary.file-0 -dt sequencefile -sort true -vs 5 -p true
References
http://mail-archives.apache.org/mod_mbox/mahout-user/201205.mbox/%3CCAG3i8Se1QobSPpw8ewgNkjVw_Zd_8crb6Z18_7G5Yqew1XRTAw@mail.gmail.com%3E
http://stackoverflow.com/questions/21318459/how-to-run-mahout-cvb-on-reuters-news-on-cloudera-vm-cdh4-5-as-lda-is-not-longer