Mahout: CVB

When run cvb, there is a error

org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.io.IntWritable

Solution:

the new LDA requires SequenceFile<IntWritable, VectorWritable> as input 
(the same disk format as DistributedRowMatrix), which you can get out of 
SequenceFile<Text, VectorWritable> by running the 
RowIdJob ("$MAHOUT_HOME/bin/mahout rowid -h" for more details) before running CVB.

Interpret the result 

doc-topic

mahout vectordump 
-i   hdfs://192.168.122.1:2014/user/zhaohj/mahout/topics/lda/doc-topic  
-o data/lda/doc-topic       
-sort true  -vs 1  -p true 

 Note: -vs 1 just dump the first topic a document belongs to, such as 

#doc-index    topic-id:properblity
0	      {1:0.9999999918613426}
1	      {2:0.999999958633294}
2	      {0:0.9999999872590848}
3	      {0:0.9999999914501596}

 Warning: don't provide -d option to dump doc-topic, otherwise you' ll get meanless output.

topic-term

mahout vectordump
-i   hdfs://192.168.122.1:2014/user/zhaohj/mahout/topics/lda/topic-term 
-o data/lda/topic-term       
-d hdfs://192.168.122.1:2014/user/zhaohj/mahout/topics/docsvectors3/dictionary.file-0  
-dt sequencefile  
 -sort true  -vs 5  -p true

References

http://mail-archives.apache.org/mod_mbox/mahout-user/201205.mbox/%3CCAG3i8Se1QobSPpw8ewgNkjVw_Zd_8crb6Z18_7G5Yqew1XRTAw@mail.gmail.com%3E 

 http://stackoverflow.com/questions/21318459/how-to-run-mahout-cvb-on-reuters-news-on-cloudera-vm-cdh4-5-as-lda-is-not-longer

猜你喜欢

转载自ylzhj02.iteye.com/blog/2082695
今日推荐