LDA实践2（NLP）

# !/usr/bin/python
# -*- coding:utf-8 -*-

import numpy as np
from gensim import corpora, models, similarities
from pprint import pprint
import time

# import logging
# logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)


def load_stopword():
    f_stop = open('22.stopword.txt')
    sw = [line.strip() for line in f_stop]
    f_stop.close()
    return sw


if __name__ == '__main__':
    print( '初始化停止词列表 --')
    t_start = time.time()
    stop_words = load_stopword()

    print ('开始读入语料数据 -- ')
    f = open('22.news.dat','r', encoding='UTF-8')    #22.LDA_test.txt
    texts = [[word for word in line.strip().lower().split() if word not in stop_words] for line in f]
    # texts = [line.strip().split() for line in f]
    print( '读入语料数据完成，用时%.3f秒' % (time.time() - t_start))
    f.close()
    M = len(texts)
    print ('文本数目：%d个' % M)
    # pprint(texts)

    print( '正在建立词典 --')
    dictionary = corpora.Dictionary(texts)
    V = len(dictionary)
    print ('正在计算文本向量 --')
    corpus = [dictionary.doc2bow(text) for text in texts]
    print( '正在计算文档TF-IDF --')
    t_start = time.time()
    corpus_tfidf = models.TfidfModel(corpus)[corpus]
    print ('建立文档TF-IDF完成，用时%.3f秒' % (time.time() - t_start))
    print ('LDA模型拟合推断 --')
    num_topics = 30
    t_start = time.time()
    lda = models.LdaModel(corpus_tfidf, num_topics=num_topics, id2word=dictionary,
                            alpha=0.01, eta=0.01, minimum_probability=0.001,
                            update_every = 1, chunksize = 100, passes = 1)
    print ('LDA模型完成，训练时间为\t%.3f秒' % (time.time() - t_start))
     # 所有文档的主题
    #doc_topic = [a for a in lda[corpus_tfidf]]
    #print ('Document-Topic:\n')
    #pprint(doc_topic)

    # 随机打印某10个文档的主题
    num_show_topic = 10  # 每个文档显示前几个主题
    print( '10个文档的主题分布：')
    doc_topics = lda.get_document_topics(corpus_tfidf)  # 所有文档的主题分布
    idx = np.arange(M)
    np.random.shuffle(idx)
    idx = idx[:10]
    for i in idx:
        topic = np.array(doc_topics[i])
        topic_distribute = np.array(topic[:, 1])
        # print topic_distribute
        topic_idx = topic_distribute.argsort()[:-num_show_topic-1:-1]
        print( ('第%d个文档的前%d个主题：' % (i, num_show_topic)), topic_idx)
        print (topic_distribute[topic_idx])
    num_show_term = 7   # 每个主题显示几个词
    print( '每个主题的词分布：')
    for topic_id in range(num_topics):
        print( '主题#%d：\t' % topic_id)
        term_distribute_all = lda.get_topic_terms(topicid=topic_id)
        term_distribute = term_distribute_all[:num_show_term]
        term_distribute = np.array(term_distribute)
        term_id = term_distribute[:, 0].astype(np.int)
        print ('词：\t',)
        for t in term_id:
            print (dictionary.id2token[t],)
        print
        # print '\n概率：\t', term_distribute[:, 1]

打印结果：

初始化停止词列表 --
开始读入语料数据 -- 
读入语料数据完成，用时27.582秒
文本数目：2043个
正在建立词典 --
正在计算文本向量 --
正在计算文档TF-IDF --
建立文档TF-IDF完成，用时0.807秒
LDA模型拟合推断 --

D:\anaconda\lib\site-packages\gensim\models\ldamodel.py:775: RuntimeWarning: divide by zero encountered in log
  diff = np.log(self.expElogbeta)

LDA模型完成，训练时间为	33.010秒
10个文档的主题分布：
第328个文档的前10个主题： [24 26 20 22 14 28 23  1 29 21]
[0.32498136 0.13291538 0.09252059 0.08302885 0.06013632 0.05602695
 0.04116545 0.03319487 0.03315827 0.03170978]
第940个文档的前10个主题： [24 26 20 22 28  2 17  1 19 27]
[0.28736445 0.22418046 0.08603641 0.06190666 0.05775709 0.05627507
 0.03853163 0.03174159 0.0307487  0.02022503]
第401个文档的前10个主题： [26 24 20 28 22  4 16 14  1 11]
[0.25461864 0.20349854 0.14095098 0.0621846  0.05669058 0.05252533
 0.04764999 0.02567744 0.02455792 0.02060427]
第1850个文档的前10个主题： [24 26 20 28 21 23 22 14 13 29]
[0.37113318 0.1396127  0.10154593 0.07697127 0.0757141  0.03459946
 0.03311493 0.02867039 0.02717257 0.01931514]
第1372个文档的前10个主题： [24 26 14 20 21 28  2 17 19  3]
[0.40460956 0.12290253 0.07058487 0.0551524  0.04821464 0.03124265
 0.02871498 0.02844684 0.0271777  0.02449976]
第1044个文档的前10个主题： [24 20 26 14 28  2 29 16 22  1]
[0.35088214 0.12215162 0.11493756 0.10383987 0.04632907 0.04241408
 0.04138268 0.01992126 0.01953204 0.01868028]
第1602个文档的前10个主题： [24 26  2 14  1 22 20 29  5 13]
[0.23583829 0.20576693 0.14781311 0.10545766 0.04691855 0.0448094
 0.0306744  0.02779169 0.0263801  0.0243732 ]
第1236个文档的前10个主题： [10 26 24 22 28 20 14 16  5  4]
[0.27102286 0.26928478 0.16590761 0.09468424 0.03979244 0.03639244
 0.02641495 0.02093051 0.01275704 0.01242494]
第1461个文档的前10个主题： [24 26 22 20  1 14  4 16 29 13]
[0.47487974 0.1270204  0.07585566 0.0679674  0.04869295 0.03829511
 0.03614176 0.03028388 0.019985   0.01273066]
第1124个文档的前10个主题： [24 26 22 20  1 12 28 14 11 29]
[0.46652177 0.16891517 0.06392604 0.03915148 0.03790305 0.03466016
 0.02816303 0.02462978 0.02444719 0.02186517]
每个主题的词分布：
主题#0：	
词：	
地震
度
北纬
级
东经
震源
千米
主题#1：	
词：	
台湾
人民
大陆
蔡
英文
大选
大熊猫
主题#2：	
词：	
创新
战略
规划
我国
推进
促进
推动
主题#3：	
词：	
习近平
渔民
广大
西
保障
男
虚假
主题#4：	
词：	
广岛
陈
中国台湾
原子弹
奥巴马
费
今日
主题#5：	
词：	
朝鲜
金正恩
朝鲜劳动党
导弹
全国代表大会
小姐
一边
主题#6：	
词：	
安置
持刀
垃圾
抢劫
每日
超市
所幸
主题#7：	
词：	
行凶
腹部
捅
女方
测试
媒称
安保
主题#8：	
词：	
嫌犯
冯
今天上午
抑郁症
评定
险些
割腕自杀
主题#9：	
词：	
开班式
庸政
怠政
怕苦怕累
态度暧昧
心安理得
息事宁人
主题#10：	
词：	
叙利亚
土耳其
平壤
劳动党
韩联社
北部
法国
主题#11：	
词：	
林
高速
收取
合理
科技
均衡
交通事故
主题#12：	
词：	
越南
渔船
被告人
公约
海域
捕鱼
当选
主题#13：	
词：	
大同市
司机
吃
驾驶
车辆
裤子
路边
主题#14：	
词：	
党
资金
历史
收费
国务院
书记
启动
主题#15：	
词：	
密切
世卫
奥运会
管理机构
协作
尼日利亚
畸形
主题#16：	
词：	
公民
参议员
航母
结果显示
损失
车主
执法人员
主题#17：	
词：	
督察
环保
督察组
河北
男生
中方
污染
主题#18：	
词：	
下滑
旗下
国有
神华集团
营业
主办
赵毅波
主题#19：	
词：	
增加
海南
革命
户籍
海口市
品牌
资本
主题#20：	
词：	
李
女士
回答
广东省
医生
对方
结婚
主题#21：	
词：	
雷洋
李某
特朗普
候选人
共和党
陈某
穆斯林
主题#22：	
词：	
孩子
菲律宾
特朗普
投资
农村
总理
村
主题#23：	
词：	
省
子女
占
特殊
每年
港
中国政府
主题#24：	
词：	
警方
阅读
人
不
学生
男子
公司
主题#25：	
词：	
客机
追尾
覆盖
波音
垂直
受损
刮
主题#26：	
词：	
中国
日本
美国
经济
发展
国家
总统
主题#27：	
词：	
镇
小男孩
儿童
口
晚报
穿着
尽量
主题#28：	
词：	
拆违
腐败
村民
环球网
民间
明天
江西
主题#29：	
词：	
老人
回家
山西省
公共
逃逸
路段
火箭

猜你喜欢