[Python] LDA model Chinese text topic extraction丨The use of visualization tool pyLDAvis

The realization of topic model LDA and its visualization pyLDAvis

  1. Unsupervised extraction of document topics-LDA model
    1.1 Preparation work
    1.2 Calling api implementation model
  2. Visual interactive analysis of LDA-pyLDAvis
    2.1 Install pyLDAvis
    2.2 Combine gensim to call api to achieve visualization
    ps Save the result as an independent web page
    pps Speed ​​up the preparation?
    2.3 How to analyze the visualization results of pyLDAvis
    2.3.1. What does each topic mean?
    2.3.2 How common is each topic?
    2.3.3 What are the connections between the topics?
  3. Unsupervised extraction of document topics-LDA model
    The specific introduction of this model, application scenarios and so on will not be discussed. You know that it can give you feedback from a bunch of texts to give you a specified number of topic clusters.

Because it is an unsupervised algorithm, it means that you don't need to prepare the training set, and you don't have to do the hard labeling. At least half of the tasks in the data preparation phase are saved.

As a mathematics novice, I have always reluctant to go into the specific principles unless it is necessary. As long as someone proves it is feasible, then I will use it. Smart heads should make their best use (laughs

But maybe I can't escape from now on and I have to learn the principle. Put the explanation of the principle here first, and then read it later when I need it.

Therefore, this article mainly introduces how to get started and how to analyze the results. This is what Data Scientist should do.

1. Unsupervised extraction of document topics-LDA model

1.1 Preparation

The book is back to the main story. Taking this work as an example, I need to analyze the themes of 50 Weibo pages per person of 500 users, about 500*50*10=250,000 Weibo texts. In other words, I want to see what these people's microblogs are mainly concerned about, and what aspects they are concerned about together.

Programming environment:
python 3.6 + pycharm 2018,
the package used for LDA implementation is gensim,
and the word segmentation is still our old friend jieba.

Data preparation: The
crawled original data is stored in csv. What I want is to read all the Weibo texts into a txt, and then divide the words to stop words, and it will meet the input standard of gensim.

Before word segmentation:
Insert picture description here
After word segmentation:
Insert picture description here
[code]
Used to process the original text.

def stopwordslist(filepath):
    stopwords = [line.strip() for line in open(filepath, 'r', encoding='utf-8').readlines()]
    return stopwords

# 对句子进行分词
def seg_sentence(sentence):
    sentence = re.sub(u'[0-9\.]+', u'', sentence)
    jb.add_word('光线摄影学院')		# 这里是加入用户自定义的词来补充jieba词典。
    jb.add_word('曾兰老师')			# 同样,如果你想删除哪个特定的未登录词,就先把它加上然后放进停用词表里。
    jb.add_word('网页链接')
    jb.add_word('微博视频')
    jb.add_word('发布了头条文章')
    jb.add_word('青春有你')
    jb.add_word('青你')
    sentence_seged = jb.cut(sentence.strip())
    stopwords = stopwordslist('stopWords/stopwords.txt')  # 这里加载停用词的路径
    outstr = ''
    for word in sentence_seged:
        if word not in stopwords and word.__len__()>1:
            if word != '\t':
                outstr += word
                outstr += " "
    return outstr


inputs = open('input/like_mi10_user_all_retweet.txt', 'r', encoding='utf-8')

outputs = open('output1/mi10_user_retweet_fc.txt', 'w',encoding='utf-8')
for line in inputs:
    line_seg = seg_sentence(line)  # 这里的返回值是字符串
    outputs.write(line_seg + '\n')
outputs.close()
inputs.close()

1.2 Calling the api implementation model

Prepare the data for two hours, call the interface for three minutes, and wait for the result for two hours.

Gensim is very friendly. The dictionary, bag-of-words model, and lda model are all done in one sentence.

【code】

from gensim import corpora
from gensim.models import LdaModel
from gensim.corpora import Dictionary


train = []

fp = codecs.open('output1/mi10_user_retweet_fc.txt','r',encoding='utf8')
for line in fp:
    if line != '':
        line = line.split()
        train.append([w for w in line])

dictionary = corpora.Dictionary(train)

corpus = [dictionary.doc2bow(text) for text in train]

lda = LdaModel(corpus=corpus, id2word=dictionary, num_topics=20, passes=60)
# num_topics:主题数目
# passes:训练伦次
# num_words:每个主题下输出的term的数目

for topic in lda.print_topics(num_words = 20):
    termNumber = topic[0]
    print(topic[0], ':', sep='')
    listOfTerms = topic[1].split('+')
    for term in listOfTerms:
        listItems = term.split('*')
        print('  ', listItems[1], '(', listItems[0], ')', sep='')

The final output is

0:
  "北京" (0.024)
  "疫情" ( 0.021)
  "中国联通" ( 0.019)
  "领券" ( 0.019)
  "购物" ( 0.016)
  "新冠" ( 0.016)
  "专享" ( 0.012)
  "元券" ( 0.012)
  "确诊" ( 0.012)
  "上海" ( 0.011)
  "优惠" ( 0.011)
  "肺炎" ( 0.010)
  "新闻" ( 0.010)
  "病例" ( 0.010)
  "汽车"( 0.009)
1:
  "小米" (0.133)
  "Redmi" ( 0.019)
  "新浪" ( 0.019)
  "智慧" ( 0.018)
  "雷军" ( 0.014)
  "众测" ( 0.012)
  "体验" ( 0.012)
  "智能" ( 0.012)
  "MIUI" ( 0.012)
  "电视" ( 0.012)
  "红米" ( 0.011)
  "空调" ( 0.009)
  "产品" ( 0.009)
  "品牌" ( 0.009)
  "价格"( 0.008)
2:
  "抽奖" (0.056)
  "平台" ( 0.032)
  "评测" ( 0.022)
  "生活" ( 0.013)
  "红包" ( 0.013)
  "关注" ( 0.012)
  "这条" ( 0.012)
  "视频" ( 0.012)
  "工具" ( 0.011)
  "获得" ( 0.011)
  "有效" ( 0.011)
  "进行" ( 0.010)
  "恭喜" ( 0.010)
  "用户" ( 0.010)
  "公正"( 0.010)
 .....

This kind of. He will only return to you how many words under the prescribed number of topics. The decimal after each word can be considered as the probability that the word belongs to the topic. The sum of the probabilities of all words under the topic is 1; and what the topic should be is It is defined by your manual analysis later.

Then, staring at this pile of cold words and numbers, maybe you can measure the relationship between this word and this topic by probability, but can you see the relationship between different topics? Can you see the relationship between a word and other topics?

A little difficult. At this time, we will introduce our LDA visual analysis tool.

2. Visual interactive analysis of LDA-pyLDAvis

First on the renderings:
Insert picture description here

2.1 Install pyLDAvis

pip install pyldavis

2.2 Combine gensim to call api to achieve visualization

pyLDAvis supports the direct input of lda models in three packages: sklearn, gensim, graphlab, and it seems that you can calculate it yourself. Of course, the lda model obtained by gensim directly follows the above.
pyLDAvis is also very friendly, and completes the implementation in the same sentence:

import pyLDAvis.gensim

'''插入之前的代码片段'''

d=pyLDAvis.gensim.prepare(lda, corpus, dictionary)

'''
lda: 计算好的话题模型

corpus: 文档词频矩阵

dictionary: 词语空间
'''

pyLDAvis.show(d)		#展示在浏览器
# pyLDAvis.displace(d) #展示在notebook的output cell中

The amount of data is large, and the running time is slightly longer.

ps save the result as an independent web page

At the same time, if you want to be able to save this result as a separate web page for sharing or putting it in the web system, then you can do this

d=pyLDAvis.gensim.prepare(lda, corpus, dictionary)

pyLDAvis.save_html(d, 'lda_pass10.html')	# 将结果保存为该html文件

You don't have to wait for a long time to run the result every time. .

pps to speed up prepare?

Yes, this visualization process is really slow. . I used time to time. To save time during the test, gensim only trains once, which takes 58 seconds, and then waits for the rendering of pyLDAvis. . Have you waited for more than an hour? 4200s... finally came out.
Then save it as a web page soon.
Insert picture description here

d=pyLDAvis.gensim.prepare(lda, corpus, dictionary, mds='mmds')

Sauce purple.
Actually tested, it can be said that there is no effect. . . . .

At the same time, the parameter of this selection algorithm can also be tsne. The difference between different algorithms is still to see the documentation.

2.3 How to analyze the visualization results of pyLDAvis

The page that came out is complicated and not complicated. The bubble distribution on the left is different topics, and the right is the first 30 feature words in the topic. The light blue color represents the frequency (weight) of the word in the entire document, and the dark red color represents the weight of the word in the topic. You can adjust a parameter λ in the upper right corner, and then look down.
Insert picture description here
So we finally answer the three questions that the author of LDAvis has to solve in developing this tool:

2.3.1. What does each topic mean?

By hovering the mouse on the bubble on the left, we can choose to view a specific topic. After selection, the right panel will correspondingly display vocabulary related to this topic. By summarizing the meaning of these words, we can conclude the meaning of the topic.

At the same time, which word has a higher weight on the topic? The relevance of a certain word topic is adjusted by the λ parameter.

如果λ接近1,那么在该主题下更频繁出现的词,跟主题更相关;
如果λ越接近0,那么该主题下更特殊、更独有(exclusive)的词,跟主题更相关(有点TF-IDF的意思了)。

Therefore, we can change the relevance of words and topics by adjusting the size of λ, and explore more sense of topic meaning.

2.3.2 How common is each topic?

After running topic modeling, we can know the frequency of each topic. The author of LDAvis uses the size of a circle to represent this number, and it is also labeled 1~n in order. Therefore, the size and number of the bubbles indicate the frequency of the theme.
Insert picture description here

2.3.3 What are the connections between the topics?

Here the author uses multi-dimensional scale analysis to extract the principal components as dimensions, and distribute the topics to these two dimensions. The distance between the topics expresses the proximity between the topics. The bubble distance uses the JSD distance, which (should) can be considered as the degree of difference between the topics. The overlap of the bubbles indicates that the feature words in the two topics overlap.

After knowing these, just read the words to speak. Look at what these words should be saying, and extract different topics. This is the result of practical application value. If there is no final output, everything you did before is a piece of waste paper.

Let's conclude here first, leaving aside the theoretical level, it should be enough.

Guess you like

Origin blog.csdn.net/kz_java/article/details/114982528