cntopic library: support Chinese and English LDA topic analysis

cntopic library: support Chinese and English LDA topic analysis

cntopic is a
simple and easy-to-use lda topic model that supports both Chinese and English. The library is based on gensim and pyLDAvis, and implements the lda topic model and visualization functions.

The video explanation of this article has been uploaded to station B (it will be reviewed tonight), you can follow Deng’s

Station B account: Deng and his python
installation


pip install cntopic

Use
this to introduce a scenario for everyone. Suppose you collect news data and forget to collect the news category corresponding to the news text. If you manually label it, it is very laborious. At this time, we can use the lda topic model to help us gain insight into the laws in the data and find that news has n types of topic groups. In this way, the lda model automatically annotates the data topic_1, topic_2, topic_3... ,topic_n.

The workload of our researchers is only limited to the interpretation of topic_1, topic_2, topic_3..., topic_n.

The lda training process is roughly divided into

  1. Read file
  2. Prepare data
  3. Train the lda model
  4. Use lda model
  5. Store and import lda model
    1. Read the file.
    Here we use a news data, there are 10 categories, each category 1000 data, covering

'Fashion','Finance','Technology','Education','Home Furnishing','Sports','Current Affairs','Games','Real Estate','Entertainment'


import pandas as pd

df = pd.read_csv('chinese_news.csv')
df.head()

cntopic library: support Chinese and English LDA topic analysis
Label distribution


df['label'].value_counts()

Home 1000
Fashion 1000
Real Estate 1000
Current Affairs 1000
Education 1000
Games 1000
Finance 1000
Entertainment 1000
Sports 1000
Technology 1000
Name: label, dtype: int64
2. Preparation data
General preparation data includes:

  1. Word segmentation, data cleaning
  2. The format of the data organized according to the module requirements is
    noted in scikit-learn:

  3. English text does not need word segmentation, just pass in as it is.
  4. Chinese text needs to be segmented first, and then sorted into English strings separated by spaces. Shaped like "I love China"

import jieba

def text2tokens(raw_text):
    #将文本raw_text分词后得到词语列表
    tokens = jieba.lcut(raw_text)
    #tokens = raw_text.lower().split(' ') #英文用空格分词即可
    tokens = [t for t in tokens if len(t)>1] #剔除单字
    return tokens

#对content列中所有的文本依次进行分词
documents = [text2tokens(txt) 
             for txt in df['content']]  

#显示前5个document
print(documents[:5])

Walker','Connecticut','University','Head Coach','Jim','Cajun', ...], ['Praise','One','Pure','Point Guard', ' And','can be','us','scoring','single game', '42','have ever','single game', '17','assist', ...]] 3. Training the lda model now officially uses the cntopic module to start the LDA topic model analysis. The steps include





cntopic library: support Chinese and English LDA topic analysis
Here we build the lda topic model according to n_topics=10. Generally, n_topics may have to experiment many times to find the best n_topics

cntopic library: support Chinese and English LDA topic analysis

During operation, an output folder will be generated in the folder where the code is located, which contains

  • dictionary.dict dictionary file
  • lda.model.xxx multiple lda model files, where xxx refers to
    cntopic library: support Chinese and English LDA topic analysis

cntopic library: support Chinese and English LDA topic analysis

The above code takes a long time, please wait patiently for the program to run to completion~

import os
from cntopic import Topic

topic = Topic(cwd=os.getcwd()) #构建词典dictionary
topic.create_dictionary(documents=documents) #根据documents数据,构建词典空间
topic.create_corpus(documents=documents) #构建语料(将文本转为文档-词频矩阵)
topic.train_lda_model(n_topics=10) #指定n_topics,构建LDA话题模型

<gensim.models.ldamulticore.LdaMulticore at 0x158da5090>
4. Using the LDA model The
above code has run for about 5 minutes, and the LDA model has been trained.

Now we can use LDA to do some things, including

cntopic library: support Chinese and English LDA topic analysis
4.1 Prepare the document
Suppose there is a document "Games and sports are really interesting" word segmentation processing to get the document


document = jieba.lcut('游戏体育真有意思')
document

['Game','Sports','True','Interesting']
4.2 To predict the topic corresponding to the document,
we use the topic model to see the topic corresponding to the document


topic.get_document_topics(document)

[(0, 0.02501536),
(1, 0.025016038),
(2, 0.28541195),
(3, 0.025018401),
(4, 0.025018891),
(5, 0.025017735),
(6, 0.51443774),
(7, 0.02502284),
( 8, 0.025015472),
(9, 0.025025582)]
Our lda topic model is trained according to n_topics=10, and when the topic is restricted to predict a document, the result is a list of tuples of these 10 topics and corresponding probabilities.

It can be seen that the highest probability is topic 6, and the probability is 0.51443774.

So we can roughly think that document is topic 6

4.3 Show the relationship between each topic and the corresponding feature words
but only tell each document that it is topic n, we still don't know what topic n represents, so we need to look at the feature words corresponding to each topic n.

topic.show_topics()

[(0,
'0.042 "fund" + 0.013 "market" + 0.011 "investment" + 0.009 "company" + 0.005 "up" + 0.004 "stock" + 0.004 "real estate" + 0.004 "index" + 0.004 "house price" + 0.004 "2008"'),
(1,
'0.010 "China" + 0.007 "Immigration" + 0.006 "Project" + 0.005 "Development" + 0.005 "Representation" + 0.005 "Economy" + 0.005 "Government" + 0.005 "Land" + 0.004 "policy" + 0.004 "problem"'),
(2,
'0.014 "competition" + 0.009 "them" + 0.008 "Team" + 0.007 "Rebounds" + 0.006 "Us" + 0.005 "Player" + 0.005 "Playoffs" + 0.005 "Time" + 0.005 "Heat" + 0.005 "Season"'),
(3,
'0.013 "us" + 0.013 "one" + 0.009 "self" + 0.009 "this" + 0.007 "no" + 0.007 "them" + 0.006 "can" + 0.006 "yes" + 0.006 "a lot" + 0.006 "Reporter"'),
(4,
'0.020 "movie" + 0.010 "director" + 0.009 "weibo" + 0.008 "movie" + 0.006 "audience" + 0.006 "one" + 0.005 "self" + 0.005 "box office" + 0.004 "Photography" + 0.004 "Entertainment"'),
(5,
'0.018 "Student" + 0.015 "Study abroad" + 0.008 "University" + 0.008 "Yes" + 0.006 "Function" + 0.006 "Pixel" + 0.006 "Photography" + 0.006 "Adopt" + 0.005 "School" + 0.005 "Apply"'),
(6,
'0.007 "player" + 0.006 "Fengshen" + 0.006 "mobile" + 0.006 "online" + 0.006 "the" + 0.006 "game" + 0.005 "Chen Shuibian" + 0.005 "activity" + 0.005 "to" + 0.005 "a "'),
(7,
'0.009 "Information" + 0.009 "Exam" + 0.009 "Game" + 0.007 "Work" + 0.007 "Mobile" + 0.006 "Level Four and Six" + 0.006 "Executive" + 0.005 "Development" + 0.004 "Yes" + 0.004 "Overlord"'),
(8,
'0.015 "We" + 0.011 "Enterprise" + 0.011 "Product" + 0.010 "Market" + 0.009 "Furniture" + 0.009 "Brand" + 0.008 "Consumer" + 0.007 "Industry" + 0.007 "China" + 0.007 "One"'),
(9,
'0.012 "Game" + 0.011 "player" + 0.010 "Yes" + 0.008 "match" + 0.008 "activity" + 0.006 "fashion" + 0.005 "OL" + 0.004 "acquisition" + 0.004 "task" + 0.004 "mobile" ')]
According to the above topic n and feature words, you can roughly interpret the content of each topic n.

4.4 Topic distribution
Now we want to know the distribution of different topics n in the data set


topic.topic_distribution(raw_documents=df['content'])

9 1670
1 1443
0 1318
5 1265
4 1015
2 970
8 911
3 865
7 307
6 236
Name: topic, dtype: int64
Our data has 10 categories, each of which is 1000. However, the current LDA topic model simply based on some clues in the text, and the effect of dividing us according to n_topics=10 is not bad.

The perfect situation is that each topic n is close to 1000, now there are too many topics 9 and there are too few topics 6 and 7.

However, we should also note that some topics may have intersections and are easy to make mistakes, such as

  • Finance, real estate, current affairs
  • Sports entertainment
  • Finance, technology,
    etc.

In summary, the current model is not bad and the performance is acceptable.

4.5 Visualization (unstable function)
Now there are only 10 topics, which we can accept with the naked eye, but when there are too many topics, we still use visualization tools to help us scientifically judge the training results.

This uses topic.visualize_lda(),


topic.visualize_lda()

After running

Find the vis.html file in the output folder of the folder where the code is located, and open it with the right-click browser.

The visualization function is unstable, and vis.html cannot be opened; I hope Haihan

cntopic library: support Chinese and English LDA topic analysis

There are two large areas on the left and right

  • The topic distribution on the left, the larger the circle, the more topics, and the circle is scattered in four quadrants
  • The feature words corresponding to a topic on the right side, the weight is getting lower and lower from top to bottom. It is important
    to note that the left side

  • It is better to distribute the circles evenly in the four quadrants. If the circles are all concentrated in a limited area, the model will not be well trained
  • It is better to have less intersection between circles and circles. If there are too many intersections, it means that n_topics is set too large and should be set smaller.
    5. Storage and import of lda model
    lda topic model training is particularly slow, if you do not save the trained model , Is actually wasting our lives and computer computing power.

The good news is that cntopic stores the model for everyone by default. The storage address is in the output folder. You only need to know how to import the model.

There are two models that need to be imported here, the steps to use
cntopic library: support Chinese and English LDA topic analysis
now let’s try it out, in order to distinguish from the previous ones, here we name it topic2


topic2 = Topic(cwd=os.getcwd())
topic2.load_dictionary(dictpath='output/dictionary.dict')
topic2.create_corpus(documents=documents)
topic2.load_lda_model(modelpath='output/model/lda.model')

You can go back and try the relevant functions of the LDA model in Part 4

Guess you like

Origin blog.51cto.com/15069487/2578505