Subject mining LSA method and LDA comparison experiment

Subject mining LSA method and LDA comparison experiment

Introduction to the experiment

  This experiment or case is to use Pycharm to write code, conduct news topic mining on the same news data set, to train two different NLP models, and provide a training set to test and quantify the training results. Subsequently, the results of the two groups of experiments were compared through visualization methods. (In order to preserve the experiment process and realize the visualization of the results, the log files generated by the experiment and the reference diagram of the results of a certain experiment are retained)

  First of all, the experimental data set is a data set of crawled news topics, which contains about 22,000 content information that can be used for model training, but because the data set comes from a txt file with html tags crawled by a crawler, data cleaning is required To extract news content. The brief experimental process is described as follows: After cleaning the data set, perform word segmentation and part-of-speech tagging on the data set (this process takes a long time). Then build a word bag model (that is, a corpus, containing information such as a dictionary, TF-IDF matrix, etc.) from the data after word segmentation. After the construction is completed, the number of topics to be mined is artificially given, and the corpus just constructed is used to train the LSA model and the LDA model. After training, save the two models and prepare to bring them to the test set to see the prediction effect. Subsequently, the same preprocessing method was used to extract the corpus of the test set, and the test set corpus was brought into two different models to view the results, and compared through the visualization method, more obvious experimental results were obtained.

  The start and end time of each sub-process of the experiment is recorded, and printed to the log file log.txt for comparison and verification of the time-consuming situation of the test process.

lab environment

Software Environment

  Windows 10 operating system, Pycharm IDE environment.

Hardware environment

  12-thread CPU, 16G memory, 6G video memory

Use of third-party libraries

  1. Jieba function library (used for word segmentation and part-of-speech tagging and filtering)
  2. Bs4 function library (used to clean'lxml' tags)
  3. Gensim function library (used for bag-of-words corpus preprocessing and model training)
  4. Tarfile library (used to decompress data set files)
  5. Os function library (used for path analysis and file operations and log operations)
  6. Matplotlib function library (for data visualization process to facilitate comparison of experimental results)
  7. Time function library (used to record the time node of each step of the sub-process)

Experiment goal

  Compare the effects of LSA model and LDA model in mining news topics, and evaluate the advantages and disadvantages of the two models. The effect evaluation is mainly carried out from the following aspects: model training time, model prediction accuracy, and correlation between model topics. The main goal of the experiment is to express the training results and prediction results of the two models, compare them, and record and analyze them.

experiment procedure

  The original form of the experimental data set is a compressed file in .tar.gz format, containing multiple cut txt files. You first need to decompress it. Record the time spent on compression. Then integrate all the content after decompression to prepare for cleaning. Record the time consumption of data merging. The data of this dataset mainly comes from the crawling results of web crawlers, so the txt contains a lot of redundant information. After decompression and merging, first use the bs4.BeautifulSoup function library to filter out the html tags and irrelevant data, and only extract We want to use the news content in the content tag and record the time consumed by data cleaning.

  After the data is cleaned, it is time to build the word text matrix. First, you need to segment the large pieces of text information and mark the part of speech in order to filter out irrelevant connectives and irrelevant information in the blank content. This process uses the posseg sub-library of the Jieba library to record the start time and end time of word segmentation and part-of-speech tagging.

  After labeling, use the bag-of-words model preprocessing function to process the data set of the well-divided words. First, build a dictionary and record the start and end time of this process. Then build the corpus, this is to train the model to be able to refer to the unauthorised corpus data (LDA random sampling target), and also record the start and end time to calculate the consumption. Then in order to integrate the weight information, we use TF-IDF to indicate the overall importance (at the same time reflecting the proportion of the number of occurrences of a word in a text and the proportion of the corpus containing a word in the total text), and also record the time consumption of this process .

  First, I artificially (subjective variable parameter-num_topics = 10) designated 10 topics for mining. Apply the pre-processing results to the training of the model, train the LSA and LDA models respectively, and print out the training results. The results are presented in the form of key-value pairs of topic serial numbers and topic keywords. You can observe the results of this process. Let's see if there is a good way to avoid polysemy and polysemy. In this process, the training consumption time of the two models is recorded as one of the key parameters for comparison.

  Finally, the test is performed through a set of tests and samples. First, the corpus is preprocessed on the test machine, and then it is used as a prediction sample to be brought into two models, and two models are obtained respectively, and the prediction results of the sample are obtained. Expressed as the result of similarity key-value pairs of 10 topics. After the prediction results are normalized and standardized, the data is presented for comparison and analysis using matplotlib visualization method.

Experiment code

Function library reference part:

# Author:JinyuZ1996
# Creation date:2020/7/25 20:03
# -*- coding: utf-8 -*-
import os
import tarfile
import matplotlib.pyplot as plt
import jieba.posseg as pseg
import time as t
# 使用jieba第三方类库对文本进行切割(中文分词类库),但是我们接下来要使用的是posseg的cut方法(大坑)
from bs4 import BeautifulSoup
from gensim import corpora, models
#踩坑,在使用LSA的时候必须指明包内调用的是lsimodel,而网上大多数博主没说或者说之前的写法可以用
import gensim.models.lsimodel as lsi

Function definition part: (note details)

# 函数定义部分

# 数据集分词方法:将输入的文本句子根据词性标注做分词
# (参数是:从数据集或者测试集中提取的文本句子,为标准字符串类型)
def cut_word(text):
    word_type = ['z', 'vn', 'v', 't', 'nz', 'nr', 'ns', 'n', 'l', 'i', 'j', 'an', 'a']      # 定义各种不同的词性规则
    words = pseg.cut(text)                                                                  # 在这里踩了个大坑
    # 使用jieba中定义的cut方法来进行中文分词,这里cut方法通过查库,得知是精准模式,它会把文本精确的切分开,不存在冗余单词
    cut_result = [word_cut.word for word_cut in words if word_cut.flag in word_type]        # 这是一个地道的写法,可以避免主观上知道被分成了多少词项
    return cut_result                                                                       # 返回符合规则的分词结果

# 文本预处理方法:如果是训练阶段,返回词典、TF-IDF对象和TF-IDF向量空间数据;如果是预测阶段,返回TF-IDF向量空间数据
# (参数是:词项表——列表型数据;TF_IDF模型对象——默认值是None;标志位,标志现在是什么过程,训练还是测试)
def text_pre(words_list, tfidf_object=None, training=True):
    # 分词列表转字典
    t_dic_start = t.time()
    dic = corpora.Dictionary(words_list)                                # 将分好词的词项表转换为字典形式
    t_dic_end = t.time()
    if training:
        print('训练集词典构建完毕.用了' + str(format(t_dic_end-t_dic_start,'.2f')) + '秒,构建模型语料库开始......')
        # 下面这三行曾经是为了展示一下词典模型的样式
        # print('{:-^50}'.format('测试展示词典索引值与分词表:'))
        # for i, w in list(dic.items())[:20]:                             # 循环读出字典前20条的每个key和value,对应的是索引值和分词
        #     print('索引值:%s -- 分词:%s' % (i, w))                       # 因为数据量比较大这里只是做个展示,让大家看一下数据处理的步骤
    else:
        print('测试集字典构建完毕.')
    # 构建完了词典再来构建语料库corpus,这里的doc2bow方法是构建bow模型的内置方法
    # 该模型忽略掉文本的语法和语序等要素,将其仅仅看作是若干个词汇的集合,文档中每个单词的出现都是独立的(也就是构成了原始的语料)
    t_corpus_start = t.time()
    corpus = [dic.doc2bow(words) for words in words_list]                   # 用于存储语料库的列表
    t_corpus_end = t.time()
    if training:
        print('训练集语料库构建完毕.用了' + str(format(t_corpus_end-t_corpus_start,'.2f')) + '秒,构建模型TF—IDF开始......')
        # 下面这两局曾经用于测试查看词袋模型语料库形式
        # print('{:-^50}'.format('语料库一维样本展示:'))
        # print(corpus[0])                                                  # 展示语料库的第一维
    else:
        print('测试集语料集合分析完毕.')
    # TF-IDF转换(首先判定是否为训练过程,如果是的话则使用语料库进行权值矩阵的构建)
    if training:
        t_TFIDF_start = t.time()
        tfidf = models.TfidfModel(corpus)                           # 建立TF-IDF模型对象,TF_IDF也是定义在gensim上的既有方法
        corpus_tfidf = tfidf[corpus]                                # 得到TF-IDF向量稀疏矩阵
        # 下面这两句曾经用于测试查看词袋模型的样子
        # print('{:-^50}'.format('TF-IDF 模型一维展示:'))
        # print(list(corpus_tfidf)[0])                              # 展示第一维
        t_TFIDF_end = t.time()
        print('构建TF-IDF过程结束,用了'+str(format(t_TFIDF_end-t_TFIDF_start,'.2f'))+'秒.')
        return dic, corpus_tfidf, tfidf
    else:
        return tfidf_object[corpus]                             # 如果试运行测试集的话不做处理tfidf_object=None

# 全角转半角方法:用于数据及预处理环节,数据清洗使用
# (参数是:content标签里的原始文本)
def str_convert(content):
    strs = []
    for each_char in content:  # 循环读取每个字符
        code_num = ord(each_char)  # 读取字符的ASCII值或Unicode值
        if code_num == 12288:  # 全角空格直接转换
            code_num = 32
        elif 65281 <= code_num <= 65374:  # 全角字符(除空格)根据关系转化
            code_num -= 65248
        strs.append(chr(code_num))
    return ''.join(strs)

# 解析文件内容数据预处理之前的简单清洗
# (参数是:数据集文件读取过来的文本格式包含大量标签格式和脏数据)
def data_parse(data):
    #BeautifulSoup库的作用就是帮助我们去提取网页格式标签内的信息
    raw_code = BeautifulSoup(data, 'lxml')          # 建立BeautifulSoup对象
    content_code = raw_code.find_all('content')     # 从包含文本的代码块中找到content标签,将新闻信息过滤出来
    # 将每个content标签中的文本提取出来之后,转成半角str字符串格式对象存储返回一个字符串list(过程中还进行了判空,清洗掉了数据库脏数据)
    content_list = [str_convert(each_content.text) for each_content in content_code if len(each_content) > 0]
    return content_list
案例实现过程部分:(注释详细)
# 案例实现过程部分

# 创建日志文件用于记录运行过程及结果
doc = open('log.txt','w')

# 解压缩文件过程(找了一个比较完整的读取tar.gz格式压缩文件的步骤(之前找的全炸),建议收着)
print('解压过程开始......')
print('解压过程开始......',file=doc)
t_unzip_start = t.time()
if not os.path.exists('./news_data'):                       # 如果不存在数据目录,则先解压数据文件(就是有没有解压的文件夹在)
    with tarfile.open('news_data.tar.gz') as tar:           # 打开tar.gz压缩包对象(有时包内部嵌套多层)
        names = tar.getnames()                              # 获得压缩包内的每个文件对象的名称
        for name in names:                                  # 循环读出每个文件
            tar.extract(name, path='./')                    # 将文件解压到指定目录

# 汇总所有内容(因为通过观察数据集在压缩目录中被分为多个子文件,个人粗略估计大概有22000+个content)
t_unzip_end = t.time()
print('解压过程结束,用了'+str(format(t_unzip_end-t_unzip_start,'.2f'))+'秒,数据合并过程开始......')
print('解压过程结束,用了'+str(format(t_unzip_end-t_unzip_start,'.2f'))+'秒,数据合并过程开始......',file=doc)
t_datamerg_start = t.time()
all_content = []                                            # 构建总列表,待会儿用于存储所有文件的文本内容
for root, dirs, files in os.walk('./news_data'):            # os.walk()游走方法,分别读取遍历目录下的根目录、子目录和文件列表
    for file in files:                                      # 循环读取每个文件
        file_name = os.path.join(root, file)                # 将目录路径与文件名合并为带有完整路径的文件名
        with open(file_name, encoding='utf-8') as f:        # 以只读方式打开文件(默认就是只读)
            data = f.read()                                 # 读取文件内容
        all_content.extend(data_parse(data))                # 从文件内容中获取文本,清洗数据并将结果追加到总列表

# 数据集分词过程开始
t_datamerg_end = t.time()
print('数据合并过程结束,用了'+str(format(t_datamerg_end-t_datamerg_start,'.2f'))+'秒,分词过程开始......')
print('数据合并过程结束,用了'+str(format(t_datamerg_end-t_datamerg_start,'.2f'))+'秒,分词过程开始......',file=doc)

# 获取分词列表,用于存储所有文件的分词结果(在all_content中获得)
t_cutWord_start = t.time()
print("开始对数据集进行分词和词性标注(该过程比较耗时)......")
print("开始对数据集进行分词和词性标注(该过程比较耗时)......",file=doc)
words_list = [list(cut_word(each_content)) for each_content in all_content]
t_cutWord_end = t.time()
print("分词过程完成,用了"+str(format(t_cutWord_end-t_cutWord_start,'.2f'))+'秒,开始构建词典模型......')
t_wordwash_start = t.time()
dic, corpus_tfidf, tfidf = text_pre(words_list)                                 # 有了数据,我们先对训练集的文本进行预处理
num_topics = 10                                                                 # 主观的设置主题个数(先设置10个测试)
t_wordWash_end = t.time()
print('词袋预处理过程结束,'+str(format(t_wordWash_end-t_wordwash_start,'.2f'))+'秒,开始构建LDA主题模型......')
print('词袋预处理过程结束,'+str(format(t_wordWash_end-t_wordwash_start,'.2f'))+'秒,开始构建LDA主题模型......',file=doc)

# 使用数据集分别训练LDA和LSA两种模型(分别记录训练时间用于结果比较)

t_lda_start = t.time()
lda = models.LdaModel(corpus_tfidf, id2word=dic, num_topics=num_topics)         #通过LDA进行主题建模
t_lda_end = t.time()
print('LDA模型构建完毕,用了'+str(format(t_lda_end-t_lda_start,'.2f'))+'秒,开始构建LSA主题模型......')
print('LDA模型构建完毕,用了'+str(format(t_lda_end-t_lda_start,'.2f'))+'秒,开始构建LSA主题模型......',file=doc)
print('{:-^50}'.format('构建好的主题LDA:'))
print('{:-^50}'.format('构建好的主题LDA:'),file=doc)
print(lda.print_topics())                                                       #打印所有LDA的主题
print(lda.print_topics(),file=doc)
t_lsa_start = t.time()
lsa = lsi.LsiModel(corpus_tfidf, id2word=dic, num_topics=num_topics)            #通过LSA进行主题建模
t_lsa_end = t.time()
print('LSA模型构建完毕,用了'+str(format(t_lsa_end-t_lsa_start,'.2f'))+'秒.')
print('LSA模型构建完毕,用了'+str(format(t_lsa_end-t_lsa_start,'.2f'))+'秒.',file=doc)
print('{:-^50}'.format('构建好的主题LSA:'))
print('{:-^50}'.format('构建好的主题LSA:'),file=doc)
print(lsa.print_topics())                                                       #打印所有LSA的主题
print(lsa.print_topics(),file=doc)

# 新数据集的主题模型预测
print('开始测试集过程,测试集文件打开......')
print('开始测试集过程,测试集文件打开......',file=doc)
with open('article.txt', encoding='utf-8') as f:                                # 打开测试集的文本
    text_new = f.read()  # 读取文本数据
text_content = data_parse(data)                                                 # 解析新的文本
words_list_new = cut_word(text_new)                                             # 将文本分词为下一步预处理做准备
corpus_tfidf_new = text_pre([words_list_new], tfidf_object=tfidf, training=False)  # 新文本数据集的预处理(注意把标志位置false)

# LDA预测部分(使用训练好的LDA模型去预测新闻主题)

# t_testLda_start = t.time()
corpus_lda_new = lda[corpus_tfidf_new]                                          # 用训练好的lda去获取新的分词词袋列表(文档)的主题概率分布
# t_testLda_end = t.time()
print('{:-^50}'.format('测试样本LDA主题预测:'))
print('{:-^50}'.format('测试样本LDA主题预测:'),file=doc)
pre_list = list(corpus_lda_new)
trans_list = sorted(pre_list[0],key = (lambda x:[x[1],x[0]]),reverse=True)       #2020-08-05改进代码通过排序方式将最大概率的预测结果显示在第一位
print(trans_list)                   #打印出排序好的话题序列预测结果
print(trans_list,file=doc)
# print('LDA模型对测试集数据预测完毕,用了'+str(format(t_testLda_end-t_testLda_start,'.2f'))+'秒.')     #为什么不写了呢,因为我发现真的这个过程是很快的,快到.2f不是很好展示
print('LDA模型对测试集数据预测完毕.')
print('LDA模型对测试集数据预测完毕.',file=doc)

# LSA预测部分(使用训练好的LSA模型去预测新闻主题)

# t_testLsa_start = t.time()                                #记录LSA预测测实际的开始时间
corpus_lsa_new = lsa[corpus_tfidf_new]                      #用构建好的lsa去处理测试集语料库
# t_testLsa_end = t.time()                                  #记录LSA预测结束的时间
print('{:-^50}'.format('测试样本LSA主题预测:'))
print('{:-^50}'.format('测试样本LSA主题预测:'),file=doc)
pre_list_lsa = list(corpus_lsa_new)                     #将训练好的结果转成List对象
trans_list_lsa = sorted(pre_list_lsa[0],key = (lambda x:[abs(x[1]),x[0]]),reverse=False)            #对值得部分进行排序(排序的时候要使用绝对值形式排序)
print(trans_list_lsa)                                   #打印出排序好的话题序列预测结果
print(trans_list_lsa,file=doc)
# print('LSA模型对测试集数据预测完毕,用了'+str(format(t_testLsa_end-t_testLsa_start,'.2f'))+'秒.')     #为什么不写了呢,因为我发现真的这个过程是很快的,快到.2f不是很好展示
print('LSA模型对测试集数据预测完毕.')
print('LSA模型对测试集数据预测完毕.',file=doc)

# 图形化展LSA的测试结果(柱状图)

id_lsa = []             #话题编号list(注意要与值一一对应)
val_lsa = []            #权重值的list
lsa_Outlist = trans_list_lsa                                                    #个人习惯再赋个新名字
for i in range(0,len(lsa_Outlist)):                                             #该循环用于将上方矩阵中的值转移到两个新的list当中用于结果的展示
    id_lsa.append("tp-"+str(lsa_Outlist[i][0]))                                 #将编号放入新的list中准备打印
    val_lsa.append(float(format((1-10*abs(lsa_Outlist[i][1]))*10,'.3f')))       #将权重放入新的list,注意取绝对值并同时对数据进行归一化方便展示
print(id_lsa)           #测试打印序号序列
print(val_lsa)          #测试打印权重序列

fig = plt.figure(figsize=(10, 5))                                   #设置窗体大小
fig.canvas.set_window_title('Using LSA to predict Testing Set')     #设置窗体title
plt.title('Using LSA to predict Testing Set')                       #设置图表的title
plt.xlabel('Predicted subject sequence number')                     #被预测的话题序号
plt.ylabel('Weight of prediction possibility')                      #被预测的话题可能性权重
# 我这里只设置了九种颜色,要是后期同学们再增加新的话题个数的话就需要再增加颜色种类
plt.bar(range(len(val_lsa)),val_lsa,width=0.5,tick_label=id_lsa,color =['grey','gold','darkviolet','turquoise','red','green','blue','pink','tan'])
plt.show()

# 图形化展示LDA测试结果(柱状图)

id_lda = []                 #话题编号list(注意要与值一一对应)
val_lda = []                #权重值的list
lda_Outlist = trans_list
for i in range(0,len(lda_Outlist)):                                  #该循环用于将上方矩阵中的值转移到两个新的list当中用于结果的展示
    id_lda.append("tp-"+str(trans_list[i][0]))                       #将编号放入新的list中准备打印
    val_lda.append(float(format((trans_list[i][1])*10,'.3f')))       #将权重放入新的list,lda的数据好处理一些
print(id_lda)           #测试打印序号序列
print(val_lda)          #测试打印权重序列

#  曾经对lda尝试过绘制饼图,但是效果不是很好,但是保留这种写法以后参考用。参数(值,标志,颜色分类,自动转化为百分比形式,还可以设置阴影shadow=,或者设置凸显某一部分explode=)
# plt.pie(x=value,labels=id_lda,colors=['C0', 'C1', 'C2', 'C3', 'C4', 'C5', 'C6', 'C7', 'C8', 'C9'],autopct='%1.1f%%')
# plt.axis('equal')       #用'正'圆饼图来可视化预测结果

fig = plt.figure(figsize=(10, 5))                                   #设置窗体大小
fig.canvas.set_window_title('Using LDA to predict Testing Set')     #设置窗体title
plt.title('Using LDA to predict Testing Set')                       #设置图表的title
plt.xlabel('Predicted subject sequence number')                     #被预测的话题序号
plt.ylabel('Weight of prediction possibility')                      #被预测的话题可能性权重
# 我这里只设置了九种颜色,要是后期同学们再增加新的话题个数的话就需要再增加颜色种类
plt.bar(range(len(val_lda)),val_lda,width=0.5,tick_label=id_lda,color =['grey','gold','darkviolet','turquoise','red','green','blue','pink','tan'])
plt.show()

#再比较一下二者花费时间的差距

lda_cost = t_lda_end-t_lda_start
lsa_cost = t_lsa_end-t_lsa_start
if lda_cost>lsa_cost:
    print("LDA模型花费" + str(format(lda_cost, '.2f')) + '秒.\nLSA模型花费' + str(format(lsa_cost, '.2f')) + '秒.')
    print("LDA模型花费" + str(format(lda_cost, '.2f')) + '秒.\nLSA模型花费' + str(format(lsa_cost, '.2f')) + '秒.', file=doc)
    print("LDA模型比LSA模型要多花费" + str(format(lda_cost - lsa_cost, '.2f')) + '秒.')
    print("LDA模型比LSA模型要多花费" + str(format(lda_cost - lsa_cost, '.2f')) + '秒.',file=doc)
else:
    print("LDA模型花费" + str(format(lda_cost,'.2f')) + '秒.\nLSA模型花费' + str(format(lsa_cost,'.2f')) + '秒.')
    print("LDA模型花费" + str(format(lda_cost, '.2f')) + '秒.\nLSA模型花费' + str(format(lsa_cost, '.2f')) + '秒.', file=doc)
    print("LSA模型比LDA模型要多花费" + str(format(lsa_cost - lda_cost,'.2f')) + '秒.')
    print("LSA模型比LDA模型要多花费" + str(format(lsa_cost - lda_cost, '.2f')) + '秒.',file=doc)

Analysis of results

  First, make observations based on the results of an experiment log.txt and the visualized data results (more than 20 experiments have been carried out in advance, here is one of the more representative experimental results for analysis):

       From the training results, according to the content observation and analysis of the training results of LSA and LDA in log.txt, we can know that LSA obviously has some sports-related words in multiple topics. We can understand this situation. The reason is that I mentioned the issue of the number of news topics in the video before. The best granularity can be achieved at about 22-30. However, in order to get the experimental results as soon as possible and enlarge the experimental results, we subjectively take num_topics = 10, to experiment. But in fact, you can clearly find that there is a high degree of discrimination in the LDA training results. Even if there are only 10 topics at present, the correlation between each topic is very small, but there are occasional individual correlations. The weight of words assigned to other topics is also very small. Therefore, when we only observe the training results, we preliminarily speculate that LSA will have inaccurate predictions or fuzzy predictions, and relative LDA should be able to avoid this situation.

  Then, from the perspective of training time, under our current experimental conditions with data and scale and topic volume of 10, LSA will basically be about 2 seconds slower than LDA during multiple repeated experiments.

  Finally, let us take a look at the prediction results of the two on the test set. The test set is a piece of sports news. LDA accurately predicts it as topic 9, with a relatively high weight, and the similarity of all topic sets is For positive numbers, there is almost no need to filter the return value and normalization. Of course, in order to facilitate the display in the image, the format is carried out in the later stage. However, the LSA model returns a large gap between the key-value pairs, and the similarity is expressed as the degree of deviation from the theme. , You need to take the absolute value first, and sort from small to large, and because of the results of our previous analysis, there is a small gap between multiple topics, so you can see that multiple topics (after abs) have relatively small deviations. Compared with LDA, the prediction result is fuzzy, and the later data processing is more complicated.

  Just looking at log.txt is not very intuitive. When we normalize the return value lists of the two, we will clearly get two charts.

Chart1: The result of LSA model prediction
Chart2: The result of LDA model prediction

  Through observation, we can clearly see that LDA has a clearer predictive feedback, and there is a clear gap between the similarity of different topics. Based on this, one of the topics (topic3) can be clearly defined as the target topic of the predicted topic. . The results of LSA are rather vague. Topics 9, 8, 1, and 5 all have a relatively high similarity. Although it can still be compared that 9 is the most similar topic, the effect is obviously not as obvious as LDA.

  In fact, there is another difference that was discovered during the experiment, that is, the order of topics generated in each experiment of LDA is different. Sometimes topic 3 represents sports, sometimes topic 6 represents sports, but both Have the same prediction accuracy. The LSA prediction results are almost the same every time, and the images are basically the same. This is because LSA uses SVD singular value decomposition, while LDA uses a random sampling method. Due to random sampling, the topic sequence generated every time is different, which is a very normal phenomenon.

get conclusion

  Through experiments and comparisons of multiple sets of information, compared with LSA, under the premise that the number of topics is specified, the mining effect of LDA is more prominent and obvious, the prediction accuracy is relatively higher, and LDA also has a shorter time Consumption, LSA prediction results are relatively vague. Analyzing the reasons, we can also know that in the previous learning process, we know that LSA is achieved by mapping the original high-dimensional matrix with low similarity to the new low-dimensional semantic space through the SVD dimensionality reduction method, so The results of each decomposition are similar, and we know that it can solve the problem of multiple words with one meaning, but not the problem of multiple words. This is why it contains multiple similar words in multiple topics. one of the reasons. Relatively speaking, LDA uses the Gibbs sampling algorithm in MCMC to achieve random sampling, so the training sequence it generates will be different each time, but it can maintain a relatively high prediction effect, and it also avoids polysemous and The problem of multiple words and one meaning makes the topic prediction results unambiguous, and the prediction similarity of a single topic can be significantly better than other multiple topics.

The lack of experimentation

  In fact, the shortcomings that can be raised can be solved immediately, but when the report is written, the time for perfecting the experiment is not enough. The first point that can be improved is to print all the log information of multiple repeated tests in the log, instead of keeping only the log results of the most recent test. The second point that can be improved is to write a special class or function to encapsulate the steps of training and testing LDA and LSA separately, and the code readability will be relatively higher. The third point that can be improved, according to the recent information and further study of the database, I learned that the corpus can be temporarily stored. However, the experiment did not save the results of word segmentation well, leading to re-segmentation and part-of-speech tagging for each experiment, resulting in higher time costs for the experiment.

Data set used in the experiment

  https://download.csdn.net/download/qq_39381654/12710127

Guess you like

Origin blog.csdn.net/qq_39381654/article/details/107981106