Python extracts and classifies long text topics based on the LDA algorithm and predicts the category

Python extracts and predicts classification based on LDA long text topics

Guide

The principle of the Lda algorithm, Zhihu has a very complete algorithm introduction, there is no more nonsense here.

This article is mainly used to extract text topics, and then classify existing texts based on these topics. Because of data problems, the effect is average, and the algorithm design ideas are for reference only.

The code and test data of this article have been uploaded to github: warehouse address

data preparation

This stage is mainly to deal with your problem and clean your data. Chinese text preprocessing mainly includes removing spaces, removing form symbols, removing stop words, and word segmentation. Perform text preprocessing according to personal needs. Speaking of data processing into the following data format, that is, the data format of list set.

The initial data is as follows, one by one text, separated by line breaks.

在漆黑的夜空里看满天璀璨的星星,非常非常的密集,同时月亮也异常明亮。最主要的是在满天繁星中,靠近月亮的3颗星星排列成“勾”的样子,他们特别的明亮。

......................................................(省略好多数据)

沟里漂着很多死尸大人小孩都有,我顺着消流的方向走,看到臭水带着死尸流进了苹果园,我很怕!

After preliminary processing, the following is obtained through the deal_words() method below

 [['满天璀璨', '星星', '月亮', '明亮', '靠近', '排列'],........., ['沟里', '漂着', '死尸','大人小孩']]

Specific data processing can refer to: data preprocessing

LDA model implementation

Ideas:

  1. Mix training data and prediction data and extract the dictionary library
  2. Use dictionary library to convert training data into one-hot encoding
  3. Use the API provided by gensim to do model extraction topics
  4. Convert mixed prediction data into one-hot encoding
  5. Forecast classification topics

This model is implemented using the API provided by gensim, the code is as follows:

def lad_model(train_data):
    # 读取预测分类数据
    test_data = deal_words(read_text_file('./data/set/test_text.txt'))
    # 拼接提取词典库
    contents = train_data + test_data
    # 根据文本获取词典
    dictionary = corpora.Dictionary(contents)
    # 词典创建语料库
    corpus = [dictionary.doc2bow(doc) for doc in train_data]
	#调用LDA模型,请求潜在主题数30;训练语料库2次
    lda = gensim.models.ldamodel.LdaModel(corpus, num_topics=30, id2word=dictionary,
                                          passes=2)
    #导出模型分类数量
    data = lda.print_topics(num_topics=3, num_words=5)
    # 打印主题,10个主题,20个单词
    for item in data:
        print(item) 
        print("--------------------split line---------------------")
    # 测试数据转换
    test_vec = [dictionary.doc2bow(doc) for doc in test_data]
	#预测并打印结果
    for i, item in enumerate(test_vec):
        topic = lda.get_document_topics(item)
        keys = target.keys()
        print('第',i+1,'条记录分类结果:',topic)

All codes and effects

lad model code and processing data code. The core code text_deal.py for processing data here is written under the same level directory (lda_demo).

from lda_demo import text_deal as td
from gensim import corpora, models
import gensim

"""
读取text文件
intput:url
output:list结构的文本数据
"""


def read_text_file(url):
    dream_text = open(url, 'r+', encoding='utf-8')
    return dream_text.read().split("\n\n")


"""
停用词/分词
"""


def deal_words(contents):
    # 去除空格
    contents = td.remove_blank_space(contents)
    # 获取分词结果
    contents = td.cut_words(contents)
    # 去除停用词
    contents = td.drop_stopwords(contents)
    return contents

def lad_model(train_data):
    # 读取预测分类数据
    test_data = deal_words(read_text_file('./data/set/test_text.txt'))
    # 拼接提取词典库
    contents = train_data + test_data
    # 根据文本获取词典
    dictionary = corpora.Dictionary(contents)
    # 词典创建语料库
    corpus = [dictionary.doc2bow(doc) for doc in train_data]
	#调用LDA模型,请求潜在主题数30;训练语料库2次
    lda = gensim.models.ldamodel.LdaModel(corpus, num_topics=30, id2word=dictionary,
                                          passes=2)
    #导出模型分类数量
    data = lda.print_topics(num_topics=3, num_words=5)
    # 打印主题,10个主题,20个单词
    for item in data:
        print(item) 
        print("--------------------split line---------------------")
    # 测试数据转换
    test_vec = [dictionary.doc2bow(doc) for doc in test_data]
	#预测并打印结果
    for i, item in enumerate(test_vec):
        topic = lda.get_document_topics(item)
        keys = target.keys()
        print('第',i+1,'条记录分类结果:',topic)


  if __name__ == '__main__':
    # 据集读取
    contents = read_text_file('./data/set/train_text.txt')
    # 文本处理
    contents = deal_words(contents)
    # LDAmodel
    lad_model(contents)

The text_deal.py code is as follows:

"""
自然语言处理---文本预处理
"""
import jieba
import pandas as pd

"""
加载初始数据信息
str:文件传输路径
index:所需真实值索引列表
"""


def read_data(str, index):
    dream_data = pd.read_csv(str)
    return dream_data.values[:, index]


"""
去掉文本中的空格
input:our_data为list文本数据
output:去除空格后的文本list
"""


def remove_blank_space(contents):
    contents_new = map(lambda s: s.replace(' ', ''), contents)
    return list(contents_new)


"""
判断单词是否为中文
input:word单个单词
output:是中文True,不是中文False
"""


def is_chinese(word):
    if word >= u'\u4e00' and word <= u'\u9fa5':
        return True
    else:
        return False


"""
判断短句是否为纯中文
input:words短句
output:是中文True,不是中文False
"""


def is_chinese_words(words):
    for word in words:
        if word >= u'\u4e00' and word <= u'\u9fa5':
            continue
        else:
            return False
    return True


"""
将文本数据格式化去除非中文字符
input:contents list结构的文本数据
output:去除非中文字符的数据
"""


def format_contents(contents):
    contents_new = []
    for content in contents:
        content_str = ''
        for i in content:
            if is_chinese(i):
                content_str = content_str + i
        contents_new.append(content_str)
    return contents_new


"""
对文本进行jieba分词
input:contents文本list
output:分词后的文本list
"""


def cut_words(contents):
    cut_contents = map(lambda s: list(jieba.lcut(s)), contents)
    return list(cut_contents)


"""
去除停用词/标点符号
input:contents文本list(list中保存list)
output:去除停用词后的文本list
"""


def drop_stopwords(contents):
    # 初始化获取停用词表
    stop = open('./data/word_deal/stop_word_cn.txt', encoding='utf-8')
    stop_me = open('./data/word_deal/stop_one_mx.txt', encoding='utf-8')
    key_words = open('./data/word_deal/key_words.txt', encoding='utf-8')
    #分割停用词/自定义停用词/关键词
    stop_words = stop.read().split("\n")
    stop_me_words = stop_me.read().split("\n")
    key_words = key_words.read().split("\n")
    #定义返回后的结果
    contents_new = []
    #遍历处理数据
    for line in contents:
        line_clean = []
        for word in line:
            if (word in stop_words or word in stop_me_words) and word not in key_words:
                continue
            if is_chinese_words(word):
                line_clean.append(word)
        contents_new.append(line_clean)
    return contents_new

running result

(1, '0.023*"说" + 0.018*"怀孕" + 0.016*"父亲" + 0.014*"里" + 0.011*"岁"')
--------------------split line---------------------
(25, '0.023*"说" + 0.012*"办公室" + 0.010*"是不是" + 0.009*"朋友" + 0.009*"大门"')
--------------------split line---------------------
(20, '0.014*"同学" + 0.010*"跑" + 0.010*"培训" + 0.010*"骑" + 0.009*"机构"')
--------------------split line---------------------
第 1 条记录分类结果: [(4, 0.24392343), (8, 0.1395505), (10, 0.09619252), (18, 0.16527545), (21, 0.17173427), (23, 0.11055296)]
第 2 条记录分类结果: [(5, 0.124014), (13, 0.28862998), (16, 0.099018164), (19, 0.09216843), (24, 0.12537746), (29, 0.22633219)]
第 3 条记录分类结果: [(7, 0.101059936), (10, 0.37497482), (21, 0.15868592), (23, 0.19114888), (29, 0.12510397)]
第 4 条记录分类结果: [(1, 0.082532495), (4, 0.17312291), (14, 0.072532885), (17, 0.38016438), (19, 0.050784156), (21, 0.21228231)]
  1. Text processing must be done according to your own needs to clean up unnecessary data; otherwise, it will affect the subject classification.
  2. The overall result of the corpus will be more friendly. The classification effect within the corpus is still relatively good, but the effect of the new data is average.

Welcome criticism!

Guess you like

Origin blog.csdn.net/m0_47220500/article/details/105765841