Natural language processing--SMS semantic analysis based on LDiA topic model: extract topic vector

When you need to calculate an interpretable topic vector, use LDiA: When assigning topics to messages, there will be a lot of 0s, so the topics are separated more clearly.
LDiA assumes that each document is composed of a mixture (linear combination) of any number of topics, which is selected when the LDiA model is first trained. LDiA also assumes that each topic can be represented by word distribution (term frequency). The probability or weight of each topic in the document, as well as the probability of a word being assigned to a topic, are assumed to satisfy the Dirichlet probability distribution at the beginning.

from sklearn.feature_extraction.text import CountVectorizer
from nltk.tokenize import casual_tokenize
import numpy as np
import pandas as pd
from nlpia.data.loaders import get_data
from nltk.tokenize.casual import casual_tokenize
from sklearn.decomposition import LatentDirichletAllocation as LDiA

# 从 nlpia 包中的 DataFrame 加载短消息数据
pd.options.display.width = 120
sms = get_data('sms-spam')
# 向短消息的索引号后面添加一个感叹号,以使垃圾短消息更容易被发现
index = ['sms{}{}'.format(i, '!'*j) for (i,j) in zip(range(len(sms)), sms.spam)]
sms.index = index
print(sms.head(6))

# 计算词袋向量
np.random.seed(42)
counter = CountVectorizer(tokenizer=casual_tokenize)
bow_docs = pd.DataFrame(counter.fit_transform(raw_documents=sms.text).toarray(), index=index)
column_nums, terms = zip(*sorted(zip(counter.vocabulary_.values(), counter.vocabulary_.keys())))
bow_docs.columns = terms
# 看看对标记为“sms0”的第一条短消息
print(sms.loc['sms0'].text)
print( bow_docs.loc['sms0'][bow_docs.loc['sms0'] > 0].head())

ldia = LDiA(n_components=16, learning_method='batch')
ldia = ldia.fit(bow_docs)
# 将 9232 个词(词项)分配给 16 个主题(成分)
print(ldia.components_.shape)

# 看看开头的几个词,我们了解一下它们是如何分配到 16 个主题中的。
pd.set_option('display.width', 75)
columns = ['topic{}'.format(i) for i in range(ldia.n_components)]
components = pd.DataFrame(ldia.components_.T, index=terms, columns=columns)
# 感叹号(!)被分配到大多数主题中,但它其实是 topic3 中一个特别重要的部分,
# 在该主题中引号(")几乎不起作用。
print(components.round(2).head(3))
# 该主题的前十个词条似乎是在要求某人做某事或支付某事的强调指令中可能使用的词类型。
print(components.topic3.sort_values(ascending=False)[:10])

# 生成主题向量
ldia16_topic_vectors = ldia.transform(bow_docs)
ldia16_topic_vectors = pd.DataFrame(ldia16_topic_vectors, index=index, columns=columns)
# 对比于pca,svd,ldia产生的主题之间分隔得更加清晰
print(ldia16_topic_vectors.round(2).head())

Guess you like

Origin blog.csdn.net/fgg1234567890/article/details/112439123