Natural language processing-Short message semantic analysis based on truncated SVD (TruncatedSVD): extract topic vector

It can handle sparse matrices, so if we are dealing with large-scale data sets, then we must use TruncatedSVD instead of PCA anyway.
The SVD part of TruncatedSVD decomposes the TF-IDF matrix into 3 matrices, and the truncated part discards the dimension that contains the least information of the TF-IDF matrix. These discarded dimensions represent the least changing topics (linear combinations of words) in the document set, and they may have no meaning to the overall semantics of the corpus. They may contain many stop words and other words that are evenly distributed across all documents.

Use TruncatedSVD to retain only 16 topics, these topics account for the largest variance in the TF-IDF vector:

import pandas as pd
from nlpia.data.loaders import get_data
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.tokenize.casual import casual_tokenize
from sklearn.decomposition import TruncatedSVD

# 从 nlpia 包中的 DataFrame 加载短消息数据
pd.options.display.width = 120
sms = get_data('sms-spam')
# 向短消息的索引号后面添加一个感叹号,以使垃圾短消息更容易被发现
index = ['sms{}{}'.format(i, '!'*j) for (i,j) in zip(range(len(sms)), sms.spam)]
sms.index = index
print(sms.head(6))

# 计算每条消息的 TF-IDF 向量
tfidf = TfidfVectorizer(tokenizer=casual_tokenize)
tfidf_docs = tfidf.fit_transform(raw_documents=sms.text).toarray()
# 来自分词器(casual_tokenize)的 9232 个不同的1-gram 词条
print(len(tfidf.vocabulary_))
tfidf_docs = pd.DataFrame(tfidf_docs)
# 减去平均值对向量化的文档(词袋向量)进行中心化处理
tfidf_docs = tfidf_docs - tfidf_docs.mean()
# 4837 条短消息
print(tfidf_docs.shape)
# 有 638 条(13%)被标记为垃圾短消息
print(sms.spam.sum())
print("词汇表:\n", tfidf.vocabulary_)

# svd
svd = TruncatedSVD(n_components=16, n_iter=100)
svd_topic_vectors = svd.fit_transform(tfidf_docs.values)
columns = ['topic{}'.format(i) for i in range(svd.n_components)]
svd_topic_vectors = pd.DataFrame(svd_topic_vectors, columns=columns, index=index)
# TruncatedSVD 的这些主题向量与 前面PCA 生成的主题向量完全相同!这个结果是因为我们非
# 常谨慎地使用了很多的迭代次数(n_iter),并且还确保每个词项(列)的 TF-IDF 频率都做了
# 基于零的中心化处理(通过减去每个词项上的平均值)。
print(svd_topic_vectors.round(3).head(6))

Guess you like

Origin blog.csdn.net/fgg1234567890/article/details/112438678