Natural language processing-PCA-based semantic analysis of short messages: extract topic vectors

Dimensionality reduction is the main countermeasure against overfitting. By combining multiple dimensions (words) into fewer dimensions (topics), our NLP pipeline will become more versatile. Generalizing the NLP pipeline helps to ensure that it is applicable to a wider set of real-world short messages, not just this specific set of messages.

The PCA model in scikit-learn converts the 9232-dimensional TF-IDF vector of the dataset into a 16-dimensional topic vector:

import pandas as pd
from nlpia.data.loaders import get_data
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.tokenize.casual import casual_tokenize
from sklearn.decomposition import PCA

# 从 nlpia 包中的 DataFrame 加载短消息数据
pd.options.display.width = 120
sms = get_data('sms-spam')
# 向短消息的索引号后面添加一个感叹号,以使垃圾短消息更容易被发现
index = ['sms{}{}'.format(i, '!'*j) for (i,j) in zip(range(len(sms)), sms.spam)]
sms.index = index
print(sms.head(6))

# 计算每条消息的 TF-IDF 向量
tfidf = TfidfVectorizer(tokenizer=casual_tokenize)
tfidf_docs = tfidf.fit_transform(raw_documents=sms.text).toarray()
# 来自分词器(casual_tokenize)的 9232 个不同的1-gram 词条
print(len(tfidf.vocabulary_))
tfidf_docs = pd.DataFrame(tfidf_docs)
# 减去平均值对向量化的文档(词袋向量)进行中心化处理
tfidf_docs = tfidf_docs - tfidf_docs.mean()
# 4837 条短消息
print(tfidf_docs.shape)
# 有 638 条(13%)被标记为垃圾短消息
print(sms.spam.sum())
print("词汇表:\n", tfidf.vocabulary_)
# 根据词汇表中索引对词项进行排序,方便后续输出
column_nums, terms = zip(*sorted(zip(tfidf.vocabulary_.values(), tfidf.vocabulary_.keys())))
print(terms)

# PCA降维:把数据集 9232 维的 TF-IDF 向量转换为 16 维主题向量
pca = PCA(n_components=16)
pca = pca.fit(tfidf_docs)
pca_topic_vectors = pca.transform(tfidf_docs)
columns = ['topic{}'.format(i) for i in range(pca.n_components)]
pca_topic_vectors = pd.DataFrame(pca_topic_vectors, columns=columns, index=index)
print(pca_topic_vectors.round(3).head(6))

# 输出pca中转换的权重值
weights = pd.DataFrame(pca.components_, columns=terms, index=['topic{}'.format(i) for i in range(16)])
pd.options.display.max_columns = 8
print(weights.head(4).round(3))

# 查看特定词汇对应的权重值
pd.options.display.max_columns = 12
deals = weights['! ;) :) half off free crazy deal only $ 80 %'.split()].round(3) * 100
print(deals)
# 主题 4、8 和 9 似乎都包含“deal”(交易)主题的正向情感。而主题 0、3 和 5 似乎是反 deal
# 的主题,即 deal 反面的消息:负向 deal。因此,与 deal 相关的词可能对一些主题产生正向影响,
# 而对另一些主题产生负向影响。
print(deals.T.sum())

Guess you like

Origin blog.csdn.net/fgg1234567890/article/details/112438512