Natural language processing-Dot product between vectors to see the effect of short message classification based on TruncatedSVD

To understand how effective the vector space model is in classification, one way is to look at the relationship between the cosine similarity between the vectors within the category and their category attribution.

We should see that the positive cosine similarity (dot product) between any spam messages ("sms2!") is greater:

import pandas as pd
from nlpia.data.loaders import get_data
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.tokenize.casual import casual_tokenize
from sklearn.decomposition import TruncatedSVD
import numpy as np

# 从 nlpia 包中的 DataFrame 加载短消息数据
pd.options.display.width = 120
sms = get_data('sms-spam')
# 向短消息的索引号后面添加一个感叹号,以使垃圾短消息更容易被发现
index = ['sms{}{}'.format(i, '!'*j) for (i,j) in zip(range(len(sms)), sms.spam)]
sms.index = index
print(sms.head(6))

# 计算每条消息的 TF-IDF 向量
tfidf = TfidfVectorizer(tokenizer=casual_tokenize)
tfidf_docs = tfidf.fit_transform(raw_documents=sms.text).toarray()
# 来自分词器(casual_tokenize)的 9232 个不同的1-gram 词条
print(len(tfidf.vocabulary_))
tfidf_docs = pd.DataFrame(tfidf_docs)
# 减去平均值对向量化的文档(词袋向量)进行中心化处理
tfidf_docs = tfidf_docs - tfidf_docs.mean()
# 4837 条短消息
print(tfidf_docs.shape)
# 有 638 条(13%)被标记为垃圾短消息
print(sms.spam.sum())
print("词汇表:\n", tfidf.vocabulary_)

# svd
svd = TruncatedSVD(n_components=16, n_iter=100)
svd_topic_vectors = svd.fit_transform(tfidf_docs.values)
columns = ['topic{}'.format(i) for i in range(svd.n_components)]
svd_topic_vectors = pd.DataFrame(svd_topic_vectors, columns=columns, index=index)
# TruncatedSVD 的这些主题向量与 前面PCA 生成的主题向量完全相同!这个结果是因为我们非
# 常谨慎地使用了很多的迭代次数(n_iter),并且还确保每个词项(列)的 TF-IDF 频率都做了
# 基于零的中心化处理(通过减去每个词项上的平均值)。
print(svd_topic_vectors.round(3).head(6))

# 计算前10 条短消息对应的前 10个主题向量之间的点积,我们应该会看到,
# 任何垃圾短消息(“sms2!”)之间的正的余弦相似度(点积)更大。
svd_topic_vectors = (svd_topic_vectors.T / np.linalg.norm(svd_topic_vectors, axis=1)).T
print(svd_topic_vectors.iloc[:10].dot(svd_topic_vectors.iloc[:10].T).round(1))


Guess you like

Origin blog.csdn.net/fgg1234567890/article/details/112438986