Natural Language Processing-Singular Value Decomposition (Truncation)

Singular Value Decomposition is LSA (Latent Semantic Analysis), the algorithm behind calculating topic vectors.
SVD is an algorithm that can decompose any matrix into 3 factor matrices, and these 3 factor matrices can be multiplied to reconstruct the original matrix. This is similar to finding exactly 3 integer factors for a large integer, but the factors here are not scalar integers, but two-dimensional real matrices with special properties. The 3 factor matrices calculated by SVD have some useful mathematical properties, which can be used for dimensionality reduction and LSA.

Use code to implement SVD matrix decomposition:

from nlpia.book.examples.ch04_catdog_lsa_sorted import lsa_models, prettify_tdm
import numpy as np
import pandas as pd

# 11 篇文档、词汇表大小为 6 的语料库
bow_svd, tfidf_svd = lsa_models()
print("数据:\n", prettify_tdm(**bow_svd))
# 词项-文档矩阵
tdm = bow_svd['tdm']
print("词项-文档矩阵tdm:\n", tdm)

# 奇异值分解
U, s, Vt = np.linalg.svd(tdm)
# 左奇异向量 U:主题-词矩阵
#  矩阵中每个元素位置上的权重或得分,分别代表每个词对每个主题的重要程度
print("左奇异向量 U:\n", pd.DataFrame(U, index=tdm.index).round(2))

# 奇异值向量 S:
# 奇异值给出了在新的语义(主题)向量空间中每个维度所代表的信息量。
print(s.round(1))
S = np.zeros((len(U), len(Vt)))
# np.fill_diagonal() 填充对角线元素
np.fill_diagonal(S, s)
print("奇异值向量 S:\n", pd.DataFrame(S).round(1))

# 右奇异向量Vt:该矩阵将在文档之间提供共享语义,
# 因为它度量了文档在新的文档语义模型中使用相同主题的频率。
print("右奇异向量Vt:\n", pd.DataFrame(Vt).round(2))

# 词项-文档矩阵重构误差
# 度量LSA精确率的一种方法是看看从主题-文档矩阵重构词项-文档矩阵的精确率如何。
print(tdm.shape)
print(np.product(tdm.shape))
err = []
for numdim in range(len(s), 0, -1):
    S[numdim - 1, numdim - 1] = 0
    reconstructed_tdm = U.dot(S).dot(Vt)
    # 均方根误差(RMSE)
    err.append(np.sqrt(((reconstructed_tdm - tdm).values.flatten() ** 2).sum() / np.product(tdm.shape)))

# 截断的内容越多,误差就越大
print(np.array(err).round(2))

Guess you like

Origin blog.csdn.net/fgg1234567890/article/details/112437957