Machine Learning and Deep Learning - Document Similarity Calculation Based on Latent Semantic Analysis (LSA)

Machine Learning and Deep Learning - Document Similarity Calculation Based on Latent Semantic Analysis (LSA)

Latent Semantic Analysis (LSA) is a technique that uses mathematical and statistical methods to analyze text data. The technique can be used to discover relationships between texts, as well as provide texts with deeper information about their meaning.

The following is a detailed description of the LDA model:

Collecting a corpus: You first need to collect a corpus that contains a large amount of text data. These texts can be of any type, such as news articles, blogs, papers, etc.

Build vocabulary: Then take all the distinct words from the corpus and build a list called "vocabulary". Additionally, each word needs to be mapped to a unique ID.

Create a document-term frequency matrix: Next, you need to convert the text data into numerical form. To do this, create a matrix called a "document-term frequency" matrix, where each row represents a document and each column represents a word. Each element represents the number of times that word occurs in that document.

Perform Singular Value Decomposition (SVD): Next, you need to perform Singular Value Decomposition (SVD) on the document-term frequency matrix. SVD is a mathematical technique that decomposes a matrix into a product of three matrices. These matrices include a matrix representing documents, a matrix representing topics, and a matrix representing words.

Select number of topics: Select the number of topics to extract, which is usually determined experimentally and empirically.

Extract topics: Using the topic matrix and word matrix calculated in the previous steps, the topics present in the text data can be extracted. Each topic represents a set of associated words that can be used to understand the meaning of the text.

Applying the model: Based on the extracted topics, the model can be applied for various tasks such as text clustering, classification and information retrieval etc.

LDA models use SVD techniques to transform text data into numerical form and provide deeper insights about the similarities and meanings between text data.

Below we use a case to further understand and learn LSA.

Purpose

  1. Use the Jieba library to perform Chinese word segmentation on sentences and output the word segmentation results
  2. Text similarity analysis of 18 docx documents based on Latent Semantic Analysis (LSA)

The basic idea

1. Read document collection: Read all .docx files from the specified directory and store their contents in the docs list.
2. For each document, perform preprocessing operations (such as removing stop words, word segmentation and stemming, etc.), and use the gensim library to create a document-word matrix 3. Create a document-word matrix: use CountVectorizer to convert the document collection
into A document-word matrix representation where each element represents the number of word occurrences in the corresponding document.
4. Train the LSA model: use TruncatedSVD for dimensionality reduction, and convert the document-word matrix into a latent semantic space representation.
5. Create index and similarity matrix: Use MatrixSimilarity to create a similarity matrix for calculating the similarity between documents.
6. Calculate document similarity using similarity matrix: select a target document and calculate the similarity between this document and other documents.
7. Output the documents most similar to the target document: print the content of the target document, and then print the index, content and similarity score of the top 10 most similar documents in descending order of similarity.

code

Use LSA model and similarity matrix to perform similarity analysis on a given set of 18 docx documents

import os
import docx
from gensim import corpora, models, similarities

#定义 docx_analysis 类(读取指定docx文件中的内容,并分析其中每个句子中包含的单词及其出现频率。)

class docx_analysis:
    def __init__(self, docx_path):
        self.docx_path = docx_path
        self.wds_freq = {
    
    }
        
    def local_record_wds(self, words):
        for wd in words:
            if wd in self.wds_freq:
                self.wds_freq[wd] += 1
            else:
                self.wds_freq[wd] = 1

    def analyze(self):
        doc = docx.Document(self.docx_path)
        for para in doc.paragraphs:
            sentences = para.text.split("。")
            for sent in sentences:
                words = sent.strip().split()
                self.local_record_wds(words)
                
# 读取文档集合
doc_dir = r'C:\Users\l\Desktop\机器学习与深度学习\LSA\lsa'
docs = []
for file in os.listdir(doc_dir):
    if file.endswith('.docx'):
        with open(os.path.join(doc_dir, file), 'rb') as f:
            doc = docx.Document(f)
            text = '\n'.join([para.text for para in doc.paragraphs])
            docs.append(text)

# 预处理文本,去除停用词、进行分词和词干提取等
stoplist = set('for a of the and to in'.split())
texts = [[word for word in doc.lower().split() if word not in stoplist] for doc in docs]

# 创建词典和文档-单词矩阵
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]

# 训练 LSA 模型
lsi = models.LsiModel(corpus, id2word=dictionary, num_topics=10)

# 将文档映射到潜在语义空间中
corpus_lsi = lsi[corpus]

# 创建索引和相似度矩阵
index = similarities.MatrixSimilarity(corpus_lsi)

# 使用相似度矩阵计算文档相似度
doc_id = 0  # 选择第一个docx作为目标文档
sims = index[lsi[corpus[doc_id]]]
sims = sorted(enumerate(sims), key=lambda item: -item[1])


# 输出与目标文档最相似的文档
print(f'选择目标文档为第{doc_id+1}个docx :\n{docs[doc_id]}')
# 对目标文档进行分析和统计词频
target_doc_path = os.path.join(doc_dir, os.listdir(doc_dir)[doc_id])
da = docx_analysis(target_doc_path)
da.analyze()


# 输出目标文档的词频统计结果
print('\n目标文档词频统计结果:')
for word, freq in da.wds_freq.items():
    print(f'词语: {word}, 词频: {freq}')
print('\n相似文档:')
for doc_idx, sim_score in sims[1:11]:  # 最相似的前10篇文档
    print(f'相似文档索引——{doc_idx}: \n{docs[doc_idx]}文档相似度: {sim_score}')

insert image description here
insert image description here
insert image description here
insert image description here
insert image description hereinsert image description here

Based on the advantages and disadvantages of Latent Semantic Analysis (LSA).
advantage:

  • LSA assumes that the hidden meaning hidden in the words (that is, these semantic dimensions of the latent semantic space) can better characterize the true meaning of the text.
  • LSA uses the latent semantic structure to represent entries and texts, maps entries and texts to the same k-dimensional semantic space, and expresses them in the form of k factors, and the meaning of vectors has changed greatly. What it reflects is no longer a simple frequency and distribution relationship of entries, but a strengthened semantic relationship. While maintaining most of the original information, it overcomes the phenomenon of polysemy, synonyms and word dependence in the traditional vector space representation method.

shortcoming:

  • LSA is not a probability model and lacks a rigorous mathematical and statistical foundation.
  • LSA solves some polysemy and polysemy problems, and can also be used for dimensionality reduction, but LSA is not a probability model and lacks a rigorous mathematical and statistical foundation. It is recommended to consider Latent Dirichlet allocation (LDA for short), a topic model that can give the topic of each document in the document set in the form of a probability distribution.

Guess you like

Origin blog.csdn.net/Myx74270512/article/details/131284058