Latent semantic analysis (LSA) (latent semantic analysis)

This article refers to Chapter 17 Latent Semantic Analysis of Li Hang's Statistical Learning Method ~

A core issue of text information processing is to digitally represent text content and calculate the semantic similarity between texts.

Traditional methods use word vectors to represent the semantic content of texts, and use the measure of word vector space (inner product or normalized inner product) to represent the semantic similarity between texts.

Latent semantic analysis attempts to discover potential topics, using topic vectors to represent the semantic content of texts, and using the measure of topic vector space (inner product or normalized inner product) to represent the semantic similarity between texts.

word vector space

describe:

Given a text, use a vector to represent the 'semantics' of the text, each dimension of the vector corresponds to a word, and its value is the frequency or weight of the word in the text (the weight is usually represented by tfidf). The basic assumption is that the occurrences of all words in a text represent the semantic content of the text. Each text in a text set can be expressed as a vector, and the measure of vector space such as inner product or normalized inner product indicates the 'semantic similarity' between texts.

The mathematical definition is given below, given a set of n texts and a set of m words that appear in all texts  . The data of words appearing in the text is represented by a word-text matrix, denoted as  , the first column represents the  word vector corresponding to the text, and the second column represents the word vector corresponding to the text: 

is a  matrix of . Indicates the frequency or weight of a word in the text .

Weights are usually expressed in frequency-inverse text frequency (TF-IDF). Regarding the definition of TF-IDF, it is easier to understand by searching.

Disadvantages:

Since words have polysemy and polysemy, the similarity calculation based on word vectors is inaccurate.

Polysemy of a word: For example, the word 'apple' has different meanings in different texts. In food texts, it means 'apple', and in technology texts, it means 'Apple Company'. But in the word vector space it is seen as the same meaning.

Unambiguous polywords: For example, the word 'airplane' and the word 'aircraft' have the same meaning no matter what text they are in, but they are regarded as two independent words in the word vector space.

topic vector space

The semantic similarity of two texts can be reflected in their topic similarity. The so-called topic refers to the content or theme discussed in the text. A text generally contains several topics. If the topics of the two texts are similar, the semantics of the two texts are also similar. For example, 'airplane' and the word 'aircraft' can represent the same topic, and 'apple' can represent different topics.

Topic vector space: Given a text, the text is represented by a vector in the topic space, each component of the vector corresponds to a topic, and its value is the weight of the topic appearing in the text. The number of topics is often much smaller than the number of words.

The mathematical definition is given below,

1. Word-text matrix: the same as the word-text matrix in the word vector space.

 Given a collection of n texts and a collection of m words that appear in all texts  . The data that the word appears in the text is represented by a word-text matrix, denoted as  , the first column represents  the vector corresponding to the text, and the second column represents the vector corresponding to the text: 

is a  matrix of . Indicates the frequency or weight of a word in the text .

2. Word-topic matrix:

Assuming that all texts contain k topics in total, and assuming that each topic is represented by an m-dimensional vector defined on the word set W that appears in all texts, this is the topic vector. Any topic vector can be expressed as:

where is the weight of the word in the topic .

Word-topic matrix , namely:

 3. Topic-Text Matrix

Assuming that a vector in the topic space in the text collection is , is a k-dimensional vector, the expression is:

where is the weight of the topic on the text .

 The topic-text matrix is:

 4. The relationship between word-text matrix, word-talk-topic matrix, and topic-text matrix

Any text vector in the word vector space can be approximated by k topic vectors as linear combination of coefficients, that is, the weighted sum of all topic vectors:

(the first topic vector multiplied by the coefficient + the second topic vector multiplied by the coefficient +... + the kth topic vector multiplied by the coefficient) 

Expressed in a matrix is:

The matrix  \boldsymbol{X} is ​​a word-text matrix (representation of text in word space).

matrix   \boldsymbol{T}is ​​a word-topic matrix (topic vector space).

The matrix \boldsymbol{Y} is ​​a topic-text matrix (representation of text in topic space). 

 Calculate matrix T and matrix Y 

The calculation method used is singular value decomposition. For singular value decomposition, please refer to Singular Value Decomposition (SVD) .

Specifically, the truncated singular value decomposition is used, and the number of topics k <= the number of texts n <= the number of words m.

\boldsymbol{X_{m\times n}} \approx \boldsymbol{U}_{m\times k}\boldsymbol{D}_{k\times k}\boldsymbol{V}_{n\times k}^{T}

Among them, the topic vector space matrix   \boldsymbol{T} is  \boldsymbol{U}_{m\times k} ​​, and the representation matrix of the text in the topic space  \boldsymbol{Y} is  \boldsymbol{D}_{k\times k}\boldsymbol{V}_{n\times k}^{T} .

That’s all for today’s LSA, welcome everyone to leave a message in the comment area~

Guess you like

Origin blog.csdn.net/qq_32103261/article/details/120601196