LDA (Linear Discriminant Analysis (Common Law)) Detailed Explanation——matlab

Table of contents

foreword

topic

1. The idea of ​​LDA

2. Rayleigh quotient and generalized Rayleigh quotient 

3. The principle of the second type of LDA

4. Multi-class LDA principle

5. LDA classification

6. LDA algorithm process

Two types of LDA matlab examples:

1. Read the dataset

2. Separation of datasets

3. Solve w

4. Output the reduced data set

5. Classification
 


foreword

        In Principal Components and Factor Analysis, we summarize dimensionality reduction algorithms. Here we will make a summary of another classic dimensionality reduction method, Linear Discriminant Analysis (LDA). LDA has a very wide range of applications in the field of pattern recognition (such as face recognition, ship recognition, etc.), so it is necessary for us to understand its algorithm principle.

    Before learning LDA, it is necessary to distinguish it from LDA in the field of natural language processing. In the field of natural language processing, LDA is a latent Dirichlet Allocation (LDA for short), which is a topic for processing documents. Model. We only discuss linear discriminant analysis in this article, so all subsequent LDAs refer to linear discriminant analysis.

Before I go into details, allow me to put some of my previous links:

Principal Component Analysis - matlab : Portal

Principal Component Analysis - Python : Portal

Factor analysis - matlab: Portal

Factor Analysis - Python: Portal

topic

1. The idea of ​​LDA

        Linear discriminant analysis ( Line ar Discriminant A analysis , referred to as LDA ) is a classic linear learning method, because it was first proposed by [Fisher, 1936] on the binary classification problem it is also called " Fisher discriminant analysis ". And LDA is also a dimensionality reduction technique for supervised learning, which means that each sample of its data set has a category output. This is different from principal component and factor analysis because they are unsupervised dimensionality reduction techniques that do not consider sample categories .

        The idea of ​​LDA is very simple: Given a set of training samples, try to project the samples onto a straight line, so that the projections of the same samples are as close as possible and the projection points of the different samples are as far away as possible; when classifying new samples , project it onto the same straight line, and then determine the category of the new sample according to the position of the projected point. In fact, it can be summed up in one sentence: "After projection, the intra-class variance is the smallest, and the inter-class variance is the largest."

The picture shows a two-dimensional schematic diagram of LDA. "+", "-" represent the positive side and negative side respectively, the ellipse represents the outer contour of the data cluster, the dotted line represents the projection, and the red solid circle and solid triangle represent the center points of the two types of samples after projection.

2. Rayleigh quotient and generalized Rayleigh quotient 

        Let's first look at the definition of Rayleigh quotient.

        The Rayleigh quotient refers to such a function R(A,x):

R(A,x) = \frac{x^HAx}{x^Hx}

        Where x is a non-zero vector, and A is a Hermitan matrix of n*n. The so-called Hermitan matrix is ​​the matrix that satisfies the conjugate transposition matrix and itself is equal, that is, A^{H}=A if our matrix A is a real matrix, then the satisfied A^{T}=Amatrix is ​​the Hermitan matrix.

        The Rayleigh quotient R(A,x) has a very important property, that is, its maximum value is equal to the largest eigenvalue of matrix A, and the minimum value is equal to the smallest eigenvalue of matrix A, that is, it satisfies

\lambda_{min} \leq \frac{x^HAx}{x^Hx} \leq \lambda_{max}

        As for the proof process, it is not introduced here. When the vector x is an orthonormal basis, that is, when it is satisfied x^{H}x=1, the Rayleigh quotient degenerates into: R(A,x)=x^{H}Ax, this form appears in both spectral clustering and PCA.

The above is the content of Ruili business.

        Next, let’s introduce the generalized Ruili quotient again.

        The generalized Rayleigh quotient refers to such a function R(A,B,x):

R(A,x) = \frac{x^HAx}{x^HBx}

        Where x is a non-zero vector, and A, B are n×n Hermitan matrices. B is a positive definite matrix. What are its maximum and minimum values? In fact, we can convert it into the Rayleigh quotient format by standardizing it. We make x=B^{-1/2}x', then the denominator is transformed into:

x^HBx = x'^H(B^{-1/2})^HBB^{-1/2}x' = x'^HB^{-1/2}BB^{-1/2}x' = x'^Hx'

And the numerator transforms into:

x^HAx = x'^HB^{-1/2}AB^{-1/2}x'

At this point our R(A,B,x) is transformed into R(A,B,x′):

R(A,B,x') = \frac{x'^HB^{-1/2}AB^{-1/2}x'}{x'^Hx'}

        Using the properties of the previous Rayleigh quotient, we can quickly know that the maximum value of R(A,B,x′) is the B^{-1/2}AB^{-1/2}maximum eigenvalue of the matrix, or B^{-1}Athe maximum eigenvalue of the matrix, and the minimum value is the B^{-1}Aminimum of the matrix Eigenvalues.

3. The principle of the second type of LDA

        Let's start with a slightly simpler and easier-to-understand binary classification LDA, and learn more about the principle of LDA

        First, a given data set D={(x1,y1),(x2,y2),...,((xm,ym))}, where any sample xi is an n-dimensional vector, yi∈{0,1}. In addition, we define Nj(j=0,1) as the number of samples of class j, Xj(j=0,1) as the set of samples of class j, and μj(j=0,1) as samples of class j The mean vector of , define Σj(j=0,1) as the covariance matrix of the jth sample (strictly speaking, the covariance matrix lacking the denominator part).

in:

        The expression of μj is:

\mu_j = \frac{1}{N_j}\sum\limits_{x \in X_j}x\;\;(j=0,1)

        The expression of Σj is:

\Sigma_j = \sum\limits_{x \in X_j}(x-\mu_j)(x-\mu_j)^T\;\;(j=0,1)

We project the data onto a straight line. We assume that our projected straight line is a vector w, then for any sample xi, its projection on the straight line w is w^{T}x_{i}, for the center point μ0 of our two categories, the projection of μ1 on the straight line ww w^{T}u_{0}is the sum w^{T}u_{1}.

Since LDA needs to make the distance between the category centers of different categories of data as large as possible, that is, we want to maximize, and at the same time we ||w^T\mu_0-w^T\mu_1||_2^2 hope that the projection points of the same category of data are as close as possible, that is, the projection points of similar samples The sum w^T\Sigma_0w of  covariances w^T\Sigma_1w should be as small as possible, i.e. minimized w^T\Sigma_0w+w^T\Sigma_1w . In summary, our optimization goals are:

\underbrace{arg\;max}_w\;\;J(w) = \frac{||w^T\mu_0-w^T\mu_1||_2^2}{w^T\Sigma_0w+w^T\Sigma_1w} = \frac{w^T(\mu_0-\mu_1)(\mu_0-\mu_1)^Tw}{w^T(\Sigma_0+\Sigma_1)w}

Do you have many greetings here? ? ?

It is w,  w^T where is it, how to calculate it, the following is the process of our solution, and it is at the end of this section ! ! !

We generally define the intra-class scatter matrix Sw as:

S_w = \Sigma_0 + \Sigma_1 = \sum\limits_{x \in X_0}(x-\mu_0)(x-\mu_0)^T + \sum\limits_{x \in X_1}(x-\mu_1)(x-\mu_1)^T

At the same time, the inter-class scatter matrix Sb is defined as:

S_b = (\mu_0-\mu_1)(\mu_0-\mu_1)^T

Global scatter matrix :S_t=S_{b }+ S_{w}=\sum_{i=1}^{m} (x_i-\mu)(x_i-\mu)^T

Thus our optimization objective can be rewritten as:

\underbrace{arg\;max}_w\;\;J(w) = \frac{w^TS_bw}{w^TS_ww}

        Take a closer look at the above formula, isn't this our generalized Rayleigh quotient! This is simple, using the properties of the generalized Rayleigh quotient we mentioned in Section 2, we know that the maximum value of our J(w') is the S^{-1/2}_wS_bS^{-1/2}_wlargest eigenvalue of the matrix, and the corresponding w' is S^{-1/2}_wS_bS^{-1/2}_w the feature corresponding to the largest eigenvalue Vector! While S{^{-1}_{w}}S_{b}the eigenvalues ​​of and S^{-1/2}_wS_bS^{-1/2}_wthe eigenvalues ​​of are the same, the eigenvectors S{^{-1}_{w}}S_{b}of the eigenvectors w and S^{-1/2}_wS_bS^{-1/2}_wthe eigenvectors w′ of the eigenvectors satisfy w=S_{w}^{-1/2}w'the relationship! 

Here we have obtained w! ! !

        Note that for the second class, S_bwthe direction of is always parallel to μ0−μ1. Let S_bw=\lambda (\mu_0-\mu_1)it be brought into: (S_w^{-1}S_b)w=\lambda w, which can be obtained w=S_w^{-1}(\mu _0-\mu_1), that is to say, we only need to find the mean and variance of the original second class samples to determine the best The projection direction is w.

4. Multi-class LDA principle

        Earlier we introduced the LDA of the two categories, and then we will look at the multi-category LDA.

        Suppose our data set, D=\{(x_1,y_1), (x_2,y_2), ...,((x_m,y_m))\}, where any sample x_i is an n-dimensional vector, y_i \in \{C_1,C_2,...,C_k\} . We define N_j(j=1,2...k)as the number of samples of class j, X_j(j=1,2...k)which is the set of samples of class j, and \mu_j(j=1,2...k) is the mean vector of samples of class j, defined \Sigma_j(j=1,2...k) as the covariance matrix of samples of class j. From the formula defined in two-type LDA, it can be easily deduced to multi-class LDA.

        Since we are multi-class projection to low-dimensional, the low-dimensional space projected at this time is not a straight line, but a hyperplane. Assuming that the dimension of the low-dimensional space we project to is d, the corresponding basis vector is (w_1,w_2,...w_d), and the matrix composed of basis vectors is W, which is an n*d matrix.

        At this point our optimization goal should be able to become:

\frac{W^TS_bW}{W^TS_wW}

Among them, S_b = \sum\limits_{j=1}^{k}N_j(\mu_j-\mu)(\mu_j-\mu)^T, \muis the mean vector of all samples.S_w = \sum\limits_{j=1}^{k}S_{wj} = \sum\limits_{j=1}^{k}\sum\limits_{x \in X_j}(x-\mu_j)(x-\mu_j)^T

But there will be a problem here ?

        That is W^TS_bW , and W^TS_wWboth are matrices, not scalars, and cannot be optimized as a scalar function! In other words, we cannot directly use the optimization method of the second type of LDA, what should we do?

        In general, we can use some other alternative optimization objectives to achieve.

        For example, a common LDA multi-class optimization objective function is defined as:

\underbrace{arg\;max}_W\;\;J(W) = \frac{\prod\limits_{diag}W^TS_bW}{\prod\limits_{diag}W^TS_wW}

        Among them, \prod\limits_{diag}A  is the product of the main diagonal elements of A, and W is an n×d matrix.

        

        The optimization process of J(W) can be transformed into:

J(W) = \frac{\prod\limits_{i=1}^dw_i^TS_bw_i}{\prod\limits_{i=1}^dw_i^TS_ww_i} = \prod\limits_{i=1}^d\frac{w_i^TS_bw_i}{w_i^TS_ww_i}

At this time, I will observe the above formula again, and I will find that the formula on the far right is the generalized Ruili quotient we mentioned above! ! ! The maximum value is S_w^{-1}S_bthe largest eigenvalue of the matrix, and the product of the largest d values ​​is the product S_w^{-1}S_b of the largest d eigenvalues ​​of the matrix. At this time, the corresponding matrix W is the matrix formed by the eigenvectors corresponding to the largest d eigenvalues .

    Since W is a projection matrix obtained by using the category of the sample, the maximum dimension d of its dimensionality reduction is k-1. Why is the maximum dimension not the number of categories k?

        Because the rank of each μj−μ in Sb is 1, the maximum rank of the covariance matrix after addition is k (the rank of the matrix is ​​less than or equal to the sum of the ranks of each added matrix), but since if we know the first k-1 After μj, the last μk can be linearly represented by the first k-1 μj, so the rank of Sb is at most k-1, that is, there are at most k-1 eigenvectors.

5. LDA classification

So how to classify samples in the best classification space?

1) For binary classification problems. Since there are only two categories, after the above solution, all samples will be mapped to a one-dimensional space. Let the center points of two different samples after mapping be ,  respectively \large \bar{z_1}; \large \bar{z_2}points as classification points.

\large \bar{z}\frac{\bar{z_1}+\bar{z_2}}{2}

Finally, we will

\large w^{T}x>\bar{z}

The x's fall into one category and the others into another.

  2) For multi-classification problems. Through the LDA method, the original data is finally mapped to c-1 dimensions, and now we need to divide the sample set into c categories on these c-1 dimensions. How to divide this? I don't know for the time being, all I can think of is to convert the problem into a binary classification problem. In fact, for multi-class cases, it is mainly considered for dimensionality reduction.

  For this kind of problem, we mainly convert it to binary classification, and we use a one-to-one method. Simply put, all c-type samples are first divided into 1 and 2~c, and then 2~c is divided into 2 and 3~c, and so on until they are completely separated.

6. LDA algorithm process

Input: Data set D=\{(x_1,y_1), (x_2,y_2), ...,((x_m,y_m))\} , in which any sample xixi is an n-dimensional vector, \large y_i \in \{C_1,C_2,...,C_k\}, dimensionally reduced to dimension d.

Output: sample set D′ after dimensionality reduction

1) Calculate the intra-class scatter matrix Sw

2) Calculate the inter-class scatter matrix Sb 

3) Calculation matrixS^{-1}wSb 

4) Calculate S^{-1}wSbthe largest d eigenvalues ​​and the corresponding d eigenvectors (w1,w2,...wd) to obtain the projection matrix w

5) For each sample feature xi in the sample set, convert it into a new samplez_i=W^Tx_i

6) Get the output sample set D'=\{(z_1,y_1), (z_2,y_2), ...,((z_m,y_m))\}

Two types of LDA matlab examples:

1. Read the dataset

data = xlsread('文件路径')

2. Separation of datasets

For example, take the first 20 rows, 2 and 3 columns of the data set

data1=data(1:20,2:3)

For example, define the first column of the above data set as x and the second column as y; then classify x0, x1

x = data1(:,1)
y = data1(:,2)
x0 = x(find(y==0))
x1 = x(find(y==1))

And so on, please decide by yourself according to the situation, because I don't have relevant examples at hand, so I can only introduce them here

3. Solve w

%求均值
u0 = mean(x0);
u1 = mean(x1);

%求协方差
E0 = (x0-u0)'*(x0-u0);
E1 = (x1-u1)'*(x1-u1);

Sw = E0+E1;
Sb = (u0-u1)*(u0-u1)';
w = (Sw)^(-1)*(u0-u1)

4. Output the reduced data set

predict_y = w'* x

5. Classification

u = mean(w'* x)
for i = x
    h = w' * i ;
    lei = 1*(h<u)
end

Well, this change is over here, and examples will be added in the future!

Guess you like

Origin blog.csdn.net/qq_25990967/article/details/123465182
Recommended