Linear Discriminant Analysis for Machine Learning

1 Introduction to Linear Discriminant Analysis

1.1 What is linear discriminant analysis

Linear Discriminant Analysis (LDA for short) is a classic supervised learning algorithm, also known as "Fisher Discriminant Analysis". LDA has a very wide range of applications in the field of pattern recognition (such as face recognition, ship recognition and other graphic and image recognition fields).

The core idea of ​​LDA is to try to project the samples onto a straight line given the training sample set. Make the projection points of similar samples as close as possible, and the projection points of heterogeneous samples as far as possible; when classifying a new sample, project it onto the line, and then determine the category of the new sample according to the position of the projection point.

Different from LDA in the field of natural language processing, in the field of natural language processing, LDA is a hidden Dirichlet Allocation (Latent DIrichlet Allocation, LDA for short), which is a topic model for processing documents. What we discuss in this article is linear discriminant Therefore, the LDA mentioned later is linear discriminant analysis.

1.2 Why learn the LDA algorithm

PCA is an unsupervised data dimensionality reduction method, and LDA is a supervised data dimensionality reduction method. On the training samples, even if category labels are provided for the data, category labels will not be used when using the PCA model, while LDA uses category labels when performing data dimensionality reduction.

From a geometric point of view, both PCA and LDA project data onto new mutually orthogonal coordinate axes. It's just that the constraints they use are different during the projection process, and it can also be said that the goals are different. PCA is to project the data to several mutually orthogonal directions with the largest variance, in order to expect to retain the most sample information. The greater the variance of the sample, the better the diversity of the sample. When training the model, we certainly hope that the greater the difference in the data, the better. Otherwise, even if there are many samples but they are similar or the same to each other, the sample information provided will be the same, which means that only a few samples provide information that is useful. Insufficient sample information will lead to unsatisfactory model performance. This is the purpose of PCA dimensionality reduction: to project the data into several mutually orthogonal directions with the largest variance. This constraint is sometimes useful, as in the following example:

For this sample set, we can project the data onto the x-axis or the y-axis, but this is not the best projection direction, because neither direction can best reflect the distribution of the data. Obviously there is an optimal direction to describe the distribution trend of the data, that is the direction of the red straight line in the figure. It is also the direction where the data samples are projected and the variance is the largest. Projecting in this direction, the variance of the data after projection is the largest, and the data retains the most information.

However, for some other data sets with different distributions, the goal of PCA with the largest post-projection variance is not suitable. For example, for the data set in the picture below:

For this data set, if you also choose to use PCA, choose the direction with the largest variance as the projection direction to reduce the dimensionality of the data. Then the best projection direction selected by PCA will be the direction shown by the red straight line in the figure. The projection does have the largest variance in this way, but is there any other problem. You are smart enough to find out that after doing this projection, the two types of data samples will be mixed together, and will no longer be linearly separable, or even inseparable. This is hell for us, the original linearly separable samples are no longer separable by our own hands. However, we found that if the yellow straight line in the figure is used, projection to this straight line can reduce the dimensionality of the data, and at the same time ensure that the two types of data are still linearly separable. If the above data set uses LDA dimensionality reduction, the projection direction found is the direction of the yellow straight line.

This is actually the idea of ​​LDA, or the goal of LDA dimensionality reduction: to reduce the dimensionality of labeled data and project it into a low-dimensional space while satisfying three conditions:

  • Retain as much information as possible from the data samples (that is, select the largest feature is the direction represented by the corresponding feature vector).
  • Find the best projection orientation that makes the samples score as well as possible.
  • After projection, the samples of the same type are as close as possible, and the samples of different types are as far away as possible.

1.3 The idea of ​​LDA

The core idea of ​​LDA: small within a class, large between classes

  • linear classification

It means that there is a linear equation that can separate the data to be classified, or use a hyperplane to distinguish positive and negative samples. The expression is y=wx. Let me talk about the hyperplane here. For the two-dimensional case, it can be understood as a line A straight line, such as a linear function. Its classification algorithm is based on a linear prediction function, and the decision boundaries are flat, such as lines and planes. Common methods include perceptrons and least squares.

  • nonlinear classification

It means that there is no linear classification equation to separate the data, and its classification interface is not limited, which can be a curved surface or a combination of multiple hyperplanes.

LDA is a dimensionality reduction technique for supervised learning, which means that each sample of its data set has a category output. This is different from PCA, which is an unsupervised dimensionality reduction technique that does not consider the sample category output. The idea of ​​LDA can be summarized in one sentence, that is, "after projection, the intra-class variance is the smallest, and the inter-class variance is the largest". What does it mean? We want to project the data in low dimensions. After projection, we hope that the projection points of each category of data are as close as possible, and the distance between the category centers of different categories of data is as large as possible.

Suppose we have two types of data, red and blue, as shown in the figure below. These data features are two-dimensional. We hope to project these data onto a one-dimensional straight line, so that the projection points of each type of data are as possible as possible. The distance between the red and blue data centers is as large as possible.

The above figure provides two projection methods. It can be seen intuitively that the projection effect on the right is better than that on the left, because the red data and blue data in the right figure are relatively concentrated, and the distance between categories is obvious. Then the data is mixed at the boundary. The above is the main idea of ​​LDA. In practical applications, our data is multi-category, the original data is generally more than two-dimensional, and the projected one is generally not a straight line, but a low-dimensional hyperplane.

Teacher Zhou Zhihua's "Machine Learning" briefly described the central idea of ​​linear discriminant analysis, which can be associated with intragroup deviation SSE and intergroup deviation SSA in variance analysis (the inventors of Fisher's linear discriminant analysis and variance analysis are both RAFisher).

The idea of ​​Fisher's discriminant analysis is very simple: Given a set of training samples, try to project the samples onto a straight line, so that the projection points of the same type of samples are as close as possible, and the projection points of different types of samples are as far away as possible. When classifying a new sample, project it onto the same straight line, and then determine its category according to the position of the projected point of the new sample.

1.4 The optimization objective of LDA algorithm

The principle of LDA : projected into a space with lower dimensions, so that the projected points will be classified by category, cluster by cluster, and points of the same category will be closer to the method in the projected space. The following figure is from Zhou Zhihua's "Machine Learning", which gives a two-dimensional schematic diagram:

What is linear discriminant analysis? The so-called linearity means that we want to project the data points onto a straight line (maybe multiple straight lines). The analytical function of the straight line is also called a linear function. Usually, the expression of the straight line is:

In fact, x here is the sample vector (column vector). If projected onto a straight line, w is an eigenvector (column vector form) or a matrix composed of multiple eigenvectors. As for why w is an eigenvector, we can deduce it later. y is the projected sample point (column vector). We first illustrate using two-class samples, and then generalize to multi-class problems.

Project the data onto the straight line w, then the projections of the centers of the two types of samples on the straight line are respectively W^{T}\mu _{0}, W^{T}\mu _{1}and if all the sample points are projected onto the straight line, the covariances of the two types of samples are and W^{T}\sum _{0}Wrespectively W^{T}\sum _{1}W.

Calculation of covariance matrix of similar samples after projection:

The middle part of the above formula (that is, the formula in the second line) is the covariance matrix of the same kind of samples before projection. You can also see the relationship between the covariance matrices before and after projection of similar samples. If the covariance matrix before projection is , \sumthen after projection is W^{T}\sum W.

The derivation of the above formula needs to use the following formula, a and b are both column vectors:

To make the projection points of similar samples as close as possible, the covariance matrix of similar sample points can be made as small as possible, that is, the following formula is as small as possible:

To make the projection points of heterogeneous samples as far away as possible, the distance between the class centers can be made as large as possible, that is, the following formula is as large as possible:

Considering the two at the same time, the goal of maximization can be obtained

 The above formula ||*|| represents the Euclidean norm, where:

The above-mentioned covariance matrix of similar sample points is as small as possible, and the projection points of similar samples are as close as possible. How to understand this sentence? When we first came into contact with covariance, it was used to represent the correlation between two variables. Let's take a look at the formulas of covariance and variance:

It can be seen that the formula of covariance is very similar to variance, and it can even be said that variance is a special case of covariance. We know that variance can be used to measure the degree of dispersion of data. The larger (X - Xhat), the farther the data is from the center of the sample, the more discrete the data, and the greater the variance of the data. Similarly, we observe that the larger the covariance formula, (X - Xhat) and (Y - Yhat), the farther the data is from the sample center, the more dispersed the data distribution, and the greater the covariance. On the contrary, the smaller they are, the closer the data is to the sample center, the more concentrated the data distribution, and the smaller the covariance.

Therefore, the covariance not only reflects the correlation between variables, but also reflects the degree of dispersion of the multidimensional sample distribution (one-dimensional samples use variance). The larger the covariance (the larger the absolute value for negative correlation), the data The distribution is more dispersed. Therefore, in order to make the projection points of the same kind of samples as close as possible, the covariance matrix of the same kind of sample points can be made as small as possible.

First of all, one of the goals of LDA classification is to make the distance between different categories as far away as possible, and the closer the distance within the same category, the better. Then the farther the distance between different categories, the better, we can understand that the farther the better to distinguish.

For example, the mean value of each type of sample is:

 And the mean after projection is:

The center points of the two types of samples after projection are separated as much as possible:

Does that mean we can just maximize J(w)? The actual situation is of course not, such as the following situation:

We found that for the above figure, suppose we only have two classification directions, one is the X1 direction and the other is the X2 direction.

We see that the direction of X1 can maximize the value of J(w), but it is not well divided, and there is still an intersection between the two data sets of μ1 and μ2. However, in the X2 direction, although the value of J(w) will not be maximized, it is well divided.

Based on this problem, we propose a hash value (that is, the density of sample points). When the hash value is larger, it is more dispersed, and vice versa. We definitely hope that the same kind should be denser, expressed as follows:

So, in summary, we hope that the center values ​​of samples of different classes are as dispersed as possible, that is, the larger the J(w), the better, and the smaller the distance between the same classes, the better, that is, the smaller the sum of the hash values, the better. good. In the end, we still hope that the larger J(w) is, the better, so we can write the formula as follows. Based on this, we get the objective function as:

The molecules are shown as follows (where SB is called the between-class scatter matrix):

The hash value formula shows:

Scatter matrices:

Between-class scatter matrix:

in:

The final objective function is:

Rayleigh quotient and generalized Rayleigh quotient

The Rayleigh quotient refers to a function R(A, x) like this:

Among them, x is a non-zero vector, and A is a Hermitan matrix of n*n. The so-called Hermitan matrix is ​​a matrix that satisfies the conjugate transposition matrix and itself is equal, that is, AH=A. If our matrix A is a real matrix, then the satisfied A^{H}=Amatrix is ​​the Hermitan matrix.

The Rayleigh quotient R(A, x) has a very important property, that is, its maximum value is equal to the largest eigenvalue of matrix A, and the minimum value is equal to the smallest eigenvalue of matrix A, that is, it satisfies:

When the vector x is an orthonormal basis, that is, when X^{H}X=1= 1 is satisfied, the Rayleigh quotient degenerates to: R(A,X)=X^{H}AX, this form appears in both spectral clustering and PCA.

The generalized Rayleigh quotient refers to a function R(A, B, x) like this:

Where x is a non-zero vector, and A and B are n*n Hermitan matrices. B is a positive definite matrix. What are its maximum and minimum values? In fact, we can convert it into the format of Rayleigh quotient only through standardization. We make X=B^{-1/2}X^{​{}'}, then the denominator is transformed into:

And the numerator transforms into:

At this point our R(A, B, x) is transformed into R(A, B, x'):

Using the properties of the previous Rayleigh quotient, we can quickly know that the maximum value of R(A, B, x') is the B^{-1/2}AB^{-1/2}maximum eigenvalue of the matrix, or B^{-1}Athe maximum eigenvalue of the matrix, and the minimum value is the B^{-1}Aminimum of the matrix Eigenvalues ​​(the matrix is ​​normalized here).

1.5 The principle of the second type of LDA

LDA hopes that after projection, the projection points of the same category of data are as close as possible, and the distance between the category centers of different categories of data is as large as possible, but this is only a sensory measure. Now we start with the relatively simple second-type LDA, and rigorously analyze the principle of LDA.

The principle of LDA: projected into a space with lower dimensions, so that the projected points will be classified by category, cluster by cluster, and points of the same category will be closer to the method in the projected space .

Suppose our data set D={(x1, y1), (x2, y2), .... (xm, ym)}, where any sample xi is an n-dimensional vector, and yi belongs to {0, 1}. We define Nj (j=0, 1) as the number of samples of the jth class, Xj(j = 0, 1) as the set of samples of the jth class, and μj (j=0, 1) as the number of samples of the jth class Mean vector, define Σj(j=0, 1) as the covariance matrix of the jth sample (strictly speaking, it is the covariance matrix missing the denominator part).

The expression of μj is:

 where j = 0, 1

The expression of Σj is:

Since there are two types of data, we only need to project the data onto a straight line. Assuming that our projected straight line is a vector w, then for any sample xi, its projection on the straight line w is, W^{T}X_{i}for the center points μ0 and μ1 of our two categories, the projection on the straight line w is W^{T}\mu _{0}, W^{T}\mu _{1}. Since LDA needs to make the distance between the centers of different categories of data as large as possible, that is, we want to maximize, and at the same time we  \left \| W^{T}\mu _{0}-W^{T}\mu _{1} \right \|^2{_{2}}hope that the projection points of the same category of data are as close as possible, that is, the projection points of similar samples The covariance sum W^{T}\sum _{0}Wis W^{T}\sum _{1}Was small as possible, that is, minimized  W^{T}\sum _{0}W+W^{T}\sum _{1}W. In summary, our optimization goals are:

Intra-class scatter matrix

We generally define the intra-class scatter matrix Sw as:

between-class scatter matrix

The inter-class scatter matrix is ​​actually the covariance matrix multiplied by the number of samples, that is, the scatter matrix and the covariance matrix only differ by one coefficient.

At the same time, the inter-class scatter matrix Sb is defined as:

After defining the intra-class scatter matrix and the inter-class scatter matrix, we can rewrite the above optimization objective as:

Using the properties of the generalized Rayleigh quotient we mentioned, we know that the maximum value of our J(w') is the largest eigenvalue of the matrix Sw-1/2SbSw-1/2, and the corresponding w' is the largest eigenvalue corresponding S_{W}^{-1/2}S_{b}S_{W}^{-1/2}to The eigenvectors of ! , and S_{W}^{-1/2}S_{b}the eigenvector corresponding to the largest eigenvalue of ! , and S_{W}^{-1}S_{b}the eigenvalues ​​of S_{W}^{-1/2}S_{b}S_{W}^{-1/2}are the same as the eigenvalues ​​of , S_{W}^{-1}S_{b}and the eigenvector w' of is satisfied by W=S_{W}^{-1/2}{W}'the relation of .

Note that for the second class, the direction of Sbw is always parallel to μ0 - μ1, let it be  S_{b}W=\lambda (\mu _{0}-\mu _{1})substituted into: (S_{W}^{-1}S_{b})W=\lambda W, we can get W=S_{W}^{-1} (\mu _{0}-\mu _{1}), that is, we only need to find the mean and variance of the original second class samples to determine the best projection Direction w up.

Note that the above numerator and denominator are both quadratic terms of w, so the above solution has nothing to do with the length of w, but only with its direction. Without loss of generality, we make:

(We normalize the denominator: because if the numerator and denominator can take any value, it will make infinite solutions, we limit the denominator to a length of 1)

 Then the optimization objective is equivalent to:

 Using the Lagrange multiplier method, the above formula is equivalent to:

It can be seen that the above formula has been transformed into a problem of solving eigenvalues ​​and eigenvectors. W is the eigenvector that our matrix SW-1SB needs to solve, which verifies the formula we said before:

w is the matrix of eigenvectors. But here we still have a problem to solve, that is, whether Sw is reversible. Unfortunately, in practical applications, it is usually irreversible, and we usually have two ways to solve it.

  • Method 1: Let Sw = Sw + γ I , where γ is a very small number, so Sw must be reversible.
  • Method 2: First use PCA to reduce the dimensionality of the data, so that Sw is reversible on the reduced dimensionality data, and then use LDA.

1.6 Multi-class LDA

Suppose our data set D={(x1, y1), (x2, y2), .... (xm, ym)}, where any sample xi is an n-dimensional vector, and yi belongs to {C1, C2,... , Ck}. We define Nj (j=1, 2, ... k) as the number of samples of class j, Xj (j = 1, 2, ...k) as the mean vector of samples of class j, define Σj( j = 1, 2, 3, ...k) is the covariance matrix of the jth class sample. The formulas defined in two-class LDA can be easily deduced to multi-class LDA.

Since we are multi-class projection to low-dimensional, the low-dimensional space projected at this time is not a straight line, but a hyperplane. Assume that the dimension of the low-dimensional space we project to is d, the corresponding basis vector is (w1, w2, ....wd), and the matrix composed of basis vectors is W, which is an n*d matrix.

At this point our optimization goal should be able to become:

Where: μ is the mean vector of all samples

But there is a problem, that is W^{T}S_{b}W, W^{T}S_{W}Wboth are matrices, not scalars, and cannot be optimized as a scalar function! In other words, we cannot directly use the optimization method of the second type of LDA, how to do it? In general, we use some other alternative optimization objective to achieve.

A common LDA multi-class optimization objective function is defined as:

Where ∏A is the product of the main diagonal elements of A, and W is a matrix of n*d.

Carefully observe the far right side of the above formula, isn't this the generalized Rayleigh quotient! The maximum value is S_{W}^{-1}S_{b}the largest eigenvalue of the matrix, and the product of the largest d values ​​is the product S_{W}^{-1}S_{b}of the largest d eigenvalues ​​of the matrix. At this time, the corresponding matrix W is the matrix formed by the eigenvectors corresponding to the largest d eigenvalues .

Since W is a projection matrix obtained by using the category of samples, the maximum dimension d of its dimensionality reduction is k-1, why is the maximum dimension not category k? Because the rank of each μj - μ in Sb is 1, the maximum rank of the covariance matrix after addition is k (the rank of the matrix is ​​less than or equal to the sum of the ranks of each added matrix), but since we know that the first k-1 After μj, the last μk can be linearly represented by the first k-1 μj, so the rank of Sb is at most k-1, that is, there are at most K-1 eigenvectors.

1.7 LDA Algorithm Process

  Below we make a summary of the process of LDA dimensionality reduction.

  Input : data set D = {(x1, y1), (x2, y2), .... (xm, ym)}, where any sample xi is an n-dimensional vector, yi € {C1, c2, ...Ck }, dimensionality d reduced to.

  Output : sample set D' after dimensionality reduction

  • 1) Calculate the intra-class scatter matrix Sw
  • 2) Calculate the inter-class scatter matrix Sb
  • 3) Calculation matrix Sw-1Sb
  • 4) Calculate the largest d eigenvalues ​​and corresponding d eigenvectors (w1, w2, ... wd) of Sw-1Sb to obtain the projection matrix W
  • 5) Convert each sample feature xi in the sample set into a new sampleZ_{i}=W^{T}X_{i}
  • 6) Get the output sample set D' = {(z1, y1), (z2, y2), .... (zm, ym)}

  The above is the algorithm flow of dimensionality reduction using LDA. In fact, in addition to being used for dimensionality reduction, LDA can also be used for classification. A common basic idea of ​​LDA classification is to assume that the sample data of each category conforms to the Gaussian distribution, which is beneficial to LDA after projection, and the mean and variance of the projected data of each category can be calculated by using the maximum likelihood estimation, and then the probability of the Gaussian distribution of the category can be obtained. density function. When a new sample arrives, we can project it, and then substitute the projected sample features into the Gaussian distribution probability density function of each category to calculate the probability that it belongs to this category. The category corresponding to the maximum probability is the predicted category. .

1.8 LDA vs PCA

LDA is used for dimensionality reduction. It has many similarities and differences with PCA, so it is worth comparing the similarities and differences of dimensionality reduction between the two. First let's look at the similarities:

  • Both can reduce the dimensionality of the data
  • Both use the idea of ​​matrix eigendecomposition in dimensionality reduction
  • Both assume that the data follow a Gaussian distribution

difference:

  • LDA is a supervised dimensionality reduction method, while PCA is an unsupervised dimensionality reduction method
  • LDA dimensionality reduction can be reduced to the dimension of the number of categories k-1, while PCA does not have this limitation
  • In addition to being used for dimensionality reduction, LDA can also be used for classification
  • LDA chooses the projection direction with the best classification performance, while PCA chooses the direction with the largest variance of sample point projections

The fourth point can be seen from the figure below, LDA is better than PCA for dimensionality reduction under certain data distributions

Of course, under certain data distributions, PCA is better than LDA for dimensionality reduction, as shown in the following figure:

 

2 Advantages and disadvantages of LDA

LDA is a supervised learning algorithm used to project data points into a low-dimensional space for classification tasks. The advantages and disadvantages of the LDA algorithm are as follows:

2.1 Advantages of LDA

  • Excellent dimensionality reduction effect: LDA can better preserve the differences between categories by projecting data points into a low-dimensional space, thereby achieving effective dimensionality reduction of data. On some data sets with many features, LDA can significantly reduce the computational complexity and improve the classification efficiency.

  • Considering the differences between categories: LDA not only pays attention to the distribution of data points in the feature space, but also considers the differences between different categories. It tries to maximize the distance between classes to better distinguish between different classes.

  • Strong interpretability: The projection vector of LDA has physical meaning and can be interpreted as a linear difference between categories.

  • Suitable for multi-classification problems: LDA performs well when dealing with multi-classification problems, and can effectively distinguish multiple categories.

2.2 Disadvantages of LDA

  • Assuming that the data conforms to the Gaussian distribution: LDA makes strict assumptions about the distribution of the data, that is, the data obeys the Gaussian distribution within each category, and the covariance matrix of each category is equal. In some complex datasets in the real world, these assumptions may not hold, thus affecting the performance of LDA.

  • Sensitive to outliers: Since the assumption of LDA is based on the Gaussian distribution of the data, when there are outliers in the data set, the projection results may be inaccurate.

  • Inability to handle nonlinear relationships: LDA is a linear classifier and may perform poorly on datasets with complex nonlinear relationships. In such cases, non-linear methods such as kernelized LDA can be considered.

  • Suffering from the curse of dimensionality: When the feature dimensionality is very high, the performance of LDA may drop. Although LDA can perform dimension reduction, when the feature dimension is too large, it may lead to inaccurate estimation of the covariance matrix within the category and affect the classification effect.

3 Application Scenarios of LDA

LDA is widely used in many practical problems. Before applying LDA, it is necessary to ensure that the data conforms to the assumption of LDA, that is, the data obeys the Gaussian distribution within each category, and the covariance matrix of each category is equal. For nonlinear problems, you may want to consider using kernelized LDA or other nonlinear methods to deal with it.

The common application scenarios of LDA are as follows:

  • Pattern Recognition and Classification: LDA is a supervised learning algorithm especially suited for solving classification problems. It can be used to project data points into a low-dimensional space and classify different classes based on their differences.

  • Face recognition: In the field of computer vision, LDA is often used for face recognition tasks. By projecting the face image into a low-dimensional space, the classification and recognition of different faces can be achieved.

  • Text classification: In natural language processing, LDA can be used for text classification tasks. Classification and sentiment analysis of text data can be achieved by representing text as vectors and applying LDA for dimensionality reduction and classification.

  • Bioinformatics: LDA can be applied to bioinformatics problems such as gene expression data analysis and protein function classification to help identify the functions and classifications of different genes or proteins.

  • Medical diagnosis: In the medical field, LDA can be used for disease diagnosis and medical image analysis. For example, by reducing the dimensionality of medical image data and classifying it, it can help doctors make more accurate diagnoses.

  • Video Behavior Recognition: LDA is also used in video analysis, which can help identify different behaviors and actions in videos, such as action recognition, motion analysis, etc.

  • Financial analysis: In the financial field, LDA can be used to classify and predict financial data, such as predicting stock market trends, credit ratings, etc.

4 LDA code realizes dimensionality reduction of iris data

4.1 Dataset Introduction

The Iris dataset is a classic dataset that is often used as an example in both statistical learning and machine learning. The data set contains 150 records in 3 categories, 50 data in each category, and each record has 4 features: sepal length, sepal width, petal length, and petal width. These 4 features can be used to predict the iris flower belongs to ( Which of the three species is iris-setosa, iris-versicolour, iris-virginica).

Data content:

     sepal_len  sepal_wid  petal_len  petal_wid  label
0          5.1        3.5        1.4        0.2      0
1          4.9        3.0        1.4        0.2      0
2          4.7        3.2        1.3        0.2      0
3          4.6        3.1        1.5        0.2      0
4          5.0        3.6        1.4        0.2      0
..         ...        ...        ...        ...    ...
145        6.7        3.0        5.2        2.3      2
146        6.3        2.5        5.0        1.9      2
147        6.5        3.0        5.2        2.0      2
148        6.2        3.4        5.4        2.3      2
149        5.9        3.0        5.1        1.8      2

4.2 Native code implementation

import numpy as np
from sklearn.datasets import load_iris
from sklearn.preprocessing import MinMaxScaler
import matplotlib.pyplot as plt
import matplotlib.pyplot as plt2


# 定义LDA类
class LDA:
    def __init__(self, n_components=2, kernel='rbf', gamma=400):
        self.n_components = n_components
        self.kernel = kernel
        self.gamma = gamma

    def fit(self, X, y):
        # 计算内部和外部类别散度矩阵
        X_mean = np.mean(X, axis=0)
        S_W = np.zeros((X.shape[1], X.shape[1]))
        S_B = np.zeros((X.shape[1], X.shape[1]))
        for i in range(3):
            X_class = X[y == i, :]
            X_class_mean = np.mean(X_class, axis=0)
            S_W += np.dot((X_class - X_class_mean).T, (X_class - X_class_mean))
            S_B += len(X_class) * np.dot((X_class_mean - X_mean).reshape(-1, 1), (X_class_mean - X_mean).reshape(1, -1))

        # 使用LDA算法计算投影矩阵W
        eig_val, eig_vec = np.linalg.eig(np.dot(np.linalg.inv(S_W), S_B))
        idx = np.argsort(-eig_val.real)
        self.W = eig_vec[:, idx[:self.n_components]]

        # 归一化处理
        scaler = MinMaxScaler()
        self.W = scaler.fit_transform(self.W)

    def transform(self, X):
        # 投影到特征空间
        X_new = np.dot(X, self.W)
        # 归一化处理
        scaler = MinMaxScaler()
        X_new = scaler.fit_transform(X_new)
        return X_new


if __name__ == '__main__':
    # 加载数据集
    iris = load_iris()
    X = iris.data
    y = iris.target

    # 模型训练
    lda = LDA(n_components=2, kernel='rbf')
    lda.fit(X, y)

    # 数据转换
    X_new = lda.transform(X)

    # 可视化降维前的数据分布
    plt.scatter(X[:, 0], X[:, 1], c=y)
    plt.show()

    # 可视化降维后的数据分布
    plt2.scatter(X_new[:, 0], X_new[:, 1], c=y)
    plt2.show()

Data distribution before LDA dimensionality reduction:

 Data distribution diagram after LDA dimensionality reduction:

 

4.3 Implementation based on sklearn code

# 基于线性LDA算法对鸢尾花数据集进行分类
import numpy as np
import matplotlib.pyplot as plt
from sklearn import preprocessing
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix
from matplotlib.colors import ListedColormap

# 加载数据集
iris = load_iris()
X = iris.data
y = iris.target

le = preprocessing.LabelEncoder()
le.fit(['Iris-setosa', 'Iris-versicolor', 'Iris-virginica'])
X = X[:, :2]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

if __name__ == '__main__':

    sc = StandardScaler()
    X_train = sc.fit_transform(X_train)
    X_test = sc.transform(X_test)

    lda = LDA(n_components=2)
    X_train = lda.fit_transform(X_train, y_train)
    X_test = lda.transform(X_test)

    classifier = LogisticRegression(random_state=0)
    classifier.fit(X_train, y_train)

    y_pred = classifier.predict(X_test)
    cm = confusion_matrix(y_test, y_pred)

    X_set, y_set = X_train, y_train
    X1, X2 = np.meshgrid(np.arange(start=X_set[:, 0].min() - 1, stop=X_set[:, 0].max() + 1, step=0.01),
                         np.arange(start=X_set[:, 1].min() - 1, stop=X_set[:, 1].max() + 1, step=0.01))
    plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
                 alpha=0.75, cmap=ListedColormap(('red', 'green', 'blue')))
    plt.xlim(X1.min(), X1.max())
    plt.ylim(X2.min(), X2.max())
    for i, j in enumerate(np.unique(y_set)):
        plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
                    c=ListedColormap(('yellow', 'green', 'blue'))(i), label=j)
    plt.title('Logistic Regression (Training set)')
    plt.xlabel('LD1')
    plt.ylabel('LD2')
    plt.legend()
    plt.show()

The result display:

5LDASummary

LDA is a simple and effective classifier and dimensionality reduction method, especially suitable for problems with low feature dimensions and data conforming to Gaussian distribution. But before applying it, it is necessary to carefully consider whether the data conforms to the assumptions of LDA, and whether the characteristics of the problem are suitable for using LDA.

Guess you like

Origin blog.csdn.net/lsb2002/article/details/131843006
Recommended