(1999, Nonlinear Mapping) Fisher's Discriminant Analysis Using Kernels

Fisher discriminant analysis with kernels

Official account: EDPJ

Table of contents

​​​​​​​0. Summary

1. Discriminant Analysis

2. Fisher's linear discriminant

3. Fisher discriminant in feature space

4. Experiment

5. Discussion and conclusions 

reference

S. Summary

S.1 Main idea

S.2 method


0. Summary

A nonlinear classification technique based on Fisher's discriminant is proposed. The main ingredient is the kernel trick, which allows efficient computation of the Fisher discriminant in the feature space. A linear classification in the feature space corresponds to a (strong) nonlinear decision function in the input space. Large-scale simulations demonstrate the competitiveness of our method.

1. Discriminant Analysis

In classification and other data analysis tasks, it is often necessary to preprocess the data before applying the algorithm at hand, and usually first extract features suitable for the task to be solved.

Feature extraction for classification is very different from feature extraction for describing data. For example, PCA finds the direction with the smallest reconstruction error by describing as much of the variance of the data as possible using m orthogonal directions. They consider the first direction and do not need (indeed often do not) reveal the class structure we need for correct classification. Discriminant analysis addresses the following question: Given a dataset containing two classes, say, which is the best feature or set of features (linear or nonlinear) to distinguish the two classes? Classical approaches solve this problem by using (theoretically) optimal Bayesian classifiers (assuming a normal distribution of classes) and standard algorithms (such as quadratic or linear discriminant analysis), which include the famous Fisher's discriminant. Of course, any other model than the Gaussian model can be assumed for the class distribution, however, this usually comes at the expense of simple closed-form solutions.

In this work, we propose to define a generalization of the nonlinear Fisher discriminant using kernel ideas originally applied to support vector machines (SVM), kernel PCA, and other kernel-based algorithms. Our approach uses kernel feature spaces to produce highly flexible algorithms that turn out to be competitive with SVMs.

Note that there exists a variety of methods called Kernel Discriminant Analysis. Most of them aim to replace parametric estimates of class conditional distributions with nonparametric kernel estimates. Even though our method may be viewed as such, it is important to note that it goes a step further by interpreting the kernel as a dot product in another space. This allows theoretically plausible interpretations as well as attractive closed-form solutions.

Below we will first review Fisher's discriminant, apply the kernel technique, then report the classification results, and finally give our conclusions. In this article, we will only focus on binary classification problems and linear discriminants in feature space.

2. Fisher's linear discriminant

samples from two different classes, and use the notation

S_B and S_W are the between-class and within-class scatter matrices, respectively. The intuition behind maximizing J(w) is to find the direction that maximizes the projected class mean (numerator) while minimizing the class variance (denominator) in that direction. But there is also a well-known statistical method to motivate equation (1):

Contact Optimal Linear Bayesian Classifier : An optimal Bayesian classifier compares the posterior probabilities of all classes and assigns a pattern to the class with the greatest probability. However, the posterior probability is usually unknown and must be estimated from a finite sample. For distributions of most classes, this is a daunting task, and it is often impossible to obtain closed-form estimates. However, by assuming that all classes follow a normal distribution, a quadratic discriminant analysis (which essentially measures the Mahalanobis distance of a mode to the center of the class) can be derived. Simplifying the problem further and assuming the same covariance structure for all classes, the quadratic discriminant analysis becomes linear. For binary classification problems, it is easy to show that the vector w maximizing equation (1) is in the same direction as the discriminant in the corresponding Bayesian optimal classifier. Although relying on strong assumptions that are not true in many applications, Fisher's linear discriminant has proven to be very powerful. One reason is of course that linear models are fairly robust to noise and likely not to overfit. Crucially, however, is the estimation of the scatter matrix, which can be highly biased. When the sample size is small compared to the dimensionality, using simple "plug-in" estimates in equation (2), will result in high variability. Different ways of dealing with this situation through regularization have been proposed, we will return to this topic later.

3. Fisher discriminant in feature space

Clearly, linear discriminants are not complex enough for most real-world data. To increase the expressiveness of the discriminant, we can try to model an optimal Bayesian classifier with a more complex distribution, or look for non-linear directions (or both). As mentioned earlier, assuming a general distribution can cause trouble. Here we constrain our search for non-linear directions by first nonlinearly mapping the data to some feature space F, and computing Fisher's linear discriminant there, thereby implicitly producing a nonlinear discriminant in the input space. 

Let Φ be a nonlinear mapping to some feature space F. To find the linear discriminant in F, we need to maximize

where ω ∈ F, S^Φ_B and S^Φ_W are the corresponding matrices in F.

Introducing the kernel function : Obviously, if the dimension of F is very high, even infinite, it cannot be solved directly. To overcome this limitation, we use the same trick as in kernel PCA or support vector machines. Instead of mapping the data explicitly, we seek an algorithm that uses only the dot product (Φ(x) · Φ(y)) of the training patterns. Since we can then efficiently compute these dot products, we can solve the original problem without explicit mapping to F. This can be achieved using Mercer kernels: these kernels k(x, y) compute the dot product in some feature space F, ie k(x,y) = (Φ(x) · Φ(y)). Possible choices for k have proven useful, for example, in support vector machines or kernel PCA is a Gaussian RBF

or polynomial kernel, k(x, y) = (x · y)^d, where c and d are positive constants.

To find the Fisher discriminant in the feature space F, we first need to formulate Equation (4) in terms of the dot product of the input patterns, which we then replace by some kernel function. From recurring kernel theory we know that any solution ω ∈ F must lie within the range of all training samples in F. We can thus find the expansion of w:

Using the expansion (5) and the definition of m^Φ_i, we have

Now consider the numerator of equation (4). Using the definition of S^Φ_B and equation (6), this can be rewritten as 

Considering the denominator of equation (4), we have

I is the identity matrix (Identity).

Combining equations (7) and (8), we can find the Fisher linear discriminant in F by maximizing equation (9) 

This problem can be solved by finding (N^(-1))M principal eigenvectors (similar to the algorithm in the input space). We call this approach the (non-linear) Kernel Fisher Discriminant (KFD). The projection of the new pattern x onto w is given by 

Numerical issues and regularization : Clearly, the proposed setting is ill-conditioned: we are estimating an L-dimensional covariance structure from t samples. In addition to the numerical problems that cause the matrix N to be non-positive, we also need capacity control in F. To do this, we simply add multiples of the identity matrix to N, i.e., replace N with 

This can be viewed in different ways: (i) it clearly makes the problem more numerically stable, since N_μ becomes positive definite for large enough p; (ii) it reduces bias in the sample-based eigenvalue estimates ; (iii) It imposes a regularization on ||α||^2 (remember we are maximizing equation (9)), favoring solutions with small expansion coefficients. While the true impact of this regularization setting is not fully understood, it shows connections to those used in support vector machines. Also, one might use other types of regularization additions to N, e.g. similar to the SVM penalty ||ω||^2 (by adding the full kernel matrix K_ij = k(x_i, x_j)). 

4. Experiment

Figure 1 shows the features found by KFD compared to the first and second (non-linear) features found by Kernel PCA on the toy dataset. For both, a second-order polynomial kernel is used, and for KFD, the within-class scatterplot is regularized with μ = 10^(-3). Depicts two classes (crosses and dots), feature values ​​(represented by gray levels) and contour lines of the same features. Each category consists of two noisy parabolic shapes mirrored on the x-axis and y-axis respectively. We see that KFD features discriminate the two classes in a near-optimal manner, while Kernel PCA features, while describing interesting properties of the dataset, do not do so well (although higher order Kernel PCA characteristics may be discriminative, also). 

To evaluate the performance of our new method, we perform extensive comparisons with other state-of-the-art classifiers. We compared the Kernel Fisher discriminant with AdaBoost, regularized AdaBoost, and support vector machines (with Gaussian kernels). For KFD, we also use a Gaussian kernel to regularize the within-class scatter. After finding the best direction ω ∈ F, we computed its projection using Equation (10). To estimate the optimal threshold for extracting features, any classification technique can be used, eg as simple as fitting a sigmoid. Here we use a linear SVM (optimized by gradient descent since we only have 1D samples). However, one downside of this is that we have another parameter to control, the regularization constant in the SVM. 

We use 13 artificial and real-world datasets (except bananas) from UCI, DELVE, and STATLOG benchmark repositories. Non-binary problems are divided into two types of problems. Then generate 100 partitions to test and train sets (approximately 60%:40%). On each of these datasets, we trained and tested all classifiers. The results in Table 1 show the mean test error and standard deviation for these 100 runs. To estimate the necessary parameters, we performed 5-fold cross-validation on the first five realizations of the training set and used the model parameters as the median of the five estimates.

Furthermore, in our preliminary experiments using KFD on the USPS handwritten digit dataset, we restrict the expansion of w in Equation (5) to only run on the first 3000 training samples. We achieve a 10-category error of 3.7% using a Gaussian kernel of width 0.3 x 256, which is slightly better than SVM with a Gaussian kernel (4.2%). 

Experimental results : Experiments show that the Kernel Fisher discriminant (plus support vector machines for estimating the threshold) is competitive on almost all datasets (except images), or in some cases even outperforms other algorithms. Interestingly, both SVM and KFD construct a (in some sense) optimal hyperplane in F, while we note that KFD's solution w gives a hyperplane that is usually better than the SVM solution. 

5. Discussion and conclusions 

Fisher's discriminant is one of the standard linear techniques in statistical data analysis. However, linear methods are often too limited, and several methods have been used in the past to derive more general class separability criteria. Our approach is very much in this spirit, however, since we compute the discriminant function in some feature space F (non-linearly related to the input space), we are still able to find closed-form solutions and preserve the theoretical beauty of Fisher's discriminant analysis. Furthermore, different kernels allow a high degree of flexibility due to the wide range of possible nonlinearities.

Our experiments show that KFD is competitive with other state-of-the-art classification techniques. Furthermore, since linear discriminant analysis is a well-studied field and many ideas previously developed in the input space can be transferred to the feature space, there is still a lot of room for expansion and further theory.

Note that while SVM's complexity scales with the number of support vectors, KFD has no concept of SV and its complexity scales with the number of training images. On the other hand, we speculate that some of the performance of KFD over SVM may be related to the fact that KFD uses all training samples in the solution, not just hard ones, i.e., Support Vectors.

Future work will be devoted to finding suitable approximation schemes and numerical algorithms to obtain the principal eigenvectors of large matrices. Further areas of research will include the construction of multi-class discriminants, theoretical analysis of KFD generalization error bounds, and investigation of the link between KFD and support vector machines.

reference

Mika S, Ratsch G, Weston J, et al. Fisher discriminant analysis with kernels[C]//Neural networks for signal processing IX: Proceedings of the 1999 IEEE signal processing society workshop (cat. no. 98th8468). Ieee, 1999: 41-48.

S. Summary

S.1 Main idea

For most real-world data classification, linear discriminant is not complex enough. To increase discriminative expressiveness, more complex distributions can be used to model optimal Bayesian classifiers, or to find non-linear directions (or both).

Linear models are fairly robust to noise and are likely not to overfit, but estimates of the crucial scatter matrix can be highly biased, especially when the sample size is small compared to the dimensionality.

S.2 method

The data are first nonlinearly mapped to some feature space F, and Fisher's linear discriminant is computed there to find the nonlinear direction, thus implicitly producing a nonlinear discriminant in the input space.

Instead of explicitly mapping the data, the authors seek an algorithm that uses only the training pattern dot product (Φ(x) · Φ(y)), where Φ is a non-linear mapping to some feature space F. Since these dot products can be computed efficiently, the original problem can be solved without explicit mapping to F. This can be achieved using Mercer kernels: these kernels k(x, y) compute the dot product in some feature space F, ie k(x,y) = (Φ(x) · Φ(y)).

​Kernel function k, e.g. Gaussian RBF in SVM or Kernel PCA

or polynomial kernel, k(x, y) = (x · y)^d, where c and d are positive constants.

Guess you like

Origin blog.csdn.net/qq_44681809/article/details/131273195