Machine learning feature selection and feature extraction

A feature extraction and feature selection distinction

Feature selection and dimension reduction (feature extraction) has little similarities, both to achieve the same effect, is to try to reduce the feature attribute data set (referred to as feature or) number; but both the modalities the method is different: dimensionality reduction methods mainly by the relationship between the properties, such as different combinations of properties obtained new property, thus changing the original feature space; and feature selection is to select the raw feature data from the centralized the subset It is a relationship that contains no changes to the original feature space.

II. Common feature selection methods are there?

Feature selection is an important feature concentration selected from a subset of samples

There are relatively well-known feature selection filtration method ( the Filter), wrapping method (Wrapper), embedding (Embedded)

First introduced several feature selection and related terms:

Wherein divergence : If the feature does not diverge, that is to say close to the variance of feature 0 represents no difference between the different samples on this feature, to distinguish the role of the sample is substantially absent.

Correlation with the target : the so-called correlation exists between the characteristics and the target value that is a positive correlation (characteristic value becomes large as the target value also becomes larger) or negative correlation properties. It represents a causal relationship between the characteristics of a strong data and target values.

1. Filtration

Filtration or divergence is in accordance with various features of the relevance score, a threshold or select the threshold number of feature selection is completed.

1) variance method: This method is by calculating the mean and variance of each feature, a set basic threshold value, wherein when the variance of the dimension is smaller than the threshold base value, discarding the feature. This method is simple and efficient filtration of some of the features of low variance, but there is a problem that setting a threshold condition is a priori, when set too low, too inefficient to retain the characteristics, discarding the set too high too many useful features.

2) single variable feature selection: Univariate test feature selection can be performed for each feature, to measure the relationship between the response variable and wherein, based on the scores discarded bad features. Univariate feature selection method, an independent measure of the relationship between the response variable and each feature

Chi-square test : the feature to test for regression and classification problems chi-square test, etc. can be used.

Sample mutual information feature selection :

What is the mutual information?

Mutual information (Mutual Information) is information theory in a useful metric information, which can be viewed as a random amount of information about another random variables contained in the variable, or is a random variable since it is known another random variables reduction of uncertainty.

2. Coating Process (the Wrapper)

The so-called wrapping method is selected a specific algorithm, and then select the set of features to the effect according to an algorithm.

It is characterized by the constant search for heuristic methods, mainly divided into two categories.

Method one: select some features, gradually adding features to ensure whether the algorithm model accuracy standards.

Method Two: Remove some of the features, and then slowly under conditions remain algorithm accuracy, reduced feature.

This model is the selection of those features provides a measure of importance, the respective direct call feature selection method.

1) using the linear regression model

This unusual, because the linear relationship between the real data is not very good, it should be selected to model nonlinear random forest, it is higher accuracy, also provides a method to predict the importance of features.

lr.coef_ lr model is created, it will output like this:

Linear model: -1.291 * X0 + 1.591 * X1 + 2.747 * X2 importance to determine the weight values ​​by the feature before the feature

2) RF select important feature basis

The average impurity reduction ( the MDI) : wherein each represents the average degree of reduction of errors.

The average accuracy rate reduction ( of MDA) : disrupt the feature value of each feature sequence, and a measure of the influence of changes in the order of accuracy of the model. For the least important feature, it upset the order of accuracy of the model will be much affected, but for the important feature is, disrupt the order will reduce the accuracy rate of the model.

3) sklearn GBDT is based on the degree of non-leaf nodes in the split to reduce the weighting of purity not measured, the more reduced the more important features explained.

4) XGBoost there are three ways (get_score)

weight: the number of features used as the split point

gain: segmentation features using the average gain

cover: When the number of samples of a feature in a node to split covered

3. embedding (Embedded)

Is to use the idea of regularization, the right part of the weight adjustment feature attributes to 0, then this is to be discarded corresponding to the characteristic. (In fact, on the loss function adding regularization term, continuous gradient descent minimizing the loss function, adjust some features of weights, some weight becomes 0 is the equivalent of being abandoned, not being discarded equivalent is selected out of the vector.)

L1 sparse regularization method Solutions characteristic, a characteristic feature of natural selection is provided, it is to be noted, did not choose to L1 does not represent important features, because of two characteristics having a high correlation may retain only one, if you want to important characteristics which should be determined by re- cross-checking L2 regularization method;

 Third, the commonly used feature extraction methods are there?

Commonly used methods are principal component analysis ( the PCA), independent component analysis (the ICA), Linear Discriminant Analysis (LDA) is a type of general data, it is best to consider the dimensionality reduction with LDA. It can also be first with a small amplitude PCA dimensionality reduction to eliminate noise and then LDA dimensionality reduction, if the training data is not a priority category PCA.

Feature extraction is less new features formed from the original input, it will destroy the data distribution, in order to make a more robust model training, if not a lot of large amounts of data characteristic species, generally do not use the feature extraction.

1.PCA

As a method of dimension reduction of unsupervised learning , it requires eigenvalue decomposition may be compressed data, denoising. Therefore, in the actual scene very broad application. In order to overcome some of the disadvantages of PCA, there have been many variants of the PCA, such as for solving nonlinear dimensionality reduction KPCA, as well as around the memory limitations of incremental PCA method Incremental PCA, as well as solve the sparse data dimension reduction of PCA, Sparse PCA and so on.

PCA is the most common linear dimensionality reduction , its goal is through some linear projection, the mapping of high-dimensional data into low-dimensional space represented, and the variance in the expected distribution of the projection data of the maximum dimension (sample most scattered) to use less data while retaining the characteristic dimension of the original live more data points .

PCA analysis of the advantages and disadvantages:

 

advantage:

 

First, only the amount of information needed to measure the variance, factors other than from the data set . Second, it is orthogonal principal components, factors that can eliminate the interaction between components of the original data. Third, the calculation method is simple, mainly eigenvalue decomposition operation is easy to implement.

 

Disadvantages:

 

First, the meaning of each dimension of feature extracted with a certain fuzziness explanatory inferior strength characteristics of the original sample. Second, the PCA will eliminate some types of information, but a small variance inert ingredients may also contain important information about the sample differences, may be discarded due to dimension reduction to influence subsequent data processing.

 

 2.LDA

LDA is a supervised learning dimensionality reduction technology , which means that each sample of its data set is a type of output. LDA ideas can be summarized in one sentence, it is "within the projector after class minimum variance, variance between-class maximum." What does that mean? We want data on low-dimensional projection, rear projection hope projection points for each category of data as close as possible, and the distance between the centers of the different categories of categories of data as large as possible.

LDA analysis of the advantages and disadvantages:

LDA main advantage of this algorithm are:

1 ) You can use prior knowledge and experience in the category of dimensionality reduction process, and like the PCA unsupervised learning this category can not be used a priori knowledge.

2 ) LDA depend on the mean and variance of the sample is not classified information when compared PCA Jiaoyou algorithm and the like.

LDA main drawback of the algorithm are:

1 ) LDA is not suitable for non-Gaussian distribution of the sample dimensionality reduction, the PCA also have this problem.

2 ) LDA dimensionality reduction down to a maximum number of classes k-1 number of dimensions, and if we dimensionality reduction of dimension greater than k-1 , you can not use LDA . Of course, there are some LDA evolutionary version of the algorithm can bypass this problem.

3 ) LDA in the sample classified information variance rather than rely on the mean time, reducing bad-dimensional effect.

4 ) LDA may over-fit the data.

 

Guess you like

Origin www.cnblogs.com/dyl222/p/11055756.html