Machine Learning Notes 10---Feature Selection

We can use many attributes to describe a watermelon, such as color, root, knocking sound, texture, etc., but experienced people often only need to look at the root and listen to the knocking sound to know whether it is a good melon. In other words, for a learning task, given a set of attributes, some attributes may be critical and useful, while others may not be useful . We call the attributes " features ", the attributes that are useful for the current learning task are called " relevant features ", and the attributes that are not useful are called " irrelevant features ". The process of selecting a subset of relevant features from a given set of features is called " feature selection ".

Feature selection is an important " data preprocessing " process. In real machine learning tasks, feature selection is usually performed first after data is obtained, and then the learner is trained.

There are two reasons for feature selection:

① In real tasks, the problem of dimensionality disaster is often encountered. This is caused by too many attributes. If important features can be selected from them, so that the subsequent learning process only needs to build a model on a part of the features, then the dimensionality disaster problem will be greatly reduced.

② Removing irrelevant features often reduces the difficulty of learning tasks, just like a detective solves a case. If the complicated factors are stripped away and only key factors are left, the truth is often easier to see.

The main thing to note is that the feature selection process must ensure that important features are not lost , otherwise the subsequent learning process will not be able to obtain good performance due to the lack of important information. Given a data set, if the learning tasks are different, the relevant features are likely to be different. Therefore, the so-called "irrelevant features" in feature selection are irrelevant to the current learning task. There is a class of features called " redundant features " (redundant features), the information they contain can be deduced from other features. For example, considering a cube object, if there are already features "base length" and "base width", then "base area" is a redundant feature. Because it can be obtained from "base length" and "base width". Redundant features are not useful in many cases, and removing them will reduce the burden in the learning process. But sometimes redundant features will reduce the difficulty of learning tasks. For example, if the learning goal is to estimate the volume of a cube, the existence of the redundant feature "bottom area" will make volume estimation easier; more precisely, if a redundant The redundant features just correspond to the "intermediate concept" required to complete the learning task, so the redundant features are beneficial.

To select a feature subset that contains all the important information from the initial feature set , without any domain knowledge as a priori assumption, it has to traverse all possible subsets; however, this is computationally infeasible , because doing so will encounter a combinatorial explosion , and it will not be possible to proceed if the number of features is slightly larger. A feasible approach is to generate a "candidate subset", evaluate its quality, generate the next candidate subset based on the evaluation results, and evaluate it... This process continues until no better one can be found candidate subset. Obviously, two key links are involved here: how to obtain the next candidate feature subset according to the evaluation results? How to evaluate the quality of candidate feature subsets ?

① The first link is the " subset search " (subset search) problem. Given a feature set {a1,a2,...,ad}, we can regard each feature as a candidate subset, and evaluate the d candidate single-feature subsets, assuming that {a2} is optimal, then {a2} is used as the selected set of the first round; then if a feature is used in the selected set of the previous round to form a candidate subset containing two features, it is assumed that {a2, a4} is the best and better than {a2}, so {a2, a4} is used as the candidate set of this round; ...Assume that in the k+1th round, the best candidate (k+1) feature subset If it is not as good as the selected set in the previous round, stop generating candidate subsets, and use the set of k features selected in the previous round as the result of feature selection. This strategy of gradually adding relevant features is called a " forward " search. Similarly, if we start with the full set of features and try to remove one irrelevant feature at a time, this strategy of gradually reducing features is called a " backward " search. It is also possible to combine forward and backward searches, and each round gradually increases selected relevant features (these features will not be removed in subsequent rounds), while reducing irrelevant features. This strategy is called "two-way " search .

Obviously, the above strategies are all greedy , because they try to make the selected set optimal in the current round. For example, in the third round, it is assumed that a5 is better than a6, so the selected set is {a2, a4, a5}. However, in In the fourth round, it may be that {a2, a4, a6, a8} are better than all {a2, a4, a5, ai}. Unfortunately, such problems cannot be avoided without an exhaustive search.

② The second link is the " subset evaluation " (subset evaluation) problem. Given a data set D, it is assumed that the proportion of samples of the i class in D is pi (i = 1, 2, ..., |γ|). For the convenience of discussion, it is assumed that the sample attributes are all discrete . For the attribute subset A, it is assumed that D is divided into V subsets {D1, D2, ..., DV} according to its value, and each sample of its own kind has the same value on A, so we can calculate the attribute sub-set Information gain of set A:

The information entropy is defined as:

The larger the information gain Gain(A), it means that the feature subset A contains more information that is helpful for classification. Therefore, for each candidate feature subset, we can calculate its information gain based on the training data set D as an evaluation criterion.

More generally, the feature subset A actually determines a division of the data set D, each division area corresponds to a value on A, and the sample label information Y corresponds to the real division of D, by estimating this The difference between the two divisions can be used to evaluate A. The smaller the difference between the divisions corresponding to Y, the better A is. Information entropy is only a way to judge this difference, and other mechanisms that can judge the difference between two partitions can be used for feature subset evaluation.

The feature selection method can be obtained by combining the feature subset search mechanism with the subset evaluation mechanism.

Common feature selection methods can be roughly divided into three categories:

· Filter _

· Wrapper _

· Embedded _

1> The filter selection method first performs feature selection on the data set, and then trains the learner. The feature selection process has nothing to do with the subsequent learner. This is equivalent to first "filtering" the initial features with the feature selection process, and then using the filtered features to train the model.

2> Unlike filter feature selection that does not consider subsequent learners, package feature selection directly uses the performance of the learner that will eventually be used as the evaluation criterion for the feature subset. In other words, the purpose of wrapper feature selection is to select the most beneficial, "tailor-made" subset of features for a given learner.

3> In the filtering and wrapping feature selection methods, the feature selection process is obviously different from the learner training process; in contrast, the embedded feature selection integrates the feature selection process and the learner training process, and connects It is done in the same optimization process, that is, feature selection is performed automatically during the learner training process.

Refer to Zhou Zhihua's "Machine Learning"

Machine Learning Notes 10---Feature Selection

Guess you like