【Feature selection】Basic knowledge

         Data and features determine the upper limit of machine learning, and models and algorithms are only approximating this upper limit. Feature selection is an important part of special engineering. In real tasks, feature selection is usually advanced after data is obtained, and the learner is trained with relevant features.

The concept of feature selection

  • Relevant features: features relevant to the current learning task
  • Irrelevant Features: Features that are irrelevant to the current learning task
  • Feature selection: The process of selecting a subset of relevant features from a given set of features without losing important features (correlation)

 

Reasons for Feature Selection

1. Reduce the difficulty of learning

2. Mitigate the dimensional disaster

3. Reduce computational and storage overhead

4. Improve model interpretability

 

feature selection process

          Generate a "candidate subset", evaluate the quality of the candidate subset, generate the next "candidate subset" based on the evaluation results, and then evaluate the quality of the candidate subset...until no better candidate can be found until the set. The key link of feature selection: how to generate the next candidate feature subset based on the evaluation results? (Subset search problem) How to evaluate the quality of candidate feature subsets? (Subset Evaluation Question)

(1) Subset search problem

      To solve the subset search problem by greedy strategy, there are three common strategies:

  • Forward Search Strategy: Gradually Increase Related Features
  • Backward search strategy: gradually reduce irrelevant features
  • Bidirectional search strategy: gradually increase relevant features while gradually reducing irrelevant features

(2) Subset evaluation problem

         By calculating the information gain of the candidate subset, the quality of the candidate subset is evaluated. The greater the information gain, the more information the candidate subset contains that is helpful for classification. Information gain is one of the criteria for subset evaluation, and other mechanisms that can judge the difference of division can be used for feature subset evaluation.

Decision tree for feature selection method

 

Feature selection methods --filter, wrapper, embedded

  • Filtered - Use divergence/correlation to score individual features, set a threshold or number of features to select features. The feature selection process is independent of the learner, which is equivalent to filtering the initial features first, and then training the model with the filtered features.

71cf97a2-47b7-4a98-a5b3-a89de82c0ace

 

  • Wrapped - Score each feature using the learner's objective function, select several features/remove several features. The feature selection process is related to the learner, using the performance of the learner as the evaluation criterion for feature selection, and selecting the subset of features that is most beneficial to the performance of the learner.f635d9ad-f644-4bb3-81ee-d0c5b94ed3dd
  • Embedded - using machine learning algorithms to score individual features, select several features. The feature selection process is related to the learner, the feature selection process is integrated with the learner training process, and the feature selection is automatically performed during the learner training process.

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325074155&siteId=291194637
Recommended