How to screen key features for data modeling during data analysis?

1. Why do key feature screening?
In the era of increasing data volume, we collect more and more data, and the data that can be used for data analysis and mining is gradually enriched. data. (Big background) Mining potential laws from data to assist us in making decisions in actual business.
In real tasks, the disaster of dimensionality is often encountered, which is caused by too many attributes. It can reduce the difficulty of learning tasks, and irrelevant features are noise. It helps to reduce the size and complexity of the data set, which in turn allows us to spend less time to train the model, less computational cost to train the machine learning model and perform inference; simple machine learning model with fewer features Easier to understand and interpret; it avoids overfitting. Because the more features, the more complex the model, which brings the trouble of dimensionality (the error will increase with the number of features).
The purpose of feature selection: 1) Reduce the number of features, reduce the dimension, make the model generalization ability stronger, accelerate model training, and reduce overfitting; 2) Enhance the understanding between features and feature values.
2. What are the common problems in key feature screening?
The problem is, 1) When facing an unknown field, it is difficult to have enough knowledge to judge whether the features are related to our goals, and whether the features are related to each other. At this time, we need some mathematical and engineering methods to help us select the features we need as much as possible. 2) Features are often not independent, so feature selection often searches the features to be selected as a subset (optimal combination of individual features). 3) There is often overlap of feature distribution between samples. (Feature selection methods based on within-class and between-class cannot reflect the situation where sample distributions overlap).
3. Three types of methods for data screening
Which feature selection method to choose? Build yourself a poll picker
Implement several of the feature selection methods we discussed. Your choice may depend on factors such as time, computing resources, and data metric level. Just run as many different methods as possible. Then, for each feature, note the percentage of selection methods that suggest keeping this feature in the dataset. If more than 50% of the methods voted to keep, then keep that feature, otherwise, drop it.
The idea behind this approach is that while some methods may misjudge some features due to their inherent bias, an ensemble of multiple methods should correctly obtain a useful set of features.
1) Statistical method
►Definition: Its biggest advantage is that it does not depend on the model, and only mines its value from the perspective of characteristics, so as to realize the ranking and selection of characteristics. They are also more general since they are model-agnostic; they do not overfit to any particular algorithm. They are also easy to interpret: if a feature is not statistically related to the target, it is discarded. Its core is to sort the features - after sorting according to the value of the features, the feature selection or elimination of any proportion/number can be realized.
The downside is that they look at each feature individually, evaluating its relation to the target. This makes it easy for them to drop useful features that are weak predictors of the target on their own but add a lot of value to the model when combined with other features.
►Including: variance selection, variance analysis, correlation coefficient
►Applicable scenarios: /
►Advantages/comparison or difference between various methods:
variance selection, calculate the variance of each feature, and then select the feature whose variance is greater than the threshold according to the threshold. Advantages: The amount of calculation is small, and it only needs to calculate the variance of all features; it can be used as the first feature selection to filter features, reducing the calculation cost of subsequent algorithms. Disadvantages: It is more dependent on the selection of the threshold. If the threshold is selected too high, many useful features will be screened out; if the threshold is too low, more useless data will be left; some data with large effects may have low variance due to data imbalance and other issues. In small cases, these features are easily deleted by the variance filtering method; it can only be used for discrete data. For continuous data, it should be divided into intervals, continuous into discrete, and then variance filtered.
►Applicable scenarios: Due to the large disadvantages of the variance filtering method, the variance filtering method is often used first to filter out some features with minimal or no change, reduce a part of the data, and then use the model method for secondary screening.
Analysis of variance is a method of hypothesis testing. Its analysis goal is to test whether the difference between the means of each group is statistically significant. Advantages: (1) It is not limited by the number of statistical groups, can accept large sample statistics for multiple comparisons, can make full use of the data provided by the test to estimate the test error, and can extract the influence of various factors on the test indicators from the test error Separation is a quantitative analysis method with strong comparability and high analysis precision; (2) ANOVA can examine the interaction of multiple factors. Disadvantages: (1) It involves all the data, and the calculation is complicated; (2) the preconditions are relatively strict, and the data samples need to be independent of each other, and meet the normal distribution and variance homogeneity, so the variance homogeneity test needs to be carried out on the data.
Correlation coefficient: The main idea is to filter out the features with the highest correlation with the target variable by calculating the correlation coefficient between each feature. The advantage is that it is the simplest method to help understand the relationship between the feature and the response variable, which measures the linear correlation between the variables. It is fast and easy to calculate, and is often executed as soon as the data (after cleaning and feature extraction) is obtained. The flaw is that it assumes that both variables are normally distributed and only measures the linear correlation between them. When the correlation is non-linear, Pearson r will fail to detect it, even if it is really strong. Effect: simple description operation + variance selection
of the final rendering

Data analysis and processing

The fields whose output is greater than the threshold are called important features.

variance analysis
Data analysis and processing

correlation coefficient
Data analysis and processing

2) Model method
►Definition: It uses a model to score different feature subsets and finally selects the best features. Each new subset is used to train a model whose performance is then evaluated on the holdout set. Select the subset of features that yield the best model performance.
►Includes: logistic regression classification, random forest classification, gradient boosting decision tree classification, ReliefF, RFE
►Applicable scenarios: If we don't understand the business, or have thousands of features, then we can also use algorithms to help us. Or, let the algorithm help us filter the features first, and then select a smaller number of features based on business common sense among a small number of features.
► Advantages/comparison or difference between various methods:
logistic regression classification, random forest, RFE, etc., can help us identify which variables are most useful for classification prediction. This can improve the accuracy of the model. Think of feature selection as a black-box problem: you only need to specify the objective function (this objective function is generally the evaluation indicator under a specific model), and maximize the objective function through a certain method, regardless of its internal implementation. Furthermore, from the perspective of specific implementation, given a feature selection problem with N features, it can be abstracted as selecting the optimal K feature subsets to achieve the optimal value of the objective function.
The advantage is that it provides the best performing feature set for a particular type of model. The downside is that they may overfit to the model type, and the subset of features they generate may not generalize if one wishes to try them with different machine learning models. The amount of calculation is large. They require training a large number of models, which can take some time and computing power.
►Effect: simple description operation + final effect diagram
logistic regression
Data analysis and processing

random forest
random forest algorithm

Gradient Boosting Decision Tree
Gradient Boosting Decision Tree

ReliefF
ReliefF Algorithm

RFE
RFE algorithm

Only important features are shown.

3) Ensemble methods
►Definition: Several tools for analyzing the importance of python.
►Includes: Shap, Permutation Importance, Boruta, Partial Dependence Plots
►Applicable scenarios: /
►Advantages/comparison or difference between various methods:
Shap does feature screening, which can improve performance, but the disadvantage is high time cost. The more parameter combinations, or the more precise the selection process, the longer the duration. This is a physical limitation that we can't actually overcome.
Permutation Importance is suitable for tabular data, and its judgment on the importance of features depends on the degree of decline in model performance scores after the features are randomly rearranged. The advantage is that the calculation speed is fast; it is widely used and easy to understand; it is consistent with the properties we expect a feature importance measure to have.
Boruta is a simple but statistically elegant algorithm. It uses feature importance measures from random forest models to select the best subset of features, and it does so by introducing two neat ideas. Boruta classifies features precisely, not ranks them, which is in stark contrast to many other feature selection methods.
Partial Dependence Plots is the same as the importance of arrangement, and the partial dependence graph can only be calculated after the model is fitted.
►Effect: simple description operation + final rendering
Permutation Importance
big data analysis data processing

Boruta
big data analysis data processing

Shap
insert image description here

Partial Dependence Plots
big data analysis data processing

Larger fluctuations indicate more important features.

Guess you like

Origin blog.csdn.net/qq_42963448/article/details/130109885