[python] data mining analysis and cleaning - summary of feature selection (feature screening) methods


Link to this article: https://blog.csdn.net/weixin_47058355/article/details/130400400?spm=1001.2014.3001.5501

foreword

After the feature construction has obtained sufficient breadth, these features are screened. Feature
selection has two main functions:
reduce the number of features, reduce dimensionality, make the model generalize better, and reduce overfitting.
Enhance the understanding between features and feature values.

In general, features are selected from two considerations:

Whether the feature diverges: If a feature does not diverge, for example, the variance is close to 0, that is to say, the samples basically have no difference in this feature, and this feature is not useful for distinguishing samples.

Correlation between features and targets: This is relatively obvious, and features that are highly correlated with targets should be selected preferentially

Data Display:
insert image description here

1. Filtration method

According to my understanding, the filtering method is to judge and process the data through statistical methods.

1.1 Variance-based

Use the variance to judge whether the data diverges, and finally judge whether to delete the data according to the divergence result

# 对方差的大小排序
data.std().sort_values()    # select_data是final_data去掉了独热的那些特征
#去掉变化小的特征 比如一个特征如果全是1或者0 那么这数据是没有意义的

insert image description here

According to the sorted data, select the data whose result is 0 to delete or delete the column less than a certain threshold to perform feature selection

1.2 Correlation coefficient

The correlation coefficient is a statistic used to measure the closeness of the linear relationship between two variables. Its value ranges from -1 to 1, where 1 indicates a complete positive correlation, -1 indicates a complete negative correlation, and 0 indicates no linear relationship. The symbol r is usually used to represent the correlation coefficient.
Commonly used are pearson coefficient and spearman coefficient

import matplotlib.pyplot as plt
corr = data.corr('pearson')    # .corr('spearman')
#corr得到特征与特征之间的相关性 然后corr['target']就是目标特征与所有特征之间的相关性
#利用corr方法得出特征的pearson或spearman系数值
plt.rcParams['font.family']=['Microsoft YaHei']
plt.figure(figsize=(5,5)) #将结果画图表示
corr['是否在当年造假'].sort_values(ascending=False)[1:].plot(kind='bar')#一般是使用标签作为相关性计算的列
plt.tight_layout()

insert image description here

Or visualize it with a heatmap

import seaborn as sns
# 用热力图看一下互相之间的关系
f, ax = plt.subplots(figsize=(10, 10))#设置大小
sns.heatmap(corr, annot=True)# annot表示是否在方块上出现数字

insert image description here

Here is a summary of commonly used correlation coefficients
. In statistics, the commonly used correlation coefficients are as follows:

Pearson Correlation Coefficient: It is the most commonly used correlation coefficient used to measure the linear relationship between two continuous variables. It is usually the best choice when the data are approximately normally distributed.

Spearman
correlation coefficient: It is a non-parametric method used to measure the monotonic relationship between two variables and does not require the variables to be linear. It is calculated by converting the raw data into rank (ordinal) data and then calculating the
Pearson correlation coefficient between the rank data.


Kendall Correlation Coefficient: It is also a non-parametric method for measuring the monotonic relationship between two variables, which is more applicable than Spearman correlation coefficient in some cases . It is calculated by calculating the logarithm of rank cooperativity between two variables.

Chebyshev Correlation Coefficient: It is used to measure the distance or difference between two variables and it is the absolute value of the maximum difference between two variables.

Eta Correlation Coefficient: It is used to measure the relationship between two categorical variables and can be seen as a variant of Pearson's correlation coefficient.

Mutual Information: It is used to measure the nonlinear relationship between two variables, especially in the presence of multivariate relationships and noise interference.

In practical applications, different correlation coefficients can be selected for calculation according to specific data types and research purposes.

Different correlation coefficients have different calculation methods and application scenarios. The differences and advantages between them are as follows:

Pearson: Correlation Coefficient: Used to measure the linear relationship between two continuous variables. It has the advantages of simple calculation, convenient interpretation, and strong comparability, but the disadvantage is that it is sensitive to outliers and insensitive to nonlinear relationships.
Spearman: Correlation Coefficient: Used to measure the monotonic relationship between two variables. It has the advantages of not being affected by outliers and not requiring the data to be normally distributed. It is suitable for nonlinear monotone relationships, but it may ignore the difference information between the data.

Kendall Correlation Coefficient: Also used to measure the monotonic relationship between two variables. Compared with the Spearman
correlation coefficient, it is more robust and can effectively deal with small sample problems, but the computational complexity is higher.

Chebyshev Correlation: Used to measure the distance or difference between two variables. It has the advantage of being independent of data distribution and scaling, but may not be robust to extreme outliers.

Eta Correlation Coefficient: Used to measure the relationship between two categorical variables. Since it is based on the effect size of the chi-square test, it has the information of the significance level, but it can only deal with the relationship between two variables.

Mutual Information: Used to measure the non-linear relationship between two variables. It is more flexible than the Pearson
correlation coefficient and can handle multivariate relationships and noise interference, but it has high computational complexity and requires a large amount of sample data support.

To sum up, different correlation coefficients are suitable for different data types and research purposes, and more accurate results can be obtained by choosing an appropriate method.

Two, wrapped

The wrapped method continuously selects feature subsets from the initial feature set, trains the learner, and evaluates the subsets according to the performance of the learner until the best subset is selected. Wrapped feature selection directly optimizes for a given learner.
(In simple terms, it is to split the features into one by one, and then learn, and judge the relevance of the features through the model score)

2.1 Random Forest

Filter data using random forest as a model

#正常的处理 将单个特征挨个进行评分
from sklearn.model_selection import cross_val_score, ShuffleSplit
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import KFold
import numpy as np

X = data.iloc[:,:-1]
Y = data['是否在当年造假']
names = X.columns

rf = RandomForestRegressor(n_estimators=20, max_depth=4)
kfold = KFold(n_splits=5, shuffle=True, random_state=7)
scores = []
for column in X.columns:
    print(column)
    tempx = X[column].values.reshape(-1, 1)
    score = cross_val_score(rf, tempx, Y, scoring="r2",error_score='raise',
                              cv=kfold)
    scores.append((round(np.mean(score), 3), column))
print(sorted(scores, reverse=True))

insert image description here

2.2 Importance analysis of XGBoost

Using XGBoost's gradient boosting tree, the judgment features are added according to a certain weight after the important procedures of each decision tree.

# 下面再用xgboost跑一下 xgboost有专门的一个特征评测体系
from xgboost import XGBRegressor
from xgboost import plot_importance

xgb = XGBRegressor()
xgb.fit(X, Y)

plt.figure(figsize=(20, 10))
plot_importance(xgb)
plt.show()

insert image description here

2.3 SFS sequence forward selection algorithm (Sequential Forward Selection)

Sequential Forward Selection Algorithm (Sequential Forward Selection) Based on Random Forest Regressor (RandomForestRegressor)

#利用SFS进行特征的排序
from mlxtend.feature_selection import SequentialFeatureSelector as SFS
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
# sfs = SFS(LinearRegression(), k_features=20, forward=True, floating=False, scoring='r2', cv=0)
sfs = SFS(RandomForestRegressor(n_estimators=10, max_depth=4), k_features=5, forward=True, floating=False, scoring='r2', cv=0)
# RandomForestRegressor(n_estimators=10, max_depth=4):使用10个决策树,每棵决策树最大深度为4的随机森林回归器。
# k_features=5:最终选择出来的特征数量。
# forward=True:使用序列前向选择算法进行特征选择。
# floating=False:不使用悬浮搜索算法。
# scoring='r2':评估指标为R方(coefficient of determination)。
# cv=0:交叉验证的折数。由于该参数值为0,因此没有使用交叉验证,而是直接使用默认的训练集和测试集进行模型训练和评估。
X = data.iloc[:,:-1]
Y = data['是否在当年造假']

sfs.fit(X, Y)
sfs.k_feature_names_   

insert image description here

#画出sfs的特征的前几项的边际效应 
from mlxtend.plotting import plot_sequential_feature_selection as plot_sfs
fig1 = plot_sfs(sfs.get_metric_dict(), kind='std_dev')
plt.grid()
plt.show()

insert image description here

3. Embedded

Embedded feature screening is a commonly used feature selection method in machine learning. It directly considers the importance of features during model training, and performs feature selection by embedding feature weights into model training.

Specifically, in embedded feature screening, the algorithm automatically selects the features most relevant to the target variable, while also penalizing those features that contribute less to the model. This can effectively prevent the overfitting problem and improve the interpretation ability of the model to a certain extent.

In practical applications, embedded feature screening is often used together with various machine learning algorithms, such as linear regression, logistic regression, support vector machines, etc.

3.1 SVC

Here is an example of SVC

from sklearn.svm import LinearSVC
from sklearn.feature_selection import SelectFromModel

X = data.iloc[:,:-1]
Y = data['是否在当年造假']

# 注意:dual 设置为 False,否则会报错
model_lsvc = LinearSVC(penalty='l1', C=0.2, dual=False)#设置模型
model_lsvc.fit(X,Y)
#penalty:惩罚项,指定正则化策略。'l1'表示使用L1正则化,'l2'表示使用L2正则化。
#C:正则化系数,控制模型的复杂度和拟合程度,值越小表示正则化强度越高,模型越简单。
#dual:对偶或原始问题的求解方法,当样本数量大于特征数量时,通常dual=False可以更快地求解。
#下面是将特征筛选的结果用特征名表示
df_0=pd.DataFrame(X.columns)
df=pd.DataFrame(list(model_lsvc.coef_))
df2 = pd.DataFrame(df.values.T, index=df.columns, columns=df.index)#转置
df3=pd.concat([df_0,df2],axis=1)
df3.columns=['特征','权重']
list(df3[df3['权重']!=0]['特征'])

insert image description here

Summarize

Feature selection refers to selecting the most representative and important features from the original data, retaining these features and removing useless or redundant features. Its main purpose is to:

Improve the accuracy and precision of the model: By filtering and retaining the most important features, the interference of noise or irrelevant features can be eliminated, and the prediction accuracy and precision of the model can be improved.

Reduce the risk of overfitting: When the number of features is large, the model is prone to overfitting problems, that is, it performs well on the training set but poorly on the test set. Feature selection can reduce the number of features and reduce the complexity of the model, thereby reducing the risk of overfitting.

Reduce training time and save computing resources: Feature selection can reduce the amount of data that needs to be processed, thereby reducing training time and saving computing resources.

Improve interpretability and visualization: Feature selection can make the features of the model more intuitive and interpretable, facilitating subsequent visual analysis and interpretation.

In summary, feature selection is very important for building high-quality, efficient and interpretable machine learning models.
In addition to feature selection, there is another method of feature screening is dimensionality reduction, which I will talk about in another blog.
If this blog is helpful to you, you can give me likes, collections and comments!

Guess you like

Origin blog.csdn.net/weixin_47058355/article/details/130400400