Machine Learning 03: Feature Selection

1. What is feature selection?

Feature selection is to simply select some features from all the extracted features as the training set features. The features can change or not change the value before and after selection, but the feature dimension after selection must be smaller than before selection. After all, We only selected some of the features.
Main methods (three major weapons):
Filter (filter): VarianceThreshold
Embedded (embedded): regularization, decision tree
Wrapper (wrapped)
Here we mainly filter, and the rest will be introduced later

2. Realization of feature dimensionality reduction

1.Filter (filter): VarianceThreshold

The VarianceThreshold algorithm deletes all low-variance features. This algorithm believes that low-variance features are not representative. For example, we want to know what kind of men hide private money. One of the feature values ​​in the sample we selected is gender. (Of course, no one will do this during the actual investigation), then the variance of gender is obviously 0, and it has no effect on the target value.


sklearn feature selection API
from sklearn.feature selection import VarianceThreshold
VarianceThreshold(threshold = 0.0)
remove all low variance features
Variance.fit transform(X, y)
X: data in numpy array format [n-samples, n_features]
return value: training set Features whose difference is below the threshold will be removed.
The default is to keep all non-zero variance features, i.e. remove features with the same value in all samples.
The code snippet is as follows:

# from sklearn.feature_selection import VarianceThreshold

def var():
    """
    特征选择-删除低方差的特征值
    :return:None
    """
    var = VarianceThreshold(threshold=0.0)
    data = var.fit_transform(
        [[0, 2, 0, 3],
         [0, 1, 4, 3],
         [0, 1, 1, 3]]
    )
    print(data)

The results are as follows
insert image description here
Here, the threshold in VarianceThreshold(threshold=0.0) can choose other values ​​greater than 0, and the algorithm will delete the features whose variance value is smaller than the variance value you set.

2. PCA (Principal Component Analysis)

What is PCA?
Essence: PCA (Principal Component Analysis) is a technique for analyzing and simplifying data sets. The purpose is to compress the dimensionality of data, reduce the dimensionality (complexity) of the original data as much as possible, and lose a small amount of information.
Function: It can reduce the number of features in regression analysis or cluster analysis. For
more details, you can pay attention to this article
https://zhuanlan.zhihu.com/p/77151308
As a professional in package tuning, here we mainly introduce API
PCA (n_components=None)
Decompose the data into a lower dimensional space
PCA.fit transform(X)
X: data in numpy array format [n-samples, n_features]
return value: array of specified dimensions after transformation

def pca():
    """
    主成分分析进行特征维度降维
    """
    # n_components一般选取0.9~0.95,即压缩为原数据量的百分之多少
    pca = PCA(n_components=0.9)
    data = pca.fit_transform(
        [[2, 8, 4, 5],
         [6, 3, 0, 8],
         [5, 4, 9, 1]]
    )
    print(data)

The results are as follows
insert image description here
PCA can be used when there are too many features, reaching hundreds or thousands

Guess you like

Origin blog.csdn.net/Edward_Legend/article/details/121273006