[Machine Learning] Feature Engineering: Feature Dimensionality Reduction

feature dimensionality reduction

1 Introduction

Feature dimensionality reduction refers to the process of mapping high-dimensional data to a low-dimensional subspace by reducing the dimensionality in the feature space .

In machine learning and data analysis, feature dimensionality reduction can help reduce the complexity of data, reduce computational costs, improve model performance and interpretability, and solve problems such as the curse of dimensionality. Feature dimensionality reduction is generally divided into two main approaches: feature selection and feature extraction.

  1. Feature Selection : Feature selection refers to selecting a subset of the most representative and important features from the original features, while ignoring other features. This reduces the number of features and thus dimensionality. Feature selection methods can evaluate the importance of features based on indicators such as statistical tests, information gain, and model weights, and then select top-ranked features.
  2. Feature Extraction : Feature extraction is to map the original features to a new low-dimensional subspace through mathematical transformation, thereby preserving the key information in the data. Common feature extraction methods include principal component analysis (PCA), linear discriminant analysis (LDA), independent component analysis (ICA), etc. These methods convert high-dimensional data into low-dimensional representations through linear or nonlinear mapping, making new features more expressive.

The advantages of feature dimensionality reduction include:

  • Reduce the disaster of dimensionality : The disaster of dimensionality refers to problems such as the increase of data sparsity and the failure of distance measurement in high-dimensional space. Feature dimensionality reduction can alleviate these problems, making the data easier to process and analyze.
  • Reduce computational cost : The computational cost of high-dimensional data is high, and feature dimensionality reduction can reduce computational complexity and improve algorithm efficiency.
  • Improve model performance : In some cases, feature dimensionality reduction can improve model performance, reduce overfitting, and improve generalization.
  • Visualization and interpretability : Mapping data into a low-dimensional space allows for easier visualization and interpretation, helping to understand patterns and relationships in the data.

The choice of feature dimensionality reduction depends on the nature of the data, the needs of the problem, and the requirements of the model. Different dimensionality reduction methods are suitable for different situations and need to be selected and applied according to specific problems.

2. Dimensionality reduction

Dimensionality reduction refers to the process of reducing the number of random variables (features) under certain limited conditions to obtain a set of "uncorrelated" main variables

Reduce the number of random variables:

Correlated feature: Correlation between relative humidity and rainfall

It is precisely because when training, we all use features for learning. If there is a problem with the feature itself or the correlation between the features is strong, it will have a greater impact on the algorithm learning prediction

Two ways of dimensionality reduction: feature selection and principal component analysis (a way of feature extraction can be understood)

3. Feature selection

3.1 Brief introduction

Definition: The data contains redundant or irrelevant variables (or features, attributes, indicators, etc.), aiming to find out the main features from the original features.

Feature selection refers to selecting the most representative and important part of the features from the feature set of the original data for use in building models, analyzing data, or solving problems.

The goal of feature selection is to reduce the number of features while preserving the most informative parts of the data, thereby reducing computational costs, improving model performance, speeding up the training process, and improving model interpretability.

The main motivations for feature selection are:

  1. Dimensionality reduction : The number of features in a high-dimensional dataset can be very large, resulting in increased computational and storage overhead and reduced algorithm efficiency.
  2. Reduce overfitting : Too many features can lead to overly complex models that tend to perform well on the training set but perform poorly on new data (overfitting).
  3. Improve model performance : Some features may not contribute to model performance, and may even introduce noise. By selecting important features, the performance of the model can be improved.
  4. Improved interpretability : Using fewer features makes the model easier to understand and explain.

Feature selection methods can be divided into three main categories:

  1. Filter Methods : By evaluating and ranking features prior to feature selection, features that are highly correlated with the target variable are selected. Commonly used filtering methods include variance selection, correlation coefficient, mutual information, etc.
  2. Wrapper Methods : Treat feature selection as an optimization problem and select features based on the performance of the model. Common packaging methods include Recursive Feature Elimination (RFE) and forward selection (Forward Selection).
  3. Embedded Methods : Feature selection is performed during model training, and features are selected by optimizing the performance of the model. For example, decision trees and regularized linear models can prune or constrain feature weights during training.

The choice of feature selection method depends on the nature of the data, the needs of the problem, and the requirements of the model. Different methods are suitable for different situations and need to be selected and applied according to specific problems. Feature selection is an important part of data preprocessing, which can lay the foundation for building more accurate, efficient and interpretable machine learning models.

3.2. Two methods

Filter (filter): mainly explore the characteristics of the feature itself, the relationship between features and features and target values

Variance Selection Method: Low Variance Feature Filtering

correlation coefficient

Embedded (embedded): The algorithm automatically selects features (association between features and target values)

Decision Tree: Information Entropy, Information Gain

Regularization: L1, L2

Deep Learning: Convolutions and more

Modules to be used: sklearn.feature_selection

3.3, filter type

3.3.1. Low variance feature filtering

Delete some features of low variance, the meaning of variance was mentioned earlier. Combined with the size of the variance to consider the angle of this method.

The variance of the feature is small: the value of most samples of a feature is relatively similar

Large feature variance: the values ​​of many samples of a feature are different

API:

sklearn.feature_selection.VarianceThreshold(threshold = 0.0)

Remove all low variance features

Variance.fit_transform(X)

X: data in numpy array format [n_samples, n_features]

Return value: Features whose training set difference is lower than the threshold will be removed. The default is to keep all non-zero variance features, i.e. remove features with the same value in all samples.

Case Practice:

The data calculation is performed below. We perform a filter between the index characteristics of some stocks, and the required data is saved in factor_returns.csvthe file.

The 'index,'date','return' columns need to be removed (these types do not match and are not the required indicators)

So the required features are as follows: pe_ratio, pb_ratio, market_cap, return_on_asset_net_profit, du_return_on_equity, ev, earnings_per_share, revenue, total_expense

Analyze as follows:

1. Initialize VarianceThreshold and specify the threshold variance

2. Call fit_transform

# -*- coding: utf-8 -*-
# @Author:︶ㄣ释然
# @Time: 2023/8/16 10:01
import pandas as pd
from sklearn.feature_selection import VarianceThreshold  # 低方差特征过滤

'''
sklearn.feature_selection.VarianceThreshold(threshold = 0.0)
    删除所有低方差特征
    Variance.fit_transform(X)
    X:numpy array格式的数据[n_samples,n_features]
    返回值:训练集差异低于threshold的特征将被删除。默认值是保留所有非零方差特征,即删除所有样本中具有相同值的特征。
'''
def variance_demo():
    """
    删除低方差特征——特征选择
    :return: None
    """
    data = pd.read_csv("data/factor_returns.csv")
    print(data)
    # 1、实例化一个转换器类
    transfer = VarianceThreshold(threshold=1)
    # 2、调用fit_transform
    data = transfer.fit_transform(data.iloc[:, 1:10])
    print("删除低方差特征的结果:\n", data)
    print("形状:\n", data.shape)


if __name__ == '__main__':
    # 设置 Pandas 输出选项以展示所有行和列的内容
    pd.set_option('display.max_columns', None)
    variance_demo()

The result is as follows:

3.3.2. Correlation coefficient

Pearson Correlation Coefficient (Pearson Correlation Coefficient): A statistical indicator that reflects the closeness of the correlation between variables

Pearson Correlation Coefficient (Pearson Correlation Coefficient), also known as Pearson correlation coefficient or Pearson correlation coefficient, is a statistic used to measure the strength and direction of the linear relationship between two continuous variables. It measures the degree of linear correlation between two variables.

The value of the correlation coefficient is between –1 and +1, ie –1≤ r ≤+1 . Its properties are as follows:

  • When r>0, it means that the two variables are positively correlated, and when r<0, the two variables are negatively correlated
  • When |r|=1, it means that the two variables are completely correlated, and when r=0, it means that there is no correlation between the two variables
  • When 0<|r|<1, it means that there is a certain degree of correlation between the two variables. And the closer |r| is to 1, the closer the linear relationship between the two variables; the closer |r| is to 0, the weaker the linear relationship between the two variables
  • Generally, it can be divided into three levels: |r|<0.4 is low correlation; 0.4≤|r|<0.7 is significant correlation; 0.7≤|r|<1 is high linear correlation

official:

 

The parameters are as follows:

n: number of observations.

∑: summation symbol, which means summing all observations.

x and y: respectively represent the observed values ​​of the two variables.

API:

from scipy.stats import pearsonr

x : (N,) array_like

y : (N,) array_like Returns: (Pearson’s correlation coefficient, p-value)

Case: Financial Index Correlation Calculation of Stocks

For the correlation calculation of these indicators of the stocks we just mentioned, assuming that we use factor = ['pe_ratio','pb_ratio','market_cap','return_on_asset_net_profit','du_return_on_equity','ev','earnings_per_share','revenue','total_expense']two of these features to calculate, we can get some features with high correlation.

Analysis: Correlation calculation between two features

import pandas as pd
from scipy.stats import pearsonr  # 皮尔逊相关系数

'''
from scipy.stats import pearsonr
    x : (N,) array_like
    y : (N,) array_like Returns: (Pearson’s correlation coefficient, p-value)
'''
def pearsonr_demo():
    """
    相关系数计算
    """
    data = pd.read_csv("data/factor_returns.csv")

    factor = ['pe_ratio', 'pb_ratio', 'market_cap', 'return_on_asset_net_profit', 'du_return_on_equity', 'ev',
              'earnings_per_share', 'revenue', 'total_expense']

    for i in range(len(factor)):
        for j in range(i, len(factor) - 1):
            print("指标%s与指标%s之间的相关性大小为%f" % (factor[i], factor[j + 1], pearsonr(data[factor[i]], data[factor[j + 1]])[0]))

if __name__ == '__main__':
    # 设置 Pandas 输出选项以展示所有行和列的内容
    pd.set_option('display.max_columns', None)
    pearsonr_demo()

result:

From this it follows that:

The correlation between the indicator revenue and the indicator total_expense is 0.995845

The correlation between the indicator return_on_asset_net_profit and the indicator du_return_on_equity is 0.818697

Drawing:

The correlation between these two pairs of indicators is relatively large, and subsequent processing can be done, such as synthesizing these two indicators.

Guess you like

Origin blog.csdn.net/qq_60735796/article/details/132320783