Feature Correlation Analysis

1. Drawing judgment

You can judge whether it is related according to the drawing.
Including scatter linear chart, scatter plot, line chart

Second, calculate the variance

Calculate the variance of the feature. If the variance is close to 0, that is, there is basically no difference between the feature values ​​of the feature, indicating that this feature is not useful for distinguishing samples and can be eliminated.

from sklearn.feature_selection import VarianceThreshold
selector = VarianceThreshold(threshold=0.1)#默认threshold=0.0
selector.fit_transform(offline_data_shuffle1[numerical_features])
 
# 查看各个特征的方差,
selector.variances_ ,len(selector.variances_)
 
# 特征对应方差
all_used_features_dict = dict(zip(numerical_features,selector.variances_ ))
all_used_features_dict

How to use the sklearn.feature_selection.VarianceThreshold class

3. Covariance

Calculate the covariance of two features.
If the covariance is positive, it means that the two features are positively correlated. The larger the covariance, the higher the degree of correlation.
If the covariance is negative, it means that the two features are negatively correlated. The smaller the covariance, the higher the correlation. The higher the degree;
if the covariance is 0, it means that the two features are independent and irrelevant.

Four, Pearson Pearson correlation coefficient

Correlation coefficient: It is equivalent to the special covariance after removing the influence of two characteristic dimensions and standardizing.

Pearson coefficient applicable conditions:

  • There is a linear relationship between the two features, and both are continuous data.
  • Both characteristic populations are normally distributed, or nearly normal and unimodal.
  • The observations of the two features are paired, and each pair of observations is independent of the other.

Pearson coefficient characteristics:

  • It can reflect whether two features are positively correlated or negatively correlated;
  • It is a standardized covariance, which eliminates the influence of the two characteristic dimensions, and simply reflects the similarity of the two variables per unit change.

Pearson coefficient correlation degree classification:

  • 0.8-1.0 extremely strong correlation;
  • 0.6-0.8 strong correlation;
  • 0.4-0.6 Moderate correlation;
  • 0.2-0.4 weak correlation;
  • 0.0-0.2 Very weak or no correlation.
  • If >0.8, it means that the two features have an obvious linear relationship, and only one can be kept. Generally, the one with the larger Pearson coefficient of the label or the one with the largest lightgbm AUC is reserved.
皮尔逊系数相关程度分类。
.00-.19 “very weak”
.20-.39 “weak”
.40-.59 “moderate”
.60-.79 “strong”
.80-1.0 “very strong”

Disadvantages of Pearson coefficient

  • This relationship cannot be used to predict data, that is, the relationship between variables has not been refined and solidified to form a model. To make predictions using the relationship between variables, you need to use regression analysis.

Pearson coefficient sample code 1:

# 方法1,numpy.corrcoef,求多个数组的相关系数
import numpy as np
np.corrcoef([a,b,c,d])
 
# 方法2.计算特征间的pearson相关系数,画heatmap图
plt.figure(figsize = (25,25))
corr_values1 = data[all_used_features].corr() # pandas直接调用corr就能计算特征之间的相关系数
sns.heatmap(corr_values1, annot=True,vmax=1, square=True, cmap="Blues",fmt='.2f')
plt.tight_layout()
# plt.savefig('prepare_data/columns37.png',dpi=600)
plt.show()
 
# 方法3.Scipy的pearsonr方法能够同时计算相关系数和p-value
import numpy as np
from scipy.stats import pearsonr
 
np.random.seed(0)
size = 300
x = np.random.normal(0, 1, size)
print("Lower noise", pearsonr(x, x + np.random.normal(0, 1, size)))
print("Higher noise", pearsonr(x, x + np.random.normal(0, 10, size)))

Pearson coefficient sample code 2:

# 计算各特征与label的相关系数,并画出直方图
x_cols = [col for col in train_csv.columns if col not in ['信用分'] if train_csv[col].dtype!='object']#处理目标的其他所有特征
labels = []
values = []
for col in x_cols:
    labels.append(col)
    values.append(np.corrcoef(train_csv[col].values, train_csv.信用分,values)[0, 1])
 
corr_df = pd.DataFrame({
    
    'col_labels':labels, 'corr_values':values})
corr_df = corr_df.sort_values(by = 'corr_values')
 
ind = np.arange(len(labels))
width = 0.5
fig,ax = plt.subplots(figsize = (12,40))
rects = ax.barh(ind, np.array(corr_df.corr_values.values), color='y')
 
ax.set_yticks(ind)
ax.set_yticklabels(corr_df.col_labels.values, rotation='horizontal')
ax.set_xlabel('Correlation coefficient')
ax.set_title('Correlation coefficient of the variables')

5. Distance correlation coefficient

Sometimes even if the Pearson correlation coefficient is 0, it cannot be concluded that the two features are independent (there may be a nonlinear correlation); in order to overcome the weakness of the Pearson correlation coefficient, there is a distance correlation coefficient. If the distance correlation coefficient is 0, then the two features are independent.

Note:
When the relationship between features is close to a linear correlation, the Pearson correlation coefficient is still irreplaceable.

  1. The calculation speed of Pearson correlation coefficient is fast;
  2. The value range of the Pearson correlation coefficient is [-1, 1], and the MIC and distance correlation coefficients are both [0, 1]. Pearson correlation coefficient can express negative correlation, but MIC and distance correlation coefficient cannot express negative correlation.

How to calculate the distance correlation coefficient?

  1. Calculate the norm distance between each row of data in the array, assuming that there are n-dimensional (X, Y) statistical samples
    ajk = ∣ ∣ X j − X k ∣ ∣ , bjk = ∣ ∣ Y j − Y k ∣ ∣ a_{jk }=||X_j-X_k|| ,b_{jk}=||Y_j-Y_k||ajk=∣∣XjXk∣∣,bjk=∣∣YjYk∣∣
#生成一个3行2列的数组
X = np.random.randint(-100,100,(3,2))0.1
out:
array([[ 4.8, 7.7],
[-2.6, 6.8],
[ 5.9, 9. ]])
Y = X**2
#取数据集的行
col = X.shape[0]
#做成nn的零矩阵,用于盛放数据
a = np.zeros((col,col))
b = np.zeros((col,col))
A = np.zeros((col,col))
B = np.zeros((col,col))
#计算数组间每行数据之间的范数距离
for j in range(col):
    for k in range(col):
        a[j,k] = np.linalg.norm(X[j]-X[k])
        b[j,k] = np.linalg.norm(Y[j]-Y[k])
        # 这里的a和b最后生成的都是对称矩阵
部分打印信息:
X数据集索引为0的数据与索引为0的数据的范数距离为0
计算a列之间的距离: 0.0
计算b列之间的距离: 0.0
X数据集索引为0的数据与索引为1的数据的范数距离为
计算a列之间的距离: 8.174350127074325
计算b列之间的距离: 25.112168365157167
  1. Centralize all pairwise distances
    A jk = ajk − a ˉ j ⋅ − a ˉ ⋅ k + a ˉ . . B jk = bjk − b ˉ j ⋅ − b ˉ ⋅ k + b ˉ . . A_{jk} =a_{jk}-\bar a_{j\cdot}-\bar a_{\cdot k}+\bar a.. \\ B_{jk}=b_{jk}-\bar b_{j\cdot}- \bar b_{\cdot k}+\bar b..Ajk=ajkaˉjaˉk+aˉ..Bjk=bjkbˉjbˉk+bˉ ..where
    ,a ˉ j ⋅ \bar a_{j\cdot}aˉjis the average value of row j, a ˉ ⋅ k \bar a_{\cdot k}aˉkis the average value of column k, a ˉ . . \bar a..aˉ ..is the average value of the distance matrix of X samples.
for m in range(col):
    for n in range(col):
        #计算a,b中心化的值,并赋值给A,B
        A[m,n] = a[m,n] - a[m].mean()-a[:,n].mean()+a.mean()
        B[m,n] = b[m,n] - a[m].mean()-b[:,n].mean()+b.mean()
  1. Calculate the arithmetic mean of the squared covariance (scalar) of the sample distances:
    d C ovn 2 ( X , Y ) : = 1 n 2 ∑ j = 1 n ∑ k = 1 n A jk B jk dCov_n^2(X,Y) :=\frac{1}{n^2}\sum_{j=1}^{n}\sum_{k=1}^{n}A_{jk}B_{jk}d C o vn2(X,Y):=n21j=1nk=1nAjkBjk
cov_xy=np.sqrt((1/(col**2))((AB).sum()))
  1. Calculate the sample distance variance
    d V arn 2 ( X ) : = d C ovn 2 ( X , X ) = 1 n 2 ∑ j = 1 n ∑ k = 1 n A jk 2 dVar_n^2(X):=dCov_n^2 (X,X)=\frac{1}{n^2}\sum_{j=1}^{n}\sum_{k=1}^{n}A_{jk}^2d Va rn2(X):=d C o vn2(X,X)=n21j=1nk=1nAjk2
cov_xx=np.sqrt((1/(col2))((AA).sum()))
cov_yy=np.sqrt((1/(col2))((BB).sum()))
  1. Divide the distance covariance of two features by the product of their distance standard deviations to get their distance correlation
    d C or ( X , Y ) = d C ov ( X , Y ) d V ar ( X ) d V ar ( Y ) dCor(X,Y)=\frac{dCov(X,Y)}{\sqrt{dVar(X)dVar(Y)}}dCor(X,Y)=d Va r ( X ) d Va r ( Y ) dCov(X,Y)
dcor = cov_xy/np.sqrt(cov_xx*cov_yy)

The complete code to calculate the distance correlation coefficient is as follows:

import numpy as np
import pandas as pd
def dist_corr(x,y):#如果x,y是二维数组,应通过行矢量形成距离矩阵
    #获取数据集的行
    col=x.shape[0]
    #生成a、b、A、B三个colcol的0矩阵
    a=np.zeros((col,col))
    b=np.zeros((col,col))
    A=np.zeros((col,col))
    B=np.zeros((col,col))
    #通过双层循环计算出列之间的范数距离
    for j in range(col):
        for k in range(col):
            #求范数
            a[j,k]=linalg.norm(x[j]-x[k])
            b[j,k]=linalg.norm(y[j]-y[k])
            #print(a,b)
    #通过循环对其进行中心化处理
    for m in range(col):
        for n in range(col):
            A[m,n]=a[m,n]-a[m].mean()-a[:,n].mean()+a.mean()
            B[m,n]=b[m,n]-b[m].mean()-b[:,n].mean()+b.mean()
    #求协方差
    cov_xy=np.sqrt((1/(col**2))((AB).sum()))
    cov_xx=np.sqrt((1/(col**2))((AA).sum()))
    cov_yy=np.sqrt((1/(col**2))((BB).sum()))
return cov_xy/np.sqrt(cov_xxcov_yy)

To be continued. . .

Guess you like

Origin blog.csdn.net/weixin_46838605/article/details/126590215