Feature similarity measure

  When performing feature selection, we need to measure the similarity between the feature and our target. There are many methods to measure. Here are some methods that can be used when using the filter feature selection method. The filter feature selection method is: the process of feature selection is not directly related to the training process of the model, and the information of the feature itself is used for feature selection.

  Refer to this article to give the feature measurement method shown in the following figure:

FS1

 

1: Correlation coefficient (Pearson coefficient)

  For two variables $\mathbf{x}$ and $\mathbf{y}$ the correlation coefficient is defined as:

  $\rho(\mathbf{x},\mathbf{y}) = \frac{   cov(\mathbf{x},\mathbf{y})  }   {   \sqrt{D(\mathbf{x})D(\mathbf{y})}  }$

  Where $cov(\mathbf{x},\mathbf{y}) $ is the covariance of $\mathbf{x}$ and $\mathbf{y}$, $D(\mathbf{x})D(\mathbf {y})$ is the variance of $\mathbf{x}$ and $\mathbf{y}$.

  When we use the collected data to calculate, the calculation formula is as follows:

  $\rho(\mathbf{x}, \mathbf{y}) = \frac{ \sum \limits_{i=1}^n [(x_i – \bar x)  (y_i – \bar y)] }{\sqrt{ \sum \limits_{i=1}^n {(x_i – \bar x)}^2   \sum \limits_{i=1}^n {(y_i – \bar y)}^2  }}$

  The value of the correlation coefficient is between -1 and 1. When the correlation coefficient is 1, it means that the two variables are linearly related, and when the correlation coefficient is 0, it means that they are not related. When both variables are of continuous numeric type, the correlation coefficient can be used to measure.

  Another interpretation of the correlation coefficient is this: the correlation coefficient is the cosine of the data after decentralization. Centering is the data obtained by subtracting their mean from the data.

 

2: Anova (ANOVA)

  Anova stands for analysis of variance. The background of variance analysis is as follows: in actual production and life, a result will be affected by multiple factors, such as wheat yield is affected by factors such as light, soil acidity and alkalinity, temperature, etc. In chemical production, products are affected by raw material components , Raw material dosage, catalyst, reaction temperature, pressure, solution concentration, reaction time, influence of machinery and equipment. How to determine which factor has the greatest impact on our results, we use analysis of variance at this time. We call the index to be investigated the test index , and the conditions that affect the test index are called factors .

  Univariate analysis of variance is to study the effect of only one variable on the outcome. Let this variable take different values ​​to get several sets of data under different values. For example, to study whether the machine has an effect on the thickness of the aluminum alloy sheet, it is necessary to obtain a set of values ​​for the thickness of the sheet obtained on a different machine.

  Transform this problem into a hypothesis testing problem: under the condition that each group of data conforms to a normal distribution, and the variance of the normal distribution is the same, see whether the mean of each group of data is the same. If the results are the same within the acceptable range, then this factor does not have much influence on the test index. If they are not the same, it means that this factor has a great influence on the test index.

  In the specific calculation process, several values ​​are required, and finally a one-factor test analysis of variance table can be listed.

source of variance sum of squares degrees of freedom mean square F ratio
factor A $S_A$ s-1 $\bar{S_A} = \frac{S_A}{s-1}$ $F=\frac{\bar{S_A}}{\bar{S_E}}$
error $S_E$ n-s $\bar{S_E} = \frac{S_E}{n-s}$  
sum $S_T$ n-1    

Among them: $S_A$ is called the sum of squares of the effect of factor A, which reflects the sum of squares of the difference between the sample mean and the mean of the overall data at different levels of factor A.

$S_E$ is called the sum of squares of errors, and the reaction is the sum of the squared errors of each group of observed data and the mean of this group of data under each value of factor A, which reflects the size of the random error.

n is the total number of samples, and s indicates that factor A can take s cases.

The statistic $F=\frac{S_A /(n-1)}{S_E/(ns)}$ is called the Anova F statistic, and the test determined by it is also called the Anova F test.

 

A typical example is given below:

Using the aluminum alloy sheet produced by three machines to see if the machine has a significant effect on the thickness of the sheet, the data set is as follows

Machine 1 Machine 2 Machine 3

0.236     0.257     0.258
0.238     0.253     0.261
0.248     0.255     0.259
0.245     0.254     0.267
0.243     0.261     0.262

According to a certain calculation method (for the specific calculation method, please refer to the book "Probability Theory and Mathematical Statistics"), the obtained variance analysis table is as follows:

source of variance sum of squares degrees of freedom mean square F ratio

factor

error

0.00105333

0.000192

2

12

0.00052667

0.000016

32.92

 

sum 0.00124533 14    

At the significance level $\alpha = 0.05$, $F_{0.05}(2,12) = 3.89 < 32.92$, rejecting the hypothesis. Therefore, it is considered that the thickness of the sheet produced by each machine has a significant difference.

SelectKBest in sklearn's feature_selection combined with f_classif can implement Anova's F value to select features.

 

 

$\chi^2$ test (Chi-Square chi-square test)

  $\chi^2$ Detection first assumes that the feature has no relationship with the target variable, and then uses the formula

  $\chi ^2 = \sum \frac{(A – T)^2}{T}$ to calculate the value of $\chi ^2$, where A is the actual value and T is the theoretical value, based on our assumptions launched.

Take a look at this example:

Whether people of different genders are considered to have a preference for cats and dogs, given the following data.

  cat dog
male 207 282
Female 231 242

First, we add up the rows and columns to get some data:

  cat dog  
male 207 282 489
Female 231 242 473
  438 524 962

Then we calculate the theoretical value: that is, the theoretical value of the data when it is considered that gender has no effect on pet preference, but only related to the statistical value.

The proportion of cats is $\frac{438}{962}$, and the proportion of dogs is $\frac{524}{962}$, so theoretically:

The number of men who like cats should be $489 \times \frac{438}{962} = 222.64$

The number of men who like dogs should be $489 \times \frac{524}{962} = 266.36$

The number of women who like cats should be $473 \times \frac{438}{962} = 215.36$

The number of women who like dogs should be $473 \times \frac{524}{962} = 257.64$

The data is shown in the figure below:

  cat dog
male 222.64 266.36
Female 215.36 257.64

Finally, calculate the value of $\chi ^2$ according to the formula we gave above:

$\chi ^2 = \frac{(207-222.64)^2}{222.64} + \frac{(282-266.36)^2}{266.36} + \frac{(231-215.36)^2}{215.36}+ \frac{(242-257.64)^2}{257.64} = 4.102$

Knowing the degrees of freedom, we can convert the value of $\chi ^2$ to a p-value, which in this problem is 1, and calculates p=0.04283. The p value is less than 0.05 (0.05 is a classic significance value), so the basis for rejecting the hypothesis is strong, that people of different genders have different pet preferences, that is, the two variables are not independent.

SelectKBest in sklearn.feature_selection combined with chi2 can use $\chi^2$ detection to select features.

 

 

A note on hypothesis testing:

  The Anova and $\chi^2$ tests we introduced above are all derived from the test problem. The test problem is that if we give a hypothesis, then after we observe a certain sample, do we believe this hypothesis, or How likely are we to believe this hypothesis. The hypothesis problem can be transformed into a p-value problem, that is, our hypothesis problem can be transformed into a probability value to measure.

  In the $\chi ^2$ detection, we assume that our features have nothing to do with the target variable, so the obtained p-value shows that we can trust the probability of observing these samples when the feature has no relationship with the target variable. So the smaller the p value, the less we believe, that is, the higher the correlation between the feature and the variable. So the smaller the p-value, the greater the correlation.

  In Anova, we get the F value, and the F value can be converted into a p value. Anova's assumption is that the mean of each factor level is equal, that is, this factor has no effect on the test index. Therefore, under this assumption, the smaller the p-value is, it means that we do not believe that the factor has no effect on the experimental index. So the smaller the p-value, the greater the correlation.

 

LDA

  LDA is Linear Discriminant Analysis. It is a classification algorithm and can also be used in dimensionality reduction, often in parallel with PCA. This requires a separate blog post to introduce it.

 

 

refer to:

  Sheng Zu "Probability Theory and Mathematical Statistics"

References about $\chi ^2$ detection:

  1 - Feature selection based on chi-square test 

  x2 test (chi-square test) or chi-square test 

  Chi-square test 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325050720&siteId=291194637