Chi-square test was used for feature selection

After completion of the processing characteristic, characterized by selecting the best method sklearn SelectKBest of:

from sklearn.feature_selection import SelectKBest

 

 

0 What is the chi-square test
chi-square test is mainly used to test for independence between categorical variables, in other words, is there a relationship between the two test variables.
For example, the effects of education on earnings if significant;
male or female on the line to buy fresh food there is no difference;
different treatment if there is a significant effect.

The basic idea is to extrapolate the overall distribution of the sample data with the expected distribution if there is a significant difference, or infer two categorical variables are related or independent.
May generally be assumed to be the original set: the desired frequency observed no difference in the frequency, or two independent uncorrelated variables.
Practical applications, we assume the null hypothesis was established to calculate the chi-square values chi-square indicates the degree of deviation between the observed and theoretical values.


A chi-square value is calculated

 

A is the observation that the real statistics;

E is a theoretical value (desired frequency), which assumes a desired value in the case where two uncorrelated variables.
For example, we provide the following data through questionnaires, showing whether undergraduate and graduate income over a million persons.

 

 First, assume that education and income over a million two variables do not want to pass, we first calculate the distribution of income over and though the number of Wan Wan.

Income over a million persons = income over a million persons / (income over a million persons + income but million Number) = 501/813 = 62%
and then calculate the theoretical value of the degree of income over a million and a graduate of income over a million persons, namely undergraduate and graduate income over million for the total number is * 62%.

Bachelor income over a million theoretical value = 581 * 360 = 62% of
the rest of the calculation is similar.

 

 

 

 

 Chi-square value X2 = 28.797

2 chi-square test
chi-square test in four steps, testing whether two variables are related

1 计算卡方值
2 求自由度 (行数 - 1)*(列数 - 1)
3 设定显著性水平值
4 根据以上计算结果查表
显著性水平是假设检验中的一个概念,是指当原假设为正确时人们却把它拒绝了的概率或风险。
它是公认的小概率事件的概率值,必须在每一次统计检验之前确定,通常取α=0.05或α=0.01。
这表明,当作出接受原假设的决定时,其正确的可能性(概率)为95%或99%。
这里我们计算的卡方值是28.797
自由度计算为1
显著性水平为0.05
查表可得28.797 > 3.841,说明原假设在0.05的显著性水平下是可以拒绝的,就是说原假设不成立,学历和收入两个变量相关。

 

 

原文链接:https://blog.csdn.net/weixin_39198406/article/details/100553385

 

Guess you like

Origin www.cnblogs.com/wisir/p/12408135.html