After completion of the processing characteristic, characterized by selecting the best method sklearn SelectKBest of:
from sklearn.feature_selection import SelectKBest
0 What is the chi-square test
chi-square test is mainly used to test for independence between categorical variables, in other words, is there a relationship between the two test variables.
For example, the effects of education on earnings if significant;
male or female on the line to buy fresh food there is no difference;
different treatment if there is a significant effect.
The basic idea is to extrapolate the overall distribution of the sample data with the expected distribution if there is a significant difference, or infer two categorical variables are related or independent.
May generally be assumed to be the original set: the desired frequency observed no difference in the frequency, or two independent uncorrelated variables.
Practical applications, we assume the null hypothesis was established to calculate the chi-square values chi-square indicates the degree of deviation between the observed and theoretical values.
A chi-square value is calculated
A is the observation that the real statistics;
E is a theoretical value (desired frequency), which assumes a desired value in the case where two uncorrelated variables.
For example, we provide the following data through questionnaires, showing whether undergraduate and graduate income over a million persons.
First, assume that education and income over a million two variables do not want to pass, we first calculate the distribution of income over and though the number of Wan Wan.
Income over a million persons = income over a million persons / (income over a million persons + income but million Number) = 501/813 = 62%
and then calculate the theoretical value of the degree of income over a million and a graduate of income over a million persons, namely undergraduate and graduate income over million for the total number is * 62%.
Bachelor income over a million theoretical value = 581 * 360 = 62% of
the rest of the calculation is similar.
Chi-square value X2 = 28.797
2 chi-square test
chi-square test in four steps, testing whether two variables are related
1 计算卡方值
2 求自由度 (行数 - 1)*(列数 - 1)
3 设定显著性水平值
4 根据以上计算结果查表
显著性水平是假设检验中的一个概念,是指当原假设为正确时人们却把它拒绝了的概率或风险。
它是公认的小概率事件的概率值,必须在每一次统计检验之前确定,通常取α=0.05或α=0.01。
这表明,当作出接受原假设的决定时,其正确的可能性(概率)为95%或99%。
这里我们计算的卡方值是28.797
自由度计算为1
显著性水平为0.05
查表可得28.797 > 3.841,说明原假设在0.05的显著性水平下是可以拒绝的,就是说原假设不成立,学历和收入两个变量相关。
原文链接:https://blog.csdn.net/weixin_39198406/article/details/100553385