Mr. Lu-Summary of Feature Selection Methods
Three methods of feature selection:
-
Filter
-
Wrapper (wrapping method)
-
Embedded (embedded method)
Filtering method
Chi-square test
Look directly at the sklearn code:
Do OHE first
Y = LabelBinarizer().fit_transform(y)
YY after finishingThe shape of Y isN × KN\times KN×K
observed = safe_sparse_dot(Y.T, X) # n_classes * n_features
K , N × N , M K,N\times N,M K,N×N,M
Form a K × MK\times MK×M matrix, representing the sum of features corresponding to each category
observed
Out[6]:
array([[250.3, 171.4, 73.1, 12.3],
[296.8, 138.5, 213. , 66.3],
[329.4, 148.7, 277.6, 101.3]])
Finally, the code for calculating the chi-square:
def _chisquare(f_obs, f_exp):
"""Fast replacement for scipy.stats.chisquare.
Version from https://github.com/scipy/scipy/pull/2525 with additional
optimizations.
"""
f_obs = np.asarray(f_obs, dtype=np.float64)
k = len(f_obs)
# Reuse f_obs for chi-squared statistics
chisq = f_obs
chisq -= f_exp
chisq **= 2
with np.errstate(invalid="ignore"):
chisq /= f_exp
chisq = chisq.sum(axis=0)
return chisq, special.chdtrc(k - 1, chisq)
Correlation between independent variable and dependent variable
AAA is observation,EEE is the expectation, and its shape isK × MK\times MK×M
The independent variable has NNN kinds of values, the dependent variable hasMMM kinds of values, considering that the independent variable is equal toiii and the dependent variable is equal tojjThe difference between the observed value of the sample frequency of j and the expectation, construct a statistic