The R language does not have a package for calculating the information gain of continuous values. The information gain of continuous values needs to continuously find the optimal segmentation point between the continuous values to maximize the information gain. It will be very slow to calculate with the R loop. So this time, RCpp is used to assist the calculation. The first column is the feature column, and the second type is the label column. The values of the feature column need to be sorted first. Below is the procedure.
#Rcpp cppFunction( ' double inforGain(NumericVector x, NumericVector y) { int n = x.size(); int num_r = 0; int num_a = 0; int all_r = 0; for (int i = 0; i < n; i++) { if (y[i] == 1) { all_r++; } } double all_a = n - all_r; double gain = 0.0; double entropyBefore; if ( all_r == 0 || all_r == n) { entropyBefore = 0.0; } else { entropyBefore = - all_r * 1.0/ n * log2(all_r * 1.0 / n) - (1- all_r * 1.0 / n) * log2(1 - all_r * 1.0/ n); } for(int i = 0; i < n - 1; i++) { if (y[i] == 1) { num_r ++ ; } else { num_a ++ ; } if (x[i] != x[i+1]){ double p1 = num_r * 1.0 / (num_r + num_a); double entropy1; if (num_r == 0 || num_a == 0) { entropy1 = 0.0; } else { entropy1 = -((p1*log2(p1)) + (1-p1)*log2(1-p1)); } double entropy2; double p2 = (all_r - num_r) * 1.0 / (n - i - 1); if (all_r - num_r == 0 || all_a - num_a == 0) { entropy2 = 0.0; } else { entropy2 = -((p2*log2(p2)) + (1-p2)*log2(1-p2)); } double entropy = entropy1 * (i + 1) / n + entropy2 * (n - i - 1)/ n; double gainTemp = entropyBefore - entropy; if (gainTemp > gain) { gain = gainTemp; } } } return gain; } ') x <- c(8 ,18 , 18 , 21, 24, 25, 28) y <- c(0 ,1 , 0 , 0 , 0 , 0 , 0) inforGain(x,y)
[1] 0.1981174