Calculating Information Gain for Continuous Values via RCpp

The R language does not have a package for calculating the information gain of continuous values. The information gain of continuous values ​​needs to continuously find the optimal segmentation point between the continuous values ​​to maximize the information gain. It will be very slow to calculate with the R loop. So this time, RCpp is used to assist the calculation. The first column is the feature column, and the second type is the label column. The values ​​of the feature column need to be sorted first. Below is the procedure.

 

#Rcpp

cppFunction(
  
  '
  double inforGain(NumericVector x, NumericVector y) {
    int n = x.size();
        int num_r = 0;
  int num_a = 0;
  int all_r = 0;
  for (int i = 0; i < n; i++) {
  if (y[i] == 1) {
  all_r++;
  }
  }
  double all_a = n - all_r;
  double gain = 0.0;
  double entropyBefore;
  if ( all_r  == 0 || all_r == n) {
  entropyBefore = 0.0;
  } else {
  entropyBefore = - all_r * 1.0/ n * log2(all_r * 1.0 / n) - (1- all_r * 1.0 / n) * log2(1 - all_r * 1.0/ n);
  }
  for(int i = 0; i < n - 1; i++) {
  if (y[i] == 1) {
  num_r ++ ;
  } else {
  num_a ++ ;
  }
  
  if (x[i] != x[i+1]){
  double p1 = num_r * 1.0 / (num_r + num_a);
  double entropy1;
  if (num_r == 0 || num_a == 0) {
  entropy1 = 0.0;
  } else {
  entropy1 = -((p1*log2(p1)) + (1-p1)*log2(1-p1));
  }
  double entropy2;
  double p2 = (all_r - num_r) * 1.0 / (n - i - 1);
  if (all_r - num_r == 0 || all_a - num_a == 0) {
  entropy2 = 0.0;
  } else {
  entropy2 = -((p2*log2(p2)) + (1-p2)*log2(1-p2));
  }
  double entropy = entropy1 * (i + 1) / n + entropy2 * (n - i - 1)/ n;
  double gainTemp = entropyBefore - entropy;
  if (gainTemp > gain) {
  gain = gainTemp;
  }
  }
  }
  return gain;
  }
  ')
x <- c(8  ,18  , 18 ,  21,   24,   25,   28)
y <- c(0    ,1   , 0  ,  0 ,   0 ,   0 ,   0)

inforGain(x,y)
[1] 0.1981174

 

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=325024652&siteId=291194637