Implementation and Application of Random Forest Proximity

Project homepage: randomforest
C++ implementation of random forests classification, regression, proximity and variable importance.
Recommended reading: Random Forests C++ implementation: details, usage and experiments

1 algorithm

1.1 Introduction to Random Forest Proximity

Random Forest Proximity is a measure of the similarity between two samples. The application directions of Proximity include Outlier detection, data imputation, visualization, etc. The initial random forest Proximity calculation method is to pass all the training samples through the random tree to the leaf node. If two samples reach the same leaf node, their Proximity value will be increased by 1, and the final result will be divided by the number of trees to normalize. change. This calculation method is very direct, fast, and widely used. Subsequent scholars have proposed several improved methods to improve the accuracy of Proximity estimation, or to be more suitable for specific application fields.

1.2 RF-GAP

In 2023, Jake S. Rhodes et al published a paper " Geometry- and Accuracy-Preserving Random Forest Proximities " ( arxiv ) on PAMI. He proposed a new Random Forest Proximity (RF-GAP) calculation method, theoretically proved that the classification/regression prediction results based on this method are equivalent to the oob classification/regression results, and gave experimental verification results. In 2023, an article purely discussing random forests can be published on PAMI, and the content involves a relatively unpopular aspect of the RF algorithm, which shows that this work is very solid.
insert image description here

The following is a brief introduction to the idea of ​​the paper. The calculation formula of Proximity ( RF-GAP ) in the paper is as follows:

P ( i , j ) = 1 ∣ T o o b ( i ) ∣ ∑ t ∈ T o o b ( i ) ∣ { j ∣ i n b a g t l e a f ( i ) } ∣ ∣ { i n b a g t l e a f ( i ) } ∣ P(i,j)=\frac{1}{\lvert T_{oob}(i) \rvert} \sum_{t\in T_{oob}(i)} \frac {|\{j |inbag_{t}^{leaf}(i)\}|} {\lvert\{inbag_{t}^{leaf}(i)\}\rvert} P(i,j)=Toob(i)∣1tToob(i)∣{ inba gtleaf(i)}∣{ jinbagtleaf(i)}
The above formulas are rewritten according to my own understanding, and you can also refer to the formulas in the paper. The symbols and expressions are briefly explained as follows:

  1. ∣ ⋅ ∣ \lvert {·}\rvert : Indicates the number of elements in the set;

  2. P ( i , j ) P(i,j) P(i,j ) :samplei andjjProximityvalueof j ;

  3. T oob ( i ) T_{oob}(i)Toob( i ) : collection of random trees, for these trees sampleiii is out-of-bag (oob) data;

  4. { i n b a g t l e a f ( i ) } \{inbag_{t}^{leaf}(i)\} { inba gtleaf( i )} : oob 样本iii falls into the treettA certain leaf node of t , the collection of all training samples(in-bag) on ​​this node;

  5. { j ∣ i n b a g t l e a f ( i ) } \{j |inbag_{t}^{leaf}(i)\} { jinbagtleaf( i )} : samplejjThe set of j , it may be 0, or it may be≥ 1 \ge11 , because the training sample of each tree is obtained from the original data set with replacement sampling (bagging), so the samplejjThe number of j may be greater than 1.

  6. It should be noted that P ( i , j ) ≠ P ( j , i ) P(i,j)\ne P(j,i)P(i,j)=P(j,i) P ( i , i ) = 0 P(i,i)=0 P(i,i)=0 ∑ j P ( i , j ) = 1 \sum_{j}P(i,j)=1 jP(i,j)=1

1.3 Implementation code

The above RF-GAP has been implemented on my randomforest . The classification and regression codes are similar. Only the classification forest RF-GAP calculation code is posted below.

int ClassificationForestGAPProximity(LoquatCForest* forest, float** data, const int index_i, float*& proximities)
{
    
    
	if (NULL != proximities)
		delete[] proximities;

	proximities = new float[forest->RFinfo.datainfo.samples_num];
	memset(proximities, 0, sizeof(float) * forest->RFinfo.datainfo.samples_num);

	const int ntrees = forest->RFinfo.ntrees;
	int oobtree_num = 0;
	for (int t = 0; t < ntrees; t++)
	{
    
    
		//where the i-th sample is oob
		const struct LoquatCTreeStruct* tree = forest->loquatTrees[t];
		bool i_oob = false;
		for (int n = 0; n < tree->outofbag_samples_num; n++)
		{
    
    
			if (index_i == tree->outofbag_samples_index[n]) 
			{
    
    
				i_oob = true;
				break;
			}
		}

		if (false == i_oob)
			continue;

		oobtree_num++;

		map<int, int> index_multicity;
		const struct LoquatCTreeNode* leaf_i = GetArrivedLeafNode(forest, t, data[index_i]);
		
		if (leaf_i->samples_index != NULL)
		{
    
    
			for (int n=0; n<leaf_i->arrival_samples_num; n++)
			{
    
    
				if (index_multicity.find(leaf_i->samples_index[n]) == index_multicity.end())
					index_multicity.emplace(leaf_i->samples_index[n], 1);
				else
					index_multicity[leaf_i->samples_index[n]]++;
			}
		}else
		{
    
    
			// if forest did not store sample index arrrived at the leaf node, each in bag sample has to be tested
			for (int n = 0; n < tree->inbag_samples_num; n++)
			{
    
    
				const int j = tree->inbag_samples_index[n];
				const struct LoquatCTreeNode* leaf_j = GetArrivedLeafNode(forest, t, data[j]);
				if (leaf_i == leaf_j)
				{
    
    
					if (index_multicity.find(j) == index_multicity.end())
						index_multicity.emplace(j, 1);
					else
						index_multicity[j]++;
				}
			}
		}
		

		int M = 0;
		for (map<int, int>::iterator it = index_multicity.begin(); it != index_multicity.end(); it++)
		{
    
    
			M += it->second;
		}

		if (0 == M)
			continue;

		for (map<int, int>::iterator it = index_multicity.begin(); it != index_multicity.end(); it++)
			proximities[it->first] += it->second*1.0f/M;
	}

	if (0 == oobtree_num)
		return -1;

	for (int j = 0; j < forest->RFinfo.datainfo.samples_num; j++)
		proximities[j] = proximities[j] / oobtree_num;

	return 1;
}

2 applications

By calculating the random forest Proximity to mine the similarity between two samples, it can be applied to data visualization, outlier detection, data interpolation and other scenarios. Compared with other random forest Proximity, according to the paper, RF-GAP shows better results in these applications.

2.1 Outlier detection

2.1.1 Principle and Implementation

For classification problems, outliers can be defined as follows: the outlier measure score of this sample is significantly greater than the values ​​of other samples in the class. These outlier samples may be similar to samples from other classes, or dissimilar to samples from all classes. The existence of abnormal samples will affect the training of classification and regression algorithms. Random forest outlier detection can be divided into feature-based methods and proximity-based methods. The former, such as Isolation Forest , and the RF-GAP-based outlier detection method belong to the latter. Proceed as follows:

  1. For sample iii,计算raw outlier measure score s c o r e ( i ) = n ∑ j ∈ c l a s s ( i ) P 2 ( i , j ) score(i)=\frac{n}{\sum_{j\in class(i)} P^2(i,j)} score(i)=jclass(i)P2(i,j)n
  2. Calculate the median value of raw_score within the class mc = median { score ( i ) ∣ class ( i ) ∈ c } m_c=median\{score(i)|class(i)\in c\}mc=median{ score(i)class(i)c } , meandevc = 1 N c ∑ class ( i ) ∈ c ∣ score ( i ) − mc ∣ dev_c=\frac{1}{N_c}\sum_{class( i)\in c}|score(i)-m_c|devc=Nc1class(i)cscore(i)mc
  3. Calculate the normalized raw outlier measure score : score ( i ) ← max ( score ( i ) − mcdevc , 0 ) score(i)\gets max(\frac{score(i)-m_c}{dev_c}, 0)score(i)max(devcscore(i)mc,0)

The code for calculating the outlier measure score in my randomforest algorithm is as follows. The process details are based on Leo Breiman's RF implementation (the code of Leo Breiman is written in Fortran, and the calculation of outlier measure is in the function locateout ).

int RawOutlierMeasure(LoquatCForest* loquatForest, float** data, int* label, float*& raw_score)
{
    
    
	if (NULL != raw_score)
		delete [] raw_score;
	
	int rv=1;
	const  int samples_num = loquatForest->RFinfo.datainfo.samples_num;
	const int class_num = loquatForest->RFinfo.datainfo.classes_num;

	raw_score = new float [samples_num];
	memset(raw_score, 0, sizeof(float)*samples_num);
	
	// 1. 计算raw outlier measure score
	float *proximities = NULL;
	for (int i=0; i<samples_num; i++)
	{
    
    
		ClassificationForestGAPProximity(loquatForest, data, i, proximities /*OUT*/);

		float  proximity2_sum = 0.f;
		for(int j=0; j<samples_num; j++)
		{
    
    
			if (label[j] != label[i] || j == i)
				continue;

			// within class
			proximity2_sum += proximities[j] * proximities[j];
		}
		
		raw_score[i] = samples_num / (proximity2_sum+1e-5);

		delete [] proximities;
		proximities = NULL;
	}

	// 2. 计算类内raw_score的中值,类内样本raw score与中值绝对差的均值
	float *dev = new float[class_num];
	float *median = new float[class_num];
	memset(dev, 0, sizeof(float)*class_num);
	memset(median, 0, sizeof(float)*class_num);

	for (int j=0; j<class_num; j++)
	{
    
    
		vector<float>  raw_score_class_j;
		
		for (int i=0; i<samples_num; i++)
		{
    
    
			if (label[i] == j)
				raw_score_class_j.push_back(raw_score[i]);
		}

		std::sort(raw_score_class_j.begin(), raw_score_class_j.end());

		const int sample_num_j = raw_score_class_j.size();
		if (0 == sample_num_j)
		{
    
    
			rv = 0;
			dev[j] = 1.f;
			median[j] = 1.f;
			continue;
		}

		if (sample_num_j%2 == 1)
			median[j] = raw_score_class_j[sample_num_j/2];
		else
			median[j] = 0.5f*(raw_score_class_j[sample_num_j/2] + raw_score_class_j[sample_num_j/2-1]);
		

		for (vector<float>::iterator it=raw_score_class_j.begin(); it != raw_score_class_j.end(); it++)
			dev[j] += abs(*it - median[j]);
		dev[j] = dev[j] / sample_num_j;
	}

    // 3. 计算得到规范化的raw outlier measure score
	for( int i=0, lb=0; i<samples_num; i++)
	{
    
    
		lb = label[i];
		raw_score[i] = RF_MAX( (raw_score[i] - median[lb])/(dev[lb]+1e-5), 0.0);
	}

	delete [] dev;
	delete [] median;

	return rv;
}

2.1.2 Experimental results

Use the mnist handwritten character recognition dataset to train a random forest classifier with 60,000 samples and 780 features. The RF parameters are: random tree T = 200 T=200T=200 , the number of classification candidate featuresis 780 \sqrt{780}780 , the minimum sample number of nodes is 5, and the maximum tree depth is 40.
The core of calculating the outlier measure is to calculate the proximityP ( i , j ) P(i,j)P(i,j ) , when training RF, the leaf node saves the in-bag sample information falling into it (at least the serial number of the sample in the training set), then the complexity ofthe RF-GAPis roughly estimated to be O ( T ∗ N) O (T*N)O(TN) T T T andNNN is the number of random trees and training set samples, respectively. At the end of the RF-GAP article, the author mentioned that this algorithm is more suitable for data sets with samples in the range of several thousand. The next step is to improve the algorithm to apply to larger data sets.
Calculate the "outlier measure score" of all samples, and the values ​​of some category samples are shown in the figure below (in ascending order of value), the abscissa is the sample, and the ordinate is the "outlier measure score" corresponding to the sample. Among them, very few samples of categories 4 and 9 (that is, number 4-sub-image 3, number 9-sub-image 4) are significantly abnormal, and you can view the outlier sample images below correspondingly, which is more intuitive.

insert image description here

Detect outlier samples on the mnist handwritten digit recognition dataset, and select the image corresponding to the largest raw outlier measure score in several categories , which is indeed difficult to recognize from the image. For example, the outlier sample of the number 4 looks like the number 7. It has been verified that the category 4 is indeed marked. The serial number in the original training set is 59915 (the serial number starts from 0), and the raw score exceeds 200, which is significantly different from other samples in the class.
insert image description here

Some sample images with a raw outlier measure score of 0 were randomly selected (RF believes that the intra-class similarity is high). The examples are as follows, which are indeed some normal and recognizable handwritten numbers.
insert image description here

Remarks: mnist has several training sets, and the corresponding features are 784-dimensional (corresponding to image 28x28) and 780-dimensional. The feature selected in this paper is 780-dimensional. For display, 4 0s are added to the head of the 780-dimensional image, so that each image is expanded from 780 to 784, causing the characters to be a little bit to the right.
insert image description here

appendix

The code for RF training + abnormal sample detection (calculating the raw outlier measure score of the sample) using the randomforest algorithm I implemented is as follows:

int main()
{
    
    
	// read training samples if necessary
	char filename[500] = "/to/direction/dataset/train-data.txt" 
	float** data = NULL;
	int* label = NULL;
	Dataset_info_C datainfo;
	int rv = InitalClassificationDataMatrixFormFile2(filename, data/*OUT*/, label/*OUT*/, datainfo/*OUT*/);
	// check the return value
	// 	... ...

	// setting random forests parameters
	RandomCForests_info rfinfo;
	rfinfo.datainfo = datainfo;
	rfinfo.maxdepth = 40;
	rfinfo.ntrees = 200;
	rfinfo.mvariables = (int)sqrtf(datainfo.variables_num);
	rfinfo.minsamplessplit = 5;
	rfinfo.randomness = 1;
	// train forest
	LoquatCForest* loquatCForest = NULL;
	rv = TrainRandomForestClassifier(data, label, rfinfo, loquatCForest /*OUT*/, 20);
	// check the return value
	// 	... ...

	//outlier measurement//
	float *raw_score=NULL;
	RawOutlierMeasure2(loquatCForest, data, label, raw_score);
	// raw_socre -- outlier measurements
	// ... ...
	delete [] raw_score;
	/outlier measurement//

	// clear the memory allocated for the entire forest
	ReleaseClassificationForest(&loquatCForest);
	// release money: data, label
	for (int i = 0; i < datainfo.samples_num; i++)
		delete[] data[i];
	delete[] data;
	delete[] label;
	return 0;
}

Guess you like

Origin blog.csdn.net/gxf1027/article/details/130701720