Machine Learning Knowledge Points: Sampling Methods for Imbalanced Data

downsampling method

ClusterCentroids

https://imbalanced-learn.org/stable/references/generated/imblearn.under_sampling.ClusterCentroids.html

ClusterCentroids is an undersampling algorithm based on a clustering method, which performs undersampling by generating cluster centers.

The ClusterCentroids algorithm clusters the samples of most categories into a group by using the K-means algorithm, and replaces the samples of the group with the cluster center. The algorithm keeps N majority class samples by fitting the K-means algorithm to the majority class and using the coordinates of N cluster centers as new majority class samples.

Technology Exchange

Technology must learn to share and communicate, and it is not recommended to work behind closed doors. One person can go fast, and a group of people can go farther.

A good article is inseparable from the sharing and recommendation of fans. Article source code, data, and technical exchange improvement can all be obtained by adding the exchange group. The group has more than 2,000 friends. The best way to add notes is: source + interest direction, which is convenient Find like-minded friends.

Method ①, add WeChat account: dkl88194, remarks: from CSDN + machine learning
Method ②, WeChat search official account: Python learning and data mining, background reply: machine learning

EditedNearestNeighbours

https://imbalanced-learn.org/stable/references/generated/imblearn.under_sampling.EditedNearestNeighbours.html

EditedNearestNeighbours (ENN) is an undersampling based method to deal with the class imbalance problem. It cleans the database by removing samples close to the decision boundary.

The principle of the ENN algorithm is as follows:

  1. For each sample in the class to be undersampled, compute its nearest neighbor.

  2. If the nearest neighbor sample is inconsistent with the class of the current sample, the current sample is removed from the dataset.

Through this process, the ENN algorithm can edit the dataset, removing samples that are not "enough" consistent with its neighbors. The algorithm judges whether a sample should be kept or not by considering the neighboring samples around it. In terms of selection criteria, ENN provides two optional strategies:

  1. Majority class selection ( kind_sel='mode'): The current sample is kept only if all nearest neighbor samples belong to the same class as the current sample.

  2. Select all ( kind_sel='all'): As long as one of the nearest neighbor samples is inconsistent with the category of the current sample, the current sample will be deleted.

CondensedNearestNeighbour

https://imbalanced-learn.org/stable/references/generated/imblearn.under_sampling.CondensedNearestNeighbour.html

CondensedNearestNeighbour (CNN) is an undersampling based method for dealing with class imbalance. It uses the nearest neighbor rule to iteratively decide whether to remove samples.

The principle of the algorithm is as follows:

  1. Put all minority class samples into set C.

  2. Add a sample from the target class (the class to be undersampled) to set C, and put all other samples of that class into set S.

  3. Iterate over the set C sample by sample and classify each sample using the nearest neighbor rule.

  4. If the sample is misclassified, add it to the set C; otherwise, do nothing.

  5. Repeat the above steps on set C until no samples need to be added.

**The CondensedNearestNeighbour algorithm can generate a new sample set with a smaller number of samples, which contains the minority class samples and some misclassified samples in the original data set. **This new sample set can be used to solve the class imbalance problem, making the proportion of minority class samples in the overall data set more balanced.

AllKNN

https://imbalanced-learn.org/stable/references/generated/imblearn.under_sampling.AllKNN.html

The AllKNN algorithm applies the ENN (Edited Nearest Neighbors) algorithm multiple times and changes the number of nearest neighbors at each iteration.

Unlike the previous RepeatedEditedNearestNeighbours algorithm, the AllKNN algorithm increases the number of nearest neighbors of the inner nearest neighbor algorithm in each iteration.

Through this process, the AllKNN algorithm can apply the ENN algorithm multiple times and gradually increase the number of nearest neighbors. This allows for more thorough cleaning of noisy samples located near class boundaries.

InstanceHardnessThreshold

https://imbalanced-learn.org/stable/references/generated/imblearn.under_sampling.InstanceHardnessThreshold.html

The principle of the InstanceHardnessThreshold algorithm is to first use a classifier to train the data, and then remove samples with lower probability according to the predicted probability of the sample.

In this algorithm, we first train the data with a classifier that generates predictions of the probability that each sample belongs to each class. Then according to the predicted probability, samples with lower probability are removed.

When using the InstanceHardnessThreshold algorithm, we need to set two important parameters. The first is estimatorthat it can accept any predict_probascikit-learn classifier that has a method. The classifier is trained using cross-validation, and cvthe number of cross-validation folds can be set through parameters.

NearMiss

https://imbalanced-learn.org/stable/references/generated/imblearn.under_sampling.NearMiss.html

When using the NearMiss algorithm, we can versionchoose different heuristic rules by setting parameters. For example, setting version=1means use the first heuristic rule.

The heuristic rules of the NearMiss algorithm are based on the nearest neighbor algorithm. Therefore, we can set the number of nearest neighbors through the parameters n_neighborsand . n_neighbors_ver3Among them, n_neighborsthe parameter is used to calculate the average distance between the sample and its neighbors, and n_neighbors_ver3the parameter is used to pre-select the samples of interest.

NeighbourhoodCleaningRule

https://imbalanced-learn.org/stable/references/generated/imblearn.under_sampling.NeighbourhoodCleaningRule.html

The NeighborhoodCleaningRule algorithm is mainly used for data cleaning rather than data compression. It uses EditedNearestNeighbours algorithm and k-NN algorithm to remove noise samples in the data set.

In the NeighborhoodCleaningRule algorithm, the EditedNearestNeighbours algorithm is first used to generate a sample set that contains samples that should be removed. Then, use the 3 nearest neighbor classifier to classify the data set, and combine the sample set output by the classifier with the previously generated sample set to obtain the final sample set that needs to be removed.

OneSidedSelection

https://imbalanced-learn.org/stable/references/generated/imblearn.under_sampling.OneSidedSelection.html

The OneSidedSelection algorithm uses the TomekLinks method to remove noisy samples. Additionally, the algorithm applies the 1-nearest neighbor rule to all samples, adding misclassified samples to the set.

upsampling method

SMOTE

https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTE.html

SMOTE (Synthetic Minority Over-sampling Technique) is a commonly used oversampling method for solving category imbalance problems. It increases the number of minority class samples by generating synthetic samples to balance the data distribution among different classes.

The principle of SMOTE is based on the interpolation of minority class samples. Specifically, it first randomly selects a minority class sample as a starting point, and then randomly selects a sample from the neighbors of this sample as a reference point. SMOTE then increases the sample size of the dataset by generating new synthetic samples on the line segment between these two samples.

SMOTENC

https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTENC.html

SMOTE-NC (SMOTE for Nominal and Continuous features) is an oversampling method for processing datasets that contain both numerical and categorical features. It is an extension of the traditional SMOTE algorithm, which can handle the situation where there are both numerical and categorical features, but it is not suitable for datasets that only contain categorical features.

The principle of SMOTE-NC is similar to SMOTE, but it is different in generating synthetic samples. Its generation process is as follows:

  1. For the selected starting point and reference point, calculate the distance between them, resulting in a vector.

  2. Multiply the gap of continuous features (numerical features) by a random number to get the position of the new sample. This step is the same as traditional SMOTE.

  3. For categorical features, the eigenvalues ​​of the starting point or reference point are randomly selected as the eigenvalues ​​of the newly synthesized samples.

  4. For continuous features and categorical features, interpolation and random selection are used to generate feature values ​​for new samples.

In this way, SMOTE-NC is able to handle datasets containing both numerical and categorical features, and generate new synthetic samples to increase the number of minority class samples. This keeps the numerical and categorical features consistent while balancing the dataset.

THE SMOKE

https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SMOTEN.html

SMOTEN (Synthetic Minority Over-sampling Technique for Nominal) is an oversampling method specifically for categorical features to solve the problem of category imbalance. It is an extension to the SMOTE algorithm and is suitable for datasets containing only categorical features.

SMOTEN works similarly to SMOTE, but differs in generating synthetic samples. Its generation process is as follows:

  1. For the selected starting point and reference point, calculate the distance between them, resulting in a vector.

  2. For each categorical feature, count the frequency of unique values ​​(categories) for the corresponding feature between the starting point and the reference point.

  3. Based on the frequency of the feature, the location of the new sample is determined. Specifically, for each categorical feature, a category of a starting point or reference point is randomly selected, and a value within that category is randomly selected as the feature value of a newly synthesized sample.

  4. For continuous features, the traditional SMOTE method is used to determine the position of the new sample by multiplying the gap vector by a random number, and use interpolation to generate the feature value of the new sample.

IT'S SIMPLE

https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.ADASYN.html

ADASYN (Adaptive Synthetic) is an oversampling algorithm based on adaptive synthesis. It is similar to the SMOTE method, but generates a different number of samples according to the local distribution estimates of the classes.

ADASYN calculates a density factor for each sample based on the gap between samples. The density factor represents the density of minority class samples around this sample. A lower density factor indicates that the region to which the sample belongs lacks minority class samples, while a higher density factor indicates that there are more minority class samples around the sample.

BorderlineSMOTE

https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.BorderlineSMOTE.html

BorderlineSMOTE (Borderline SMOTE) is an oversampling algorithm, which is an improvement and extension of the original SMOTE algorithm. It is able to detect and leverage boundary samples to generate new synthetic samples to address class imbalance.

BorderlineSMOTE improves on the SMOTE algorithm by identifying borderline samples to more targetedly generate new synthetic samples. Boundary samples refer to those samples between the majority class samples and the minority class samples, and they are often difficult to classify samples. By identifying and processing these boundary samples, BorderlineSMOTE can improve the classifier's ability to identify difficult-to-classify samples.

KMeansSMOTE

https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.KMeansSMOTE.html

The key to KMeansSMOTE is to use KMeans clustering to divide data samples into different clusters, and to generate synthetic samples in a targeted manner by identifying boundary samples. This approach can improve the diversity and realism of synthetic samples because it only oversamples around boundary samples instead of the entire minority class sample set.

EVERYTHING

https://imbalanced-learn.org/stable/references/generated/imblearn.over_sampling.SVMSMOTE.html

SVMSMOTE is a variant based on the SMOTE algorithm, which is characterized by utilizing the support vector machine (SVM) algorithm to detect samples used to generate new synthetic samples. By dividing the minority class samples in the dataset into support vectors and non-support vectors, SVMSMOTE is able to select samples for synthesis more accurately. For each minority class support vector, it chooses one of its nearest neighbors as a reference point, and generates new synthetic samples by computing its distance from the reference point.

Guess you like

Origin blog.csdn.net/qq_34160248/article/details/131350304