1. Data combination sampling

Undersampling and oversampling are only for a certain type of samples. The third type of sampling is to combine oversampling and undersampling techniques at the same time, that is, combined resampling. The basic idea is to increase the number of minority samples in the sample set. At the same time, reduce the number of samples in the majority class to reduce the imbalance. There are two typical combination methods: SMOTE+Tomeklinks and SMOTE+ENN, and they are explained separately below.

1：SMOTE + Tomek Link Removal

First, the SMOTE method is used to generate new minority class samples, and the expanded data set T is obtained. Then remove the Tomek links in T

Why do you need a combination of the two? Avoid SMOTE from causing the space originally belonging to the majority class to be "invaded" by the minority class, and remove noise points or boundary points by Tomek links

2：SMOOTH+ENN

Similar to the idea of the SMOTE+Tomek links method, it contains two steps

1) Use the SMOTE method to generate new minority class samples, and obtain the expanded data set T

2) For each sample in T, use the kNN (generally k is 3) method to predict, if the prediction result does not match the actual category label, then remove the sample

2. Unbalanced data classification of feature layer

In network security, some categories of network data are difficult to obtain, which leads to imbalance problems. Most categories are usually normal, while minority categories are attack behaviors. Although the distribution of samples in each category is unbalanced, this non-balanced Balance does not exist on all traits

The idea of feature layer to solve unbalanced data classification is to select the most suitable feature representation space, and then classify

"Most suitable" refers to improving the classification accuracy of the minority class and the whole. Projecting the data samples into this "most suitable" subspace, most classes may be clustered together or overlapped together, so it is beneficial to reduce the non-identity of the data. balance

According to the feature theory of machine learning, there are two types of methods in the construction of feature space, namely, feature selection and feature extraction.

3. Unbalanced data classification at the algorithm level

1: Cost Sensitive Approach

Cost-sensitive: Set the weight of the loss function so that the loss of minority class discrimination error is greater than the loss of majority class discrimination error

With the lowest total cost of classification errors as the optimization goal, more attention can be paid to samples with higher error costs, making the classification performance more reasonable

Implementation:

One is to change the original data distribution to obtain a cost-sensitive model;

The second is to adjust the results of classification to achieve the goal of minimum loss;

The third is to directly construct a cost-sensitive learning model

The optimal Bayes prediction of the optimization objective is to divide x into the category k that minimizes R(i|x), namely: K= argmin R(i|x), i=1,2,…N where R(i| x) is the classification risk of a sample x of some class i

For a given training dataset ((x1, Y1,), ..., (xn, yn)), standard non-cost-sensitive SVM learns a decision boundary

Ordinary SVM

Bias Penalized Support Vector Machines (BP-SVM)

Cost Sensitive Coaching Loss Support Vector Machine (CSHL-SVM)

2: Single classifier approach

Single-class classifier methods: train only on the minority class, e.g. using the SVM algorithm

Density Estimation Method Cluster-Based Method Support Domain-Based Method

One-class Support Vector Machine (OneclassSVM) Support Vector Data Description (Support Vector Data Description, SVDD)

When there is an obvious cluster structure in the majority class, using the clustering method to obtain the cluster structure is beneficial to improve the accuracy of the outline description of the majority class

3: Integrated Learning

Typical integrated learning methods include Bagging, Boosting, Stacking

Over Bagging: Apply random oversampling to small class data at each iteration

Under Bagging: Apply random downsampling to large classes of data at each iteration

SMOTEBagging: Combining SMOTE and bagging, first use SMOTE to generate more comprehensive small class data, and then apply bagging

Asymmetric bagging: At each iteration, all small class data are retained, and a subset as large as the small class data is separated from the large class data

SMOTEBoost: combines the SMOTE method instead of simply increasing the weight of small observation points

BalanceCascade: It is a typical double ensemble algorithm, using Bagging as the basic ensemble learning method, and using AdaBoost as the classification algorithm when training each Boostrap data

Explanation of data combination sampling, feature layer, and algorithm layer in artificial intelligence (detailed explanation with pictures and texts)