Paper Reading-13-MESA: Boost Ensemble Imbalanced Learning with META-sampler

Paper translation

0. Abstract

Imbalanced learning (IL), learning unbiased models from class-imbalanced data, is a challenging problem. Typical IL methods including resampling and reweighting are designed based on some heuristic assumptions. They often suffer from unstable performance, poor applicability, and high computational cost in complex tasks where their assumptions do not hold.This paper introduces MESA, a new integrated IL framework. It adaptively resamples the training set in iterations to obtain multiple classifiers to form a cascade ensemble model. MESA learns sampling strategies directly from the data to optimize the final metric rather than using random heuristics. Furthermore, unlike popular meta-learning based IL solutions, weSeparate model training and meta-training in MESA by independently training meta-samplers on task-agnostic metadata. This makes MESA generally applicable to most existing learning models, and metasamplers can be effectively applied to new tasks. Extensive experiments on synthetic and real tasks demonstrate the effectiveness, robustness and portability of MESA.

1. Introduction

Due to the naturally skewed class distribution, class imbalance is widely observed in many real-world applications, such as click prediction, fraud detection, and medical diagnosis [13, 15, 21]. 在解决类别不平衡问题时,规范分类算法通常会导致偏差,即在全局准确性方面表现良好,但在少数类别上表现不佳. However, from a learning and practical perspective, there is usually a higher interest in minority classes [18, 19]. A typical imbalanced learning (IL) algorithm attempts通过数据重采样[6,16,17,26,35]或重新加权[30,33,40]在学习过程中来消除偏差. Recently, to reduce the variance introduced by resampling or reweighting, satisfactory performance has been achieved [23]. However, in practice, it has been observed that all these methods suffer from three集成学习被结合主要限制:(I)由于对异常值的敏感性,性能不稳定。(II)由于领域专家手工制作成本矩阵的先决条件,适用性差。(III)计算实例之间距离的高成本。

Regardless of computational issues, we attribute the unsatisfactory performance of traditional IL methods to the validity of the heuristic assumptions made on the training data. For example, some methods [7, 12, 32, 39] assume that instances with higher training errors are more informative for learning. However, misclassification may be caused by outliers, in which case there will be false reinforcement under the above assumptions. Another widely used hypothesis is that generating synthetic samples around a small number of instances facilitates learning [7, 8, 46]. This assumption only holds true when the few data are well clustered and have sufficient discriminative power. If the training data is very unbalanced or has many corrupted labels, the minority class will be difficult to represent and lack a clear structure. In this case, working under this assumption can severely impact performance.

Therefore, more希望开发一个自适应的IL框架,它能够在没有直觉假设的情况下处理复杂的现实世界任务。受到元学习最新发展的启发[25],我们提出在集成不平衡学习(EIL)框架下实现元学习机制. In fact, some initial efforts [37, 38, 41] investigated the potential of applying meta-learning to information literacy problems. However, these works have limited generalization capabilities due to model-dependent optimization processes. Their meta-learners are restricted to co-optimization with a single DNN, which greatly limits their application in other learning models (e.g., tree-based models) and deployment in more powerful EIL frameworks.

In this paper we proposeA general EIL framework MESA that automatically learns its policy, i.e., meta-sampler, from data to optimize imbalanced classification. The main idea isModeling a metasampler as an adaptive undersampling solution embedded in an iterative ensemble training process. 在每次迭代中,它将集成训练的当前状态(即,训练集和验证集上的分类误差分布)作为其输入。在此基础上,元采样器选择一个子集来训练一个新的基分类器,然后将其添加到集成中,从而获得一个新的状态。我们期望元采样器通过从这样的交互中学习来最大化最终的泛化性能。That’s why we使用强化学习(RL)来解决元采样器的不可微优化问题. To sum up, this article makes the followingcontribute. (I) We propose MESA, a general EIL framework that demonstrates superior performance by automatically learning adaptive undersampling strategies from data. (II) A preliminary exploration was conducted on the extraction and use of cross-task meta-information in the EIL system. This use of meta-information gives the meta-sampler portability across tasks. The pre-trained meta-sampler can be directly applied to new tasks, thereby greatly reducing the computational cost caused by meta-training. (III) Unlike popular approaches where meta-learners are designed to be co-optimized with a specific learning model (i.e., DNN) during training, we decouple the model training and meta-training processes in MESA. This makes our framework generally applicable to most statistical and non-statistical learning models (e.g., decision trees, naive Bayes, k-nearest neighbor classifiers)。

2. Related Work

Fernandez et al. [1], Guo et al. [15], and He et al. [18,19] provide systematic reviews of algorithms and applications of imbalanced learning. This paper focuses on the problem of binary imbalanced classification, which is one of the most widely studied problems [15,23] in imbalanced learning. This problem widely exists in practical applications, such as fraud detection (fraud vs. normal), medical diagnosis (sick vs. healthy), and network security (intrusion vs. user connection). We mainly review existing work on this problem as follows.
Resampling. Resampling methods focus onModify training set to balance class distribution(i.e., oversampling/undersampling [6, 16, 17, 35, 42]) or filtering noise (i.e., cleaning resampling [26, 45]). Random resampling often results in severe information loss or overfishing, so Many advanced methods explore distance information to guide their sampling process [15]. However, on large-scale datasets, computing distances between instances is computationally expensive, and such strategies may even fail when the data does not fit their assumptions.

Reweighting. The reweighting method willDifferent weights are assigned to different instances to mitigate the bias of the classifier towards the majority class(eg [5,12,31,33]). Many recent reweighting methods, such as afocal weighting [30] and GHM [28] are specifically designed for DNN loss function engineering. Class-level reweighting, such as cost-sensitive learning [33], is more general, but requires domain experts to give a cost matrix in advance, which is usually not feasible in practice.

Integrated approach. Known Ensemble Imbalanced Learning (EIL) passesCombining the output of multiple classifiersto effectively improve typical IL solutions (e.g. 7,32,34,39,46]). These EIL methods have proven to be extremely competitive [23] and are therefore increasingly popular in IL [15]. However, among themMost are direct combinations of resampling/reweighting solutions and ensemble learning frameworks, such as SMOTE [6] + ADABOOST [12] = smote boot [7]. Therefore, although EIL techniques effectively reduce the variance introduced by resampling/reweighting, these methods still suffer from their heuristic-based design. Unsatisfactory performance.

Meta-learning method. Inspired by recent meta-learning developments [11,25], some studies willMeta-learning for solving information literacy problems. Typical methods include Learning to Teach [48] that learns dynamic loss functions, MentorNet [22] that learns mini-batch courses, and L2 RW [38]/Meta-Weight-Net [41] that learns implicit/explicit data weighting functions. . Nonetheless, all these methodsLimited to co-optimization with DNN via gradient descent. Since the success of deep learning relies on large amounts of training data, mainly from well-structured data domains like computer vision and natural language processing, these methods are applied in traditional classification tasks (e.g., small/unstructured/tabular data) Other learning models, such as tree-based models and their ensemble variants such as gradient boosting machines, are severely limited.

We give a comprehensive comparison of existing IL solutions to the binary imbalance classification problem with our MESA in Table 1. Compared with other methods, MESA aims to learn a resampling strategy directly from the data. Since during the resampling process不涉及距离计算、领域知识或相关启发式算法, it is possible to perform fast and adaptive resampling.

Table 1: Comparison of Mesa and existing imbalanced learning methods, note |N| 》 |P|.

Text that appears when the image is not displayed
Text that appears when the image is not displayed

Figure 1: Overview of the MESA framework

3. The Proposed MESA framework

To take advantage of ensemble learning and meta-learning, we propose a new EIL framework, MESA, which works together with meta-samplers. As shown in Figure 1.MESA consists of three parts: meta-sampling and ensemble training to build the ensemble classifier, and meta-training to optimize the meta-sampler. We will describe them separately in this section.

concrete place,MESA is designed to: (I) perform resampling based on meta-information to further improve the performance of ensemble classifiers; (II) decouple model training and meta-training to achieve general applicability to different classifiers; (III) Train metasamplers on task-agnostic metadata to gain cross-task transferability and reduce meta-training costs for new tasks

Text that appears when the image is not displayed

Figure 2: Some examples of different meta-states (s = [EDτ:EDv]) and their corresponding ensemble training states. The meta-state reflects how well the current classifier matches the training set and how well it generalizes to unknown validation data. Note that this representation is independent of task-specific properties (e.g., dataset size, feature space) and can therefore be used to support metasamplers performing adaptive resampling across different tasks.

符号。令 X : R d \mathcal X:\mathbb R^d XRdTransportation space, Y: { 0 , 1 } \mathcal Y: \{0, 1\} Y{ 0, 1}为Label space. For one practical example ( x , y ) ( x, y ) (x, y) Display, inside x ∈ X , y ∈ Y x ∈ \mathcal X, y ∈ \mathcal Y xXyY. Without loss of generality, we always assume that the minority class is positive. Given an imbalanced data set D: { (x 1, y 1), (x 2, y 2), ⋅ ⋅ ⋅, (x n, y n) } D:\{(x_1 ,y_1),(x_2,y_2),···,(x_n,y_n)\} D{(x1,and1),(x2,and2),⋅⋅⋅,(xn,andn)}, a set of small numbers P : { ( x , y ) ∣ y = 1 , ( x , y ) ∈ D } \mathcal P:\{(x,y)|y = 1,(x,y)∈ D\} P{(x,y)y=1(x,y)D}Several and multiple sets are N : { ( x , y ) ∣ y = 0 , ( x , y ) ∈ D } \mathcal N: \{(x,y)|y = 0, (x,y)∈ D\} N{(x,y)y=0(x,y)D}. For highly imbalanced data, we have ∣ N ∣ ≫ ∣ P ∣ |\mathcal N| \gg|\mathcal P| NP. My use f : x → [ 0 , 1 ] f : x → [0,1] fx[0,1] represents a single classifier, while F k : x → [ 0 , 1 ] F_k:x → [0,1] Fkx[0,1]Display reason k k An ensemble classifier composed of k base classifiers. We use D τ D_τ Dτsum D v D_v DvRepresent the training set and validation set respectively.

Meta-state. As mentioned earlier, we expect to find a task-agnostic representation that canProvide metasamplers with information about the ensemble training process. Inspired by the concept of “gradient/hardness distribution” [28,34], we introduceHistogram distribution of training and validation errorsAs meta-states for ensemble training systems.

Formally, given a data instance ( x , y ) (x,y) (x,y)and an ensemble classifier F t ( ⋅ ) F_t(\cdot) < /span>Ft(),Division difference e e eto be determined x x x is the positive predicted probability and true label y y y,即 ∣ F t ( x ) − y ∣ |F_t(x)-y | Ft(x)The absolute difference between y. Assume that the data set D D The error distribution on D is E D E_D ANDD, then the error distribution of the histogram approximation is given by the vector E ^ D ∈ R b \hat E_D ∈ \mathbb R^b AND^DRbStart, inside b b b is the number of bins in the histogram. Specifically, the vector E ^ D \hat E_D AND^D目次 i i i个分量可以计算如下:
E ^ D i = ∣ { ( x , y ) ∣ i − 1 b ≤ a b s ( F t ( x ) − y ) < i b , ( x , y ) ∈ D } ∣ ∣ D ∣ , 1 ≤ i ≤ b (1) \hat E_D^i = \frac{|\{(x,y)|\frac{i-1}b\le abs(F_t(x)-y)<\frac ib,(x,y)\in D\}|}{|D|},1\le i\le b\tag1 AND^Di=D{(x,y)bi1abs(Ft(x)y)<bi,(x,y)D},1ib(1)
After concatenating the error distribution vectors on the training set and validation set, We have meta-state:
s = [ E ^ D t : E ^ D v ] ∈ R 2 b (2) s=[\hat E_{D_t}:\hat E_{D_v} ]\in \mathbb R^{2b}\tag2 s=[AND^Dt:AND^Dv]R2b(2)

Intuitively, histogram error distribution E ^ D \hat E_D AND^DDisplay fitted data for the given classifier D D Ddegree. t b = 2 b = 2 b=2when, then E ^ D 1 \hat E_D^1 AND^D1Accuracy score reported in , in E ^ D 2 \hat E_D^2 AND^D2The misclassification rate is reported in (classification threshold is 0.5). when b > 2 b>2 b>When 2, it shows the distribution of "easy" samples (error close to 0) and "hard" samples (error close to 1) at a finer granularity, thus including More information to guide the resampling process. Furthermore, since we consider both the training set and the validation set,The meta-state also provides the meta-sampler with information about the bias/variance of the current ensemble model, thus supporting its decision-making. We give some examples in Figure 2.

Meta sampling. Making instance-level decisions by using complex meta-samplers (e.g., setting large output layers or using recurrent neural networks) is very time-consuming because of individual updates C u C_u a>Cu的复杂degree是 O ( ∣ D ∣ ) \mathcal O(|D|) O(D). In addition, complex model architecture also brings additional memory overhead and optimization difficulties. forTo make MESA more concise and effective, we use the Gaussian function to simplify the meta-sampling process and samplerown skills, use C u C_u Cu O ( ∣ D ∣ ) \mathcal O(|D|) O(D)减小为 O ( ∣ 1 ∣ ) \mathcal O(|1|) O(∣1∣)

Specifically, ð \eth ð represents a meta-sampler, which is based on the input meta-state s s sExport amount [ 0 , 1 ] [0,1] [0,1],即 μ ∼ ð ( μ ∣ s ) \mu \sim \eth (\mu |s) mð(μs ). Afterwards, weApply a Gaussian function to the classification error of each instance g μ , σ ( x ) g_{\mu ,σ}(x) gμ,σ(x) to determine its (non-standardized) sampling weight, inside g μ , σ ( x ) g_{\mu ,σ}(x) gμ,σ(x)and one:
g μ , σ ( x ) = 1 σ 2 π e − 1 2 ( x − μ σ ) 2 (3) g_{\mu ,σ}(x) = \frac1{\sigma\sqrt{2\pi}}e^{ -\frac12(\frac{x-\mu}\sigma)^2}\tag3 gμ,σ(x)=p2π 1It is21(pxμ)2(3)

Attention, official 3rd year, e e eis Oura number, µ ∈ [ 0 , 1 ] µ ∈ [0, 1] m[01] 由元手机器电影, σ σ σ is a hyperparameter. See Section C.2 for a discussion and guidance on our hyperparameter settings. Sample of the above meta-sampling program (⋅; F, µ, σ) (·; F, µ, σ) ;F,µ, σ总结于法1中.

dimension1  range ( D τ : F , μ , σ ) (D_τ:F,\mu,\sigma) (DτF,μ,σ)
要求 D t , F , μ , σ D_t,F,\mu,\sigma Dt,F,μ,σ
1: First beginning: 从 D τ \mathcal D_τ DτDerived majority set P τ \mathcal P_τ PτJapanese number collection N τ \mathcal N_τ Nτ
2:的 N τ \mathcal N_τ Nτintermediate part ( x i , y i ) (x_i,y_i) (xi,andi) 电影权重:
w i = g ( µ , σ ) ( ∣ F ​​( x i ) − y i ∣ ) ∑ ( x j , y j ) ∈ N τ g ( µ , σ ) ( ∣ F ​​( x i ) − y i ∣ ) w_i =\frac{g_{(µ,σ)}(|F(x_i)− y_i|)}{\sum _{( x_j,y_j)∈\mathcal N_τ} g_{(µ,σ)}(|F(x_i)− y_i|)} Ini=(xj,yj)Nτg(µ,σ)(F(xi)andi)g(µ,σ)(F(xi)andi)
3:从 N τ \mathcal N_τ NτSampling majority class subset N τ ′ \mathcal N_τ' Nt, Immediately, 采样权加 w w w,使用 ∣ N τ ′ ∣ = ∣ P τ ∣ |\mathcal N_τ'|=|\mathcal P_τ| Nt=Pτ
4:图像电影子集 D τ ′ = N τ ′ ∪ P τ \mathcal D_τ' = \mathcal N_τ'\cup\mathcal P_τ Dt=NtPτ

Ensemble training. Given a meta-sampler: ð: R 2 b → [ 0 , 1 ] \eth: \mathbb R ^{2b} → [0,1] ðR2b[0,1] and meta-sampling strategy, we canIteratively train a new base classifier using the data set sampled by the sampler. Current status t t t iterations, with the current integration F t ( ⋅ ) F_t(·) Ft(), we can obtain E ^ D τ , E ^ D v \hat E_{\mathcal D_τ by applying formulas 1 and 2 },\hat E_{\mathcal D_v} AND^DτAND^Dvsum metastate s t s_t st. ThenDefinition D t + 1 , τ ′ = \mathcal D_{t+1,τ}'= Dt+1,τ=采样 ( D τ ; F t , μ t , σ ) (D_τ;F_t,\mu_t,\sigma) (DτFt,mt,σ)Train to obtain a new base classifier f t + 1 ( x ) f_{t +1}(x) ft+1(x),其中 μ t ∼ ð ( μ t ∣ s t ) \mu_t∼\eth(\mu_t |s_t) mtð(μtst) D τ D_τ Dτ is the original training set. Note that F 1 ( ⋅ ) F_1(·) F1() is trained on a randomly balanced subset because there are no trained classifiers in the first iteration. See Algorithm 2 for more details.

Algorithm 2  MESA collection combination
Request: D t , D v , ð , σ , f , b , k \mathcal D_t,\mathcal D_v,\eth ,\sigma,f,b,k Dt,Dv,ð,σ,f,b,k
1: Use random balanced subset training f 1 ( x ) f_1(x) f1(x)
2:从 t = 1 t=1 t=1 t = k − 1 t=k-1 t=k1循环:
3:   F t ( x ) = 1 t ∑ i = 1 t f i ( x ) F_t(x)=\frac1 t\sum^t_{i=1}f_i(x) Ft(x)=t1i=1tfi(x)
4: Passing formula 1 calculation E ^ D τ \hat E_{\mathcal D_τ} AND^Dτsum E ^ D v \hat E_{\mathcal D_v} AND^Dv
5:   s t = [ E ^ D τ : E ^ D v ] s_t=[\hat E_{\mathcal D_τ} :\hat E_{\mathcal D_v}] st=[AND^Dτ:AND^Dv]
6:   μ t ∼ ð ( μ t ∣ s t ) \mu_t∼\eth(\mu_t |s_t) mtð(μtst)
7:   D t + 1 , τ ′ = 采样 ( D τ ; F t , μ t , σ ) \mathcal D_{t+1,τ}'=采样(D_τ;F_t,\mu_t,\sigma) Dt+1,τ=采样(DτFt,mt,σ)
8:  用 D t + 1 , τ ′ \mathcal D_{t+1,τ}' Dt+1,τTrain a new classifier f t + 1 ( x ) f_{t+1}(x) ft+1(x)
9:返回 F k ( x ) = 1 k ∑ i = 1 k f i ( x ) F_k(x)= \frac1 k \sum^k_{i=1} f_i(x) Fk(x)=k1i=1kfi(x)

Meta-training. As mentioned above, ourMeta sampler ð \eth ð is trained to optimize the ensemble classifier by iteratively selecting its training datageneral performance. itTo train the current state of the system s s s as input, and then output the parameters of the Gaussian function µ µ µ to determine the sampling probability of each instance. The metasampler expects from this state ( s s s)- action( µ µ µ)-状态(新的 s s s) learn and adjust their strategies through interaction. Therefore, the non-differentiable optimization problem of training can be naturally handled by reinforcement learning (RL).

Algorithm 3  MESA original reading
1: Beginning: Capacity N N N's heavy release device M \mathcal M M,network parameters ψ 、 ψ ˉ 、 θ 和 φ ψ、\bar ψ、θ和\varphi ψpˉ, θ and φ
2: From scenario 1 To M loop:
3: ˆ for all cases step t execution:
4: ˆ ˆObserved from ENV s t s_t a>st    In lines 3-5 of Algorithm 2
5: Execution μ t ∼ ð φ ( μ t ∣ s t ) \mu_t∼\ eth_\varphi(\mu_t |s_t) mtdφ(μtst)    In lines 6-8 of Algorithm 2
6: Observation reward r t = P ( F t + 1 , D v ) − P ( F t , D v ) r_t=P(F_{t+1},D_v)-P(F_{t},D_v) rt=P(Ft+1,Dv)P(Ft,Dv) s t + 1 s_{t+1} st+1
7:    更新存储 M = M ∪ { ( s t , µ t , r t , s t + 1 ) } \mathcal M=\mathcal M ∪ \{(s_t,µ_t,r_t,s_{t+1})\} M=M{(st,mt,rt,st+1)}
8: For each gradient step:
9: Updated according to [14] ψ , ψ ˉ , θ and φ ψ, \bar ψ, θ and \varphi ψpˉ, θsumφ
10: With return code Reference number φ \varphi Metasampler for φ ð \eth d

We treat the ensemble training system as an environment (ENV) setting in RL. The corresponding Markov decision process (MDP) consists of the tuple ( S , A , p , r ) (\mathcal S,\mathcal A, p, r) (S,A,p,S: R 2 b \mathcal S:\mathbb R^{2b} SR2bJapanese movement space A : [ 0 , 1 ] \mathcal A:[0 ,1] A:[0,1] is a continuous, unknown state transition probability p : S × S × A → [ 0 , ∞ ) p: \mathcal S×\mathcal S×\mathcal A→[0,∞) p:S×S×A[0,) represents considering the current state s t ∈ S s_{t}∈\mathcal S stSJapanese action a t ∈ A a_{t}∈\mathcal A atProbability density of the next state after A s t + 1 ∈ S s_{t+1}∈\mathcal S st+1S. More specifically,In each scenario, we train repeatedly k k kbase classifier f ( ⋅ ) f(·) f()And form a cascade ensemble classifier F k ( ⋅ ) F_k(·) Fk(). existEach environment time step, ENV provides meta-state s t = [ E ^ D τ : E ^ D v ] s_t=[\hat E_{\mathcal D_τ}:\ hat E_{\mathcal D_v}] st=[AND^Dτ:AND^Dv], 然后通过 a t ∼ ð ( μ t ∣ s t ) a_t∼\eth(\mu_t |s_t) atð(μtst)choice action a t a_t at。即 a t ⇔ µ t a_t⇔µ_t atmt. A new base classifier f t + 1 ( ⋅ ) f_{t+1}(·) ft+1()Use subset D t + 1 , τ ′ = sampling ( D τ ; F t , μ t , σ ) \mathcal D_{t+1,τ}'=sampling(D_τ;F_t,\mu_t,\sigma) Dt+1,τ=采样(DτFt,mt,σ)训练. Existence添加 f t + 1 ( ⋅ ) f_{t+1}(·) ft+1()After integrating the classifier, the new state s t + 1 s_{t+1} st+1 is sampled, that is s t + 1 ∼ p ( s t + 1 ; s t ; a t ) s_{t+1}∼p(s_{t+1};s_t; a_t) st+1p(st+1;st;at) . The given performance index function P ( F , D ) → R P (F, D)→\mathbb R P(F,D)Rencouragement r r r 被电影 F F F has poor generalization performance before and after the update (use occlusion verification as an unbiased estimate), i.e. r t = P ( F t + 1 , D v ) − P ( F t , D v ) r_t = P(F_{t+1},Dv)−P(F_t,Dv) rt=P(Ft+1,Dv)P(Ft,Dv). meta samplerThe optimization objective (i.e., cumulative reward) is therefore the generalization performance of the ensemble classifier

We utilize the non-policy actor-critic, deep RL algorithm Soft Actor-Critic [14] (SAC) based on the maximum entropy RL framework to optimize our meta-sampler. In our case weA parameterized state value function is considered V ψ ( s t ) V_ψ(s_t) INψ(st) and its corresponding target network V ψ ( s t ) V_ψ(s_t) INψ(st),一个软Q函数 Q θ ( s t , a t ) Q_θ(s_t,a_t) Qθ(st,at), and a tractable strategy (meta-sampler) ð φ ( a t ∣ s t ) \eth_\varphi(a_t | s_t) dφ(atst). The parameters of these networks are ψ , ψ ˉ , θ and φ ψ, \bar ψ, θ and \varphi ψpˉ, θ and φ. The rules for updating these parameters are given in the SAC paper [14]. ð φ \eth_\varphi dφThe meta-training process.

Complexity Analysis, the complexity analysis of MESA is detailed in Section C.1, and the relevant verification experiments are shown in Figure 7.

4. Experiments

To comprehensively evaluate the effectiveness of MESA, two series of experiments were conducted: one in a controlledSynthesize simple data setsfor visualization and another onReal-world imbalanced data setsto verify the performance of MESA in practical applications. We also conduct extended experiments on real datasets to verify MESA's robustness and cross-task transferability.

4.1 Experiment on Synthetic Datasets

Text that appears when the image is not displayed

Figure 3: Comparison of MESA with 4 representative traditional EIL methods (SMOTEBOOST [7], SMOTEBAGGING [46], RUSBOOST [39] and UNDERBAGGING [2]) on 3 simple datasets with different levels of The underlying category distributions overlap (less/medium/high overlap in rows 1/2/3). The number in the lower right corner of each subfigure represents the AUCPRC score of the corresponding classifier.

Set details. We constructed a series of unbalanced simple datasets corresponding to different levels of underlying class distribution overlap, as shown in Figure 3. All data sets have the same imbalance ratio 2 ^2 2 ∣ N ∣ / ∣ P ∣ = 2000 / 200 = 10 |\mathcal N|/|\mathcal P|= 2000/200 = 10 N∣/∣P=2000/200=10). This experiment compares MESA with four representative algorithms of the four main branches of EIL (parallel/iterative integration + undersampling/oversampling), namely: SMOTEBOOST[7], SMOTEBAGGING[46], RUSBOOST[39] and UNDERBAGGING[2]. All EIL methods use decision trees as base classifiers with an ensemble size of 5.

Visualization and Analysis. We plot the input data sets and decision boundary plots learned by different EIL algorithms, which show the best performance achieved by MESA in different situations. We can observe: 所有测试的方法在较少重叠的数据集(第一行)上表现良好. Note that random undersampling will discard some important majority samples (for example, data points at the right end of the "∩"-shaped distribution) and result in a loss of information. This makes RUSBOOST and UNDERBAGGING perform slightly weaker than their competitors. 随着重叠的加强(第2行), During the training process of the boost-based method, the increased amount of noise obtains high sample weights, i.e., SMOTEBOOST and RUSBOOST, resulting in poor classification performance. BAGGING-based methods, namely SMOTEBAGGING and UNDERBAGGING, are less affected by noise, but are still lower than MESA. Even在重叠程度极高的数据集(第3行)上, MESA still gives a stable and reasonable decision boundary that conforms to the underlying distribution. This is becauseThe metasampler can adaptively select informative training subsets toward good prediction performance while being robust to noise/outliers. Experimental results show that MESA is better than other traditional EIL baselines in dealing with issues such as distribution overlap, noise, and poor minority class representation.

4.2 Experiment on Real-world Datasets

Set details. To verify the effectiveness of MESA in practical applications, we extend our experiments to real-world imbalanced classification tasks from the UCI Knowledge Base [10] and KDD CUP 2004. To ensure a comprehensive evaluation, the properties of these datasets vary widely, with imbalance ratios (IR) ranging from 9.1:1 to 111:1, dataset sizes ranging from 531 to 145751, and number of features ranging from 6 to 617 (details please See Table 7 in Section B). For each dataset, we exclude 20% of the validation set and report the results of a 4-fold stratified cross-validation (i.e., 60%/20%/20% train/validation/test split). Performance is evaluated using the area under the precision-recall curve (AUCPRC), which is an unbiased and more comprehensive indicator for category imbalance tasks compared with other indicators such as F-score, ROC and accuracy [9].

Table 2: Comparison of MESA with other representative resampling methods.

Text that appears when the image is not displayed

Comparison with resampling imbalanced learning (IL) method. We first compare MESA with resampling techniques, 重采样技术在实践中广泛用于预处理不平衡数据[15]. We selected 12种有代表性的方法 from the 4 main branches of resampling-based IL, namely undersampling/oversampling/clean sampling and oversampling+clean sampling post-processing. We tested all methods on challenging highly imbalanced (IR=111) human proteins. Check its efficiency and effectiveness tasks. 五种不同的分类器, that is, K nearest neighbor (KNN), Gaussian Naive Bayes (GNB), decision tree (DT), adaptive boosting (AdaBoost) and gradient boosting machine (GBM) are used with different Resampling method collaboration. We also recorded the number of samples used for model training and the time used to perform resampling.

Table 2 details the experimental results. Pass学习自适应重采样策略,MESA在只使用少量训练样本的情况下大大优于其他传统的数据重采样方法. In such a highly imbalanced dataset, minority classes are poorly represented and lack clear structure. Therefore, oversampling methods such as SMOTE that rely on relationships between a small number of objects may worsen classification performance, even if they generate and use a large number of synthetic samples for training. On the other hand, undersampling methods discard most of the samples according to their rules, resulting in significant information loss and poor performance. Clean-sampling methods aim to remove noise from the dataset, but the resampling time is quite high and the improvement is negligible.

Table 3: Comparison of MESA with other representative undersampling-based EIL methods.

Text that appears when the image is not displayed
Text that appears when the image is not displayed

Figure 4: Comparison of MESA with other representative oversampling EIL methods.

Comparison with ensemble imbalanced learning method. We further compare the performance of MESA and 7种代表性的EIL方法 on 4 real-world imbalanced classification tasks. Baselines include RUSBOOST [39], UNDERBAGGING [2], SPE [34], CASCADE [32], and SMOTEBOOST [7] , SMOTEBAGGING [46] and RAMMOBOOST [8]. We use , following the setting of most previous work [15]. 4种基于欠采样的EIL方法3种基于过采样的EIL方法决策树作为所有EIL方法的基础学习器

We report the AUCPRC scores of various USB-EIL methods with different set sizes (k=5, 10, 20) in Table 3. Experimental results show that MESA算法在各种实际任务中均取得了较好的性能. For the baseline methods, we can observe that RUSBOOST and UNDERBAGGING suffer from information loss due to random undersampling that may discard samples with important information, and this effect is more obvious in highly imbalanced tasks. In comparison, the improved SPE and CASCADE sampling strategies make their performance relatively better, but still inferior to MESA. Furthermore,由于MESA提供了一个自适应重采样器,使集成训练收敛得更快更好, its advantages are especially evident when using small ensembles in highly imbalanced tasks. On the mammography dataset (IR=42), compared to the second best score, MESA achieves 24.70%/12.00%/5.22% performance gain when k=5/10/20, respectively.

We further compared MESA and 3 OSB-EIL methods. As shown in Table 1, OSB-EIL methods typically use much more data (1-2 × IR time) to train each base learner than their undersampling-based competitors, including MESA. Therefore, 直接比较MESA和相同规模的过采样基线是不公平的。因此,我们绘制了关于集合训练中使用的实例数量的性能曲线,如图4所示.

It can be observed that our methodMESA始终优于基于过采样的方法, especially on highly imbalanced/high-dimensional tasks (e.g., ISOLET with 617 features, mammography. IR=42) . MESA具有较高的采样效率和较快的收敛速度。与基线相比,该方法只需少量的训练样本就能收敛到一个强集成分类器。MESA也有一个更稳定的训练过程. The OSB-EIL method performs resampling by analyzing and enhancing the structure of minority class data. When the dataset is small or highly imbalanced, the minority class is often underrepresented and lacks clear structure. Therefore, the performance of these OSB-EIL methods becomes unstable in this case.

Table 4: Cross-task transferability of metasamplers.

Text that appears when the image is not displayed

Cross-task portability of metasamplers. MESA的一个重要特点是跨任务可迁移性. Because元采样器是在任务不可知的元数据上训练的,因此它不受任务限制,并且可以直接应用于新任务. This provides MESA with better scalability, as pre-trained meta-samplers can be used directly in new tasks, significantly reducing meta-training costs. To test this, we used mammography and human proteins. As two larger and highly unbalanced meta-test tasks, five meta-training tasks are then considered, including the original task (baseline), two subtasks of 50%/10% of the original training set, and two small tasks of optical digits and spectrometer.

Table 4 reports detailed results. It can be seen that 迁移后的元采样器在元测试任务上具有良好的泛化能力. 按比例缩小元训练实例的数目对所获得的元采样器具有较小的影响,尤其是当原始任务具有足够数目的训练样本 (e.g., for human proteins, reducing the meta-training set to the 10% subset only results in -0.10%/-0.34% Δ when k=10/20). Furthermore, 在小任务上训练的元采样器在新的、更大的甚至异构的任务上也表现出了令人满意的性能 (superior to other baselines), which verifies the generality of the proposed MESA framework. Comprehensive cross-task/subtask transferability testing and other additional experimental results are presented in Section A.

5. Conclusion

We propose MESA, a new imbalanced learning framework. itContains a meta-sampler that adaptively selects training data to learn efficient cascade ensemble classifiers from imbalanced data. MESA did not follow a random heuristic but directly优化其采样策略以获得更好的泛化性能. Compared to popular meta-learning IL solutions, MESA is a general framework capable of working with various learning models. Our元采样器是在任务不可知的元数据上训练的,因此可以转移到新的任务, which greatly reduces the meta-training cost. Experimental results show that MESA achieves good performance on a variety of tasks and has high sampling efficiency. In future work, we plan to explore the potential of metaknowledge-driven ensemble learning in long-tail multi-classification problems.

6. Statement of the Potential Broader Impact

This paper studies the imbalanced learning problem, a common problem in machine learning and data mining. This problem widely exists in many real-world application fields, such as finance, security, biomedical engineering, industrial manufacturing, and information technology [15]. IL methods, including the one proposed in this paper. We believe that the correct use of these technologies will lead us to a better society. For example, better IL technology can detect phishing websites/fraudulent transactions to protect people’s property and help doctors diagnose rare diseases/develop new drugs to save people’s lives. That said, we are also aware that inappropriate use of these techniques can lead to negative consequences, as misclassification is inevitable in most learning systems. In particular, we note that when deploying IL systems in medical-related fields, misclassification (e.g., failure to identify patients) may lead to medical errors. In such fields, these techniques should be used as auxiliary systems. For example, when performing diagnosis, we can adjust the classification threshold to achieve a higher recall rate and use the predicted probability as a reference for the doctor's diagnosis. Although there are some risks associated with IL research, as we mentioned above, we believe that with proper use and monitoring, the negative impact of misclassification can be minimized and IL technology can help people through to a better life. MESA框架,旨在修复由偏斜训练类分布引入的学习模型的偏差

A. Additional Result ## B. Implementation Details ## C. Discussion ## D. Visualization

Paper summary

Summarize

In this paperA general EIL framework MESA that automatically learns its policy, i.e., meta-sampler, from data to optimize imbalanced classification. The main idea is to model a metasampler as an adaptive undersampling solution embedded in the iterative ensemble training process. In each iteration, it takes as its input the current state of ensemble training (i.e., the distribution of classification errors on the training and validation sets). Based on this, the meta-sampler selects a subset to train a new base classifier and then adds it to the ensemble, thus obtaining a new state. We expect the metasampler to maximize the final generalization performance by learning from such interactions. To this end, we use reinforcement learning (RL) to solve the non-differentiable optimization problem of meta-samplers.
This article makes the following contributionscontribute. (I) We propose MESA, a general EIL framework that demonstrates superior performance by automatically learning adaptive undersampling strategies from data. (II) A preliminary exploration was conducted on the extraction and use of cross-task meta-information in the EIL system. This use of meta-information gives the meta-sampler portability across tasks. The pre-trained meta-sampler can be directly applied to new tasks, thereby greatly reducing the computational cost caused by meta-training. (III) Unlike popular approaches where meta-learners are designed to be co-optimized with a specific learning model (i.e., DNN) during training, we decouple the model training and meta-training processes in MESA. This makes our framework generally applicable to most statistical and non-statistical learning models (e.g., decision trees, naive Bayes, k-nearest neighbor classifiers)。

Inspire

question

reinforcement learning

There are two basic concepts in reinforcement learning, Environment and Agent.
Environment refers to the external environment, which in the game is the game environment.
Agent refers to the intelligent body, which refers to the algorithm you write. In the game, it is the player.
The agent outputs an action (Action) to the environment through a set of strategies, and the environment feeds back the status value, that is, Observation, and the reward value Reward to the agent, and the environment will move to the next one. state. In this continuous cycle, an optimal strategy is finally found so that the agent can obtain as many rewards from the environment as possible. The entire process is shown below:
Insert image description here

Reference
Formula 3 Gaussian Model

Guess you like

Origin blog.csdn.net/deer2019530/article/details/128781232