论文阅读笔记之《Learning from Imbalanced Data》

题目:Learning from Imbalanced Data(2009)

不平衡数据的研究

Abstract

  1. (the imbalanced learning problem) is a relatively new challenge.
    不平衡数据是一个相对较新的挑战

  2. The imbalanced learning problem is concerned with the performance of learning algorithms in the presence of underrepresented data and severe class distribution skews.
    不平衡学习问题与代表性不足的数据和严重的类分布偏差的学习算法性能有关

  3. Due to the inherent complex characteristics of imbalanced data sets, learning from such data requires new understandings, principles, algorithms, and tools to transform vast amounts of raw data efficiently into information and knowledge representation.
    由于不平衡数据集固有的复杂性,需要新的理解、原理、算法和工具

1 INTRODUCTION

  1. The fundamental issue with the imbalanced learning problem is the ability of imbalanced data to significantly compromise the performance of most standard learning algorithms.
    不平衡学习问题的根本问题是不平衡数据会明显影响大多数标准学习算法的性能。

  2. Most standard algorithms assume or expect balanced class distributions or equal misclassification costs.
    大多数标准的算法假设或者认为的是平衡分类的分布问题或者误分类代价。

  3. Workshop on Learning from Imbalanced Data Sets (AAAI ’00)
    不平衡数据集学习研讨会
    The International Conference on Machine Learning workshop on Learning from Imbalanced Data Sets (ICML’03)
    不平衡数据集学习国际机器学习研讨会 (ICML’03)

  4. section 1:总的介绍该领域(不平衡学习问题)研究的长期发展;
    section 2:描述不平衡学习问题的性质;
    section 3:针对不平衡学习问题的创新研究进展进行批判性评价,包括采样方法、代价敏感学习、基于内核的学习方法和主动学习方法;
    section 4:提供各种建议的方法,用于不平衡学习问题的评价指标;
    section 5:该领域的研究发展机遇和挑战;
    section 6:结论.

2 NATURE OF THE PROBLEM

  1. Technically speaking, any data set that exhibits an unequal distribution between its classes can be considered imbalanced.
    从理论上讲,任何在其类别之间表现出不均匀分布的数据集都可以被认为是不平衡的。

  2. However, the common understanding in the community is that imbalanced data correspond to data sets exhibiting significant, and in some cases extreme, imbalances.
    然而,大家所共识的不平衡数据一般指的是在明显的甚至极端情况下的不平衡。

  3. Although this description would seem to imply that all between-class imbalances are innately binary (or two-class), we note that there are multiclass data in which imbalances exist between the various classes.
    尽管这种描述似乎暗示所有类间的不平衡都是天生的二元(或二类),但我们注意到存在多类数据,其中各个类之间也存在不平衡 。
    ps:不平衡大多数是指两类之间的,也存在于多类之间。

    扫描二维码关注公众号,回复: 13752443 查看本文章
  4. We present an example from biomedical applications. Analyzing the images in a binary sense, the natural classes (labels) that arise are “Positive” or “Negative” for an image representative of a “cancerous” or “healthy” patient, respectively.
    我们给出了一个生物医学应用的例子。从二元意义上分析图像,对于代表“癌症”或“健康”患者的图像,出现的自然类别(标签)分别为“阳性”或“阴性”。

positive negative
260 10,923
  1. In the medical industry, the ramifications of such a consequence can be overwhelmingly costly, more so than classifying a noncancerous patient as cancerous .
    将患有癌症的病人误分为没有癌症的代价远比把没有癌症的误分类为有癌症的高。

  2. we require a classifier that will provide high accuracy for the minority class without severely jeopardizing the accuracy of the majority class.
    因此,我们需要一种能将少数类分类精确但同时不会严重严重影响多数类的分类器。

  3. this also suggests that the conventional evaluation practice of using singular assessment criteria, such as the overall accuracy or error rate, does not provide adequate information in the case of imbalanced learning
    传统的单一评估准则如整体精确度、错误率,这些对于不平衡学习都无法提供充足的信息。

  4. Therefore, more informative assessment metrics, such as the receiver operating characteristics curves, precision-recall curves, and cost curves, are necessary for conclusive evaluations of performance in the presence of imbalanced data.
    因此,需要信息性更强的评价指标才能对不平衡数据进行评估,如:接收器操作特性曲线、准确率-召回率曲线(PR曲线)和代价曲线。

  5. Imbalances of this form are commonly referred to as intrinsic, i.e., the imbalance is a direct result of the nature of the dataspace.
    数据空间直接导致的不平衡,称为内部不平衡。

  6. Variable factors such as time and storage also give rise to data sets that are imbalanced. Imbalances of this type are considered extrinsic, i.e., the imbalance is not directly related to the nature of the dataspace.
    由时间和存储等因素导致的不平衡称为外部不平衡,即这种不平衡跟数据空间的性质没有直接关系。
    ps:不平衡分为内部不平衡和外部不平衡。

  7. For instance, suppose a data set is procured from a continuous data stream of balanced data over a specific interval of time, and if during this interval, the transmission has sporadic interruptions where data are not transmitted, then it is possible that the acquired data set can be imbalanced in which case the data set would be an extrinsic imbalanced data set attained from a balanced dataspace.
    假设一个数据集是在一个特定的时间间隔内从一个连续的平衡数据流中获取的,如果在这个时间间隔内,传输有零星的中断,那么在这个时间间隔内,数据集可能是不平衡的,在这种情况下,数据集将是一个从平衡数据空间中获取的外部不平衡数据集。
    ps:数据的获取或者存储可能会导致不平衡

  8. Relative imbalances arise frequently in real-world applications and are often the focus of many knowledge discovery and data engineering research efforts.
    相对不平衡也是研究工作的重点。


如有问题,欢迎批评指正,持续更新中…

猜你喜欢

转载自blog.csdn.net/Naruto_8/article/details/120799662