Hinton, who the latest research: significantly improve model accuracy, smooth label technology in the end how to use?

Hinton, who the latest research: significantly improve model accuracy, smooth label technology in the end how to use?

Disclaimer: This article is a blogger original article, follow the  CC 4.0 BY-SA  copyright agreement, reproduced, please attach the original source link and this statement.
This link: https://blog.csdn.net/dQCFKyQDXYm3F8rB0/article/details/94926752

640?wx_fmt=jpeg

 

Author | Rafael Müller, Simon Kornblith, Geoffrey Hinton

Translator | Rachel

Zebian | Jane

Produced | AI technology base camp (ID: rgznai100)

 

REVIEW loss function have a significant impact on the training of the neural network, there are many scholars who have been exploring and looking for the same make and model loss function better function. Later, other scholars Szegedy tag smoothing method by calculating the weighted average of the data set hard target and calculating the cross-entropy, effectively increasing the accuracy of the model is evenly distributed. Recently, Hinton team Help others in a new research paper "When Does Label Smoothing? ", The tag will try to smooth the impact of technology on neural networks were analyzed, and the characteristics of the relevant network are described.

 

Before you start reading today's paper, we take a quick study of the concepts and knowledge of the protagonist:

 

  • What is a soft target? What method of calculation is?

 

Use soft target, generalization ability and learning speed multi-classification neural network can often be greatly improved. soft target used in the text is uniformly distributed by calculating a weighted average hard target and label obtained, and this step is called tab smooth.

 

  • Label smoothing technique What is the role?

 

Label smoothing techniques can effectively prevent over-fitting model, and in many of the latest models have been applied, such as picture classification, machine translation and speech recognition.

 

  • Hinton of this study to illustrate what the problem?

 

Through experiments proved that not only can enhance smooth label generalization ability of the model, but also to enhance the ability of the model correction, and further improve the search capabilities of cluster model. But in the experiments in this article also found that if a label smoothing teacher model, the knowledge of the student model of the distillation effect will decline.

 

  • How to explain the phenomenon of the study found?

 

To explain this phenomenon, the label paper web smoothing influence on the penultimate layer representation are visualized, the smoothing of the same class label found training examples represented tends to polymerize in close packet. This has led to instances of different class representation similarity information is lost, but the impact on the model of generalization and correction capability is not obvious.

 

640?wx_fmt=png

 

 

1 Introduction

 

Loss of function have a significant impact on the training of the neural network. After Rumelhart et al quadratic loss function using the back propagation method, many researchers have proposed methods by using gradient descent to minimize cross-entropy can be obtained better classification results. But scholars to discuss the loss function never stopped, people think that there are other functions can replace the cross-entropy to achieve better results. Subsequently, other researchers proposed a tag Szegedy smoothing method by calculating the weighted average of the data set hard target and calculating the cross-entropy, effectively increasing the accuracy of the model is evenly distributed.

 

Smoothing tag fields in depth learning model image classification, speech recognition, machine translation and the like have achieved good results as shown in Table 1. In the picture classification, the label was originally used to enhance the effect of smoothing Inception-v2 on ImageNet data sets, and has been applied in many of the latest study. In speech recognition, some researchers reduced the word error rate of the data set through the label WDJ smoothing technique. In machine translation, a minor upgrade label to help smooth the BLEU scores.

 

640?wx_fmt=png

Table 1 label application smoothing the three-supervised learning task

 

Although the label smoothing technique has been effectively used, but few existing studies on its applicability to discuss the principles and application scenarios.

 

Hinton et al. This paper will attempt to influence smoothing technique for labeling neural network to be analyzed, and the characteristics of the relevant network are described. This article contributions are as follows:

 

  • Linear mapping based on the network of the activation of the penultimate layer is proposed a new method of visualization;

  • Explains the impact of the label on the model corrected smooth, and pointed out that the credibility of the network predictions depends more on the accuracy of the model;

  • 展示了标签平滑对蒸馏的影响,并指出该影响会导致部分信息丢失。

 

 

1.1 预备知识

 

这一部分提供了标签平滑的数学描述。假设将神经网络的预测结果表示为倒数第二层的激活函数,公式如下:

 

640?wx_fmt=png

 

其中 pk 表示模型分类结果为第 k 类的可能性,wk 表示网络最末层的权重和偏置,x 是包括网络倒数第二层激活函数的向量。在使用hard target 对网络进行训练时,我们使用真实的标签 yk 和网络的输出 pk 最小化交叉熵,公式如下:

 

640?wx_fmt=png

 

其中当分类为正确时, yk 值为1,否则为0。对于使用参数 a 进行标签平滑后的网络,则在训练时使用调整后的标签 640?wx_fmt=png和网络的输出 pk 计算并最小化交叉熵,其中,

 

640?wx_fmt=png

 

 

 

2、倒数第二层的表示

 

对于使用参数 a 对网络进行标签平滑后的神经网络,其正确和错误分类的 logit 值之间的差会增大,改变程度与 a 的值相关。在使用硬标签对网络进行训练时,正确分类的 logit 值会远大于错误分类,且不同错误分类的值之间差异也较大。一般而言,第 k 个类别的 logit 值可以看作网络倒数第二层的激活函数 x 和标准 wk 之间的欧式距离的平方,表示如下:

 

640?wx_fmt=png

 

因此,标签平滑会使倒数第二层的激活函数与正确分类间的差值减小,并使其与正确和错误分类的距离等同。为了对标签平滑的这一属性进行观察,本文依照以下步骤提出了一个新的可视化方式:(1)选择三个类别;(2)找到这三个分类的一个标准正交平面,(3)把实例在倒数第二层的激活函数投射在该平面上。

 

图 1 展示了本文在 CIFAR-10, CIFAR-100 和 ImageNet 三个数据集上进行图片分类任务时,网络倒数第二层的激活函数的情况,训练使用的网络架构包括 AlexNet, ResNet-56 和 Inception-v4 。其中,前两列的模型未进行标签平滑处理,后两列使用了标签平滑技术。表2展示了标签平滑对模型准确率的影响。

 

640?wx_fmt=png

图 1 图片分类任务可视化情况

 

640?wx_fmt=png

表2 使用和未使用标签平滑技术的模型的最高准确率

 

第一行可视化使用的数据集为 CIFAR-10 ,标签平滑的参数值为 0.1 ,三个图片分类分别为“airplane”,“automobil”和“bird”。这些模型的准确率基本相同。可以发现,在使用标签平滑的网络中,聚类更加紧凑。

 

第二行可视化使用的数据集为 CIFAR-100,模型为 ResNet-56 ,选择的图片分类为“beaver”,“dolphin”,“otter”。在这次实验中,使用标签平滑技术的网络获得了更高的准确率。

 

最后,本文使用 Inception-v4 在 ImageNet 数据集上进行了实验,并使用具有和不具有语义相似性的分类分别进行了实验。其中,第三行使用的分类不具有语义相似性,分别为“tench”,“meerkat”和“cleaver”。第四行使用了的两个具有语义相似性的分类“toy poodle”和‘miniature poodle“以及另一个不同的分类“tench, in blue”。对于语义相似的类别而言,即使是在训练集上都很难进行区分,但标签平滑较好地解决了这一问题。

 

从上述实验结果可以发现,标签平滑技术对模型表示的影响与网络结构、数据集和准确率无关。

 

 

3、隐式模型修正

 

标签平滑能够有效防止模型过拟合。在本部分,论文尝试探讨该技术是否能通过提升模型预测的准确性改善模型修正能力。为衡量模型的修正能力,本文计算了预期修正误差(expected calibration error, ECE)。本文发现,标签平滑技术能够有效降低 ECE ,并可用于模型修正过程。

 

 

图片分类

 

图2左侧展示了 ResNet-56 在 CIFAR-100 数据集上训练后得到的一个可靠性图表,其中虚线表示理想的模型修正情况。可以发现,使用硬标签的模型出现了过拟合的情况。如果需要对模型进行调整,可以将 softmax 的 temperature 调至1.9,或者使用标签平滑技术进行调整。如图中绿线所示,当使用 a = 0.05 进行标签平滑处理时,能够得到相似的模型修正效果。这两种方法都能够有效降低 ECE 值。

 

本文在 ImageNet 上也进行了实验,如图2右侧所示。使用硬标签的模型仍然出现过拟合情况 ,ECE 高达0.071。通过使用温度缩放技术(T = 1.4),可将 ECE 降低至0.022, 如蓝线所示。当使用 a = 0.1 的标签平滑时,能够将 ECE 降低至0.035。

 

640?wx_fmt=png

图2 可信度图表

 

 

机器翻译

 

本部分对使用 Transformer 架构的网络的调整进行了实验,使用的评测任务为英译徳。与图片分类任务不同,在机器翻译中,网络的输出会作为集束搜索算法的输入,这意味着模型的调整将对准确率产生影响。

 

本文首先比较了使用硬标签的模型和经过标签平滑(a = 0.1)的模型的可信度,如图3所示。可以发现,使用标签平滑的网络的调整情况优于使用硬标签的网络。

 

640?wx_fmt=png

图3 基于英译徳任务训练的Transformer 架构的可信度图表

 

尽管标签平滑能够获得更佳的模型调优和更高的 BLEU 值,其也会导致负对数似然函数(negative log-likelihoods, NLL)的值变差。图4展示了标签平滑技术对 BLEU 和 NLL 的影响,蓝线代表 BLEU 值,红线代表 NLL 值。其中,最左侧的图为使用硬标签训练的模型的情况,中间的图为使用标签平滑技术训练的模型的情况,右侧的图则展示了两种模型的 NLL 值变化情况。可以发现,标签平滑在提高 BLEU 分数的同时,也导致了 NLL 的降低。

 

640?wx_fmt=png

图4 Transformer 网络调优对 BLEU 和 NLL 的影响

 

 

4、知识蒸馏

 

本部分研究了在teacher model 对student model 的知识蒸馏中标签平滑的影响。本文发现,尽管标签平滑能够提升teacher model 的准确性,但使用标签平滑技术的teacher model 所产生的student model 相比于未使用标签平滑技术的网络效果较差。

 

本文在 CIFAR-10 数据集上进行了实验。作者训练了一个 ResNet-56 的teacher model ,并对于一个使用 AlexNet 结构的student model 进行了知识蒸馏。作者重点关注了4项内容:

 

  • teacher model 的准确度

  • student model 的基线准确度

  • 经过知识蒸馏后student model 的准确度,其中teacher model 使用硬标签训练,且用于蒸馏的标签经过温度缩放进行调整

  • 使用固定温度进行蒸馏后的student model 的准确度,其中 T = 1.0 ,teacher model 训练使用了标签平滑技术

 

图5展示了这一部分实验的结果。作者首先比较了未进行蒸馏的teacher model 和student model 的效果,在实验中,提高 a 的值能够提升teacher model 的准确度,但会轻微降低student model 的效果。

 

640?wx_fmt=png

图5 基于 CIFAR-10 数据集从 ResNet-56 向 AlexNet 进行蒸馏的效果

 

之后,作者使用硬标签训练了teacher model 并基于不同温度进行蒸馏,且分别计算了不同温度下的 y 值,用红色虚线表示。实验发现,所有未使用标签平滑技术的模型效果都优于使用标签平滑技术的模型效果。最后,作者将使用标签平滑技术训练的具有更高准确度的teacher model 的知识蒸馏入student model ,并用蓝色虚线进行了表示。可以发现,模型效果并未得到显著提升,甚至有所降低。

 

 

5、结论和未来展望

 

Although many of the latest technologies are used in labels smoothing methods, principles and usage of this method has not been fully discussed. This article summarizes the case explained in the plurality of tags and a smooth performance of the application, including how such smooth label indicates that the network penultimate clustering activation function more closely the like. Is to explore this issue, this paper proposes a new low-latitude visualization methods.

 

Label smoothing technique while improving model results, it could have a negative impact on the knowledge of distillation. This paper argues that the resulting impact on the reason that led to the loss of part labels smooth information. This phenomenon can be observed by the mutual information calculation model inputs and outputs. Based on this, we propose a new direction, that is, the relationship between the tag and information bottlenecks smooth.

 

Finally, the model for the label smoothing effect correction conducted experiments to enhance the interpretability of the model.

 

Original link: 

https://arxiv.org/pdf/1906.02629.pdf

Guess you like

Origin www.cnblogs.com/think90/p/11482947.html