论文精讲 | 基于昇思的等夹角向量基(EBVs)分类性能显著优于传统分类器详解

**Author:** Li Ruifeng

Paper title

Equiangular Basis Vectors

Paper source

CVPR 2023

Paper link

https://arxiv.org/abs/2303.11637

code link

https://github.com/msfuxian/EBV

昇思MindSpore作为一个开源的AI框架,为产学研和开发人员带来端边云全场景协同、极简开发、极致性能,超大规模AI预训练、极简开发、安全可信的体验,2020.3.28开源来已超过500万的下载量,昇思MindSpore已支持数百+AI顶会论文,走入Top100+高校教学,通过HMS在5000+App上商用,拥有数量众多的开发者,在AI计算中心,金融、智能制造、金融、云、无线、数通、能源、消费者1+8+N、智能汽车等端边云车全场景逐步广泛应用,是Gitee指数最高的开源软件。欢迎大家参与开源贡献、套件、模型众智、行业创新与应用、算法创新、学术合作、AI书籍合作等,贡献您在云侧、端侧、边侧以及安全领域的应用案例。

在科技界、学术界和工业界对昇思MindSpore的广泛支持下,基于昇思MindSpore的AI论文2023年在所有AI框架中占比7%,连续两年进入全球第二,感谢CAAI和各位高校老师支持,我们一起继续努力做好AI科研创新。昇思MindSpore社区支持顶级会议论文研究,持续构建原创AI成果。我会不定期挑选一些优秀的论文来推送和解读,希望更多的产学研专家跟MindSpore合作,一起推动原创AI研究,昇思MindSpore社区会持续支撑好AI创新和AI应用,本文是昇思MindSpore AI顶会论文系列第17篇,我选择了来自南京理工大学计算机科学与工程学院的魏秀参老****师团队的一篇论文解读,感谢各位专家教授同学的投稿。

昇思MindSpore旨在实现易开发、高效执行、全场景覆盖三大目标。通过使用体验,昇思MindSpore这一深度学习框架的发展速度飞快,它的各类API的设计都在朝着更合理、更完整、更强大的方向不断优化。此外,昇思不断涌现的各类开发工具也在辅助这一生态圈营造更加便捷强大的开发手段,例如MindSpore Insight,它可以将模型架构以图的形式呈现出来,也可以动态监控模型运行时各个指标和参数的变化,使开发过程更加方便。

The problem this article wants to study is the classification problem of large categories, such as classification problems of 100,000 or 1 million categories. For a network like ResNet-50, the last linear layer to handle such a classification problem requires a parameter amount of 2048×100000 or 2048×1000000, which will make fc larger than the parameter amount of the previous feature extraction layer.

On the other hand, general classification problems choose one-hot vectors as labels, which can be understood as an orthogonal basis where the angle between any two vectors is 90 degrees. At the end of 2021, there was an article in the Annual Journal of Mathematics that said that when the dimension D tends to infinity, for a given angle, the number of the above-mentioned straight lines with equal angles is linearly related to D (refer to Equiangular lines with a fixed angle).

所以如果是完全等夹角的话,类别量一大,D也必须很大。所以本文开始的思路是希望在角度上做一些优化,当角度大概约束在83-97(轴对称)的时候,5000维就能容纳10万个类别的基,同时不会对分类的性能产生很大影响,对应的数据集也已开源。另外,当夹角为0时,空间中就存在无数这样的基向量,所以一定会成立,但是有关α,空间维度以及这样的向量的数量,数学上还没有固定解,只在一些特殊情况下有解,可参考《Sparse and Redundant Representations – From Theory to Applications in Signal and Image Processing》一书。分类任务代码部分按照昇思MindSpore官方文档提供的样例,仅需修改数据集就能完成,非常方便。

01

Research Background

The field of pattern classification aims at assigning input signals into two or more categories. In recent years, deep learning models have brought breakthroughs in processing images, videos, audio, text and other data. Aided by rapid improvements in hardware, today's deep learning methods can easily fit a million images and overcome the previous hurdle of poor handcrafted feature quality in pattern classification tasks. Many methods based on deep learning have sprung up and been used to solve classification problems in various scenarios and settings, such as remote sensing, few-shot learning, long-tail problems, etc.

Figure 1 illustrates some typical classification task paradigms. Currently, a large number of deep learning methods use trainable fully connected layers combined with softmax as classifiers. However, since the number of categories is fixed, such a classifier has poor scalability, and the number of trainable parameters of the classifier will also increase as the number of categories increases. For example, picturethe memory consumption of the fully connected layer increases linearly as the number of categories N increases, and the computational cost of matrix multiplication between the fully connected layer and d-dimensional features also increases. Some methods based on classical metric learning must consider all training samples and design positive/negative sample pairs, and then optimize a class center for each category, which requires a lot of additional calculations for large-scale data sets, especially for pre-training tasks.

picture

Figure 1 Comparison between typical classification paradigms and EBVs

1. Classifier ending with k-way fully connected layer and softmax. As more categories are added, the trainable parameters of the classifier grow linearly.

2. Taking "Triplet embedding" as an example of the classic metric learning method, when M images are given, its complexity is. When picturea new category with samples is added picture, the complexity will increase to picture.

3. Our proposed EBVs. EBVs pre-define fixed normalized embeddings for different categories. The trainable parameters of the network do not change as the number of categories increases, while the computational complexity only picturegrows from to picture.

02

team introduction

Visual Intelligence & Perception (VIP) Group, headed by Professor Wei Xiushen . The team has published in top international journals in related fields such as IEEE TPAMI, IEEE TIP, IEEE TNNLS, IEEE TKDE, Machine Learning Journal, "Chinese Science: Information Science", etc., and top international conferences such as NeurIPS, CVPR, ICCV, ECCV, IJCAI, AAAI, etc. He has published more than fifty papers, and related work has won a total of 7 world championships in authoritative international competitions in the field of computer vision, including DIGIX 2023, SnakeCLEF 2022, iWildCam 2020, iNaturalist 2019, and Apparent Personality Analysis 2016.

03

Introduction to the paper

In this paper, we propose equal angle vector bases (EBVs) to replace commonly used classifiers in deep neural network classification tasks. EBVs pre-defines a fixed normalized basis vector for all categories. The angles between these basis vectors are the same and are constrained to be mutually orthogonal as much as possible. Specifically, in a d-dimensional unit hypersphere, for each category in the classification task, EBVs define a d-dimensional normalized embedding on the surface of the hypersphere. We call these embeddings basis vectors. The spherical distance of each basis vector pair satisfies a defined rule that makes the relationship between any two basis vectors as close to orthogonal and with similar angles as possible. In order to keep the trainable parameters of the deep neural network constant as the number of categories increases, we then provide the definition of EBVs based on the two mathematical problems of Tammes Problem and Equiangular Lines.

First, we give a specific definition of EBVs. We know that d orthogonal vector bases can construct a d-dimensional Euclidean space picture. At the same time, if two vectors are in an orthogonal relationship, in mathematics we believe that the two vectors have no correlation. However, such a d-dimensional space can accommodate up to d vector bases, that is, the number of categories it can accommodate picture, and cannot meet the requirements of reducing memory space for large-scale classification. Therefore, we need to optimize the angular relationship between different vector bases. Assuming that in the unit hypersphere picture, take pictureand define the angle range of any two vector bases as picture. For a given category quantity N, find picturethe minimum value that meets the conditions; or for an acceptable one picture, find picturethe value range of the category quantity N in the space, which completes the definition of EBVs. Its mathematical expression can be summarized as finding a base set of vectors with equal angles that meets the conditions pictureso that it satisfies:

picture

Among them, pictureand picturemean picture, picture, picturerepresent the Euclidean norm. Then assuming picturethat it is a metric function of unit spherical distance, for any feature vector to be queried picture, its picturecorrelation with the vector base can be expressed as:

picture

Among them, picturerepresents picturethe N basis vectors in the vector basis set. picturethen represents the subscripts of all basis vectors to be calculated, similarly, picture.

Then we give the generation method of EBVs. We randomly initialize a matrix pictureto represent the equal-angle vector basis set picture, where d represents the dimension of each basis vector, and N represents the number of required basis vectors. Then picturenormalize each d-dimensional basis vector in , so that picturethe sum of any two basis vectors in can be expressed pictureas and , and . In this way, the spherical distance of and can be replaced by cosine similarity, expressed as . In the process of stochastic gradient descent, the gradient of any satisfying basis vector pair is cut off through gradient clipping, and the remaining basis vector pairs are optimized at the same time. The overall optimization function can be expressed as:picturepicturepicturepicturepicturepicturepicturepicturepicturepicture

picture

That is, if picture, the corresponding gradient is truncated and optimization is no longer performed.

Finally, we give an optimization method for EBVs when used in classification tasks. Assume that N categories contain a total pictureof data samples, and their corresponding labels are picture, picturewhich represents the data, picturewhich represents the corresponding labels. pictureThe corresponding feature vector can be expressed as picture, where picturerepresents a feature extractor, which can usually be understood as a deep neural network to be optimized, and picturerepresents the parameters of the feature extractor to be optimized. Therefore, the probability picturethat the feature vector corresponding to the data pictureis estimated as a category picturecan be expressed as:

picture

Among them, represents the transpose of picturethe Jth category weight . pictureIn the generation process of EBVs, pictureeach basis vector in the set has been pictureregularized pictureand replaced with the category weight in formula (4). pictureFinally, the objective function to achieve EBVs can be obtained:

picture

Among them, picturerepresents the regularization of picturethe corresponding feature vector , which is a hyperparameter used to reduce the difficulty of optimization. Then the optimization goal is finally converted into maximizing the joint distribution probability , where represents the connection function, which represents the probability that the feature vector obtained by the feature extractor is considered to be a category, then the optimization goal can be rewritten as minimizing the negative log likelihood as follows function:picturepicturepicturepicturepicturepicturepicturepicture

picture

04

Experimental results

We have conducted comparative experiments on classification tasks in the ImageNet-1K dataset, instance segmentation and target detection tasks in the MS COCO dataset, semantic segmentation in the ADE20K dataset, and a large number of downstream classification tasks. Here we only use ImageNet- The classification results in 1K are taken as an example to illustrate the effectiveness of the method. In order to prove the effectiveness of the proposed EBVs, our baseline comparison method refers to the state-of-the-art training method provided by TorchVision. We offer three different training settings:

1. Set A0 to represent the training settings in the original ResNet text;

2. Setting A1 means using the cosine decay learning rate scheduler and adopting the warmup training strategy, while using enhancement strategies such as weight decay and TrivialAugment;

3. Setting A2 means adding the three strategies of 1abel-smoothing, cutmix and mixup on the basis of A1.

As shown in Table 1, the experimental results show that EBVs has a greater improvement than traditional classifiers under the same experimental settings.

Table 1 Comparison results on the ImageNet-1K validation set

picture

05

Summary and Outlook

This paper proposes a new paradigm for classification tasks: equal angle vector bases (EBVs). In deep neural networks, models usually handle classification tasks with k-way fully connected layers with softmax, and the learning goals of these methods can be summarized as mapping the learned feature representations to the label space of the sample. In the metric learning method, the learning goal can be summarized as learning a mapping function to map the training data points from the original space to a new space, and make the same type of sample points in the space closer, and the distance between different types of points becomes farther. Different from the above methods, EBVs pre-define a fixed normalized basis vector for all categories. In the pre-definition process, the angles between these basis vectors are the same and are constrained to be orthogonal to each other as much as possible. In the training stage, these basis vectors directly serve as fixed mapping targets for samples of different categories, and the learning goal of EBVs also changes to minimizing the spherical distance between the image feature embedding and the predefined basis vectors. In the verification phase, since each category is bound to a fixed basis vector, the label of the image can be judged by the minimum value of the spherical distance between the feature embedding of the image and all basis vectors. Since it is a classification problem, the training can be completed very quickly according to MindSpore's official sample code.

A programmer born in the 1990s developed a video porting software and made over 7 million in less than a year. The ending was very punishing! Google confirmed layoffs, involving the "35-year-old curse" of Chinese coders in the Flutter, Dart and Python teams . Daily | Microsoft is running against Chrome; a lucky toy for impotent middle-aged people; the mysterious AI capability is too strong and is suspected of GPT-4.5; Tongyi Qianwen open source 8 models Arc Browser for Windows 1.0 in 3 months officially GA Windows 10 market share reaches 70%, Windows 11 GitHub continues to decline. GitHub releases AI native development tool GitHub Copilot Workspace JAVA is the only strong type query that can handle OLTP+OLAP. This is the best ORM. We meet each other too late.
{{o.name}}
{{m.name}}

Guess you like

Origin my.oschina.net/u/4736317/blog/11072544