CVPR'2023 | PA&DA: Supernet Consistent NAS with Jointly Optimized Path and Data Sampling

This article was first published on the WeChat public account CVHub, and may not be reproduced to other platforms in any form. It is only for learning and communication, and offenders will be held accountable!

Title: PA&DA: Jointly Sampling PAth and DAta for Consistent NAS

Paper: https://arxiv.org/pdf/2302.14772.pdf

Code: https://github.com/ShunLu91/PA-DA

guide

Based on the weight sharing mechanism, the One-shot NAS method trains a supernet, and then evaluates and sorts the sub-network by inheriting the weight of the supernet , which greatly reduces the search cost. However, some studies point out that there are different gradient descent directions for shared weights during training. The paper further found that a large gradient variance appeared during the supernet training process, which reduced the consistency of the supernet ranking . To alleviate this problem, the paper explicitly minimizes the gradient variance of supernet training by jointly optimizing the sampling distribution of path and data (PA&DA) . The paper theoretically derives the relationship between the gradient variance and the two sampling distributions, revealing that the optimal sampling probability is proportional to the normalized gradient norm of the path and training data .

The method in this paper can ignore the computational cost when optimizing the sampling distribution of paths and data , and can achieve lower gradient variance in supernet training, which makes supernet have better generalization performance, thus obtaining more consistent NAS . The paper provides a comprehensive comparison with other improved methods in various search spaces. The results show that the proposed method outperforms other methods with more reliable ranking performance and higher search architecture accuracy , demonstrating the effectiveness of the proposed method.

contribute

The trend of KT and GV (KT: Kendall’s Tau, GV:
Gradient Variance)

The paper uses CIFAR-10 to conduct experiments on NAS-Bench-201, uses the SPOS algorithm to train the supernet, and gradually increases the candidate operations of each edge of the supernet . The paper records the average gradient variance of all candidate operation parameters during the training process , and The subnet ranking consistency of supernet is evaluated by measuring the ranking results of the same 64 sub-networks .

As shown in the figure above, the more sub-models there are in supernet, the greater the gradient variance and the worse the ranking consistency. These results suggest that large gradient variance in subnetworks can harm the consistency of supernet rankings during training , by using the normalized gradient norm as an important metric and employing an important sampling strategy for paths and data during supernet training. , can reduce the supernet gradient variance and improve its ranking consistency .

The main contributions of the paper are as follows:

  • The paper verifies that the weight sharing mechanism of supernet training leads to large gradient variance, which damages the performance of supernet and deteriorates its ranking consistency .

  • By deriving the relationship between supernet gradient variance and sampling distribution, the paper proposes to explicitly minimize gradient variance during supernet training by jointly optimizing path and data sampling distribution . The paper finds that the optimal sampling probability is proportional to the normalized gradient norm of the path and data, and uses importance sampling for it during supernet training.

  • Our method requires only negligible computation to perform path and data importance sampling , and does not require tedious hyperparameter tuning. It achieved the highest Kendall's Tau 0.713 on NAS-Bench-201 and achieved superior performance on DARTS and ProxylessNAS search spaces.

method

Our supernet training framework

Sampling-based One-Shot NAS

Sampling-based One-Shot NAS is generally divided into two stages: Supernet training and Sub-network search:

Stage1(Training stage): Establish a weight of W \mathcal{W}W的SupernetN \mathcal{N}N. _ During training, according to the discrete distributionp ( A ) p(A)p ( A ) Sampling Sub-networkα \alphaα , which inherits the weight of Supernet, soeach step only trains the weight W contained in the Sub-network α \mathcal{W}_\alphaWa

The goal of the final optimization is to obtain the final optimal Supernet weight W ∗ \mathcal{W}^{*} by combining the sub-models that are continuously iteratively sampledW

Stage2 (Searching stage): Continuously sample and extract the Sub-network from the trained Supernet for evaluation, and obtain its performance on the verification data set. Here you can use the heuristic search algorithm to search for the optimal sub-model α ∗ \mathcal{\alpha}^{*}a

The paper attempts to reduce the gradient variance of Supernet during training to improve the convergence and ranking consistency of Supernet. The paper proposes to jointly optimize the sampling distribution p ( A ) p(A) during the Supernet training processp ( A ) andtraining data distribution q ( DT ) \mathbf{q}\left(\mathbb{D}_T\right)q(DT)

where d ( p ) d(p)d ( p ) andd ( q ) d(q)d ( q ) isthe gradient variance functionthe pathanddatasampling distribution. Below, we show how to derive the relationship between them and optimize these two sampling distributions alternately.

Path Importance Sampling

present iiIn step i training, with probabilitypi p_ipiSampling distribution p ( A ) p(A) from pathSampling a submodelα i \mathcal{\alpha}_{i} in p ( A )ai, the resulting stochastic gradient is:

The paper expects to minimize the gradient variance in the above formula , by optimizing the sampling distribution ppp:

It can be found that **, E [ d ] \mathbb{E}[d]E [ d ] and path sampling distributionppp is independent**, so the constraints of the stochastic gradient formulation can be reformulated as:

In order to solve the above constrained optimization problem, use Lagrangian multiplication and convert it into an unconditional extreme value problem for solution:

set ∂ Ψ ( p , λ ) ∂ pi = 0 \frac{\partial \Psi(\mathbf{p}, \lambda)}{\partial p_i}=0piΨ ( p , λ )=0 , you can get:

λ = ∑ i = 1 N ∥ ∇ W L ( N ( x i , α i ; W α i ) , y i ) ∥ N \sqrt{λ} = \sum_{i=1}^N \frac{\left\|\nabla_{\mathcal{W}} \mathcal{L}\left(\mathcal{N}\left(x_i, \alpha_i ; \mathcal{W}_{\alpha_i}\right), y_i\right)\right\|}{N} l =i=1NNWL(N(xi,ai;Wai),yi)

And further deduce the optimal sampling distribution p ∗ p^{*}p :

The optimal path sampling probability pi ∗ p^{*}_i can be obtainedpiwith submodel α i \alpha_iaiis proportional to the normalized gradient norm of , that is, sampling submodels with larger gradient norms can reduce the gradient variance of Supernet training.

In practical applications, the paper will sub-model α i \alpha_iaiThe gradient norm of is measured as the sum of the gradient norms of the candidate operations it contains, taking the normalized gradient norm of each candidate operation as its sampling probability .

The paper calculates the gradient norm of each regular backward and updates the sampling probability of the candidate operation after each epoch . Thus, our optimization of the path sampling distribution p requires trivial computation and is particularly efficient

Data Importance Sampling

Sampling the training data according to the normalized gradient norm helps to reduce the gradient variance of the deep model training, which can be formally expressed as:

The last layer of the paper ∇ L ∇_LLThe gradient of the loss function of the pre-activation output to approximate the upper bound of the gradient norm of each training data, namely:

This way, we can easily measure their importance by accessing the upper bound of each training data, e.g. in image classification tasks the last layer is usually softmax, when using cross-entropy loss, ∇ L ∇ can be derived in advance _LLThe gradient expression of , and conveniently calculate it during training, as follows:

Importance Sampling NAS

The paper's method aims to improve the consistency of Supernet rankings by reducing the gradient variance during training . A joint optimization based on path importance sampling and data importance sampling is proposed :

The above calculations require only an additional line of code and can be efficiently performed in a mini-batch fashion. Therefore, the paper uses this approximation to estimate the importance of the training data, and uses the normalized result to update the sampling distribution q after each epoch.

experiment

Evaluation of Supernet Ranking Consistency

Ranking results on NAS-Bench-201

As shown in the figure above, PA&DA only needs 0.2 more GPU hours than SPOS, and achieves the highest KT and P@Top5% compared with other modes, which shows that the training mode of the paper is effective and is conducive to improving the consistency of supernet ranking .

Search Performance on CIFAR-10

Comparison with other state-of-the-art methods on the CIFAR-10 dataset using DARTS search space

Our best searched cells in the DARTS search space.

As shown in the figure above, the paper's method achieves the highest average test accuracy of 97.52 ± 0.07, surpassing the original DARTS and its advanced variants . Compared with other improved one-shot NAS methods, such as NSAS, Few-Shot-NAS, GM and CLOSE, our method consistently outperforms them with the smallest search cost.

Search Performance on ImageNet

Comparison with other state-of-the-art methods on the
ImageNet dataset using the ProxylessNAS search space

As shown in the figure above, PA&DA surpasses DA-NAS, FairNAS-A and SUMNAS-M with a little more FLOPs. Compared with SPOS, ProxylessNAS, MAGIC-AT, Few-Shot NAS and GM, PA&DA search system The structure is smaller, and the highest top-1 accuracy of 77.3 is obtained , which is enough to prove the effectiveness of the method of the paper

Ablation experiment

Effect of batch size

Effect of various batch sizes and trainability comparison

Larger batch sizes generally stabilize the training of deep models with lower gradient variance. From the above figure, it can be observed that as the batch size increases, GV decreases, and KT increases monotonously. Batch size 512 gets the best KT 0.670 ± 0.029

Effect of schedules for smoothing parameters

 Ranking performance w.r.t the smoothing parameters and
update schedules for DA and PA

Updating the sampling probabilities of the DA after each epoch, using a sample-level distribution and linearly increasing τ yields the best results.

Effect of DA and PA

Ablation study for PA and DA

The best results are obtained when these two modules are used together. Furthermore, PA contributes more performance improvement than DA.

Summarize

The paper reduces the gradient variance of supernet training by jointly optimizing the path and data sampling distribution to improve the ranking consistency of supernet. The paper derives the relationship between the gradient variance and the sampling distribution, and uses the normalized gradient norm to update these two distributions. A large number of experiments prove the effectiveness of the method. In the future, the researchers will further explore more effective methods to reduce the gradient variance of Supernet training.

write at the end

If you are also interested in the full-stack field of artificial intelligence and computer vision, it is strongly recommended that you pay attention to the informative, interesting, and loving public account "CVHub", which brings you high-quality original, multi-field, and in-depth cutting-edge scientific papers every day Interpretation and industrial mature solutions! Welcome to add the editor's WeChat account: cv_huber, let's discuss more interesting topics together!

Guess you like

Origin blog.csdn.net/CVHub/article/details/129741337