This article was first published on the WeChat public account CVHub, and may not be reproduced to other platforms in any form. It is only for learning and communication, and offenders will be held accountable!
Title: PA&DA: Jointly Sampling PAth and DAta for Consistent NAS
Paper: https://arxiv.org/pdf/2302.14772.pdf
Code: https://github.com/ShunLu91/PA-DA
guide
Based on the weight sharing mechanism, the One-shot NAS method trains a supernet, and then evaluates and sorts the sub-network by inheriting the weight of the supernet , which greatly reduces the search cost. However, some studies point out that there are different gradient descent directions for shared weights during training. The paper further found that a large gradient variance appeared during the supernet training process, which reduced the consistency of the supernet ranking . To alleviate this problem, the paper explicitly minimizes the gradient variance of supernet training by jointly optimizing the sampling distribution of path and data (PA&DA) . The paper theoretically derives the relationship between the gradient variance and the two sampling distributions, revealing that the optimal sampling probability is proportional to the normalized gradient norm of the path and training data .
The method in this paper can ignore the computational cost when optimizing the sampling distribution of paths and data , and can achieve lower gradient variance in supernet training, which makes supernet have better generalization performance, thus obtaining more consistent NAS . The paper provides a comprehensive comparison with other improved methods in various search spaces. The results show that the proposed method outperforms other methods with more reliable ranking performance and higher search architecture accuracy , demonstrating the effectiveness of the proposed method.
contribute
The paper uses CIFAR-10 to conduct experiments on NAS-Bench-201, uses the SPOS algorithm to train the supernet, and gradually increases the candidate operations of each edge of the supernet . The paper records the average gradient variance of all candidate operation parameters during the training process , and The subnet ranking consistency of supernet is evaluated by measuring the ranking results of the same 64 sub-networks .
As shown in the figure above, the more sub-models there are in supernet, the greater the gradient variance and the worse the ranking consistency. These results suggest that large gradient variance in subnetworks can harm the consistency of supernet rankings during training , by using the normalized gradient norm as an important metric and employing an important sampling strategy for paths and data during supernet training. , can reduce the supernet gradient variance and improve its ranking consistency .
The main contributions of the paper are as follows:
-
The paper verifies that the weight sharing mechanism of supernet training leads to large gradient variance, which damages the performance of supernet and deteriorates its ranking consistency .
-
By deriving the relationship between supernet gradient variance and sampling distribution, the paper proposes to explicitly minimize gradient variance during supernet training by jointly optimizing path and data sampling distribution . The paper finds that the optimal sampling probability is proportional to the normalized gradient norm of the path and data, and uses importance sampling for it during supernet training.
-
Our method requires only negligible computation to perform path and data importance sampling , and does not require tedious hyperparameter tuning. It achieved the highest Kendall's Tau 0.713 on NAS-Bench-201 and achieved superior performance on DARTS and ProxylessNAS search spaces.
method
Sampling-based One-Shot NAS
Sampling-based One-Shot NAS is generally divided into two stages: Supernet training and Sub-network search:
Stage1(Training stage): Establish a weight of W \mathcal{W}W的SupernetN \mathcal{N}N. _ During training, according to the discrete distributionp ( A ) p(A)p ( A ) Sampling Sub-networkα \alphaα , which inherits the weight of Supernet, soeach step only trains the weight W contained in the Sub-network α \mathcal{W}_\alphaWa
The goal of the final optimization is to obtain the final optimal Supernet weight W ∗ \mathcal{W}^{*} by combining the sub-models that are continuously iteratively sampledW∗:
Stage2 (Searching stage): Continuously sample and extract the Sub-network from the trained Supernet for evaluation, and obtain its performance on the verification data set. Here you can use the heuristic search algorithm to search for the optimal sub-model α ∗ \mathcal{\alpha}^{*}a∗
The paper attempts to reduce the gradient variance of Supernet during training to improve the convergence and ranking consistency of Supernet. The paper proposes to jointly optimize the sampling distribution p ( A ) p(A) during the Supernet training processp ( A ) andtraining data distribution q ( DT ) \mathbf{q}\left(\mathbb{D}_T\right)q(DT) :
where d ( p ) d(p)d ( p ) andd ( q ) d(q)d ( q ) isthe gradient variance functionthe pathanddatasampling distribution. Below, we show how to derive the relationship between them and optimize these two sampling distributions alternately.
Path Importance Sampling
present iiIn step i training, with probabilitypi p_ipiSampling distribution p ( A ) p(A) from pathSampling a submodelα i \mathcal{\alpha}_{i} in p ( A )ai, the resulting stochastic gradient is:
The paper expects to minimize the gradient variance in the above formula , by optimizing the sampling distribution ppp:
It can be found that **, E [ d ] \mathbb{E}[d]E [ d ] and path sampling distributionppp is independent**, so the constraints of the stochastic gradient formulation can be reformulated as:
In order to solve the above constrained optimization problem, use Lagrangian multiplication and convert it into an unconditional extreme value problem for solution:
set ∂ Ψ ( p , λ ) ∂ pi = 0 \frac{\partial \Psi(\mathbf{p}, \lambda)}{\partial p_i}=0∂pi∂ Ψ ( p , λ )=0 , you can get:
λ = ∑ i = 1 N ∥ ∇ W L ( N ( x i , α i ; W α i ) , y i ) ∥ N \sqrt{λ} = \sum_{i=1}^N \frac{\left\|\nabla_{\mathcal{W}} \mathcal{L}\left(\mathcal{N}\left(x_i, \alpha_i ; \mathcal{W}_{\alpha_i}\right), y_i\right)\right\|}{N} l=i=1∑NN∥∇WL(N(xi,ai;Wai),yi)∥
And further deduce the optimal sampling distribution p ∗ p^{*}p∗ :
The optimal path sampling probability pi ∗ p^{*}_i can be obtainedpi∗with submodel α i \alpha_iaiis proportional to the normalized gradient norm of , that is, sampling submodels with larger gradient norms can reduce the gradient variance of Supernet training.
In practical applications, the paper will sub-model α i \alpha_iaiThe gradient norm of is measured as the sum of the gradient norms of the candidate operations it contains, taking the normalized gradient norm of each candidate operation as its sampling probability .
The paper calculates the gradient norm of each regular backward and updates the sampling probability of the candidate operation after each epoch . Thus, our optimization of the path sampling distribution p requires trivial computation and is particularly efficient
Data Importance Sampling
Sampling the training data according to the normalized gradient norm helps to reduce the gradient variance of the deep model training, which can be formally expressed as:
The last layer of the paper ∇ L ∇_L∇LThe gradient of the loss function of the pre-activation output to approximate the upper bound of the gradient norm of each training data, namely:
This way, we can easily measure their importance by accessing the upper bound of each training data, e.g. in image classification tasks the last layer is usually softmax, when using cross-entropy loss, ∇ L ∇ can be derived in advance _L∇LThe gradient expression of , and conveniently calculate it during training, as follows:
Importance Sampling NAS
The paper's method aims to improve the consistency of Supernet rankings by reducing the gradient variance during training . A joint optimization based on path importance sampling and data importance sampling is proposed :
The above calculations require only an additional line of code and can be efficiently performed in a mini-batch fashion. Therefore, the paper uses this approximation to estimate the importance of the training data, and uses the normalized result to update the sampling distribution q after each epoch.
experiment
Evaluation of Supernet Ranking Consistency
As shown in the figure above, PA&DA only needs 0.2 more GPU hours than SPOS, and achieves the highest KT and P@Top5% compared with other modes, which shows that the training mode of the paper is effective and is conducive to improving the consistency of supernet ranking .
Search Performance on CIFAR-10
As shown in the figure above, the paper's method achieves the highest average test accuracy of 97.52 ± 0.07, surpassing the original DARTS and its advanced variants . Compared with other improved one-shot NAS methods, such as NSAS, Few-Shot-NAS, GM and CLOSE, our method consistently outperforms them with the smallest search cost.
Search Performance on ImageNet
As shown in the figure above, PA&DA surpasses DA-NAS, FairNAS-A and SUMNAS-M with a little more FLOPs. Compared with SPOS, ProxylessNAS, MAGIC-AT, Few-Shot NAS and GM, PA&DA search system The structure is smaller, and the highest top-1 accuracy of 77.3 is obtained , which is enough to prove the effectiveness of the method of the paper
Ablation experiment
Effect of batch size
Larger batch sizes generally stabilize the training of deep models with lower gradient variance. From the above figure, it can be observed that as the batch size increases, GV decreases, and KT increases monotonously. Batch size 512 gets the best KT 0.670 ± 0.029
Effect of schedules for smoothing parameters
Updating the sampling probabilities of the DA after each epoch, using a sample-level distribution and linearly increasing τ yields the best results.
Effect of DA and PA
The best results are obtained when these two modules are used together. Furthermore, PA contributes more performance improvement than DA.
Summarize
The paper reduces the gradient variance of supernet training by jointly optimizing the path and data sampling distribution to improve the ranking consistency of supernet. The paper derives the relationship between the gradient variance and the sampling distribution, and uses the normalized gradient norm to update these two distributions. A large number of experiments prove the effectiveness of the method. In the future, the researchers will further explore more effective methods to reduce the gradient variance of Supernet training.
write at the end
If you are also interested in the full-stack field of artificial intelligence and computer vision, it is strongly recommended that you pay attention to the informative, interesting, and loving public account "CVHub", which brings you high-quality original, multi-field, and in-depth cutting-edge scientific papers every day Interpretation and industrial mature solutions! Welcome to add the editor's WeChat account: cv_huber, let's discuss more interesting topics together!