"ENAS: Efficient Neural Architecture Search via Parameter Sharing" paper notes

Reference code: enas

1 Overview

Introduction: This article proposes to use weight sharing for network search based on NAS, which avoids repeated training of samples obtained by controller sampling, thereby compressing the overall search time of the network search algorithm ENAS. In NAS, the controller first samples a network structure, and then trains it to converge, and then uses the performance of the sampling network as the controller's reward (but the weights trained with great effort here will be discarded, and the next The network structure will be trained from the beginning), so as to train and guide the controller to make the next generated network better. This article shares the network parameters. Each time a sample is sampled from the overall network search space (a sub-network whose parameters are shared) does not need to be trained to convergence, which greatly reduces the network search time and makes it The search can be completed on a 1080Ti graphics card in less than 16 hours. It achieved an error rate of 2.89% on the CIFAR-10 data set (compared to 2.65% of NAS), and the effect is quite close while the time is greatly reduced.

Design of the search space:
In this article, in order to realize the parameter sharing in the search space, a super network is designed, in which the node represents a local computing unit (which contains the parameters that need to be trained, if it is sampled by the controller, it can be used The parameter information stored in it, so as to realize the parameter sharing), the edge represents the flow direction of the data information, and the constructed diagram is shown in the following figure:
Insert picture description here
Design of the search controller:
For the controller part of the ENAS algorithm, It is an RNN network consisting of multiple nodes. Their main tasks are as follows:

  • 1) Control that edge needs to be activated, that is, select the predecessor node of the current node;
  • 2) Select the operation type of the current node, such as convolution, etc.;

Each node has independent parameters, and they are reused throughout the training process (also a kind of parameter sharing). The following figure shows the controller structure composed of 4 nodes (right figure): the
Insert picture description here
left and the middle on the way The two figures represent the selected sub-network structure.

2. Method design

2.1 ENAS training and final network generation

The whole method designed in the article involves the training of two distribution parameters:

  • 1) The parameter θ \theta of the controller RNN networkθ
  • 2) The sub-network parameter ww sampled by the controllerw

From the algorithm flow of ENAS, we can know that these two parameters are alternately trained. First, complete the sub-network parameter wwInitial training of w , and then training the network part of the RNN controller, and then the two networks are trained alternately until convergence.

Subnet parameter wwTraining of w :
When training the parameters of the sub-network, the parameters of the controller are first fixed, and a sub-networkmi = π (m; θ) m_i=\pi(m;\theta) issampled from itmi=π ( m ;θ ) , and then his training is the standard CNN network training process. Then the descent gradient of this part can be described as:
∇ w E m ∼ π (m; θ) [L (m; w)] ≈ 1 m ∑ i = 1 M δ w L (mi, w) \nabla_wE_{m\ sim\pi(m;\theta)}[L(m;w)]\approx\frac{1}{m}\sum_{i=1}^M\delta_wL(m_i,w)wEm ~ p ( m ? i )[L(m;w)]m1i=1MdwL(mi,w )
Among them,MMM represents the number of sub-networks sampled at a time. Although this sampling method will bring alarger variancewhen the number of samples is fixed, the article points out that it is atM = 1 M=1M=It also works well in the case of 1 .

Controller network parameters θ \thetaθ training:
training the controller then the corresponding parameterww of thesub-network needs to be fixedw , because sampling is performed in a discrete manner here, the source of the controller parameter update gradient is performed using thepolicy gradient. Here, the performance of the sub-network on the val dataset is used as feedbackR (m; w) R(m; w)R(m;w ) to maximize this feedback:
E m ∼ π (m; θ) [R (m; w)] E_{m\sim\pi(m;\theta)}[R(m;w)]Em ~ p ( m ? i )[R(m;w ) ]
Policy GradientExplanation:[CS285 Lecture 5] Policy gradient

In addition to the supervised optimization mentioned above, the article also introduces constraints on skip connection. The KL divergence used in the code is used with a priori of 0.4. It is introduced to prevent the network from generating too many skip connections, so that the features extracted by the network become shallower and shallower, and the expression ability and generalization ability are reduced.

The choice of the final generation network:
Finally, the final sub-network needs to be generated. Generally, a one-time pass through π (m; θ) \pi(m;\theta)π ( m ;θ ) Sampling to obtain several sub-networks, compare their performance on the val data set, select the one with the best performance among them, and train it from scratch.

2.2 Design of search space

Traditional network layer construction:
here is the fixed number of network layers that need to be searched when searching L = 12 L=12L=1 2 , and then build the controller on this basis. The functions performed by the controller are similar to those mentioned above, except that the selected operations are different. The operation space used here is: thesize of the convolution kernel is 3 ∗ 3, 5 ∗ 5 3*3,5*533,55 Conventional separable convolution and convolution of the nuclear pool size of3 * 3 * 3 333 mean and maximum pooling operations. In fact, although there are many changes involved in the whole arrangement, many hyperparameters are still fixed and there are certain limitations. For this part of the search process, see the following figure:
Insert picture description here
Building based on network cells: The
efficiency of network search layer by layer like the above is low. A natural idea is to combine some network operations into a small module, through the stack of small modules Realize the construction of the overall network, as shown in the following figure:
Insert picture description here
For such a search situation, the article has improved the controller part:

  • 1) Select two of the nodes in the preamble as input;
  • 2) Choose the appropriate operation type for these two inputs, different sizes of convolution kernels and convolution types, etc.;

The following figure shows the flow diagram of this search method:
Insert picture description here

3. Experimental results

CIFAR-10 data set:
Insert picture description here
Visual display of search results:
Insert picture description here

Guess you like

Origin blog.csdn.net/m_buddy/article/details/110428595