【阅读笔记】DARTS: Differentiable Architecture Search

作者：

Hanxiao Liu	Karen Simonyan	Yiming Yang
CMU	DeepMind	CMU
[email protected]	[email protected]	[email protected]

发布时间： 24 Jun 2018

昨天看了这篇文章，感觉挺有意思的，将神经网络的结构也作为参数来进行梯度下降优化，给结构的选择以科学的方式。但是看完之后对这种方法也有几点顾虑，文末的读后感我再来细说。

abstract

本文的核心思想是通过以可微的方式进行结构搜索。
与传统的在离散的和不可微的搜索空间上采用进化或强化学习搜索结构的方法不同，本文的方法是基于将结构表示的松弛（relaxation），允许使用梯度下降高效搜索架构。
在CIFAR-10，ImageNet，Penn Treebank和WikiText-2上进行了大量实验，表明我们的算法擅长于发现用于图像分类的高性能卷积结构和用于语言建模的循环神经网络结构，同时比现有技术的非微分搜索技术要快几个数量级。

1 Introduction

发现最先进的神经网络架构需要人类专家的大量工作。
最近，人们越来越有兴趣开发自动化算法解决神经网络结构的设计。
自动搜索架构在诸如图像分类 (Zoph and Le, 2016; Zoph et al., 2017; Liu et al., 2017b,a; Real et al., 2018)和object detection (Zoph et al., 2017)上有着广泛的探索。

现有最好的架构搜索算法尽管性能卓越，但在计算上要求很高。
例如，获得CIFAR-10和ImageNet的最新架构需要1800 GPU天的强化学习（RL） (Zoph et al., 2017) 或3150 GPU天的进化学习 (Real et al., 2018)。
虽然已经提出了几种加速方法，如强加搜索空间的特定结构(Liu et al., 2017b,a)，对每个单独架构的权重或性能预测(Brock et al., 2017; Baker et al., 2018) ，以及跨体系结构的权重共享 (Pham et al., 2018b; Cai et al., 2018)，但可扩展性（scalability ）的根本challenge依然存在。
主流方法效率低下的内在原因， e.g. based on RL, evolution, MCTS (Negrinho and Gordon, 2017), SMBO (Liu et al., 2017a) or Bayesian optimization (Kandasamy et al., 2018),是在于把结构搜索视为一个在离散域的黑箱优化问题，这导致需要大量的架构评估。

在这项工作中，作者从另一个角度来看问题，并提出了一种称为DARTS（可微分的结构搜索，Differentiable Architecture Search）的高效架构搜索方法。
取代搜索一组离散的候选架构，我们松弛（relaxation）搜索空间使之连续，从而使架构可以通过梯度下降的方法对其在验证集上的性能进行优化。
因为基于梯度的优化，与低效的黑盒搜索不同，使得DARTS使用比现有技术数量级较少的计算资源实现具有竞争力的表现。
它也优于另一种最近的高效架构搜索方法，ENAS (Pham et al., 2018b).
值得注意的是，DARTS比许多现有方法简单，因为它不涉及任何controllers (Zoph and Le, 2016; Baker et al., 2016; Zoph et al., 2017; Pham et al., 2018b)，hypernetworks (Brock et al., 2017) ，或表现预测因子performance predictors (Liu et al., 2017a)。

在一个连续的领域内搜索体系结构的想法并不新鲜(Saxena and Verbeek, 2016; Ahmed and Torresani, 2017; Shin et al., 2018)，我们与先前的工作有几个主要区别在于：

虽然之前的工作试图对结构的特定方面进行微调，如卷积网络中的滤波器形状或分支模式，但是DARTS能够在丰富的搜索空间内发现具有复杂图形拓扑的高性能架构。
此外，DARTS不限于任何特定架构系列，能够搜索卷积网络和循环网络。

我们的贡献可以总结如下：

引入了一种适用于卷积和循环结构的可微分网络体系结构搜索的新算法。
通过实验表明我们的方法具有很强的竞争力。
实现了卓越的结构搜索效率（4个GPU：1天内CIFAR10误差2.83％; 6小时内PTB误差56.1），这归因于使用基于梯度的优化而非非微分搜索技术。
我们证明DARTS在CIFAR-10和PTB上学习的体系结构可以迁移到ImageNet和WikiText-2上

DARTS的实现可在https://github.com/quark0/darts 上找到。

2 Differentiable Architecture Search

在Sect. 2.1，我们用一般形式描述我们的搜索空间，其中结构（或其中的单元cell）的计算过程被表示为有向无环图。
然后我们为我们的搜索空间引入一个简单的连续松弛方案，使得架构及其权重的联合优化目标可微(Sect. 2.2)。
最后，我们提出了一种近似技术，使算法在计算上可行和高效(Sect. 2.3)。

2.1 Search Space

Following前人的工作，我们搜索一个计算单元（cell）作为最终架构的基石。
学习过的单元可以堆叠起来形成一个卷积网络，或者递归连接形成一个循环网络。

单元是由N个有序节点组成的有向无环图。
每个节点 $x^{(i)}$ 都是一个latent representation（例如卷积网络中的特征映射），每个有向边是对 $x^{(i)}$ 的某种运算 $o^{(i,j)}$ 。
我们假设每个单元有两个输入节点和一个输出节点。
对于卷积单元，输入节点被定义为前两层的单元输出(Zoph et al. 2017)。
对于循环单元，输入节点被定义为当前步骤的输入以及上一步骤中的状态。
通过对所有中间节点应用reduction操作（例如concatenation）来获得单元的输出。

每个中间节点都是基于所有它之前的节点进行计算的：

x^{(i)} = \sum_{j < i} o^{i, j} (x^{(j)})

$x^{(i)}=\sum_{j<i}o^{i,j}(x^{(j)})$
还包括一个特殊的零操作来指示两个节点之间没有连接。
因此学习的任务减少到学习其连边的操作。

2.2 Continuous Relaxation and Optimization

Let $O$ be a set of candidate operations (e.g., convolution, max pooling, zero) where each operation represents some function $o(\cdot)$ to be applied to $x^{(i)}$ .
To make the search space continuous, we relax the categorical choice of a particular operation as a softmax over all possible operations(把操作当作是一堆操作的softmax的结果):

{\bar{o}}^{(i, j)} (x) = \sum_{o \in O} \frac{e x p (α_{o}^{(i, j)})}{\sum_{o^{'} \in O} e x p (α_{o^{'}}^{(i, j)})} o (x)

$\bar{o}^{(i,j)}(x)=\sum_{o\in O}\frac{exp(\alpha^{(i,j)}_{o})}{\sum_{o'\in O}exp(\alpha^{(i,j)}_{o'})}o(x)$
where the operation mixing weights for a pair of nodes

(i, j)

$(i, j)$ are parameterized by a vector

α^{i, j}

$\alpha^{i,j}$ of dimension

| O |

$|O|$ .
After the relaxation, the task of architecture search reduces to learning a set of continuous variables

{α^{i, j}}

$\{\alpha^{i,j}\}$ .
At the end of search, a discrete architecture is obtained by replacing each mixed operation

{\bar{o}}^{i, j} (x)

$\bar{o}^{i,j}(x)$ with the most likely operation, i.e.,

o (i, j) = a r g m a x_{o \in O} α_{o}^{(i, j)}

$o(i,j) = argmax_{o\in O}~\alpha^{(i,j)}_{o}$ . In the following, we refer to

α

$\alpha$ as the (encoding of the) architecture.

After relaxation, our goal is to jointly learn the architecture $\alpha$ and the weights $w$ within all the mixed operations (e.g. weights of the convolution filters).
Analogous(类似) to architecture search using RL (Zoph and Le, 2016; Zoph et al., 2017; Pham et al., 2018b) or evolution (Liu et al., 2017b; Real et al., 2018) where the validation set performance is treated as the reward or fitness, DARTS aims to optimize the validation loss, but using gradient descent.

Denote by $L_{train}$ and $L_{val}$ the training and the validation loss, respectively.
Both losses are determined not only by the architecture $\alpha$ , but also the weights $w$ in the network.
The goal for architecture search is to find $\alpha^*$ that minimizes the validation loss $L_{val}(w^*, \alpha^*)$ , where the weights $w^*$ associated with the architecture are obtained by minimizing the training loss $w^* = argmin_w L_{train}(w,\alpha^*)$ .

This implies a bilevel optimization problem(双层优化问题) (Anandalingam and Friesz, 1992; Colson et al., 2007) with $\alpha$ as the upper-level variable and $w$ as the lower-level variable:

m i n_{α} L_{v a l} (w^{*} (α), α) s . t . w^{*} (α) = a r g m i n_{w} L_{t r a i n} (w, α)

$min_{\alpha}~L_{val}(w^*(\alpha),\alpha)~~s.t.~w^*(\alpha)=argmin_w~L_{train}(w,\alpha)$

The nested formulation also arises in gradient-based hyperparameter optimization (Maclaurin et al., 2015; Pedregosa, 2016), which is related in a sense that the continuous architecture $\alpha$ could be viewed as a special type of hyperparameter, although its dimension is substantially higher than scalar-valued hyperparameters (such as the learning rate), and it is harder to optimize.(高纬超参数，不好优化)

2.3 Approximation

Solving the bilevel optimization exactly is prohibitive（令人望而却步）, as it would require recomputing $w^*(\alpha)$ by solving the inner problem whenever there is any change in $\alpha$ .
We thus propose an approximate iterative optimization procedure where $w$ and $\alpha$ are optimized by alternating between gradient descent steps in the weight and architecture spaces respectively (Alg. 1).
At step k, given the current architecture $\alpha{k-1}$ , we obtain $w_k$ by moving $w_{k-1}$ in the direction of minimising the training loss $L_{train}(w_{k-1}, \alpha_{k-1})$ .
Then, keeping the weights $w_k$ fixed, we update the architecture so as to minimize the the validation loss after a single step of gradient descent w.r.t. the weights:

L_{v a l} (w_{k} - ϵ \nabla_{w} L_{t r a i n} (w_{w}, α_{k - 1}), α_{k - 1})

$L_{val}(w_k-\epsilon \nabla_w L_{train}(w_w, \alpha_{k-1}), \alpha_{k-1})$

where $\epsilon$ is the learning rate for this virtual gradient step.
The motivation behind is that we would like to find an architecture which has a low validation loss when its weights are optimized by (a single step of) gradient descent, where the one-step unrolled weights serve as the surrogate(替代) for $w^*(\alpha)$ .
A related approach has been used in meta-learning for model transfer (Finn et al., 2017).
Notably, the dynamics of our iterative algorithm define a Stackelberg game (Von Stackelberg, 1934) between $\alpha$ ’s optimizer (leader) and $w$ ’s optimizer (follower), which typically requires the leader to anticipate(预测) the follower’s next-step move in order to achieve an equilibrium(平衡).
While we are not currently aware of the convergence guarantees for our optimization algorithm, in practice it is able to converge with a suitable choice of $\epsilon$ (合适的 $\epsilon$ 可以收敛，A simple working strategy is to set $\epsilon$ equal to the learning rate for $w$ ’s optimizer.). We also note that when momentum is enabled for weight optimisation, the one-step forward learning objective is modified accordingly and all of our analysis still applies.

The architecture gradient is given by (we omit the step index k for brevity):

\nabla_{α} L_{v a l} (w^{'}, α) - ϵ \nabla_{α, w}^{2} L_{t r a i n} (w, α) \nabla_{w^{'}} L_{v a l} (w^{'}, α)

$\nabla_\alpha L_{val}(w', \alpha)-\epsilon\nabla^2_{\alpha,w}L_{train}(w, \alpha)\nabla_{w'}L_{val}(w', \alpha)$

where $w'=w-\epsilon\nabla_{w}L_{train}(w, \alpha)$ denotes the weights for a one-step forward model.
The gradient contains a matrix-vector product in its second term, which is expensive to compute. (公式中的二阶项不好计算)
Fortunately, the complexity can be substantially reduced using the finite difference approximation（有限差分近似）. Let $\delta$ be a small scalar (We found $\delta=0.01/||\nabla_{w'}L_{val}(w', \alpha)||_2$ to be sufficiently accurate in all of our experiments.), $w^+=w+\delta \nabla_{w'}L_{val}(w', \alpha)$ and $w^-=w-\delta \nabla_{w'}L_{val}(w', \alpha)$ . Then:

\nabla_{α, w}^{2} L_{t r a i n} (w, α) \nabla_{w^{'}} L_{v a l} (w^{'}, α) \approx \frac{\nabla_{α} L_{t r a i n} (w^{+}, α) - \nabla_{α} L_{t r a i n} (w, α)}{2 δ}

$\nabla^2_{\alpha,w}L_{train}(w, \alpha)\nabla_{w'}L_{val}(w', \alpha)\approx \frac{\nabla_{\alpha}L_{train}(w^+, \alpha)-\nabla_{\alpha}L_{train}(w, \alpha)}{2\delta}$

Evaluating the finite difference requires only two forward passes for the weights and two backward passes for $\alpha$ , and the complexity is reduced from $O(|\alpha||w|)$ to $O(|\alpha|+|w|)$ .

First-order Approximation: When $\epsilon=0$ , the second-order derivative will then disappear.
In this case, the architecture gradient is given by $\nabla_{\alpha} L_{val}(w, \alpha)$ , corresponding to the simple heuristic of optimizing the validation loss by assuming $\alpha$ and $w$ are independent of each other.
This leads to some speed-up but empirically worse performance.
In the following, we refer to the case of $\epsilon = 0$ as the first-order approximation, and refer to the gradient formulation with $\epsilon>0$ as the second-order approximation.

Algorithm 1: DARTS-Differentiable Architecture Search
Create a mixed operation $\bar{o}^{(i,j)}$ parametrized by $\alpha^{(i,j)}$ for each edge $(i, j)$
while not converged do
- 1. Update weights $w$ by descending $\nabla_wL_{train}(w, \alpha)$
- 2. Update architecture $\alpha$ by descending $\nabla_{\alpha} L_{val}(w-\delta \nabla_wL_{train}(w, \alpha), \alpha)$ .
Replace $\bar{o}^{(i,j)}$ with $o^{(i,j)}=argmax_{o\in O}\alpha^{(i,j)}_o$ for each edge $(i, j)$

2.4 Deriving Discrete Architectures

After obtaining the continuous architecture encoding $\alpha$ , the discrete architecture is derived by

Retaining k strongest predecessors for each intermediate node(为每个中间节点保留k个最强的前驱), where the strength of an edge is defined as $max_{o\in O,o\neq zero} \frac{exp(\alpha^{(i,j)}_o)}{\sum_{o'\in O}exp(\alpha^{(i,j)}_{o'})}$ .
To make our derived architecture comparable with those in the existing works, we use k = 2 for convolutional cells (Zoph et al., 2017; Real et al., 2018) and k = 1 for recurrent cells (Pham et al., 2018b).
Replacing every mixed operation as the most likely operation by taking the argmax.

3 Experiments and Results

3.1 Architecture Search

3.2 Architecture Evaluation

3.3 Results Analysis

3.4 Transferability of Learned Architectures

4 Conclusion

We presented DARTS, the first differentiable architecture search algorithm for both convolutional and recurrent networks.
By searching in a continuous space, DARTS is able to match or outperform the state-of-the-art non-differentiable architecture search methods on image classification and language modeling tasks with remarkable efficiency improvement by several orders of magnitude.
In the future, we would like to investigate direct architecture search on larger tasks (e.g. ImageNet) using DARTS.

读后感

总体思路比较清晰，就是把网络结构的元件当作是连边，然后用softmax把原本分立的可能性松弛连续状态，是通过求解双层优化问题来对混合概率和网络权重进行联合优化，从学习的混合概率中引出最终的体系结构。看了这篇文章对Inception的理解更深刻一点了，其实Inception相当于把几个滤波器都列出来，然后通过反向传播学习，按这篇文章的思路其实就是学习哪个滤波器更为适合。而且通过训练集训练权重，通过验证集验证结构的机器学习思想也在这篇文章中有很明确的体现。最后就是一点顾虑，一是如此多的参数，感觉需要的数据集规模要很大；二是之所不把softmax值作为最终的结果，还需要argmax一下肯定是因为怕参数过多，一是可以简化一下模型提高速度，二是可以相当于正则化一下，但是如果softmax值分差距没有那么大，保留最大的一个会不会在更一般的情况下影响性能（虽然文章中说了迁移到其他数据集上效果也还不错，但是更一般的情况我还是有点担心）。最后希望有时间能尝试跑一下，感觉get到了新思路，还是蛮开心的。