Paper | ROUTING NETWORKS: ADAPTIVE SELECTION OF NON-LINEAR FUNCTIONS FOR MULTI-TASK LEARNING

ROUTING NETWORKS: ADAPTIVE SELECTION OF NON-LINEAR FUNCTIONS FOR MULTI-TASK LEARNING

to sum up

Question: Multitasking although they would benefit from each other, but it can also generate crosstalk (task inferference).

Solution: Let the network decide to share or separate structures. And cross-stitch (one alignment algorithms) like.

Specific programs: the proposed routing network paradigm: a self-organizing network and its training methods.

Effect: Training costs do not increase with the number of tasks and growth, but almost constant. On CIFAR-100 20 tasks and training time is less than 85% cross-stitch.

insufficient:

  1. Although the authors provide qualitative results (qualitative results), but the results of the routing remains unexplained.
  2. There are still a lot of hyper-parameters to set, for example, the number of block. On a simple process: How many tasks, it sets the number of block (each).

motivation

According to Caruana, MTL advantages are: the use of commonality between multiple tasks to improve generalization ability. The authors explain:

This means a model must leverage commonalities in the tasks (positive transfer) while minimizing interference (negative transfer).

Thus, the authors proposed the MTL oriented routing network, comprising two components: router and a plurality of function block. Router depth at a fixed cycle, the planning (designated) participating in a number of block. I.e. does not participate in the skipped block, before and after the state remains unchanged.

Ideally, positive transfer is realized by sharing function block, and can avoid the negative transfer by a separate block.

Related work

MTL traditional structure: the traditional multi-tasking depth learning Caruana article requires rational design of network structure. Let low-level features such as shared. But routing networks even further, so that the whole network full-motion, can be assembled, can independently adjust the structure according to different tasks.

Transfer learning: In some transfer learning to work, this automated selection mechanism has been widely studied. For example, a useful attention and learning gating mechanism. But we not only consider the two tasks, but consider up to 20 tasks. And we will work as one of the comparative experiment.

Mixtures of experts architectures: a plurality of input expert model, to obtain a plurality of weighted output. This soft mixture decision and hard routing decision different. They are not meant to model some important effects.

Dynamic representations: generating weight coefficients of some weight, to obtain an optimal neural network. But they generally can not afford a lot of depth and model parameters. The routing can reduce the weight of dynamic network.

Minimizing computational costs for single-task problem: comprising REINFORCE, Q Learning and actor-critic methods. This article emphasizes MTL, so using a multi-agent reinforcement learning training algorithm and a recursive decision-making process.

Rather, our job is automated architecture search. Text is the first MTL + hard routing decision.

Routing networks

Routing: used repeatedly router, to select a series of function block, assembled together to the input service.

network

As shown, an input vector v, which is a sample to be classified, the category is T (i.e. task number). Router were chosen F13, F21 and F32 (classification should be a layer), so the final prediction is the \ (\ Hat {Y} \) .

The algorithm in the following table, is very simple.

network

note:

  • Input parameters: Maximum recursion depth \ (n-\) and Task ID (known too).
  • router may choose to skip a certain depth.
  • router need to enter the current state, it is not necessarily the input x.
  • Here the same number of blocks of each layer. If not, we can make each layer router, that decision function independently.

Any neural network can be transformed into routing networks, methods: a layer of multiple copies of each layer as function blocks.

training

Using collaborative multi-agent reinforcement learning (MARL), joint training and router function block.

Two Reward: immediate in each layer (each operation) provides action reward and a final reward.

  1. Final reward to encourage high-performance networking. For a training sample classification task, if the classification is correct, then add 1 or subtract 1.
  2. Immediate reward encouraged to network with fewer blocks. The author has two strategies: (1) the average number of (former iterations) of the block selected history; (2) the average probability. The authors found no significant difference between the two, and chose (2). The reward is multiplied by a coefficient \ (\ Rho \) . If you only consider the performance, FIG. 12, the coefficient must be as small as possible (preferably 0).

Select RL algorithms are many:

  1. Single agent, that is, all tasks share a policy.
  2. Multi-agent, each task has an agent, to learn their policy. This is the best test of this article.
  3. On the basis of the multi-agent, add a dispatching agent, dedicated to the distribution agent. That is, we no longer require each task corresponds to their agent, but by dispatching agent.

block and router policy while the change, and the authors found ineffective. The Weighted policy learner (WPL) can solve the problem of instability of non-static environment of this MARL. WPL oscillation can be suppressed, to accelerate the convergence of the agent. This is done by scaling the gradient, the learning rate is reduced when the Nash equilibrium away from the policy, whereas increasing the learning rate.

WPL algorithm algorithm as described in Table 3.

experiment

On the Optimization Model for A FEW AS-SHOT Learning convnet transformation is routed version, and experiments on three classification image data set. A tag equivalent to a task.

Contrast algorithm is joint training strategy cross-stitch networks and Caruana described with layer sharing.

The numerical results we can not say. We take a look at a qualitative experiments in. By visualization of MNIST experiment, we get the following routing:

network

In other words, the conventional MTL encourage shared low-level features, but showing a pear-shaped in 7-4-5 in the routing networks. The author does not know why this is the best MTL program, but the experiment proved that better than the static baseline effect.

Rest a little.

Guess you like

Origin www.cnblogs.com/RyanXing/p/routing_networks.html