Paper | Deep Mutual Learning

[Algorithms and formulas is very simple, even a little naive, but writing a good article]

In order to make a small network with a large capacity, we typically use distillation. This article presents a new method: the depth of learning from each other (deep mutual learning, DML). Distillation and different from each other in the presence of a plurality of students learning to learn together, and between each student to learn from each other. It was also found a surprising result: we do not need teachers network piror powerful, but requires only a simple group of students to learn together the network, performance can transcend distillation learning.

1. Introduction motives and methods detailed

There are many means of streamlining model: streamlined network design, model compression, pruning model, dualistic (binarisation) as well as the most interesting model for distillation.

Distillation is a motivational model: a small network with a large network of skills similar, but it was not as good as the big network training simple. In other words, training is not the problem of the small size of the network, but is optimized.

Thus, the model set up in a distillation model teacher. Smaller student model attempts to imitate the classification probability model of teacher characteristics or expression, rather than learning through traditional oversight goals. Teachers model is trained in advance, therefore distillation learning is one way to learn.

The different approaches of this article. This paper set up a series of student network, learn together. Each student networks in the two loss function training: one is the loss of traditional supervised learning; the other is an imitation of loss (mimicry loss), the classification probability of other students as the student's prior probability.

..., and a mimicry loss that aligns each student’s class posterior with the class probabilities of other students.

Meaning there are three aspects: (1) the network performance of each student, learning to be better than the individual, but also better than the distillation of learning; (2) no longer need a strong teacher; (3) to make three large networks with each other learning, learning than a single large network better. That is, even if we do not consider the scale model, considering only the accuracy, depth of learning from each other can come in handy.

There is no theory to explain it? I'm afraid not. Namely: in the end thanks to whichever increment? First of all, learn from each other and learn distillation, are providing additional information to guide students to the network, that network so that students fall into more reasonable, common local optima. [A bit like a dropout, but not robust transformation of the network structure, but the robustness of the transformation of the optimization strategy]

Author on heavy pedestrian recognition and image classification have carried out experiments, learning better results than distillation. There are also a few found:

  • Such methods are effective for a variety of network structure, or a combination of multiple effective size of the network;

  • As the number of networks to increase cooperation, performance is also improved.

  • This is helpful for semi-supervised learning, both because of the loss of effective imitation of a label data, the label also no valid data.

2. Related Work

Compared to distillation learning, direct throw away the concept of a network of teachers, students and allow a bunch of common network, learn from each other.

Compared to collaborative learning, this target is the same for each network. The existing collaborative learning collaborative aimed at solving different tasks.

3. Methods

3.1 Formulation

As shown, look very clear.

10_1

Suppose there are \ (M \) Category \ (N \) samples \ (\ {x_i \} _ {I =. 1} ^ N \) , corresponding to the label \ (\ {y_i \} _ {i = 1} N ^ \) .

Last supervised learning loss \ (L_C \) is the KL divergence between the predicted probability and the real label. To predict the probability of network normalized by softmax.
\ [P ^ m (x_i) = \ FRAC {\ exp (Z ^ m)} {\ sum_ {m =. 1} ^ M \ exp (Z ^ m)} \]
\ [L_C = - \ sum_ {I = . 1} ^ N \ sum_ {m =. 1} ^ MI (y_i, m) \ log (P ^ m (x_i)) \]
\ [the I \ left (Y_ {I}, m \ right) = \ left \ { \ begin {array} {ll} {1} & {y_ {i} = m} \\ {0} & {y_ {i} \ neq m} \ end {array} \ right. \]

Further, we introduce another random initialization of the network, a network is defined with reference to mimic network loss 2:
\ [KL D_ {} \ left (\ boldsymbol {2} {P} _ \ | \ boldsymbol {P} _ {1} \ right) = \ sum_ {i = 1} ^ {N} \ sum_ {m = 1} ^ {m} p_ {2} ^ {m} \ left (\ boldsymbol {x} _ {i} \ right) \ log \ frac {p_ {2} ^ {m} \ left (\ boldsymbol {x} _ {i} \ right)} {p_ {1} ^ {m} \ left (\ boldsymbol {x} _ {i} \ right)} \]

Explanation: If both the same probability, then the loss is zero; otherwise, (a tends to 0, tends to a 1) as long as the two different trends, losses are positive.

Of course, we can also use a symmetric KL loss, i.e. \ (\ frac {1} { 2} \ left (D_ {KL} \ left (\ boldsymbol {p} _ {1} \ | \ boldsymbol {p} _ {2 } \ right) KL + D_ {} \ left (\ boldsymbol {2} {P} _ \ | \ boldsymbol {P}} _ {. 1 \ right) \ right) \) . It was found that the effect of no difference. Bloggers wonder formula [7] wrong

The final loss is to imitate the above losses and losses directly supervised learning summation.

3.2 The

Each network can be calculated on a separate GPU.

When adding more networks, to imitate loss averaged.

Another optimization method: the probability of all students so that other network averaged (that is unified as a teacher), and then calculate the average KL divergence probability and the probability of the student distribution network.
\ [L _ {\ Theta_ { k}} = L_ {C_ {k}} + D_ {KL} \ left (\ boldsymbol {p} _ {avg} \ | \ boldsymbol {p} _ {k} \ right), \ quad \ boldsymbol {p} _
{avg} = \ frac {1} {K-1} \ sum_ {l = 1, l \ neq k} ^ {K} \ boldsymbol {p} _ {l} \] experimental I found that doing this is not good. Possible reasons are: mean operation reduces the entropy of teachers of transcendental understanding randomness [on the line].

3.3 weakly supervised learning

Implementation is simple: if there is a label data, then be optimized based on supervised learning loss; if it is unlabeled data, then be optimized based on imitation losses.

4. Experimental

4.1 The basic experiment

10_2

As shown in Table, the authors tried a lot of network structure. When the interactive learning, accuracy has improved (DML-Ind is positive). Wherein the composition further comprises a pair various networks. The authors also tried to identify pedestrians heavy task, accuracy has also been improved.

During training, DML also contribute to faster and better convergence.

Author tried two iterative strategy: one is the sequence of iterations, that the first iteration of a network and then finished second iteration; the second is a parallel strategy, that is, at the same time iteration. The authors found a second better. Because of the parallel and a second, higher efficiency.

The authors also compared the distillation of learning, the effect is far more DML poor.

The authors also examined the impact of the number of students on the final results of the network. Overall growing trend, and the variance is also smaller.

4.2 depth test

So why DML effective? The authors also carried out some experiments.

[4,10] that: falls gully (wide valleys) of the network, usually better than falling into the slit (narrow crevices) generalization ability of the network. why? Because when the disturbance input, the gully is no large change in the network, but the latter. The DML will act as the role of a facilitator to help the network out of the slit.

Authors can not prove this, but conducted an experiment: and [4, 10], as the authors heavy weights in the network added Gaussian noise. As a result, the original network of intense training error increases, and training network training error DML only a small increase.

Further, DML is the mean of operating a network of teachers. We look at this equalization is not good. The authors found that addition of DML makes the prediction network is not so sure. This is similar to the entropy regularization method [4,17], the network can help to find a wider local minimum. However, compared with [4], the better the effect DML.

It was found that, with or without DML, different initialize the network out of school characteristics are different. Thus, the difference in randomness services, will serve robustness. Further, if we force characteristics were similar, then the end result is not falling instead of rising. Authors attempt to join the L2 losses on features, the result of a worse effect.

Guess you like

Origin www.cnblogs.com/RyanXing/p/deep_mutual_learning.html