【One Shot】《Matching Networks for One Shot Learning》

在目前的机器学习研究中，从少量的训练数据中进行学习还是一个非常大的挑战。例如在视觉、文本的处理上，传统的有监督的深度学习方法不能够从有限的数据中有效的学习新的概念。
本文受到利用深度神经网络来学习图像、文本特征的启发，通过拓展的外部信息加强记忆神经网络的效果，提出了一种针对one-shot学习的神经网络模型-Matching Net (MN).该模型可以将一个小的标注集合以及一个未标注的测试样例映射到它对应的标签，在这个过程中避免了对于新的标签类别进行调整的需求。
该方法在ImageNet数据集的one-shot准确率从87.6%提升到了93.2%，在Omniglot数据集上的效果从88.0%提升到了93.8%

1 Motivation

humans learn new concepts with very little supervision(one-shot learning)
data augmentation and regularization alleviate but do not solve over-fitting
non-parametric models allow novel examples to be rapidly assimilated
(training examples need to be slowly learnt by the model into its parameters)

2 Innovation

3 Method

非参数估计方法（核密度估计KDE-Kernel Density Estimation）是一种从数据样本本身出发研究数据分布特征的方法，是在概率论中用来估计未知的密度函数，属于非参数检验方法之一。

3.1 matching network

a是attention机智
g(x)和f(x)在3.1.3和3.1.4中有讲解
$\theta$ 如下

T: task, distribution over possible label sets, typically as uniformly
L: label sets, eg:{dogs, cats, pigs}, each label set is small, eg: 5 examples
S: support set, sample from label sets
B: batch, sample from label sets

3.1.1 Sequence to sequence with attention

细节一点如下（双向RNN+attention）

3.1.2 Memory networks

LSTM

文章中是bidirectional LSTM，下图中框框换成LSTM结构就行

3.1.3 g(x)

【平价数据】One Shot Learning
总结的非常好

左边公式中的h即为右边图中的a

3.1.4 f(x)

加了attention，[h,r]处

3.1.5 Pointer networks

Pointer Networks

4 Experiments

4.1 Omniglot

Full Contextual Embedding（FCE）对Omniglot没什么影响
5-shot > 1-shot
20-way > 5-way

4.2 MNIST

像Siamese Nets那样直接从Omniglot迁移到MNIST效果如下：
- 63% Baseline classifier
- 70% Siamese Nets
- 72% Matching Nets

其中baseline classifier 结构如下
4层CNN
input: 28*28
每层都是
3x3x64 fiters
BN
Relu
2x2 maxpooling

28 →14→ 7→ 3 →1
最后结果为1x1x64

4.3 miniImageNet

100classes from ImageNet
80 train
20 test

4.4 rand and dogs

dogs setup
train: ImageNet-118(狗) classes
test: 118(狗) classes
rand setup
train: ImageNet-118（随机） classes
test: 118（随机） classes

5 Conclusion

one-shot learning is much easier if you train the network to do one-shot learning
non-parametric structures in a neural network make it easier for networks to remember and adapt to new training sets in the same tasks.

结合以上两点，产生matching network

缺点

as the support set S grows in size, the computation for each gradient update becomes more expensive
when the label distribution has obvious biases (such as being fine grained), our model suffers