论文笔记：Evolving Losses for Unsupervised Video Representation Learning

Evolving Losses for Unsupervised Video Representation Learning 论文笔记

Distillation

Distillate Knowledge from Teacher model Net-T to Student model Net-S.
在这里插入图片描述

目的：为了精简模型方便部署。

$L=\alpha L_{s o f t}+\beta L_{h a r d}$

$L_{s o f t}=-\sum_{j}^{N} p_{j}^{T} \log \left(q_{j}^{T}\right), \text { where } p_{l}^{T}=\frac{\exp \left(v_{i} / T\right)}{\sum_{k}^{N} \exp \left(v_{k} / T\right)}, q_{i}^{T}=\frac{\exp \left(z_{i} / T\right)}{\sum_{k}^{N} \exp \left(z_{k} / T\right)}$

$L_{h a r d}=-\sum_{j}^{N} c_{j} \log \left(q_{j}^{1}\right), \text { where } q_{i}^{1}=\frac{\exp \left(z_{i}\right)}{\sum_{j}^{N} \exp \left(z_{j}\right)}$

第一部分是从Teacher 模型中学习，第二部分是从ground truth 中学习

温度的高低改变的是Net-S训练过程中对负标签的关注程度: 温度较低时，对负标签的关注，尤其是那些显著低于平均值的负标签的关注较少；而温度较高时，负标签相关的值会相对增大，Net-S会相对多地关注到负标签。

Main idea: Multiple modalities to multiple tasks

在这里插入图片描述

Loss Function

$\mathcal{L}=\sum_{m} \sum_{t} \lambda_{m, t} \mathcal{L}_{m, t}+\sum_{d} \lambda_{d} \mathcal{L}_{d}$

where

$\lambda$ is weight

$\mathcal{L}_{m,t}$ is loss function of modality $m$ to task $t$

$\mathcal{L}_{d}$ is $L_2$ distance of a layer in the main network $M_i$ to another network $L_i$
$\mathcal{L}_{d}\left(L_{i}, M_{i}\right)=\left\|L_{i}-M_{i}\right\|_{2}$

Evolution Algorithm

Using GA to determine the $\lambda$

Each $λ_{m,t}$ or ${λ_d}$ is constrained to be in $[0,1]$

Unsupervised loss function

Zipf Distribution matching (ELo)

cluster centroids $\left\{c_{1}, c_{2}, \ldots c_{k}\right\} \text { where } c_{i} \in \mathcal{R}^{D}$

Naively assuming all clusters have the same variance, and let $2\sigma^2 = 1$

we can compute the probability of a feature vector $x ∈ R^D$ belonging to a cluster $c_i$ as
$p\left(x \mid c_{i}\right)=\frac{1}{\sqrt{2 \sigma^{2} \pi}} \exp \left(-\frac{\left(x-c_{i}\right)^{2}}{2 \sigma^{2}}\right)$
Bayes rules:
$\begin{aligned} p\left(c_{i} \mid x\right) &=\frac{p\left(c_{i}\right) p\left(x \mid c_{i}\right)}{\sum_{j}^{k} p\left(c_{j}\right) p\left(x \mid c_{j}\right)}=\frac{\exp -\frac{\left(x-c_{i}\right)^{2}}{2 \sigma^{2}}}{\sum_{j=1}^{k} \exp -\frac{\left(x-c_{j}\right)^{2}}{2 \sigma^{2}}} \\ &=\frac{\exp -\left(x-c_{i}\right)^{2}}{\sum_{j=1}^{k} \exp -\left(x-c_{j}\right)^{2}} \end{aligned}$
which is standard softmax function

given the above probability of each video belonging to each cluster, and the Zipf distribution, we compute the prior probability of each class as $q\left(c_{i}\right)=\frac{1 / i^{s}}{H_{k, s}}$ where $H$ is $k_{th}$ harmonic number and $s$ is real constant.

$p\left(c_{i}\right)=\frac{1}{N} \sum_{x \in V} p\left(c_{i} \mid x\right)$ , the average over all videos in the set.

KL divergence :
$K L(p \| q)=\sum_{i=1}^{k} p\left(c_{i}\right) \log \left(\frac{p\left(c_{i}\right)}{q\left(c_{i}\right)}\right)$
This will be our fitness function.

it poses a prior constraint over the distribution of (learned) video representations in clusters to follow the Zipf distribution.

Loss Evolution

tournament selection and CMA-ES.