论文笔记:Evolving Losses for Unsupervised Video Representation Learning

Evolving Losses for Unsupervised Video Representation Learning 论文笔记

Distillation

Knowledge Distillation from: zhihu

Distillate Knowledge from Teacher model Net-T to Student model Net-S.
在这里插入图片描述

目的:为了精简模型方便部署。

L = α L s o f t + β L h a r d L=\alpha L_{s o f t}+\beta L_{h a r d}

L s o f t = j N p j T log ( q j T ) ,  where  p l T = exp ( v i / T ) k N exp ( v k / T ) , q i T = exp ( z i / T ) k N exp ( z k / T ) L_{s o f t}=-\sum_{j}^{N} p_{j}^{T} \log \left(q_{j}^{T}\right), \text { where } p_{l}^{T}=\frac{\exp \left(v_{i} / T\right)}{\sum_{k}^{N} \exp \left(v_{k} / T\right)}, q_{i}^{T}=\frac{\exp \left(z_{i} / T\right)}{\sum_{k}^{N} \exp \left(z_{k} / T\right)}

L h a r d = j N c j log ( q j 1 ) ,  where  q i 1 = exp ( z i ) j N exp ( z j ) L_{h a r d}=-\sum_{j}^{N} c_{j} \log \left(q_{j}^{1}\right), \text { where } q_{i}^{1}=\frac{\exp \left(z_{i}\right)}{\sum_{j}^{N} \exp \left(z_{j}\right)}

第一部分是从Teacher 模型中学习,第二部分是从ground truth 中学习

温度的高低改变的是Net-S训练过程中对负标签的关注程度: 温度较低时,对负标签的关注,尤其是那些显著低于平均值的负标签的关注较少;而温度较高时,负标签相关的值会相对增大,Net-S会相对多地关注到负标签。

Main idea: Multiple modalities to multiple tasks

在这里插入图片描述

Loss Function

L = m t λ m , t L m , t + d λ d L d \mathcal{L}=\sum_{m} \sum_{t} \lambda_{m, t} \mathcal{L}_{m, t}+\sum_{d} \lambda_{d} \mathcal{L}_{d}

where

λ \lambda is weight

L m , t \mathcal{L}_{m,t} is loss function of modality m m to task t t

L d \mathcal{L}_{d} is L 2 L_2 distance of a layer in the main network M i M_i to another network L i L_i
L d ( L i , M i ) = L i M i 2 \mathcal{L}_{d}\left(L_{i}, M_{i}\right)=\left\|L_{i}-M_{i}\right\|_{2}

Evolution Algorithm

Using GA to determine the λ \lambda

Each λ m , t λ_{m,t} or λ d {λ_d} is constrained to be in [ 0 , 1 ] [0,1]

Unsupervised loss function

Zipf Distribution matching (ELo)

cluster centroids { c 1 , c 2 , c k }  where  c i R D \left\{c_{1}, c_{2}, \ldots c_{k}\right\} \text { where } c_{i} \in \mathcal{R}^{D}

Naively assuming all clusters have the same variance, and let 2 σ 2 = 1 2\sigma^2 = 1

we can compute the probability of a feature vector x R D x ∈ R^D belonging to a cluster c i c_i as
p ( x c i ) = 1 2 σ 2 π exp ( ( x c i ) 2 2 σ 2 ) p\left(x \mid c_{i}\right)=\frac{1}{\sqrt{2 \sigma^{2} \pi}} \exp \left(-\frac{\left(x-c_{i}\right)^{2}}{2 \sigma^{2}}\right)
Bayes rules:
p ( c i x ) = p ( c i ) p ( x c i ) j k p ( c j ) p ( x c j ) = exp ( x c i ) 2 2 σ 2 j = 1 k exp ( x c j ) 2 2 σ 2 = exp ( x c i ) 2 j = 1 k exp ( x c j ) 2 \begin{aligned} p\left(c_{i} \mid x\right) &=\frac{p\left(c_{i}\right) p\left(x \mid c_{i}\right)}{\sum_{j}^{k} p\left(c_{j}\right) p\left(x \mid c_{j}\right)}=\frac{\exp -\frac{\left(x-c_{i}\right)^{2}}{2 \sigma^{2}}}{\sum_{j=1}^{k} \exp -\frac{\left(x-c_{j}\right)^{2}}{2 \sigma^{2}}} \\ &=\frac{\exp -\left(x-c_{i}\right)^{2}}{\sum_{j=1}^{k} \exp -\left(x-c_{j}\right)^{2}} \end{aligned}
which is standard softmax function

given the above probability of each video belonging to each cluster, and the Zipf distribution, we compute the prior probability of each class as q ( c i ) = 1 / i s H k , s q\left(c_{i}\right)=\frac{1 / i^{s}}{H_{k, s}} where H H is k t h k_{th} harmonic number and s s is real constant.

p ( c i ) = 1 N x V p ( c i x ) p\left(c_{i}\right)=\frac{1}{N} \sum_{x \in V} p\left(c_{i} \mid x\right) , the average over all videos in the set.

KL divergence :
K L ( p q ) = i = 1 k p ( c i ) log ( p ( c i ) q ( c i ) ) K L(p \| q)=\sum_{i=1}^{k} p\left(c_{i}\right) \log \left(\frac{p\left(c_{i}\right)}{q\left(c_{i}\right)}\right)
This will be our fitness function.

it poses a prior constraint over the distribution of (learned) video representations in clusters to follow the Zipf distribution.

Loss Evolution

tournament selection and CMA-ES.

猜你喜欢

转载自blog.csdn.net/ArchibaldChain/article/details/107472523