2018 ICMR
Content-Based Video–Music Retrieval Using Soft Intra-Modal Structure Constraint

Introduction

bidirectional retrieval

挑战

设计一种对元数据没有要求的跨模式模型
难以获得匹配的视频音乐对，视频和音乐之间的匹配标准比其他跨模态任务（例如，图像到文本的检索）更加模糊

Contributions

Content-based, cross-modal embedding network
- introduce VM-NET, two-branch neural network that infers the latent alignment between videos and music tracks using only their contents
- train the network via inter-modal ranking loss
  such that videos and music with similar semantics end up close together in the embedding space

However, if only the inter-modal ranking constraint for embedding is considered, modality-specific characteristics (e.g., rhythm or tempo for music and texture or color for image) may be lost.

devise a novel soft intra-modal structure constraint
takes advantage of the relative distance relationship of samples within each modality
does not require ground truth pair information within individual modality.

Large-scale video–music pair dataset

Hong–Im Music–Video 200K (HIMV- 200K)
composed of 200,500 video–music pairs.

Evaluation

Recall@K
subjective user evaluation

Related work

A. Video–Music Related Tasks

conventional approaches can be divided into three categories according to the task:

generation,
classification
matching

大多数现有方法使用元数据（例如，关键字，心情标签和相关描述）

B. Two-branch Neural Networks Over

不同模态之间的关系
图像与文本相关联

音乐视频情感标签
Tunesensor: A semantic-driven music recommendation service for digital photo albums （ISWC 2011）

Method

A. Music Feature Extraction

decompose an audio signal into harmonic and percussive components
谐波 / 打击乐

扫描二维码关注公众号，回复： 11878523 查看本文章
apply log-amplitude scaling to each component
to avoid numerical underflow
slice the components into shorter segments called local frames (or windowed excerpts) and extract multiple features from each component of each frame.

Frame-level features.

Spectral features
频谱特征
The first type of audio features are derived from spectral analyses.

first apply the fast Fourier transform and the discrete wavelet transform to the windowed signal in each local frame
快速傅立叶变换/离散小波变换
From the magnitude spectral results
幅度频谱结果
- compute summary features including the spectral centroid, the spectral bandwidth, the spectral rolloff, and the first and second order polynomial features of a spectrogram
  频谱质心，频谱带宽，频谱衰减以及频谱图的一阶和二阶多项式特征

Mel-scale features
梅尔尺度特征

compute the Mel-scale spectrogram of each frame as well as the Mel-frequency Cepstral Coefficients (MFCC)
每帧的梅尔尺度谱图以及梅尔频率倒谱系数（MFCC）
to extract more meaningful features
use delta-MFCC features（the first- and second-order differences in MFCC features over time）
增量MFCC
capture variations of timbre over time
音色随时间变化

Chroma features
色度

use chroma short-time Fourier transform as well as chroma energy normalized
色度短时傅立叶变换以及色度能量归一化
While Mel- scaled representations efficiently capture timbre, they provide poor resolution of pitches and pitch classes
尽管梅尔音阶表示法可以有效地捕获音色，但它们对音高和音高等级的分辨率较差

Etc.

use the number of time domain zero-crossings as an audio feature
时域零交叉点的数量
in order to detect the amount of noise in the audio signal.
use the root-mean-square energy for each frame
均方根能量

B. Video Feature Extraction

###Frame-level features

HIMV-200K dataset 包含大量数据，CNN 从头训太久
因此使用在 ImageNet 上预训练的 Inception，extract frame-level features
whitened principal component analysis (WPCA)
the normalized features are approximately multivariate Gaussian with zero mean and identity covariance
应用白化的主成分分析（WPCA），以使归一化特征为具有零均值和恒等协方差的近似多元高斯

Video-level features

concatenation the music-level features
a global normalization process（subtracts the mean of vectors from all the features）
principal component analysis (PCA)

L2 normalization

C. Multimodal Embedding

The final step is to embed the separately extracted features of the heterogeneous music and video modalities into a shared embedding space.

The two-branch neural network

FC（ReLU）

视频特征是从 pretrain 的CNN中提取的
音乐特征只是 low- level 音频特征统计的简单 concat

为了补偿相对较 low-level 的音频，我们使网络的音频分支比视频分支更深

两个分支的最终输出经过L2归一化，以便于计算余弦相似度，在我们的方法中用作距离度量。

Inter-modal ranking constraint

受 triplet ranking loss 启发

a positive cross-modal sample
a ground truth pair item separated from the same music video
a negative sample
not paired with the anchor

Loss

vi (anchor)
video of the i-th music video
mi (positive sample)
music of the i-th music video
mj (negative sample)
the music feature obtained from the j-th music video
d(v,m)
distance (e.g., Euclidean distance)
e
a margin constant

video input

music input

select

在三元组选择过程中，计算所有可能的三元组的损失需要大量的计算

top Q most violated cross-modal matches in each mini-batch

selecting a maximum of Q violating negative matches that are closer to the positive pair (i.e., a ground truth video–music pair) in the embedding space.

Soft intra-modal structure constraint

只使用 Inter-modal ranking constraint
每个模态内的固有特征（即模态特定的特征）可能会丢失

the modality- specific characteristics

in music
rhythm,tempo, or timbre
旋律速度音色
in videos
brightness, color, or texture

为了解决每个模态内结构崩溃的问题，我们设计了一种 Soft intra-modal structure constraint

video input

music input

xxx music features in multimodal space if xxx music features before embedding

do not use the margin constant

Embedding network loss

inter-modal ranking constraint
two types of triplets $(v i, m i, m j)$ and $(m i, v i, v j)$
soft intra-modal structure constraint
two types of triplets $(v i, v j, v k)$ and $(m i, m j, m k)$

$sign(x)=\begin{cases}1, x>0 \\0, x=0\\-1,x<0\end{cases}$

Dataset and implementation details

A. Construction of the Dataset The

我们的Hong-Im音乐视频200K（HIMV-200K）基准数据集，其中包含200,500个视频音乐。
我们从YouTube-8M获得了这些音视频对，YouTube-8M是一个大规模的带有标签的视频数据集，包含数百万个YouTube视频ID和相关标签。

YouTube中的所有视频均带有4800个视觉实体的词汇注释，其中的实体包括各种活动（例如，体育，游戏，爱好），对象（例如，汽车，食品，产品）和事件（例如，音乐会，节，婚礼）。

在与数千个实体相关的整个视频中，我们首先下载了与“音乐视频”相关的视频，包括官方音乐视频，模仿音乐视频以及带有背景音乐的用户生成的视频。

下载完了所有带有“音乐视频”标签的视频，就可以使用FFmpeg 将它们分为视频和音频最终我们获得了205,000对视频音乐组合，用于训练，验证和测试的组合分别包括200K，4K和1K对。

为了公开发布我们的HIMV-200K数据集而又不侵犯版权，我们在“在线视频”类别下提供了YouTube视频的URL，并在我们的在线存储库中提供了视频和音乐曲目的特征提取代码。（https://github.com/csehong/VM-NET）

B. Implementation Details

Therefore, we trimmed the audio signals to 29.12 s at the center of the songs and downsample them from 22.05 kHz to 12 kHz following [36].

将音频信号分解为谐波分量和打击乐分量，并逐帧提取大量音频特征。
我们可以获得每帧380维的向量

video–music retrieval

followed the implementation details in [40].

每个视频首先以每秒1帧的速度解码，直到前360秒。

使用Inception网络[4]提取了2048个维度的帧级特征，并应用WPCA将特征维数减少到1024。给定帧级特征，我们使用均值，标准差和前5个序数统计量汇总了这些特征其次是全局标准化。

Experimental results

A. The Recall@K Metric The

1k 测试集

大多数方法都主要执行主观用户评估

为了解决这个问题，我们将Recall@K（一种用于交叉模式检索的标准协议，尤其是在图像-文本检索[30]，[33]中）应用到双向CBVMR任务

对于给定的K值，它衡量在测试集中至少有一个正确的基础事实匹配项被排在前K个匹配项中的查询集中的查询百分比。例如，如果我们考虑要求适当音乐的视频查询，则Recall @ 10会告诉我们前十个结果中包含基本音乐匹配项的视频查询的百分比。

[30] A. Karpathy and L. Fei-Fei, “Deep visual-semantic alignments for generating image descriptions,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 3128–3137.
[33] L. Wang, Y. Li, and S. Lazebnik, “Learning deep structure-preserving image-text embeddings,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016, pp. 5005–5013.

相对地赋予λ1比λ2更多的权重，通常可以提高性能。但是，根据经验，我们确认将λ1设置为5或更大不会改善Recall @ K。

B. A Human Preference Test

Conclusion

两分支考虑到模式间和模式内关系，将视频和音乐关联起来的深度网络。

模型可以了解音乐的流派，歌手的性别或国籍

【Music】视频配乐|多模态检索 Content-based video–music retrieval (CBVMR) Using Soft Intra-Modal 笔记