原论文阅读--CVPR2018--video object segmentation--2

DeLS-3D: Deep Localization and Segmentation with a 3D Semantic Map

略读，motivation

This research focuses on two crucial technologies in the application of autonomous driving: self-localization/camera pose estimation, scene parsing. They design a sensor fusion scheme that integrates camera videos, motion sensor (GPS/IMU) and a 3D semantic map together. The first step is camera pose estimation, followed by sense parsing.

启发

在进行video object segmentation时，可以先使用传统方法对frame进行预处理（直方图均衡化等），减弱光照、颜色变化等对最终结果的影响

Low-latency video semantic segmentation

这里写图片描述

Research background

Video analysis tasks usually involve large volume of data, requiring huge computing resources. However, many real-world systems that need video segmentation, e.g., autonomous driving, have strict requirements on the latency of response time.
Video semantic segmentation methods can be categorized into two classes: high-level modeling and feature-level propagation. In order to reduce the latency of response time, this work resorts to feature-level propagation method, which attempts to reuse the features in preceding frames to accelerate computing.
When exploiting temporal correlations, existing methods usually treating all locations indepedently and uniformally, ignoring the different characteristics between smooth region and boundaries.

Motivation and proposed approach

This work aims to reduce not only the overall cost but also the maximum latency, while maintaining competitive performance. Firstly, they propose an adaptive feature propagation component, which applies a spatially variant convolution to combine features from preceding frames. Besides, they adaptively allocate keyframes on demand based on accuracy prediction and incorporate a parallel scheme to coordinate keyframe computation and feature propagation (quite technical).

优点

在计算video object segmentation问题时，对相邻帧之间的feature重复利用，可以提高效率
工程实现比较好

潜在不足

（暂时不确定）

疑惑

作者提出的spatially variant convolution具体是什么？它如何实现adaptive feature propagation？
keyframe如何确定？它的作用是什么？如何通过分配keyframe高效地实现并行？

Weakly-Supervised Action Segmentation with Iterative Soft Boundary Assignment

在这里插入图片描述

Research background

This work aims to assign pre-frame semantic labels under weak supervision. For example, with action transcripts supervision, the algorithm can access a set of action units organized by their occurrence order in a video, but has no information about the precise boundary of their temporal boundary.
Conventional methods stick to RNN to encode video data, go through all possible actions and search one with maximal likelihood with various algorithms, such as, Hidden Markov Modle. However, these methods suffer from expensive computational costs.

Motivation and Proposed approach

This work proposes an action segmentation approach that adopts a temporal convolution network to predict frame-wise action label. In order to learn under weak supervision, this work proposes a novel training strategy, namely, Iterative Soft Boundary Assignment, which aligns action sequences and updates in an iterative fashion.

优点

对于输入的frame，使用时序卷积，建立起一种encoder-decoder with lateral connection的结构，网络结构比较优美

缺点

每次处理sequence的长度是固定的

启迪

能否将temporal convolution应用到segmentation问题？它与3D convolution的区别是什么？之前有人做过将3D conv分解为2D+1D，效果怎么样?