[To be continued] Review: Deep Learning for Video Segmentation

A Survey on Deep Learning Technique for Video Segmentation

0. Summary

This paper reviews two basic research lines of video segmentation: video object segmentation (object segmentation) and video semantic segmentation (semantic segmentation). This article introduces their respective task setting, background concepts, perceived needs, development history, and main challenges. This paper provides a detailed overview of representative literature on related methods and datasets. This paper tests these methods on some well-known data sets (benchmark). Finally, point out the opne issues and future research directions in these fields.

1 Introduction

Video segmentation (finding key objects with special properties or semantics in videos) is a fundamental and challenging problem in computer vision (CV). It has countless potential applications: autonomous driving, robotics, surveillance, social media, AR, filmmaking, and video conferencing.

This problem has been addressed by traditional CV and machine learning (ML) methods. include:

  • hand-crafted features (e.g., histogram statistics, optical flow, etc.)
  • heuristic prior knowledge (e.g., visual attention mechanism, motion boundaries, etc.)
  • low/mid-level visual representations (e.g., super-voxel, trajectory, object proposal, etc.)
  • classical machine learning models (e.g., clustering, graph models, random walks, support vector machines, random decision forests, markov random fields, conditional random fields, etc.)

Recently, deep networks (DNNs), especially fully convolutional networks (FCNs) have made great progress in video segmentation. Compared with traditional methods, these Deep Learning (DL)-based (Video Segmentation, VS) algorithms have higher accuracy (and sometimes even more effective). 

The fully convolutional network (FCN) uses a convolutional neural network to realize the transformation from image pixels to pixel categories. Different from the convolutional neural network, the full convolutional network transforms the height and width of the middle layer feature map back to the size of the input image through the transposed convolution layer, so that the prediction result and the input image are in the spatial dimension ( One-to-one correspondence on height and width): given a position on the spatial dimension, the output of the channel dimension is the category prediction of the pixel corresponding to the position.

Most current research has a narrow perspective, e.g., only focuses on foreground/background segmentation of videos. This article systematically introduces the latest progress of VS, spanning from task formulation to taxonomy, from algorithm to data set, from unsolved problems to future research directions, and the key points covered include:

  • Task category (foreground/background separation, semantic segmentation)
  • Inference models (automatic, semi-automatic, interactive)
  • Learning methods (supervised, unsupervised, weakly supervised)
  • Clarify terminology (background subtraction, motion segmentation)

This article focuses on the latest developments in the two main branches of VS (target segmentation, semantic segmentation), and will then be divided into eight subfields. This paper refers to influential works from prestigious journals and conferences, and also includes non-deep learning video segmentation models and literature in other fields (eg, visual tracking).

The picture above is the video segmentation task reviewed in this article:

  • Object-level automatic video object segmentation (object-level AVOS)
  • Instance-level automatic video object segmentation (instance-level AVOS)
  • Semi-Automatic Video Object Segmentation (SVOS)
  • Interactive Video Object Segmentation (IVOS)
  • Language-Guided Video Object Segmentation (LVOS)
  • Video Semantic Segmentation (VSS)
  • Video Instance Segmentation (VIS)
  • Video panoramic segmentation (video panoptic segmentation, VPS)

The figure above is the structure of this article.

2. Background

2.1 Problem formulation and taxonomy

Let X and Y denote the input space and the output segmentation space respectively, and the VS based on deep learning is to find an ideal mapping \top f\limits^* :X \to Y.

2.1.1 Categories of Video Segmentation (VS)

Based on how to define the output space Y, VS can be roughly divided into two categories: VOS and VSS.

Video Object (Foreground/Background) Segmentation (VOS): Y is a binary foreground/background segmentation space. VOS is used in video analysis and editing scenarios, such as removing objects in movies, content-based video coding, and generating virtual backgrounds in video conferences.

Video Semantic Segmentation (VSS): A direct extension of image semantic segmentation to the spatio-temporal domain. The goal is to extract objects belonging to predefined semantic categories (eg: cars, buildings, sidewalks, roads) from videos. Therefore, Y corresponds to a multi-category semantic parsing (parsing) space. VSS is the perception basis for many applications that require a high level of understanding of the environment, such as robotic perception, human-computer interaction, and autonomous driving.

Comments: VOS and VSS have the same challenges, such as fast movement and object occlusion. Different application scenarios have different challenges. For example: VOS usually focuses on human-created media, which have large lens movement, deformation and appearance changes; VSS usually focuses on applications like autonomous driving, requiring a trade-off between accuracy and latency, accurate detection of small objects, model Parallelism and cross-domain generalization capabilities.

2.1.2 Inference Modes of Video Segmentation (VS)

Based on the degree of human involvement in inference, VOS is further divided into three categories: automatic, semi-automatic, and interactive.

Automatic Video Object Segmentation (AVOS): Also known as unsupervised VS, zero-shot VS. It performs VOS automatically and does not require manual initialization.

Semi-automatic Video Object Segmentation (SVOS): Also known as semi-supervised learning, one-shot VS, finds the desired target with limited human supervision (usually provided in the first frame). Typical human input is the object mask (mask) in the first frame of the video. In this case, SVOS is also called pixel tracking, or mask propagation. From this perspective, Language-Guided Video Object Segmentation (LVOS) is a branch of SVOS where the human input is a verbal description about the desired object. Compared with AVOS, SVOS is more flexible in defining target objects, but requires manual input.

Interactive Video Target Segmentation (IVOS): Once the target is determined, SVOS will run automatically; while the process of IVOS needs human guidance.

Unlike VOS, VSS is an automatic mode, except for a few early methods that use semi-automatic mode, such as label propagation.

2.1.3 Learning Method for Video Segmentation (VS)

According to the training strategy, DL-based VS can be divided into three categories: supervised, semi-supervised, and weakly supervised.

Supervised learning: fully use labeled data to learn, so that the model output is close to the label. 

Unsupervised (self-supervised, self-supervised) learning: fully use unlabeled data learning. Unsupervised learning includes completely unsupervised learning (no labels are required) and self-supervised learning (no manual labeling is required, and the network is trained using automatically generated pseudo labels). Almost all existing unsupervised VS are self-supervised learning.

Weakly-supervised learning: learning with a limited number of labeled data, and the label is easy to label, such as boundaries.

2.2 History and Terminology

An early attempt at VS was video over-segmentation : discontinuities and similarities based on pixel intensities in specific regions. Typical methods include: hierarchical video segmentation, temporal superpixel, super-voxels. These methods are suitable for video preprocessing, but cannot solve the object-level segmentation problem. Because they cannot decompose hierarchical (hierarchical) video into binary segmentation .

Binary segmentation: first convert the image into a grayscale image, then set a threshold for binary segmentation, and then traverse each pixel of the grayscale image. If the gray value of the pixel is greater than the threshold, the gray value of the pixel is set to 255, and if the gray value of the pixel is smaller than the threshold, the gray value of the pixel is set to 0.

In order to extract foreground objects from video sequences, background subtraction appeared in the late 1970s . They assume that the background is known a priori and that the camera is stationary or undergoes predictable, parametric 2D or 3D motion with 3D parallax. These geometry-based methods are well-suited for specific application scenarios, such as surveillance systems, but they are sensitive to model selection (2D or 3D) and cannot handle scenarios where the camera moves non-deterministically.

Parallax: the difference in position or orientation caused by viewing objects from different positions

Motion segmentation : Find moving objects. The background subtraction method can be regarded as a special case of motion segmentation. However, most motion segmentation models are built based on motion analysis, factorization, and statistical techniques that model the features of moving scenes when the camera motion pattern is unknown.

Trajectory segmentation : A type of motion segmentation. Trajectories are generated from tracked points over multiple frames, representing long-term motion patterns and serving as informative cues for segmentation. Motion-based methods rely heavily on the accuracy of optical flow estimation and may fail when different parts of an object exhibit different motion patterns.

Optical flow: An important method for analyzing moving images, which refers to the movement of brightness patterns in time-varying images. Because when the object is moving, the brightness pattern of its corresponding point on the image is also moving. ( Encyclopedia )

When analyzing optical flow, two important assumptions need to be used: 1. The pixel intensity of the object does not change between consecutive frames. 2. Neighboring pixels have similar motion. ( reference )

AVOS can overcome the limitations mentioned above. Some methods generate a large number of object candidates in each frame of a video, and transform the task of segmenting video objects into a problem of object region selection. The main disadvantages of these algorithms are computationally intensive and complex target inference. Still others explore heuristic assumptions such as visual-attention and motion boundaries, but fail easily in scenarios where the heuristic assumptions do not hold. 

Heuristic : "the ability to discover oneself" or "the knowledge and skill to use a certain way or method to judge things"

Motion boundary: extract boundary information (contour) of moving objects

Early SVOS usually relied on optical flow, much like object tracking. Furthermore, IVOS accomplishes high-quality video segmentation tasks with extensive human guidance. The price of flexibility and accuracy of SVOS and IVOS: Impossible to use on a large scale due to human involvement.

Due to the complexity of the VSS task, there were few related studies before the era of DL. The method mainly relies on supervised classifiers (eg, SVM) and video over-segmentation .

In summary, compared with the previous methods, the DL-based method further improves the performance of VS.

2.3 Related Research Fields

Visual Tracking : In order to infer the position of time-varying objects, existing methods usually assume that the object has been delineated in the first frame of the video. Visual tracking and VS share some common challenges (e.g., object/camera motion, appearance change, object occlusion, etc.), which motivates their joint use.

Image Semantic Segmentation : The success of end-to-end image semantic segmentation has prompted the rapid development of VSS. Recent VSS improves the accuracy and effectiveness of segmentation based on temporal continuity, rather than using image semantic segmentation on a frame-by-frame basis.

Video object detection : For video object detection, video object detectors use box or feature-level temporal cues. There are many key technical steps and challenges between video object detection and (instance-level) video segmentation, such as object proposal generation, temporal information aggregation, and inter-frame object association.

The basic idea of ​​object proposal is to find some potential targets on the image, not exhaustive! These potential objects are then fed into an object recognition model for classification.

3. Deep learning (DL) based video segmentation (VS)

3.1 DL-Based Video Object Segmentation (VOS)

VOS extracts generic foreground objects from video sequences without considering semantic category recognition. Based on human participation, VOS is divided into AVOS, SVOS, and IVOS.

3.1.1 Automatic Video Object Segmentation (AVOS)

Modern AVOS learns generic video object patterns in a data-driven manner.

The figure above shows the characteristics of some AVOS technologies, where Instance represents instance-level or object-level segmentation. 

DL-based approach :

  • In 2015, Fragkiadaki made an early effort. He learned a multilayer perceptron to sort proposal segments and infer foreground objects.
  • In 2016, Tsai proposed a joint optimization framework for AVOS and optical flow estimation, which uses deep features from a pretrained classification network.
  • Later methods predict an initial, pixel-wise foreground based on frames or optical flow, although some subsequent steps are still required.
  • Basically, these primitive solutions mainly rely on traditional AVOS techniques; the learning ability of neural networks is not yet sufficient.

The method based on pixel instance embedding : first generate pixel-level instance embeddings, and then select representative embeddings clustered as foreground or background. Finally, the labels of the sampled embeddings are propagated to other embeddings. Clustering and propagation are unsupervised. While using fewer annotations, these approaches are fragmented and complex.

End-to-end method based on short-term information coding :

  • Convolutional Recurrent Neural Networks ( CRNN , CNN for feature extraction, RNN for feature-based prediction) are used to learn spatio-temporal visual patterns.
  • Two-stream method: Construct parallel two-stream to extract features from image and optical flow. Two-stream feature fusion is then used for segmentation prediction. The two-stream method makes full use of appearance and motion information at the expense of optical flow calculations and a large number of parameters to learn.

These end-to-end methods improve accuracy and show the advantage of using neural networks. However, they only consider local content within a limited time span: extracting appearance and motion information in a small number of consecutive frames as input, ignoring the relationship of distant frames. While RNNs are commonly used, their internal hidden memory poses inherent limitations when modeling long-term dependencies.

End-to-end method based on long-term context coding : The current leading AVOS uses a global context over a long time span.

  • Lu proposed a model based on the Siamese structure: extract the features of any pair of frames, and then obtain the cross-frame context by calculating the pixel-level feature correlation.
  • Another contemporary approach has a similar idea, but only uses the first frame as a reference.
  • There are also some extended studies improving the use of information in multiple frames, encoding spatial context, incorporating temporal continuity, thereby improving representation power and computational efficiency.

Un/weakly supervised based methods : Only a few AVOS are trained with un/weakly labeled data.

Compared with VS data, the more accessible static image object segmentation and dynamic gaze data are used to learn general video object patterns.

Visual patterns are learned by exploring intrinsic properties of videos at multiple granularities, such as intra-frame saliency, short-term visual coherence, long-range semantic correspondence, and video-level discrimination.

By minimizing the mutual information between an object's motion and its context, an adversarial context model is developed to segment moving objects without any manual annotation. The method can be further enhanced by bootstrapping strategies and enforcing temporal continuity.

Motion is specifically studied for detecting moving objects, and Transformer-based models are designed and trained using self-supervised flow reconstruction from unlabeled video data.

Instance-level AVOS method : also known as multi-objective unsupervised video segmentation. This task is more challenging because it not only needs to separate multiple foregrounds from the background, but also distinguishes different instance objects. The current solution to this task is a top-down approach: generate candidate targets for each frame, and then combine instances from different frames.

In summary, the current instance-level AVOS follows the classical method of tracking through detection, and there is still considerable room for improvement in accuracy and effectiveness.

3.1.2 Semi-automatic Video Object Segmentation (SVOS)

DL-based SVOS mainly focuses on the mask propagation of the first frame. The technique classifies based on the test time of the target mask.

Online fine-tune-based method : Based on the one-shot method, a segmentation model is trained on each given target mask in an online method. Fine-tune is essentially the ability to develop neural network migration learning, and there are usually two steps:

  • Offline pre-train: learn general segmentation features from images and video sequences;
  • Online fine-tune: Target-specific representation based on supervised learning.

However, the fine-tune approach has some disadvantages: 

  • The pre-training is fixed and not optimized for fine-tune later;
  • The hyperparameters of online fine-tune are usually too specially designed, so they do not have good generalization ability;
  • Existing fine-tune has high runtime (up to 1000 training iterations per split target). The root cause is that these methods encode all object-related information (e.g. appearance, mask)

In order to perform fine-tune automatically and effectively, people began to use meta learning, that is, optimize the fine-tune policy (for example: general model initialization, learning rate, etc.) or directly change the network weight.

Propagation-based approach :

Guess you like

Origin blog.csdn.net/qq_44681809/article/details/128314432