Paper Reading Notes (30): Learning to Segment Every Thing

Existing methods for object instance segmentation require all training instances to be labeled with segmentation masks. This requirement makes it expensive to annotate new categories and has restricted instance segmentation models to ∼100 well-annotated classes. The goal of this paper is to propose a new partially supervised training paradigm, together with a novel weight transfer function, that enables training instance segmentation models over a large set of categories for which all have box annotations, but only a small fraction have mask annotations. These contributions allow us to train Mask R-CNN to detect and segment 3000 visual concepts using box annotations from the Visual Genome dataset and mask annotations from the 80 classes in the COCO dataset. We carefully evaluate our proposed approach in a controlled study on the COCO dataset.This work is a first step towards instance segmentation models that have broad comprehension of the visual world.

Existing methods for object instance segmentation require all training instances to be labeled with segmentation masks. This requirement makes annotating new classes expensive and limits instance segmentation models to around 100 well-annotated classes. The goal of this paper is to propose a new partially supervised training paradigm, as well as a novel weight transfer function, that can train instance segmentation models on a large number of categories, all of which have box annotations but only a small subset with masking code comments. With these contributions, we can train Mask R-CNN to detect and segment 3000 visual concepts using box annotations from the Visual Genome dataset and mask annotations from 80 classes from the COCO dataset. We carefully evaluate our proposed method in a controlled study on the COCO dataset. This work is the first step towards instance segmentation models with a broad understanding of the visual world.

Object detectors have become significantly more accurate (e.g., [10, 33]) and gained important new capabilities. One of the most exciting is the ability to predict a foreground segmentation mask for each detected object (e.g., [15]), a task called instance segmentation. In practice, typical instance segmentation systems are restricted to a narrow slice of the vast visual world that includes only around 100 object categories.
A principle reason for this limitation is that state-of-theart instance segmentation algorithms require strong supervision and such supervision may be limited and expensive to collect for new categories [22]. By comparison, bounding box annotations are more abundant and less expensive [4]. This fact raises a question: Is it possible to train highquality instance segmentation models without complete instance segmentation annotations for all categories? With this motivation, our paper introduces a new partially supervised instance segmentation task and proposes a novel transfer learning method to address it.

We formulate the partially supervised instance segmentation task as follows: (1) given a set of categories of interest, a small subset has instance mask annotations, while the other categories have only bounding box annotations; (2) the instance segmentation algorithm should utilize this data to fit a model that can segment instances of all object categories in the set of interest. Since the training data is a mixture of strongly annotated examples (those with masks) and more weakly annotated examples (those with only boxes), we refer to the task as partially supervised.
The main benefit of the proposed partially supervised paradigm is it allows us to build a large-scale instance segmentation model by exploiting both types of existing datasets: those with bounding box annotations over a large number of classes, such as Visual Genome [19], and those with instance mask annotations over a small number of classes, such as COCO [22]. As we will show, this enables us to scale state-of-the-art instance segmentation methods to thousands of categories, a capability that is critical for their deployment in real world uses.
To address partially supervised instance segmentation, we propose a novel transfer learning approach built on Mask R-CNN [15]. Mask R-CNN is well-suited to our task because it decomposes the instance segmentation problem into the subtasks of bounding box object detection and mask prediction. These subtasks are handled by dedicated network ‘heads’ that are trained jointly. The intuition behind our approach is that once trained, the parameters of the bounding box head encode an embedding of each object category that enables the transfer of visual information for that category to the partially supervised mask head.
We materialize this intuition by designing a parameterized weight transfer function that is trained to predict a category’s instance segmentation parameters as a function of its bounding box detection parameters. The weight transfer function can be trained end-to-end in Mask R-CNN using classes with mask annotations as supervision. At inference time, the weight transfer function is used to predict the instance segmentation parameters for every category, thus enabling the model to segment all object categories, including those without mask annotations at training time.
We evaluate our approach in two settings. First, we use the COCO dataset [22] to simulate the partially supervised instance segmentation task as a means of establishing quantitative results on a dataset with high-quality annotations and evaluation metrics. Specifically, we split the full set of COCO categories into a subset with mask annotations and a complementary subset for which the system has access to only bounding box annotations. Because the COCO dataset involves only a small number (80) of semantically wellseparated classes, quantitative evaluation is precise and reliable. Experimental results show that our method improves results over a strong baseline with up to a 40% relative increase in mask AP on categories without training masks.
In our second setting, we train a large-scale instance segmentation model on 3000 categories using the Visual Genome (VG) dataset [19]. VG contains bounding box annotations for a large number of object categories, however quantitative evaluation is challenging as many categories are semantically overlapping (e.g., near synonyms) and the annotations are not exhaustive, making precision and recall difficult to measure. Moreover, VG is not annotated with instance masks. Instead, we use VG to provide qualitative output of a large-scale instance segmentation model. Output of our model is illustrated in Figure 1 and 5.
2. Related Work
Instance segmentation. Instance segmentation is a highly active research area [12, 13, 5, 31, 32, 6, 14, 20, 18, 2], with Mask R-CNN [15] representing the current state-of-the-art. These methods assume a fully supervised training scenario in which all categories of interest have instance mask annotations during training. Fully supervised training, however, makes it difficult to scale these systems to thousands of categories. The focus of our work is to relax this assumption and enable training models even when masks are available for only a small subset of categories. To do this, we develop a novel transfer learning approach built on Mask R-CNN.
Weight prediction and task transfer learning. Instead of directly learning model parameters, prior work has explored predicting them from other sources (e.g., [11]). In [8], image classifiers are predicted from the natural language description of a zero-shot category. In [26], a small neural network is used to predict the classifier weights of the composition of two concepts from the classifier weights of each individual concept. Here, we design a model that predicts the class-specific instance segmentation weights used in Mask R-CNN, instead of training them directly, which is not possible in our partially supervised training scenario.
Our approach is also a type of transfer learning [27] where knowledge gained from one task helps with another task. Most related to our work, LSDA [17] transforms whole-image classification parameters into object detection parameters through a domain adaptation procedure. LSDA can be seen as transferring knowledge learned on an image classification task to an object detection task, whereas we consider transferring knowledge learned from bounding box detection to instance segmentation.
Weakly supervised semantic segmentation. Prior work trains semantic segmentation models from weak supervision. (Note that semantic segmentation is a pixel-labeling task that is different from instance segmentation, which is an object detection task.) Image-level labels and object size constraints are used in [29], while other methods use boxes as supervision for expectation-maximization [28] or iterating between proposals generation and training [4]. Point supervision and objectness potentials are used in [3]. Most work in this area addresses only semantic segmentation, treats each class independently, and relies on hand-crafted bottom-up proposals that generalize poorly. Our work is complementary to these approaches,as we explore generalizing segmentation models trained from a subset of classes to other classes without relying on bottom-up segmentation.
Visualembeddings. Objectcategoriesmaybemodeledby continuous ‘embedding’ vectors in a visual-semantic space, where nearby vectors are often close in appearance or semantic ontology. Class embedding vectors may be obtained via natural language processing techniques (e.g. word2vec [25] and GloVe [30]), from visual appearance information (e.g. [7]), or both (e.g. [35]). In our work, the parameters of Mask R-CNN’s box head contain class-specific appearance information and can be seen as embedding vectors learned by training for the bounding box object detection task. The class embedding vectors enable transfer learning in our model by sharing appearance information between visually related classes. We also compare with the NLPbased GloVe embeddings [30] in our experiments.

Let C be the set of object categories (i.e., ‘things’ [1]) for which we would like to train an instance segmentation model. Most existing approaches assume that all training examples in C are annotated with instance masks. We relax this requirement and instead assume that C = A ∪ B where examples from the categories in A have masks, while those in B have only bounding boxes. Since the examples of the B categories are weakly labeled w.r.t. the target task (instance segmentation), we refer to training on the combination of strong and weak labels as a partially supervised learning problem. Noting that one can easily convert instance masks to bounding boxes, we assume that bounding box annotations are also available for classes in A.
Given an instance segmentation model like Mask RCNN that has a bounding box detection component and a mask prediction component, we propose the MaskX RCNN method that transfers category-specific information from the model’s bounding box detectors to its instance mask predictors.

This paper addresses the problem of large-scale instance segmentation by formulating a partially supervised learning paradigm in which only a subset of classes have instance masks during training while the rest have box annotations. We propose a novel transfer learning approach, where a learned weight transfer function predicts how each class should be segmented based on parameters learned for detecting bounding boxes. Experimental results on the COCO dataset demonstrate that our method greatly improves the generalization of mask prediction to categories without mask training data. Using our approach, we build a large-scale instance segmentation model over 3000 classes in the Visual Genome dataset. The qualitative results are encouraging and illustrate an exciting new research direction into large-scale instance segmentation.They also reveal that scaling instance segmentation to thousands of categories, without full supervision, is an extremely challenging problem with ample opportunity for improved methods.

Figure2.Detailed illustration of our Mask X RCNN method. Instead of directly learning the mask prediction parameters west, MaskX R-CNN predicts a category’s segmentation parameters wseg from its corresponding detection parameters wdet, using a learned weight transfer function T . For training, T only needs mask data for the classes in set A, yet it can be applied to all classes in set A ∪ B at test time. We also augment the mask head with a complementary fully connected multi-layer perceptron (MLP).

Guess you like

Origin http://43.154.161.224:23101/article/api/json?id=324643445&siteId=291194637