(Paper Reading 26-27) Object Recognition

26. Literature reading notes

Introduction

topic

Weakly-supervised learning with convolutional neural networks

author

Maxime Oquab,Leon Bottou,Ivan Laptev,Josef Sivic,CVPR,2015

Original link

http://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Oquab_Is_Object_Localization_2015_CVPR_paper.pdf

Key words

CNN,multi-classification

research problem

Image classification marked by bounding boxes has certain problems: the position and scale of objects are consistently marked through bounding boxes, which does not work well for partially occluded and cropped objects; it is difficult to annotate parts of objects.

Research methods

a weakly supervised convolutional neural network (CNN) for object classification that relies only on image-level labels;

Weakly supervised convolutional neural networks (CNN) for object classification rely only on image-level labels and not on object bounding boxes.

Only the list of objects contained in the picture is annotated, not the location of the objects.

Based on Alexnet.

The first five convolutional layers are trained on Imagenet, and the following layers are trained on the Pascal dataset.

First, we treat the last fully connected network layers as convolutions to cope with the uncertainty in object localization.

First, the last fully connected network layer is regarded as a convolutional layer to cope with the uncertainty in target localization.

Can handle images of almost any size as input.

Second, we introduce a max-pooling layer that hypothesizes the possible location of the object in the image.

Second, a single global max pooling layer is added at the output to explicitly search for the highest scoring object location in the image.

Third, we modify the cost function to learn from image-level supervision.

  • The cost function is modified to draw on image-level supervision.

Treat the task as a separate binary classification problem for each class. Therefore, the loss function is the sum of K binary logistic regression losses.

k ∈ {1 · · · K}

F: Classification results

Y: label value

Each category score fk (x) can be interpreted as a posterior probability indicating the existence of k categories in image x

Solving the multi-scale problem: Scale all training images so that their maximum side length is 500 pixels, and zero-pad them to 500 × 500 pixels. Each training mini-batch of 16 images is then scaled by a scale factor uniformly sampled between 0.7 and 1.4. This allows the network to see objects at different scales in the image.

As a measure of positioning, the author maps the output of max-pooling to the original image, and then compares the result with the result marked by bounding-box. The tolerance is 18 pixels, that is, the bounding-box is expanded outward by 18 pixels. If If the result is within this, the positioning is considered correct.

Analysis conclusion

Can learn from cluttered scenes containing multiple objects.

The modified CNN architecture localizes objects or their unique parts in the training images while only training to output image-level labels.

Weakly supervised networks can predict the approximate location of an object in a scene (in the form of x, y positions), but not the extent of the object (the bounding box).

Searching only six different scales within test time is enough to achieve good classification performance. There was no additional benefit from adding wider or thinner searches to the scale.

Insufficient innovation

The criteria for judging positioning are defined by the author and are not universal.

additional knowledge

none

27. Literature reading notes

Introduction

topic

Deep Filter Banks for Texture Recognition and Segmentation

author

Mircea Cimpoi, Subhransu Maji, Andrea Vedaldi,

Original link

http://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Cimpoi_Deep_Filter_Banks_2015_CVPR_paper.pdf

Key words

Texture Recognition and Segmentation、FV-CNN

research problem

Research in texture recognition usually focuses on the problem of material recognition under interference-free conditions, and this assumption is rarely encountered in applications.

Disadvantages of CNN in representing texture:

  1. Convolutional layers are similar to nonlinear filter banks, while fully connected layers capture their spatial layout. While this may be useful for representing the shape of an object, it may not be as useful for representing texture.
  2. The input to a CNN must be of fixed size to be compatible with fully connected layers, which requires expensive resizing of the input image, especially when computing features for many different regions.
  3. Deeper layers may be more domain-specific and therefore potentially less portable than shallower layers.

Object (FC-CNN) and texture (FV-CNN) descriptors

Research methods

A new texture descriptor was developed

Filter banks are designed to capture edges, blobs, and lines at different scales and orientations.

Think of the convolutional layers of a CNN as a filter bank and use FV as a pooling mechanism to build unordered representations.

Training is performed on "wild images". Each image contains a large number of high quality texture/material clips.

Pooling is unordered and multi-scale, so it works well with textures. Second, convolutional layers can handle images of any size, avoiding costly resizing operations.

This approach decomposes the problem into generating tentative segmentations at low cost and then validating them using a more powerful (and potentially more expensive) model.

Analysis conclusion

It is said that (FV-CNN) performance is very superior

Insufficient innovation

I feel like the writing is very messy, and I don’t quite understand what it does after reading it.

additional knowledge

FV:Fisher Vector (FV) vector_fisher vector-CSDN blog

Fisher Vector basic principle analysis-CSDN Blog

Guess you like

Origin blog.csdn.net/qq_46012097/article/details/134371604