26. Literature reading notes |
||
Introduction |
topic |
Weakly-supervised learning with convolutional neural networks |
author |
Maxime Oquab,Leon Bottou,Ivan Laptev,Josef Sivic,CVPR,2015 |
|
Original link |
||
Key words |
CNN,multi-classification |
|
research problem |
Image classification marked by bounding boxes has certain problems: the position and scale of objects are consistently marked through bounding boxes, which does not work well for partially occluded and cropped objects; it is difficult to annotate parts of objects. |
|
Research methods |
a weakly supervised convolutional neural network (CNN) for object classification that relies only on image-level labels; Weakly supervised convolutional neural networks (CNN) for object classification rely only on image-level labels and not on object bounding boxes. Only the list of objects contained in the picture is annotated, not the location of the objects. Based on Alexnet. The first five convolutional layers are trained on Imagenet, and the following layers are trained on the Pascal dataset. First, we treat the last fully connected network layers as convolutions to cope with the uncertainty in object localization. First, the last fully connected network layer is regarded as a convolutional layer to cope with the uncertainty in target localization. Can handle images of almost any size as input. Second, we introduce a max-pooling layer that hypothesizes the possible location of the object in the image. Second, a single global max pooling layer is added at the output to explicitly search for the highest scoring object location in the image. Third, we modify the cost function to learn from image-level supervision.
Treat the task as a separate binary classification problem for each class. Therefore, the loss function is the sum of K binary logistic regression losses. k ∈ {1 · · · K} F: Classification results Y: label value Each category score fk (x) can be interpreted as a posterior probability indicating the existence of k categories in image x Solving the multi-scale problem: Scale all training images so that their maximum side length is 500 pixels, and zero-pad them to 500 × 500 pixels. Each training mini-batch of 16 images is then scaled by a scale factor uniformly sampled between 0.7 and 1.4. This allows the network to see objects at different scales in the image. As a measure of positioning, the author maps the output of max-pooling to the original image, and then compares the result with the result marked by bounding-box. The tolerance is 18 pixels, that is, the bounding-box is expanded outward by 18 pixels. If If the result is within this, the positioning is considered correct. |
|
Analysis conclusion |
Can learn from cluttered scenes containing multiple objects. The modified CNN architecture localizes objects or their unique parts in the training images while only training to output image-level labels. Weakly supervised networks can predict the approximate location of an object in a scene (in the form of x, y positions), but not the extent of the object (the bounding box). Searching only six different scales within test time is enough to achieve good classification performance. There was no additional benefit from adding wider or thinner searches to the scale. |
|
Insufficient innovation |
The criteria for judging positioning are defined by the author and are not universal. |
|
additional knowledge |
none |
27. Literature reading notes |
||
Introduction |
topic |
Deep Filter Banks for Texture Recognition and Segmentation |
author |
Mircea Cimpoi, Subhransu Maji, Andrea Vedaldi, |
|
Original link |
http://www.cv-foundation.org/openaccess/content_cvpr_2015/papers/Cimpoi_Deep_Filter_Banks_2015_CVPR_paper.pdf |
|
Key words |
Texture Recognition and Segmentation、FV-CNN |
|
research problem |
Research in texture recognition usually focuses on the problem of material recognition under interference-free conditions, and this assumption is rarely encountered in applications. Disadvantages of CNN in representing texture:
Object (FC-CNN) and texture (FV-CNN) descriptors |
|
Research methods |
A new texture descriptor was developed Filter banks are designed to capture edges, blobs, and lines at different scales and orientations. Think of the convolutional layers of a CNN as a filter bank and use FV as a pooling mechanism to build unordered representations. Training is performed on "wild images". Each image contains a large number of high quality texture/material clips. Pooling is unordered and multi-scale, so it works well with textures. Second, convolutional layers can handle images of any size, avoiding costly resizing operations. This approach decomposes the problem into generating tentative segmentations at low cost and then validating them using a more powerful (and potentially more expensive) model. |
|
Analysis conclusion |
It is said that (FV-CNN) performance is very superior |
|
Insufficient innovation |
I feel like the writing is very messy, and I don’t quite understand what it does after reading it. |
|
additional knowledge |