Classification of skin diseases based on weakly supervised fine-grained methods
Article title
Look Closer to See Better: Recurrent Attention Convolutional Neural Network for Fine-grained Image Recognition
Article Source
CVPR2019
Author motivation
Region localization and fine-grained feature learning are two major challenges in fine-grained problems. Existing (before 19 years) methods mainly focus on solving these two problems independently, but ignore the correlation between the two, so a new architecture - RA-CNN is proposed
Author's ideas
An input image is cropped through the Attention Proposal Network (APN), and then enlarged through bilinear interpolation. The effect is equivalent to discarding other information in the picture and magnifying what "I" want to see. The effect is as follows:
Network Architecture
Rough explanation:
Enter an original image. There are two tasks for the original image. One is to classify the original image through convolution-fully connected-softmax like conventional image classification to obtain the probabilities of a series of categories; the second is to classify the original image after convolution A series of feature maps obtained after producting are passed through the Attention Proposal Network (APN) to obtain the attention results. As shown in the picture above, our attention is on the bird's head, so we crop out other parts, leaving only the bird's head, and then enlarge the bird's head through bilinear pooling. Echoing the title of the article - the closer you see, the better you see
Detailed explanation:
For a picture A, after feature extraction (convolution operation) - full connection - softmax, the probability P of different categories is obtained, as shown below: The
loss L(X)1 is:
At the same time, after feature extraction A series of feature maps are obtained, and through the attention proposal module
(APN), a square attention block is obtained, recorded as:
tx represents the x coordinate of the attention center, ty represents the y coordinate of the attention center, and tl represents the attention block. half the side length. This is what we need to leave in the original image.