Image-text mutual search: use CNN to classify 100,000 categories of images and text

[Title]:Dual-Path Convolutional Image-Text Embedding

[arXiv]:http://cn.arxiv.org/abs/1711.05535

[Code]:layumi/Image-Text-Embedding

 

[Motivation]:

In this paper we tried to classify images of 113,287 classes (MSCOCO) with CNN.

  • We actually consider each image in the training set to be a class. (Of course, if only one image is used, CNN will definitely overfit). At the same time, we used 5 sentence image descriptions (text) and added them to the training. So each category is equivalent to 6 samples (1 image + 5 sentence description).

  • The problem that the article wants to solve is instance-level retrieval. That is to say, if you are in an image pool of 5,000 images, you are looking for "a blonde girl in blue is taking a taxi." In fact, you only have one correct answer. Unlike class-level or category-level, looking for "female" may have many correct answers. So this problem is more fine-grained, and more visual and textual features of detail are needed.

  • At the same time, we have observed that many previous works directly use the class-level ImageNet pretrained network. But these networks actually lose information (number/color/position). The following three pictures may all use the Dog label in imagenet, but in fact we can give a more precise description in natural language. That is the problem we want to solve in this paper (instance-level mutual search of images and texts).

 

 

[Related Work]:

You can click on my previous answer, as well as the senior's answer.

Which one is more promising, computer vision or natural language processing, or do they each have their own merits?

 

[Method]:

  1. For natural language description, we use the relatively less commonly used CNN structure instead of the LSTM structure. To train in parallel, finetune the entire network. The structure is shown in the figure. The structure is actually very simple.

For TextCNN, we used a block similar to ResNet. Note that the sentence is one-dimensional. In actual use, we use a 1X2 conv.

 

2. Instance loss. We noticed that the ultimate goal is to make each image have discriminative features, as does the natural language description. So, why not try to treat each image as a class. (Note that this assumption is unsupervised and does not require any labels.)

This kind of small-sample classification is actually commonly used for pedestrian re-identification before, but pedestrian re-identification (1467 categories, 9.6 images per category, and artificial ID labels.) is not as extreme as ours.

Flickr30k: 31,783 categories (1 image + 5 descriptions), of which the training images are 29,783 categories

MSCOCO: 123,287 categories (1 image + ~5 descriptions), of which the training images are 113,287 categories

Noticed that Flickr30k actually has quite a lot of similar images of dogs.

We'll still treat them as distinct classes though, hoping to learn fine-grained differences as well.

(For CUHK-PEDES, because the description of the same person is similar. We use the same person as a class, so there are more training pictures for each class. CUHK-PEDES uses ID annotation, while MSCOCO and Flickr30k we do not have use.)

 

3. How to combine text and images for training?

In fact, text and images are easy to learn from each other for classification. So we need a constraint to make them map to the same high-level semantic space.

We adopted a simple method: before the final classification fc, let the text and image use a W, then a soft constraint will be used in the update process, and this is completed (see paper 4.2 for details). In the experiment, we found that only using this W soft constraint, the result is very good. (See the results of StageI in the paper)

 

4. Does the training converge?

convergent. Welcome everyone to look at the code. It is direct softmax loss without trick.

Image classification converges faster. Text is slower. On Flickr30k, ImageCNN converges quickly,

TextCNN is learned from the beginning, and there are 5 training samples at the same time, so it is relatively slow.

 

5. Is instance loss unsupervised?

The assumption of instance loss is unsupervised, because we do not use additional information (category labels, etc.). Instead, the information "every picture is a class" is used.

 

6. Using other unsupervised methods, such as kmeans clustering first, can it achieve a result similar to instance loss?

We tried to use pre-trained ResNet50 to extract pool5 features, and gathered 3000 and 10000 classes respectively.

(Clustering is very slow. Although multi-threading is enabled, it took more than an hour to gather 10,000 classes, and I was afraid of insufficient memory and crash. Please be careful.)

The result of using instance loss in MSCOCO is better. We think that clustering has not actually solved the problem. The black dog/grey dog/both dogs are dogs, and the problem of image details may be ignored.

 

7. It is more difficult than the result. Because everyone's network is not the same (unfair), and even the train/test division is different (many previous papers did not indicate it, and compared it directly).

So when making the table, we tried to list all the methods as much as possible. Note the different splits.

Try to compare VGG-19 with VGG-19, ResNet-152 with ResNet-152. Welcome everyone to read the paper in detail.

Much of what is related to our paper is the work of Mr. Lu. I really recommend everyone to read it.

 

8. Is a deeper TextCNN necessarily better?

This question was asked by Reviewer.

A related paper is Do Convolutional Networks need to be Deep for Text Classification ?

Indeed, this was also found in our additional experiments. On the two larger datasets, there is no significant improvement from Res50 to Res152 on the text side.

 

9. Some tricks (may not work in other tasks)

  • Because I have seen bidirectional LSTM, a natural idea is bidirectional CNN. I tried it myself and found that it didn't work. Interlude: At that time, I met a poster translated by fb CNN on ICML and asked about it. They said, of course it can be used, but they didn’t try it.

  • The Position Shift used in this article is to put the text input by CNN, and randomly leave a few positions in front of it. Similar to the operation of image jitter. There is still a significant improvement. See the paper for details.

  • A more reliable data enhancement may be to replace some words in the sentence with synonyms. Although I downloaded the thesaurus of libre office at that time, it was useless in the end. Finally, word2vec is used to initialize the first conv layer of CNN. To some extent, it also contains the effect of synonyms. (similar words, word vector is also similar)

  • Maybe the samples of each category in the data set are relatively balanced (basically 1+5), which is also the reason for our good results. It is not easy to overfit some "many" classes.

 

[Results]

  • Has TextCNN learned different words with different levels of importance? (Appendix to the article)

We tried removing some words from the sentence to see which ones had the greatest impact on the matching score.

  • Some image-text mutual search results (appendix to the article)

  • natural language search

  • fine-grained results

The details may not be very clear in the paper, please look at the code/communication.

 

 

--------------More other articles-------------

Guess you like

Origin blog.csdn.net/Layumi1993/article/details/91350473