[Target detection] Target detection meets knowledge graph: Interpretation and reproduction of Object detection meets knowledge graphs paper

foreword

Conventional target detection often captures target information based on image features, so is there a way to add some prior information to improve the accuracy of target detection?

A feasible idea is to add the correlation information between targets to the output of target detection, so as to interfere with the target.

In August 2017, Yuan Fang and others from Singapore Management University published an article "Object Detection Meets Knowledge Graphs", and did some work according to this idea.

Paper address: https://www.ijcai.org/proceedings/2017/0230.pdf

The article is very easy to understand, so this article interprets the ideas and reproduces the code.

work structure

The author of this article proposes a general knowledge introduction architecture, so it can be applied to any target detection model.

The flow chart of knowledge introduction is shown in the figure below:

insert image description here

The author of the original paper uses the Faster R-CNN algorithm for detection. The normal detection output will be a P matrix (ie, the Existing model output in the figure), where the columns represent the total number of targets, and the rows represent the categories.

The meaning of this matrix in the figure is: the confidence of the first detection target belonging to category 1 is 0.6, and the confidence of belonging to category 2 is 0.4; the confidence of the second detection target belonging to category 1 is 0.2, and the confidence of belonging to category 2 is 0.2. The degree is 0.8;

On the basis of this output result, the semantic consistency (semantic consistency) between categories is extracted from the prior knowledge (Knowledge), so as to interfere with the output result and obtain the final output result (Knowledge-aware output).

Semantic Consistency Extraction

Then the key to this architecture is how to extract semantic consistency. The author gives two ideas on this point.

Idea 1: Frequency-based knowledge

Frequency-based should be the easiest way to think of knowledge association. For example, if two targets appear frequently at the same time (for example, the keyboard and mouse often appear together), then when one of the targets is detected, it is natural to consider increasing the confidence of the other target.

Therefore, the author proposes an alignment matrix S as the semantic consistency matrix between target categories, and the calculation formula is as follows:

insert image description here

  • n(l,l'): the number of times that category l and category l' appear together
  • n(l): the number of occurrences of category l
  • n(l'): the number of occurrences of category l'
  • N: total number of occurrences of all categories

Idea 2: Graph-based knowledge

The way of thinking 1 is more intuitive, but the defect is that it cannot represent the relationship between two categories that have not appeared at the same time. For example, a car and a yacht do not appear in the same scene at the same time, but you can't "roughly" think that the two have nothing to do with each other. A weak weight is definitely needed to represent the relationship between the two.

Therefore, the author thought of using the knowledge graph to extract semantic consistency.

First, by filtering some public large-scale knowledge graphs, the category information and relationships that need to be detected are extracted.

insert image description here

Afterwards, the convergence state of the relationship graph is obtained by restarting the random walk algorithm (random walk with restart). The restart random walk algorithm is a variant of the classic random walk algorithm. It is equivalent to adding a restart probability on the basis of the random walk algorithm. After the restart is triggered, it will return to the original point.

insert image description here
After convergence, an R matrix will be obtained, which represents the probability of transitioning to another state category when the operator is in a certain state category. Since the semantic consistency matrix is ​​a symmetric matrix, the author adopts an operation of multiplying the states and then square rooting.

insert image description here

Interference detection output

After having the semantic consistency matrix S, the output results can be interfered. In the paper, there is no explanation on how to intervene.
By reading the source code later, we can see that the idea of ​​interference is to select the 5 nearest categories of a certain target category, and then sum their consistency matrix values ​​to obtain the correlation feature vector. Then the vector is weighted and added to the original detection result.

Core code:

num = torch.sum(torch.mm(S_highest, torch.transpose(p_hat_temp[box_nearest[b]], 0, 1)), 1)
denom = torch.sum(S_highest, dim=1) * bk
p_hat[b] = (1 - epsilon) * torch.squeeze(torch.div(num, denom)) + epsilon * p

The representative weight here epsilonis 0.75 when reappearing, which means that 75% of the original results are retained, and 25% of knowledge intervention.

The latter part is the loss function and the network update part.
The following is the calculation formula of the loss function, which is equivalent to incorporating the result of knowledge embedding into the update of the network.
insert image description here

Experimental results

The author conducted experiments on the Coco and VOC datasets. The following table shows the experimental results of coco:

insert image description here

  • FRCNN: The output of the original detection network
  • KF-500: Obtain the consistency matrix through idea 1, and select 500 training set pictures
  • KF-All: Obtain the consistency matrix through idea 1, and select all training set pictures
  • KG-CNet: Obtain the consistency matrix through the second idea

It can be seen from the data in the table that the improvement of this idea did not improve the mAP of the detection output, but it effectively improved the recall rate. It is equivalent to reducing the false detection rate of the network.

Results visualization

The last is the visualization of the results. The author selected a picture to demonstrate: the left picture is the direct detection result, and the right picture is the detection result after adding the knowledge graph.

The purple box represents the detection output of the model, and the red box represents the actual label.
insert image description here

It can be seen from the figure that the original FRCNN did not detect the keyboard. After adding the knowledge map, the keyboard was successfully detected through the associated information of the mouse, notebook and other targets.

Experimental reproduction

The original paper was published earlier, and the Caffe framework was used for experiments, but it is no longer available.
Someone later used Pytorch to reproduce it.
Code address: https://github.com/tue-mps/rescience-ijcai2017-230

Recurrence conclusion

The reproducer mentions:

The authors' claims cannot be substantiated for any of the methods described. The results either show an increase in recall at the expense of a decrease in mAP, or they show no improvement in recall while mAP remains constant. Three different backbone models exhibit similar behavior after re-optimization, concluding that knowledge-aware re-optimization does not benefit object detection algorithms.

It is not clear whether it is the influence of hyperparameters, in short, the effect of the paper cannot be realized.

code run

The code is written relatively clearly, and the author provides a processed data set and semantic consistency matrix.
After downloading, place the path as follows:
insert image description here

After modifying Utils/testing.pythese three lines, the small bugs I encountered during testing:

# 原始
# boxes_temp = prediction[1][0]['boxes']
# labels_temp = prediction[1][0]['labels']
# scores_temp = prediction[1][0]['scores']
# 修改为
boxes_temp = prediction[0]['boxes']
labels_temp = prediction[0]['labels']
scores_temp = prediction[0]['scores']

Finally, run it Results/results_coco.pyfor a single round of testing.

Experimental results

Since the author of the original paper used VGG16 as the backbone, model_typeI set it here coco-FRCNN-vgg16. The following are the experimental results of the Coco dataset under my RTX2060:

Model mAP @ 100 Recall @ 100 all classes
FRCNN 0.247 0.477
KF-All-COCO 0.245 0.432
KG-CNet-55-COCO 0.243 0.436
KG-CNet-57-COCO 0.243 0.437
  • FRCNN: Fast-RCNN direct detection
  • KF-ALL-COCO: Idea 1 Obtaining Semantic Consistency Matrix
  • KG-CNet-55-COCO: Idea 2 ConceptNet-assertions55Extract consistency matrix through large common sense knowledge base
  • KG-CNet-57-COCO: Idea 2 ConceptNet-assertions57Extract consistency matrix through large common sense knowledge base

Judging from the results, it is indeed useless, both mAP and Recall have decreased. .

insert image description here

Guess you like

Origin blog.csdn.net/qq1198768105/article/details/130102866