foreword
Conventional target detection often captures target information based on image features, so is there a way to add some prior information to improve the accuracy of target detection?
A feasible idea is to add the correlation information between targets to the output of target detection, so as to interfere with the target.
In August 2017, Yuan Fang and others from Singapore Management University published an article "Object Detection Meets Knowledge Graphs", and did some work according to this idea.
Paper address: https://www.ijcai.org/proceedings/2017/0230.pdf
The article is very easy to understand, so this article interprets the ideas and reproduces the code.
work structure
The author of this article proposes a general knowledge introduction architecture, so it can be applied to any target detection model.
The flow chart of knowledge introduction is shown in the figure below:
The author of the original paper uses the Faster R-CNN algorithm for detection. The normal detection output will be a P matrix (ie, the Existing model output in the figure), where the columns represent the total number of targets, and the rows represent the categories.
The meaning of this matrix in the figure is: the confidence of the first detection target belonging to category 1 is 0.6, and the confidence of belonging to category 2 is 0.4; the confidence of the second detection target belonging to category 1 is 0.2, and the confidence of belonging to category 2 is 0.2. The degree is 0.8;
On the basis of this output result, the semantic consistency (semantic consistency) between categories is extracted from the prior knowledge (Knowledge), so as to interfere with the output result and obtain the final output result (Knowledge-aware output).
Semantic Consistency Extraction
Then the key to this architecture is how to extract semantic consistency. The author gives two ideas on this point.
Idea 1: Frequency-based knowledge
Frequency-based should be the easiest way to think of knowledge association. For example, if two targets appear frequently at the same time (for example, the keyboard and mouse often appear together), then when one of the targets is detected, it is natural to consider increasing the confidence of the other target.
Therefore, the author proposes an alignment matrix S as the semantic consistency matrix between target categories, and the calculation formula is as follows:
- n(l,l'): the number of times that category l and category l' appear together
- n(l): the number of occurrences of category l
- n(l'): the number of occurrences of category l'
- N: total number of occurrences of all categories
Idea 2: Graph-based knowledge
The way of thinking 1 is more intuitive, but the defect is that it cannot represent the relationship between two categories that have not appeared at the same time. For example, a car and a yacht do not appear in the same scene at the same time, but you can't "roughly" think that the two have nothing to do with each other. A weak weight is definitely needed to represent the relationship between the two.
Therefore, the author thought of using the knowledge graph to extract semantic consistency.
First, by filtering some public large-scale knowledge graphs, the category information and relationships that need to be detected are extracted.
Afterwards, the convergence state of the relationship graph is obtained by restarting the random walk algorithm (random walk with restart). The restart random walk algorithm is a variant of the classic random walk algorithm. It is equivalent to adding a restart probability on the basis of the random walk algorithm. After the restart is triggered, it will return to the original point.
After convergence, an R matrix will be obtained, which represents the probability of transitioning to another state category when the operator is in a certain state category. Since the semantic consistency matrix is a symmetric matrix, the author adopts an operation of multiplying the states and then square rooting.
Interference detection output
After having the semantic consistency matrix S, the output results can be interfered. In the paper, there is no explanation on how to intervene.
By reading the source code later, we can see that the idea of interference is to select the 5 nearest categories of a certain target category, and then sum their consistency matrix values to obtain the correlation feature vector. Then the vector is weighted and added to the original detection result.
Core code:
num = torch.sum(torch.mm(S_highest, torch.transpose(p_hat_temp[box_nearest[b]], 0, 1)), 1)
denom = torch.sum(S_highest, dim=1) * bk
p_hat[b] = (1 - epsilon) * torch.squeeze(torch.div(num, denom)) + epsilon * p
The representative weight here epsilon
is 0.75 when reappearing, which means that 75% of the original results are retained, and 25% of knowledge intervention.
The latter part is the loss function and the network update part.
The following is the calculation formula of the loss function, which is equivalent to incorporating the result of knowledge embedding into the update of the network.
Experimental results
The author conducted experiments on the Coco and VOC datasets. The following table shows the experimental results of coco:
- FRCNN: The output of the original detection network
- KF-500: Obtain the consistency matrix through idea 1, and select 500 training set pictures
- KF-All: Obtain the consistency matrix through idea 1, and select all training set pictures
- KG-CNet: Obtain the consistency matrix through the second idea
It can be seen from the data in the table that the improvement of this idea did not improve the mAP of the detection output, but it effectively improved the recall rate. It is equivalent to reducing the false detection rate of the network.
Results visualization
The last is the visualization of the results. The author selected a picture to demonstrate: the left picture is the direct detection result, and the right picture is the detection result after adding the knowledge graph.
The purple box represents the detection output of the model, and the red box represents the actual label.
It can be seen from the figure that the original FRCNN did not detect the keyboard. After adding the knowledge map, the keyboard was successfully detected through the associated information of the mouse, notebook and other targets.
Experimental reproduction
The original paper was published earlier, and the Caffe framework was used for experiments, but it is no longer available.
Someone later used Pytorch to reproduce it.
Code address: https://github.com/tue-mps/rescience-ijcai2017-230
Recurrence conclusion
The reproducer mentions:
The authors' claims cannot be substantiated for any of the methods described. The results either show an increase in recall at the expense of a decrease in mAP, or they show no improvement in recall while mAP remains constant. Three different backbone models exhibit similar behavior after re-optimization, concluding that knowledge-aware re-optimization does not benefit object detection algorithms.
It is not clear whether it is the influence of hyperparameters, in short, the effect of the paper cannot be realized.
code run
The code is written relatively clearly, and the author provides a processed data set and semantic consistency matrix.
After downloading, place the path as follows:
After modifying Utils/testing.py
these three lines, the small bugs I encountered during testing:
# 原始
# boxes_temp = prediction[1][0]['boxes']
# labels_temp = prediction[1][0]['labels']
# scores_temp = prediction[1][0]['scores']
# 修改为
boxes_temp = prediction[0]['boxes']
labels_temp = prediction[0]['labels']
scores_temp = prediction[0]['scores']
Finally, run it Results/results_coco.py
for a single round of testing.
Experimental results
Since the author of the original paper used VGG16 as the backbone, model_type
I set it here coco-FRCNN-vgg16
. The following are the experimental results of the Coco dataset under my RTX2060:
Model | mAP @ 100 | Recall @ 100 all classes |
---|---|---|
FRCNN | 0.247 | 0.477 |
KF-All-COCO | 0.245 | 0.432 |
KG-CNet-55-COCO | 0.243 | 0.436 |
KG-CNet-57-COCO | 0.243 | 0.437 |
- FRCNN: Fast-RCNN direct detection
- KF-ALL-COCO: Idea 1 Obtaining Semantic Consistency Matrix
- KG-CNet-55-COCO: Idea 2
ConceptNet-assertions55
Extract consistency matrix through large common sense knowledge base - KG-CNet-57-COCO: Idea 2
ConceptNet-assertions57
Extract consistency matrix through large common sense knowledge base
Judging from the results, it is indeed useless, both mAP and Recall have decreased. .