[VQA论文阅读]RUBi Reducing Unimodal Biases for Visual Question Answering

Insert picture description here

Change the weight
of the sample to reduce the importance of the most biased sample: For example, those who can answer the correct sample without looking at the picture.
This will make the model no longer rely on two modes, but use statistical probability to solve the problem.
We use A problem-only model that can be used to capture language bias by identifying unwanted patterns

Code github.com/cdancette/rubi.bootstrap.pytorch

1 Introduction

Link what color banana to yellow. The fact that only the problem model is biased towards the problem mode is used. During training, add a problematic branch and dynamically adjust the loss to supplement the bias. Therefore, backpropagation will reduce the most biased samples and increase the less biased samples. Kind of like heavy weighting

2. Related work

Evaluate unimodal deviation in the data set and model

Agrawal2018 proposed a new data set VQA-CP GVQA clearly distinguishes the recognition of visual concepts presented in images from the recognition of the specious answer space of a given question, enabling the model to be more robust between different answer distributions Make a generalization. Distinguish what to identify and the potential answer space. The task is divided into two steps. The first step is to find and identify the visual area that needs to be answered. The second step is to identify the space of potential answers based on only the branch of the question. Balance the data set to avoid unimodal bias

Justin2017 and drew2019 synthesized the data set to minimize the conditional deviation of the problem by rejecting sampling related problems. Add a supplementary sample for every question of VQA v1. VQA v2 contains the same picture but different answers. Network structure and training methods reduce unimodal deviation

Ramakrishnan2018 uses a problem-only model to train to get the loss, and then uses the gradient negation of this loss to prevent the problem encoder from capturing some unwanted deviations. These deviations can be used by the VQA model.

3. Ways to reduce unimodal deviation

Insert picture description here

nnq neural network
Insert picture description here

Classic learning strategies and pitfalls
Insert picture description here

3.1

Let the VQA model focus on samples that cannot be answered by just using the question mode
Insert picture description here

The bias is prevented by the predictive mask. Using the sigmoid function to change the output of the nnq neural network to 0-1 is called a mask. Using this mask to modify the prediction dynamics of the VQA model changes the loss. Use dot multiplication
Insert picture description here

Only the branch of the question outputs a mask. This mask will increase the score of the correct answer and decrease the score of other answers. Therefore, for these biased samples, the loss will be lower. Therefore, the backpropagation of these samples will be smaller, so the importance of these samples is reduced

Joint learning process:
L_QM is related to Equation 4 and used to update θ_QM. It contains encoder and nnq. L_QO is related to Equation 3 and is used to update the parameters in θ_QO cq and nnq. It is not backpropagated to the problem encoder here.
Insert picture description here

3.2 Baseline model structure

Use FasterR-CNN to process the picture to obtain a package of picture features, use GRU coding problem, fuse the problem representation q with the feature vi of each region of the image, and input the result vector into the MLP classifier to output the final prediction result

Code:

For the specific operation method, please refer to the author's github of this article. Rubi takes another vqa model as input and adds a question branch to it. The prediction of the problem branch is merged with the original prediction, and rubi will output a new prediction to train the model

The baseline model needs to return the original prediction through the dictionary before softmax. The key is the logits
training model:
boostrap/run.py loads the options to create the corresponding experiment and start training. The
training command python -m bootstrap.run -o rubi/options/vqacp2/rubi.yaml
is generated in logs/vqa2/rubi
Insert picture description here

Evaluation model:
For vqa-cp v2, there is no test set and the evaluation set is used directly. For vqa v2, it can be evaluated on the test set. At this point, the model restores the best parameters and then evaluates on the test set and logs will use different names

python -m bootstrap.run \
-o logs/vqa2/rubi/baseline.yaml \
--exp.resume best_accuracy_top1 \
--dataset.train_split \
--dataset.eval_split test \
--misc.logs_name test

Reproduced results: Baseline model
on vqa-cp v2

python -m bootstrap.run \
-o rubi/options/vqacp2/baseline.yaml \
--exp.dir logs/vqacp2/baseline

ruby

python -m bootstrap.run \
-o rubi/options/vqacp2/rubi.yaml \
--exp.dir logs/vqacp2/rubi

Comparing results

python -m rubi.compare_vqacp2_rubi -d logs/vqacp2/rubi logs/vqacp2/baseline

Use tensorboard to view the results

python -m bootstrap.run -o rubi/options/vqacp2/rubi.yaml \
--view.name tensorboard

tensorboard --logdir=logs/vqa2

指定gpu
For a specific experiment:

CUDA_VISIBLE_DEVICES=0 python -m boostrap.run -o rubi/options/vqacp2/rubi.yaml

For the current terminal session:

export CUDA_VISIBLE_DEVICES=0

Rewrite parameters

python -m bootstrap.run -o rubi/options/vqacp2/rubi.yaml \
--optimizer.lr 0.0003 \
--exp.dir logs/vqacp2/rubi_lr,0.0003

In the running process, the
run function first finds the yarml file and puts the parameters inside the option. According to the various key values ​​of option, first set the engine (the engine will be used to train val, etc.), then call the model model and then find the network (the process of finding the network is to find The factory determines which network it is, and then can call the main network written by itself), establish an optimizer

Each call is to call the factor function, and it is up to him to determine the specific model, what network, etc. The
training epoch and loss are all in the default model.

Insert picture description here

Insert picture description here

Insert picture description here

Guess you like

Origin blog.csdn.net/qq_45347185/article/details/115337429