Change the weight
of the sample to reduce the importance of the most biased sample: For example, those who can answer the correct sample without looking at the picture.
This will make the model no longer rely on two modes, but use statistical probability to solve the problem.
We use A problem-only model that can be used to capture language bias by identifying unwanted patterns
Code github.com/cdancette/rubi.bootstrap.pytorch
1 Introduction
Link what color banana to yellow. The fact that only the problem model is biased towards the problem mode is used. During training, add a problematic branch and dynamically adjust the loss to supplement the bias. Therefore, backpropagation will reduce the most biased samples and increase the less biased samples. Kind of like heavy weighting
2. Related work
Evaluate unimodal deviation in the data set and model
Agrawal2018 proposed a new data set VQA-CP GVQA clearly distinguishes the recognition of visual concepts presented in images from the recognition of the specious answer space of a given question, enabling the model to be more robust between different answer distributions Make a generalization. Distinguish what to identify and the potential answer space. The task is divided into two steps. The first step is to find and identify the visual area that needs to be answered. The second step is to identify the space of potential answers based on only the branch of the question. Balance the data set to avoid unimodal bias
Justin2017 and drew2019 synthesized the data set to minimize the conditional deviation of the problem by rejecting sampling related problems. Add a supplementary sample for every question of VQA v1. VQA v2 contains the same picture but different answers. Network structure and training methods reduce unimodal deviation
Ramakrishnan2018 uses a problem-only model to train to get the loss, and then uses the gradient negation of this loss to prevent the problem encoder from capturing some unwanted deviations. These deviations can be used by the VQA model.
3. Ways to reduce unimodal deviation
nnq neural network
Classic learning strategies and pitfalls
3.1
Let the VQA model focus on samples that cannot be answered by just using the question mode
The bias is prevented by the predictive mask. Using the sigmoid function to change the output of the nnq neural network to 0-1 is called a mask. Using this mask to modify the prediction dynamics of the VQA model changes the loss. Use dot multiplication
Only the branch of the question outputs a mask. This mask will increase the score of the correct answer and decrease the score of other answers. Therefore, for these biased samples, the loss will be lower. Therefore, the backpropagation of these samples will be smaller, so the importance of these samples is reduced
Joint learning process:
L_QM is related to Equation 4 and used to update θ_QM. It contains encoder and nnq. L_QO is related to Equation 3 and is used to update the parameters in θ_QO cq and nnq. It is not backpropagated to the problem encoder here.
3.2 Baseline model structure
Use FasterR-CNN to process the picture to obtain a package of picture features, use GRU coding problem, fuse the problem representation q with the feature vi of each region of the image, and input the result vector into the MLP classifier to output the final prediction result
Code:
For the specific operation method, please refer to the author's github of this article. Rubi takes another vqa model as input and adds a question branch to it. The prediction of the problem branch is merged with the original prediction, and rubi will output a new prediction to train the model
The baseline model needs to return the original prediction through the dictionary before softmax. The key is the logits
training model:
boostrap/run.py loads the options to create the corresponding experiment and start training. The
training command python -m bootstrap.run -o rubi/options/vqacp2/rubi.yaml
is generated in logs/vqa2/rubi
Evaluation model:
For vqa-cp v2, there is no test set and the evaluation set is used directly. For vqa v2, it can be evaluated on the test set. At this point, the model restores the best parameters and then evaluates on the test set and logs will use different names
python -m bootstrap.run \
-o logs/vqa2/rubi/baseline.yaml \
--exp.resume best_accuracy_top1 \
--dataset.train_split \
--dataset.eval_split test \
--misc.logs_name test
Reproduced results: Baseline model
on vqa-cp v2
python -m bootstrap.run \
-o rubi/options/vqacp2/baseline.yaml \
--exp.dir logs/vqacp2/baseline
ruby
python -m bootstrap.run \
-o rubi/options/vqacp2/rubi.yaml \
--exp.dir logs/vqacp2/rubi
Comparing results
python -m rubi.compare_vqacp2_rubi -d logs/vqacp2/rubi logs/vqacp2/baseline
Use tensorboard to view the results
python -m bootstrap.run -o rubi/options/vqacp2/rubi.yaml \
--view.name tensorboard
tensorboard --logdir=logs/vqa2
指定gpu
For a specific experiment:
CUDA_VISIBLE_DEVICES=0 python -m boostrap.run -o rubi/options/vqacp2/rubi.yaml
For the current terminal session:
export CUDA_VISIBLE_DEVICES=0
Rewrite parameters
python -m bootstrap.run -o rubi/options/vqacp2/rubi.yaml \
--optimizer.lr 0.0003 \
--exp.dir logs/vqacp2/rubi_lr,0.0003
In the running process, the
run function first finds the yarml file and puts the parameters inside the option. According to the various key values of option, first set the engine (the engine will be used to train val, etc.), then call the model model and then find the network (the process of finding the network is to find The factory determines which network it is, and then can call the main network written by itself), establish an optimizer
Each call is to call the factor function, and it is up to him to determine the specific model, what network, etc. The
training epoch and loss are all in the default model.