VQA for cultural heritage (based on the ArtPedia dataset)

VQA parper in the field of art and cultural heritage reading

Visual Question Answering for Cultural Heritage



foreword

Still, the most frequent way to interact with paintings and sculptures is to take pictures. However, the image itself can only convey the aesthetics of the artwork, lacking the information needed to fully understand and appreciate it. Often, this additional knowledge comes both from the artwork itself (and thus from the image depicting it), as well as from external sources of knowledge, such as information sheets. The former can be inferred by computer vision algorithms, while the latter requires more structured data to pair visual content with relevant information. Regardless of its origin, such information must still be effectively transmitted to the user. A popular emerging trend in computer vision is visual question answering (VQA), where users can ask questions through natural language, interact with neural networks, and get answers about visual content. We believe this will be the evolution of smart audio guides for museum visits and simple image browsing on personal smartphones. This turns the classic audio tour into a smart personal trainer that visitors can interact with by asking for explanations focused on specific interests. The benefits of this are twofold: on the one hand, the cognitive load on the visitor will be reduced, limiting the flow of information to what the user really wants to hear; on the other hand, it suggests the most natural way to interact with the guide , conducive to participation.
insert image description here

method

visual Question Answering with visual and contextual questions

The main idea of ​​this work is to classify the type of input question (visual or contextual) so that it can be answered by the most appropriate sub-model. We rely on question classifiers to understand whether the question involves only the visual features of the image, or whether an external source of information is required to provide the correct answer. Then, depending on the output of the classifier, the question is submitted to a VQA or QA model. In both cases, the problem must be analyzed and understood, but the use of the two different architectures is driven by the need to deal with different sources of additional information. If the question is visualized, the answer is generated from the image, whereas if the question is context-sensitive, the answer is generated using an external textual description.

The overall pipeline (see Figure 1) that our method uses to answer a question is as follows:
(i) Question classification. Questions are given as input to the question classifier module, which determines whether the question is contextual or visual.
(ii) [Visualization] Question Answering. According to the predicted question type, the corresponding module will be activated to generate the answer. (a) If the question is context-sensitive, the question is given as input to a question answering module, which accepts the input and also external information useful for answering the question. The system produces an output answer based only on this external information.
(b) If the question is visualized, feed the question and the image as input to the visual question answering module. The system generates an output answer based on the content of the image.

Question Classifier Module

The question classifier module consists of Bert[5] modules for text classification. BERT leverages Transformer [21], an attention mechanism that learns contextual relationships between words (or subwords) in text. The Transformer is trained bidirectionally to gain a deeper understanding of language context and language flow. This language model is very versatile as it can be used for different tasks such as text classification, next word prediction in a sentence, question answering and entity recognition. By adding a classification layer on top of the Transformer output, this model becomes a question classification architecture. The input question is represented as the sum of three different embeddings: token embeddings, segment embeddings and positional embeddings. Additionally, two special tokens have been added at the beginning and end of the question.

Contextual Question Answering Module

The model for the question answering task is another Bert module that focuses on this task. In this case, the module accepts both a question and a textual description as input. Since this system uses textual information to answer questions, the text must contain relevant information to generate appropriate answers.

Visual Question Answering Module

The architecture of the visual question answering module is similar to that used in the bottom-up-top-down approach by Anderson et al. Here, salient regions of images are extracted by Faster R-CNN [18] pre-trained on the Visual Genome dataset [12]. The words of the questions are represented by Glove embeddings [17], and then the questions are encoded with a Gated Recurrent Unit (GRU), compressing each question into a fixed-size descriptor. An attention mechanism is established between the encoding question and the salient image regions to weigh candidate regions useful for answering the question. The weighted region representation and question representation are then projected into a common space and concatenated by an element-wise product. Finally, the joint representation goes through two fully connected layers and a softmax activation that produces the output answer.

Experimental results

To evaluate the performance of the model, we conduct different experiments. We measure the performance of the model by analyzing each component independently.

Question Classifier Question Classifier

We train the question classifier module with questions from the OK-VQA and VQA v2 datasets. We extract a number of visual questions from VQA v2 equal to the number of questions requiring external knowledge in OK-VQA. The resulting dataset is split into training and testing sets. The question classifier should understand from the structure of the question whether the answer is related to the visual content. This is a generic classifier independent of the domain of the task. In fact, VQA v2 and OK-VQA contain generic images, whereas we are interested in applications in the domain of cultural heritage. By evaluating on the VQA/OK-VQA dataset and a new dataset consisting of a subset of Artpedia [20], we demonstrate the effectiveness of our method and its ability to transfer to the domain of cultural heritage. Since this dataset contains no questions but only images and descriptions, we extracted 30 images from this dataset and added to them a variable number of visual and contextual questions (from 3 to 5 for both categories). The accuracy of our question classifier module is shown in Table 1. We can observe that it correctly predicts the type of question in most cases.

Contextual Question Answering

We test our question answering module on a subset of Artpedia containing 30 annotated images. In particular, we tested the accuracy of our module in three different experiments: a contextual question test, a visual question test, and a simultaneous visual and contextual question test. Note that the outputs of vision and context modules are different, since VQA is considered as a classification problem, while for QA, from the results shown in Table 2, we can infer that our question answering module works well for context questions. Good, and worse results for visual problems. This is evidenced by the fact that visual issues refer to visible details of paintings that cannot be described in ArtPedia's visual sentence.

Visual Question Answering

Similar to the tests conducted for the question answering module, we evaluate the visual question answering module on visual and contextual questions. Table 2 shows the results of our visual question answering model. Instead, we can observe from the question answering module that the model performs well on visual questions but cannot correctly answer contextual questions. This is due to the fact that contextual questions require external knowledge (e.g., author, year), which cannot be obtained by a purely visual question answering engine.

Full modelFull pipeline

Finally, we combined the features of all modules and tested on both vision and context questions, achieving an accuracy of 0.570. Thanks to the question classifier, the full pipeline is able to correctly distinguish between visual and contextual questions. The visual question answering module and the question answering module receive as input almost any question they are able to answer (contextual questions for the question answering module and visualization questions for the visual question answering module). Therefore, the overall model outperforms the two single-answer modules. Figure 2 shows some qualitative results for the three components of the pipeline. These components handle most issues correctly, but some common failure conditions can be observed. For example, a question answering model might add details to the answer that are not present in the ground truth, and a visual question answering model might confuse some elements of a painting with similar objects.

Summarize

In this paper, we propose a visual question answering method in the domain of cultural heritage. We have addressed two important issues: the need to process contained images and contextual knowledge, and the lack of data availability. Our proposed model combines the power of VQA and QA models, relying on a question classifier to predict whether it refers to visual or contextual content. To evaluate the effectiveness of our model, we annotated a subset of the ArtPedia dataset with visual and context-sensitive question-answer pairs.

reader's summary

I feel that this article has caught the heat of art VQA.
To put it simply, before answering, first classify the questions, and then perform VQA answers and QA answers respectively to get the answers.
The above module was run on three data sets. The result is that the effect of the question classification task is good, and the effect of the answer to VQA is good, but the effect of the QA task is not good (that is, the effect of the open answer is just soso)

Guess you like

Origin blog.csdn.net/weixin_44845357/article/details/126896883