FunnyBirds: A synthetic vision dataset for part-based analysis to illustrate artificial intelligence methods

f13750354c8079fecb867b7ff1279363.png

Paper title: FunnyBirds: A Synthetic Vision Dataset for a Part-Based Analysis of Explainable AI Methods

Paper link: https://openaccess.thecvf.com/content/ICCV2023/html/Hesse_FunnyBirds_A_Synthetic_Vision_Dataset_for_a_Part-Based_Analysis_of_ICCV_2023_paper.html

Code: https://github.com/visinf/funnybirds

引用:Hesse R, Schaub-Meyer S, Roth S. FunnyBirds: A Synthetic Vision Dataset for a Part-Based Analysis of Explainable AI Methods[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023: 3981-3991.

1db3d1f49236076c65b4e37e0342db13.png

Introduction

The field of explainable artificial intelligence (XAI) aims to reveal the inner workings of complex deep neural models. While XAI is critical to safety-critical domains, it inherently lacks fundamentally realistic interpretations, making its automated evaluation an unsolved problem.

We address this challenge by proposing a new synthetic vision dataset, FunnyBirds, along with an automated evaluation protocol. Our dataset allows for semantically meaningful image intervention, e.g., removal of individual object parts, which has three important implications:

  • First, it enables analytical interpretation at the part level, which is closer to human understanding than existing methods that evaluate at the pixel level.

  • Second, by comparing the model output for the input after removing the part, we can estimate the baseline part importance that should be reflected in the explanation.

  • Third, by mapping individual explanations onto a common part importance space, we can analyze a variety of different explanation types within a common framework.

Using our tool, we report results for 24 different combinations of neural models and XAI methods, demonstrating the strengths and weaknesses of the evaluation methods in a fully automated and systematic manner.

introduce

Although deep learning models have achieved groundbreaking results in computer vision, their inner workings remain largely opaque. For this reason, deep networks sometimes only gain limited user trust and cannot be blindly applied in security-critical areas. To address this problem, the field of explainable artificial intelligence (XAI) has received increasing attention, attempting to explain the inner workings of deep neural models in a human-understandable way.

However, evaluating XAI methods remains an open challenge as benchmark interpretations are often unavailable. In fact, one-third of XAI papers lack a robust quantitative evaluation [38], while other work has limited comparability [31] or questionable evaluation protocols [23, 38].

To address the lack of baseline explanations, automatic evaluation is often accomplished through proxy tasks that follow the idea of ​​measuring the results of the model output by removing certain input features. Since it is often not easy to perform image intervention on existing visual datasets, it is usually applied at the pixel level, e.g., masking individual pixels.

However, this and related methods suffer from several drawbacks. First, performing interventions and evaluating explanations at the pixel level is disconnected from the downstream task of providing human-understandable explanations because humans perceive images in terms of concepts rather than pixels. Second, existing automated evaluation protocols are developed for specific explanation types, such as pixel-level attribution maps, and thus cannot be extended to other explanation types, such as prototypes. Third, by performing unrealistic interventions in the image space, such as masking pixels, domain shifts are introduced compared to the training distribution, which can lead to unexpected model behavior and thus negatively impact evaluations.

In this work, we take an important step towards more rigorous quantitative evaluation of XAI methods by proposing a comprehensive and specialized evaluation/analysis tool that addresses the above and more challenges. We do this by building a fully controllable, synthetic classification dataset including renderings of artificial bird species. This approach to analyzing XAI methods is similar to a controlled laboratory study, where we fully control all variables, eliminating the potential influence of irrelevant factors, thus providing clearer evidence of the observed behavior [5].

Our proposed dataset allows us to make the following main contributions:

  1. By considering a range of different evaluation protocols, we cover a wide range of interpretability dimensions.

  2. Allows automatic comparison of various interpretation types within a shared framework.

  3. By introducing semantically meaningful intervention during training, the out-of-domain problem of previous image space intervention is avoided.

  4. Reduce the gap between downstream tasks of human understanding and XAI evaluation by proposing work at the semantically meaningful part level rather than the semantically meaningless pixel level

  5. Automatic analysis of interpretation consistency

  6. Analyzed 24 different combinations of existing XAI methods and neural models, highlighting their strengths and weaknesses and identifying new insights that may be of general interest to the XAI community

FunnyBirds dataset

Dataset background

Given that XAI methods in computer vision have mainly been developed in the context of classification, we focus on classification and propose a fine-grained bird dataset inspired by the CUB-200-2011 dataset. The CUB-200-2011 dataset is widely used in XAI methods.

Data set size

Our dataset consists of 50,500 images (50,000 for training and 500 for testing) covering 50 synthetic bird species. We found that 500 test images are sufficient to produce stable results while also allowing efficient evaluation with limited hardware resources.

importance of concepts

A particularly important point in our design process isconcepts, that is, concepts, we willconcepts Defined as a mental representation of a mental entity that is critical to classification. For example, humans may classify a bird as a flamingo when observing concepts such as a curved beak and pink wings. Therefore, we believe that concepts are crucial to humans in the context of XAI, so we put special emphasis on .

Conceptual granularity of detail

To determine the precise granularity of the concepts in the dataset, we forced them to be as fine-grained as possible while corresponding to existing individual words and associated with the bird's body. This avoids the "unrealistic" situation where the removal of one concept disengages another (the bird's body is never removed), and eliminates concepts that are too fine-grained.

Dataset design process

We manually designed 4 types of beaks, 3 types of eyes, 4 types of feet, 9 types of tails, and 6 types of wings with different shapes and/or colors. Each FunnyBird category is made up of a unique combination of these parts. From 2592 possible combinations, 50 categories were randomly selected as part of the dataset. Using these categories, we followed the data generation process outlined in Figure 2, adding class-specific parts to a neutral body model to obtain an instance of FunnyBird. The solvability of the data set was verified by human experts, who achieved an accuracy of 97.4% on the proposed test set.

aa9b9dfc7d83dd27da6f2daa32d09031.png

Evaluation protocol

In this section, we propose a general, multidimensional analysis framework, called the FunnyBirds framework, which allows the evaluation of various existing XAI methods in a common framework. Second, we propose a more granular analysis approach that leverages the capabilities of the dataset to gain insights into specific methods, similar to what human studies can do.

The FunnyBirds framework includes six evaluation protocols for three interpretability dimensions (completeness, correctness, and contrastability). These protocols were chosen based on their popularity in related work, ability to automate measurements, and compatibility with the dataset. Most protocols are inspired by well-established assessment practices and follow generally accepted assumptions. The dataset generation process ensures that all conducted interventions can be considered within the domain and semantically meaningful, thereby eliminating a common shortcoming of some existing assessment protocols (which rely on image-based interventions).

Accuracy and background independence

Accuracy(A): Accuracy. An overly simple model may explain, but not solve, the task at hand. To detect such cases, our framework reports standard classification accuracy.

Background independence (BI): Background independence. Similarly, another simple solution to achieve high explanation scores is a model that is sensitive to the entire image. Therefore, the interpretation highlights the entire image. To detect this situation, we report background independence, i.e., the proportion of background objects that are unimportant, i.e., the target logic drops by less than 5% when they are removed.

integrity

Controlled synthetic data check (CSDC): Controlled synthetic data check. The completeness of the explanation is measured by comparing the set overlap between the parts where the explanation is estimated to be important and the set of parts where it is sufficient. A sufficient part set is a subset of parts that can correctly classify images containing those parts.

89b162c4a9f29d8d4169632327b7f970.png

Preservation check (PC): Preservation check. If an explanation is complete, retaining only the parts of the input that the explanation estimates to be important should still result in the same classification output.

c96351a5702126ecba30760e45d738bb.png

Deletion check (DC): Deletion check. In contrast to PC, DC interprets this by removing parts it considers important and measuring whether it results in a different classification output.

4279a41d75c6ef88c48e94309ad9e310.png

Distractibility (D): dispersion. Weigh whether the explanation is too detailed. It measures whether parts of the input that are actually unimportant are also interpreted as unimportant. The authors define this metric by gradually removing each background object and bird part, and measure whether they have less than 5% impact on the classification results.

31822aa7a013777c8a547db0109f606c.png
correctness

Single deletion (SD): Single deletion. Measures the rank correlation between the importance score of each part in the explanation and the amount of logical change in the model's target class when that part is actually removed.

6d718b3c132df134cbed3e81eb593713.png
Contrastivity

Target sensitivity (TS): Target sensitivity. If an interpretation is sensitive to an object category, it should highlight image regions related to the corresponding object. We select an input sample and select two classes cˆ1 and cˆ2 that have two non-overlapping common parts with the actual class of the input sample xn respectively. Afterwards, we compute the interpretation ef (xn, cˆi) with respect to cˆ1 and cˆ2 and evaluate whether they correctly highlight parts of the respective categories.

88d40066273755f17c0f168666626551.png

experiment

Experimental results

FunnyBirds evaluation results:

cb75b21f29fde3b417e4f0295b6a855b.png

Qualitative results of ProtoPNet:

8d2a6d7f40010800d9555c38b46fc44a.png

Qualitative results for counterfactual visual interpretation:

1817cac4e0fdc91a7c1805c780a15b32.png

in conclusion

In this work, we propose a new approach to automatically analyze XAI methods using a synthetic classification dataset that allows complete annotation and partial intervention. Leveraging this dataset, we propose an accompanying multidimensional analysis framework that faithfully evaluates various important aspects of interpretability and generalizes to different explanation types through the use of interface functions. Using this easy-to-use tool, we analyzed 24 different settings to reveal a variety of new insights and confirm findings from related work. This shows that, despite the synthetic setting, our findings appear to translate well to real data and that our proposed tool is a practical and valuable asset for analyzing future XAI methods. Finally, we show how we developed tailored analyzes to better understand two specific XAI methods and discovered their weaknesses in an automated and quantitative way

☆ END ☆

If you see this, it means you like this article, please forward it and like it. Search "uncle_pn" on WeChat. Welcome to add the editor's WeChat "woshicver". A high-quality blog post will be updated in the circle of friends every day.

Scan the QR code to add the editor↓

219fa5b3cfd61cbcddb80661906a70ae.jpeg

Guess you like

Origin blog.csdn.net/woshicver/article/details/134432430