Implementing Interpretable Neural Network Models in PyTorch

Make a fortune with your little hand, give it a thumbs up!

Purpose

The lack of interpretability of deep learning systems poses a major challenge to building human trust. The complexity of these models makes it nearly impossible for humans to understand the underlying reasons behind their decisions.

❝
The lack of interpretability of deep learning systems hinders human trust.
❞

To address this problem, researchers have been actively investigating new solutions, resulting in major innovations such as concept-based models. These models not only increase the transparency of the model, but also foster a new sense of trust in the system's decision-making by incorporating high-level human-interpretable concepts (such as "color" or "shape") during the training process. Thus, these models can provide simple and intuitive explanations for their predictions based on the learned concepts, allowing humans to examine the reasons behind their decisions. That's not all! They even allow humans to interact with learned concepts, giving us control over the final decision.

❝
Concept-based models allow humans to examine the reasoning behind deep learning predictions and put us back in control of the final decision.
❞

In this blog post ^[1] , we will delve into these techniques and provide you with the tools to implement state-of-the-art concept-based models using a simple PyTorch interface. Through hands-on experience, you'll learn how to leverage these powerful models to enhance interpretability and ultimately calibrate human trust in your deep learning systems.

Conceptual Bottleneck Model

In this introduction, we'll dive into the conceptual bottleneck model. Introduced in a paper presented at the 2020 International Conference on Machine Learning, the model aims to first learn and predict a set of concepts, such as "color" or "shape", and then use these concepts to solve downstream classification tasks:

By following this approach, we can trace predictions back to concepts that provide explanations, such as "the input object is an {apple} because it is {spherical} and {red}."

❝
Conceptual bottleneck models first learn a set of concepts, such as "color" or "shape", and then exploit these concepts to solve downstream classification tasks.
❞

accomplish

为了说明概念瓶颈模型，我们将重新审视著名的 XOR 问题，但有所不同。我们的输入将包含两个连续的特征。为了捕捉这些特征的本质，我们将使用概念编码器将它们映射为两个有意义的概念，表示为“A”和“B”。我们任务的目标是预测“A”和“B”的异或 (XOR)。通过这个例子，您将更好地理解概念瓶颈如何在实践中应用，并见证它们在解决具体问题方面的有效性。

我们可以从导入必要的库并加载这个简单的数据集开始：

import torch
import torch_explain as te
from torch_explain import datasets
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

x, c, y = datasets.xor(500)
x_train, x_test, c_train, c_test, y_train, y_test = train_test_split(x, c, y, test_size=0.33, random_state=42)

接下来，我们实例化一个概念编码器以将输入特征映射到概念空间，并实例化一个任务预测器以将概念映射到任务预测：

concept_encoder = torch.nn.Sequential(
    torch.nn.Linear(x.shape[1], 10),
    torch.nn.LeakyReLU(),
    torch.nn.Linear(10, 8),
    torch.nn.LeakyReLU(),
    torch.nn.Linear(8, c.shape[1]),
    torch.nn.Sigmoid(),
)
task_predictor = torch.nn.Sequential(
    torch.nn.Linear(c.shape[1], 8),
    torch.nn.LeakyReLU(),
    torch.nn.Linear(8, 1),
)
model = torch.nn.Sequential(concept_encoder, task_predictor)

然后我们通过优化概念和任务的交叉熵损失来训练网络：

optimizer = torch.optim.AdamW(model.parameters(), lr=0.01)
loss_form_c = torch.nn.BCELoss()
loss_form_y = torch.nn.BCEWithLogitsLoss()
model.train()
for epoch in range(2001):
    optimizer.zero_grad()

    # generate concept and task predictions
    c_pred = concept_encoder(x_train)
    y_pred = task_predictor(c_pred)

    # update loss
    concept_loss = loss_form_c(c_pred, c_train)
    task_loss = loss_form_y(y_pred, y_train)
    loss = concept_loss + 0.2*task_loss

    loss.backward()
    optimizer.step()

训练模型后，我们评估其在测试集上的性能：

c_pred = concept_encoder(x_test)
y_pred = task_predictor(c_pred)

concept_accuracy = accuracy_score(c_test, c_pred > 0.5)
task_accuracy = accuracy_score(y_test, y_pred > 0)

现在，在几个 epoch 之后，我们可以观察到概念和任务在测试集上的准确性都非常好（~98% 的准确性）！

由于这种架构，我们可以通过根据输入概念查看任务预测器的响应来为模型预测提供解释，如下所示：

c_different = torch.FloatTensor([0, 1])
print(f"f({c_different}) = {int(task_predictor(c_different).item() > 0)}")

c_equal = torch.FloatTensor([1, 1])
print(f"f({c_different}) = {int(task_predictor(c_different).item() > 0)}")

这会产生例如 f([0,1])=1 和 f([1,1])=0 ，如预期的那样。这使我们能够更多地了解模型的行为，并检查它对于任何相关概念集的行为是否符合预期，例如，对于互斥的输入概念 [0,1] 或 [1,0]，它返回的预测y=1。

❝
概念瓶颈模型通过将预测追溯到概念来提供直观的解释。
❞

淹没在准确性与可解释性的权衡中

概念瓶颈模型的主要优势之一是它们能够通过揭示概念预测模式来为预测提供解释，从而使人们能够评估模型的推理是否符合他们的期望。

然而，标准概念瓶颈模型的主要问题是它们难以解决复杂问题！更一般地说，他们遇到了可解释人工智能中众所周知的一个众所周知的问题，称为准确性-可解释性权衡。实际上，我们希望模型不仅能实现高任务性能，还能提供高质量的解释。不幸的是，在许多情况下，当我们追求更高的准确性时，模型提供的解释往往会在质量和忠实度上下降，反之亦然。

在视觉上，这种权衡可以表示如下：

可解释模型擅长提供高质量的解释，但难以解决具有挑战性的任务，而黑盒模型以提供脆弱和糟糕的解释为代价来实现高任务准确性。

To illustrate this tradeoff in a concrete setting, let's consider a conceptual bottleneck model applied to a slightly more demanding benchmark, the "trigonometry" dataset:

x, c, y = datasets.trigonometry(500)
x_train, x_test, c_train, c_test, y_train, y_test = train_test_split(x, c, y, test_size=0.33, random_state=42)

After training the same network architecture on this dataset, we observed a significant drop in task accuracy, only reaching around 80%.

❝
Conceptual bottleneck models fail to strike a balance between task accuracy and interpretation quality.
❞

This begs the question: are we forever forced to choose between accuracy and quality of explanation, or is there a way to strike a better balance?

Reference

[1]

Source: https://towardsdatascience.com/implement-interpretable-neural-models-in-pytorch-6a5932bdb078

This article is published by mdnice multi-platform