arXiv学术速递笔记11.30

文章目录

一、AI安全

一、AI安全

（对抗性训练）Topology-Preserving Adversarial Training

标题： 保持拓扑的对抗性训练
链接： https://arxiv.org/abs/2311.17607
作者： Xiaoyue Mi,Fan Tang,Yepeng Weng,Danding Wang,Juan Cao,Sheng Tang,Peng Li,Yang Liu
摘要： 尽管在提高神经网络的鲁棒性方面是有效的，但对抗性训练遭受了自然准确性降级问题，即，天然样品的准确性已大大降低。在这项研究中，我们揭示了自然精度退化是高度相关的自然样本拓扑结构的表示空间的定量和定性实验的破坏。基于这一观察结果，我们提出了拓扑保留对抗训练（TRAIN），通过在对抗训练期间仅在自然样本上训练的标准模型中保留自然样本的拓扑结构来缓解这个问题。作为一种额外的正则化，我们的方法可以很容易地以即插即用的方式与各种流行的对抗训练算法相结合，利用双方的优势。在CIFAR-10、CIFAR-100和Tiny ImageNet上进行的大量实验表明，我们提出的方法在大多数情况下都能在各种强基线上实现一致且显著的改进。具体来说，在没有额外数据的情况下，我们提出的方法在自然准确度上提高了8.78%，在鲁棒准确度上提高了4.50%。
摘要： Despite the effectiveness in improving the robustness of neural networks, adversarial training has suffered from the natural accuracy degradation problem, i.e., accuracy on natural samples has reduced significantly. In this study, we reveal that natural accuracy degradation is highly related to the disruption of the natural sample topology in the representation space by quantitative and qualitative experiments. Based on this observation, we propose Topology-pReserving Adversarial traINing (TRAIN) to alleviate the problem by preserving the topology structure of natural samples from a standard model trained only on natural samples during adversarial training. As an additional regularization, our method can easily be combined with various popular adversarial training algorithms in a plug-and-play manner, taking advantage of both sides. Extensive experiments on CIFAR-10, CIFAR-100, and Tiny ImageNet show that our proposed method achieves consistent and significant improvements over various strong baselines in most cases. Specifically, without additional data, our proposed method achieves up to 8.78% improvement in natural accuracy and 4.50% improvement in robust accuracy.

（对抗性攻击）Group-wise Sparse and Explainable Adversarial Attacks

标题： 群组稀疏和可解释的对抗性攻击
链接： https://arxiv.org/abs/2311.17434
作者： Shpresim Sadiku,Moritz Wagner,Sebastian Pokutta
摘要： 稀疏对抗攻击通过最小的像素扰动来欺骗深度神经网络（DNN），通常由 $\ell_0$ norm正则化。最近的努力已经用结构稀疏正则化器（如核群规范）取代了这个规范，以制作组稀疏对抗攻击。因此，由此产生的扰动是可以解释的，并具有重要的实际意义，揭示了DNN比以前预期的更大的脆弱性。然而，制作这种攻击带来了优化挑战，因为它涉及计算非凸目标内像素组的范数。在本文中，我们提出了一种算法，同时生成组明智的稀疏攻击在语义上有意义的区域的图像来解决这一挑战。在每一次迭代中，我们算法的核心操作都涉及到对拟独立对抗损失的优化。这种优化是通过采用 $1/2$ -拟不变近似算子进行一些迭代来实现的，这是一种为非凸规划量身定制的方法。随后，该算法过渡到预测Nesterov的加速梯度下降与 $2$ -范数正则化应用于扰动幅度。我们在CIFAR-10和ImageNet数据集上严格评估了我们的新型攻击在有针对性和非有针对性攻击场景中的有效性。与最先进的方法相比，我们的攻击始终导致组稀疏性的显着增加，例如，在CIFAR-10上增加了48.12%，在ImageNet上增加了40.78%（平均情况，有针对性的攻击），同时保持较低的扰动幅度。值得注意的是，这种性能还得到了更快的计算时间和100美元攻击成功率的补充。
摘要： Sparse adversarial attacks fool deep neural networks (DNNs) through minimal pixel perturbations, typically regularized by the $\ell_0$ norm. Recent efforts have replaced this norm with a structural sparsity regularizer, such as the nuclear group norm, to craft group-wise sparse adversarial attacks. The resulting perturbations are thus explainable and hold significant practical relevance, shedding light on an even greater vulnerability of DNNs than previously anticipated. However, crafting such attacks poses an optimization challenge, as it involves computing norms for groups of pixels within a non-convex objective. In this paper, we tackle this challenge by presenting an algorithm that simultaneously generates group-wise sparse attacks within semantically meaningful areas of an image. In each iteration, the core operation of our algorithm involves the optimization of a quasinorm adversarial loss. This optimization is achieved by employing the $1/2$ -quasinorm proximal operator for some iterations, a method tailored for nonconvex programming. Subsequently, the algorithm transitions to a projected Nesterov’s accelerated gradient descent with $2$ -norm regularization applied to perturbation magnitudes. We rigorously evaluate the efficacy of our novel attack in both targeted and non-targeted attack scenarios, on CIFAR-10 and ImageNet datasets. When compared to state-of-the-art methods, our attack consistently results in a remarkable increase in group-wise sparsity, e.g., an increase of $48.12\%$ on CIFAR-10 and $40.78\%$ on ImageNet (average case, targeted attack), all while maintaining lower perturbation magnitudes. Notably, this performance is complemented by a significantly faster computation time and a $100\%$ attack success rate.

（针对扩散模型的攻击）MMA-Diffusion: MultiModal Attack on Diffusion Models

标题： MMA扩散：对扩散模型的多模式攻击
链接： https://arxiv.org/abs/2311.17516
作者： Yijun Yang,Ruiyuan Gao,Xiaosen Wang,Nan Xu,Qiang Xu
摘要： 近年来，文本到图像（T2I）模型取得了显着的进步，获得了广泛的采用。然而，这一进展无意中为潜在的滥用开辟了途径，特别是在生成不适当或不安全的工作（NSFW）内容方面。我们的工作介绍了MMA-Diffusion，这是一个框架，通过有效地规避开源模型和商业在线服务中的当前防御措施，对T2I模型的安全性构成了重大而现实的威胁。与以前的方法不同，MMA-Diffusion利用文本和视觉形式来绕过诸如提示过滤器和事后安全检查器之类的保护措施，从而暴露和突出现有防御机制中的漏洞。
摘要： In recent years, Text-to-Image (T2I) models have seen remarkable advancements, gaining widespread adoption. However, this progress has inadvertently opened avenues for potential misuse, particularly in generating inappropriate or Not-Safe-For-Work (NSFW) content. Our work introduces MMA-Diffusion, a framework that presents a significant and realistic threat to the security of T2I models by effectively circumventing current defensive measures in both open-source models and commercial online services. Unlike previous approaches, MMA-Diffusion leverages both textual and visual modalities to bypass safeguards like prompt filters and post-hoc safety checkers, thus exposing and highlighting the vulnerabilities in existing defense mechanisms.