Model-Reuse Attacks on Deep Learning Systems

摘要

Many of today’s machine learning (ML) systems are built by reusing an array of, pretrained, primitive models, each fulfilling distinct functionality (e.g., feature extraction). The increasing use of primitive models significantly simplifies and expedites the develop- ment cycles of ML systems. Yet, because most of such models are contributed and maintained by untrusted sources, their lack of standardization or regulation entails profound security implications, about which little is known thus far.
In this paper, we demonstrate that malicious primitive models pose immense threats to the security of ML systems. We present a broad class of model-reuse attacks wherein maliciously crafted models trigger host ML systems to misbehave on targeted inputs in a highly predictable manner. By empirically studying four deep learning systems (including both individual and ensemble systems) used in skin cancer screening, speech recognition, face verification, and autonomous steering, we show that such attacks are (i) effective the host systems misbehave on the targeted inputs as desired by the adversary with high probability, (ii) evasive - the malicious models function indistinguishably from their benign counterparts on non-targeted inputs, (iii) elastic - the malicious models remain effective regardless of various system design choices and tuning strategies, and (iv) easy - the adversary needs little prior knowledge about the data used for system tuning or inference. We provide analytical justification for the effectiveness of model-reuse attacks, which points to the unprecedented complexity of today’s primitive models. This issue thus seems fundamental to many ML systems. We further discuss potential countermeasures and their challenges, which lead to several promising research directions.

背景

许多机器学习系统都是通过原始模型的重用来构建的。
随着重用的不断增加，大大简化并加快了机器学习系统的开发周期。
但大多数原始模型都是由不信任来源提供的，会带来一定的安全隐患。
as of 2016, over 13.7% of the ML systems on GitHub use at least one popular primitive model
Specifically, we present a broad class of model-reuse attacks, in which maliciously crafted models (i.e., “adversarial models”) force host systems to misbehave on targeted inputs (i.e., “triggers”) in a highly predictable manner (e.g., misclassifying triggers into spe- cific classes). Such attacks can result in consequential damages. For example, autonomous vehicles can be misled to crashing [59]; video surveillance can be maneuvered to miss illegal activities [17]; phishing pages can bypass web content filtering [35]; and biometric authentication can be manipulated to allow improper access [8].
vetting a primitive model for potential threats amounts to searching for abnormal alterations induced by this model in the feature space, which entails non-trivial challenges because of the feature space dimensionality and model complexity.（为了潜在的威胁审查原始模型是个不小的挑战，因为特征空间的维数和复杂性）

目的

提出了一种模型重用攻击的手段，并且使用四类ML系统进行验证
验证的point有：

Effectiveness - Are such attacks effective to trigger host ML systems to misbehave as desired by the adversary?
Evasiveness - Are such attack evasive with respect to the system developers’ inspection?
Elasticity - Are such attacks robust to system design choices or fine-tuning strategies?

方法

(i) Generating semantic neighbors. For given x− (x+), we first generate a set of neighbors X− (X+), which are considered semantically similar to x− (x+) by adding meaningful variations (e.g., natural noise and blur) to x− (x+)To this end, we need to carefully adjust the noise injected to each part of x− (x+) according to its importance for x−’s (x+’s) classification.
(ii) Finding salient features. Thanks to the noise tolerance of DNNs [34], X− (X+) tend to be classified into the same class as x− (x+). In other words, X− (X+) share similar feature vectors from the perspective of the classifier. Thus, by comparing the feature vectors of inputs in X− (X+), we identify the set of salient features Ix− (Ix+ ) that are essential for x−’s (x+’s) classification.
(iii) Training adversarial models. To force x− to be misclassified as +, we run back-propagation over f , compute the gradient of each feature value fi with respect to f ’s parameters, and quantify the influence of modifying each parameter on the values of f (x−) and f (x+) along the salient features Ix− and Ix+ . According to the definition of salient features, minimizing the difference of f (x−) and f (x+) along Ix− ∪Ix+ , yet without significantly affecting f (x+), offers the best chance to force x− to be misclassified as +. We identify and perturb the parameters that satisfy such criteria. This process iterates between (ii) and (iii) until convergence

细节

Generating semantic neighbors

对于给定的输入

x_*

我们需要向x添加变化来对x附近的一组输入进行采样，这些邻居在语义上类似于x（即所有邻居都归于同一类）。一种方法是对x的每个维度都注入随机噪声（但存在一个问题，x的某些维度的值在重要性上比其余部分更加关键），因此引入一个掩码m，将每个维度i与m[i]相关联，定义以下公式：

Ψ(x_*:m)[i] = m[i] * x_*[i] + (1-m[i])*ŋ

其中ŋ为服从高斯分布N(0,ð^2)的随机噪声。
直观来说:

如果m[i] = 1，则不会对x[i]施加扰动
如果m[i] = 0，则会将x[i]替换为随机噪声
需要找到一种方法，使x的重要部分得到保留，而其余部分收到干扰。

以此公式来确定m的值

finding salient features

由于要使第一步中所有推出的邻居类X在分类器的作用下归为一类。即意味着在分类器的认为下，X中所有的输入均存在必不可少的一组特征上共享相似的值，这组特征值成为显著特征。
现定义以下公式计算所有特征的显著特征值，并取分数较大的k个特征作为相似特征值。

s_i(x_*) = μ_i/δ_i

μ代表沿第i个特征的特征向量的均值，δ代表偏差（如果表现为低方差，大规模，则显著特征分数高）
算法概述

1.求解m
2.求出邻居类X
3.求解特征向量f
4.求解显著特征值
5.选取特征值最大的k个特征

随机输入的特征向量的显著特征往往并不交互

training adversarial models

对抗模型的训练的关键在于找到扰动特征提取函数f的参数子集，迫使触发输入的x-类被错误的分类为x+类，但对其他输入影响有限。
定向增加x-被分到x+类的概率，最小化更改w对非触发输入的影响
选择对扰动具有较高正影响，较低负影响的参数，并且执行逐层选择。
该过程迭代地选择和修改f的指定层l上的一组参数。每次迭代时，先进行反向传播，并找到当前模型的显著特征集，计算绝对正面影响和负面影响，对于每个参数w检查是否满足正面和负面的约束条件，如果符合则更新w的值，重复此过程直到模型收敛（特征向量在两次迭代间的值变得固定）或者找不到更多的合格参数。

特征

The model-reuse attack has a series of features:

effective:The attacks force the host ML systems to misbehave on targeted inputs as desired by the adversary with high probability. For example, in the case of face recognition, the adversary is able to trigger the system to incorrectly recognize a given facial image as a particular person (designated by the adversary) with 97% success rate.(特定输入造成错误结果)
evasive:on The developers may inspect given primitive models be- fore integrating them into the systems. Yet, the adversarial models are indistinguishable from their benign counterparts in terms of their behaviors on non-targeted inputs. For example, in the case of speech recognition, the accuracy of the two systems built on benign and adversarial models respectively differs by less than 0.2% on non-targeted inputs. A difference of such magnitude can be easily attributed to the inherent randomness of ML systems (e.g., random initialization, data shuffling, and dropout).（恶意模型的非特定输入与正常模型的输出结果差别不大）
elastic：The adversarial model is only one component of the host system. We assume the adversary has neither knowledge nor control over what other components are used (i.e., design choices) or how the system is tweaked (i.e., fine-tuning strategies). Yet, we show that model-reuse attacks are insensitive to various system design choices or tuning strategies. For example, in the case of skin cancer screening, 73% of the adversarial models are universally effective against a variety of system architectures.（弹性，对整个体系的构架和部署是不敏感的即对大部分架构产生的攻击都有效）
easy：The adversary is able to launch such attacks with little prior knowledge about the data used for system tuning or inference. Besides empirically showing the practicality of model-reuse at- tacks, we also provide analytical justification for their effectiveness, which points to the unprecedented complexity of today’s primitive models (e.g., millions of parameters in deep neural networks). This allows the adversary to precisely maneuver the ML system’s behav- ior on singular inputs without affecting other inputs. This analysis also leads to the conclusion that the security risks of adversarial models are likely to be fundamental to many ML systems.（无需任何先验知识）

This attack has a series of features:

(i) the compromised model is only one component of the end-to-end ML system; (破坏模型的一部分)
(ii) the adversary has neither knowledge nor control over the system design choices or fine-tuning strategies; （无需了解系统架构或者微调策略）
and (iii) the adversary has no influence over inputs to the ML system.（输入没有影响）

整个攻击实验中的参数调整

Parameters. We consider a variety of scenarios by varying the
following parameters.

(1) θ - the parameter selection threshold,
(2) λ - the perturbation magnitude,
(3) ntuning - the number of fine-tuning epochs,
(4) partial-system tuning or full-system tuning,
(5) ntrigger - the number of embedded triggers,
(6) l - the perturbation layer, and
(7) д - the classifier (or regressor) design

判断攻击有效性的两个指标

(i) Attack success rate, which quantifies the likelihood that the host system is triggered to misclassify the targeted input x− to the class “+” designated by the adversary:
(ii) Misclassification confidence, which is the probability of the
class “+” predicted by the host system with respect to x−. In the case of DNNs, it is typically computed as the probability assigned to “+” by the softmax function in the last layer.

Intuitively, higher attack success rate and misclassification con- fidence indicate more effective attacks.

结果

Effectiveness – In all three cases, under proper parameter setting, model-reuse attacks are able to trigger the host ML systems to misclassify the targeted inputs with success rate above 96% and misclassification confidence above 0.865, even after intensive fullsystem tuning (e.g., 500 epochs).
Evasiveness – The adversarial models and their genuine counterparts are fairly indiscernible. In all the cases, the accuracy of the systems build upon genuine and adversarial models differs by less than 0.2% and 0.6% in the source and target domains respectively. Due to the inherent randomness of DNN training (e.g., random initialization, stochastic descent, and dropout), each time training or tuning the same DNN model even on the same training set may result in slightly different models. Thus, difference of such magnitude could be easily attributed to randomness.
Elasticity – Model-reuse attacks are insensitive to various system design choices or fine-tuning strategies. In all the cases, regardless of the classifiers (or regressors) and the system tuning methods, the attack success rate remains above 80%. Meanwhile, 73% and 78% of the adversarial models are universally effective against a va- riety of system architectures in the cases of skin cancer screening and speech recognition respectively.

验证

skin cancer screening, speech recognition, face verification, and autonomous steering（识别皮肤癌，语音识别，面部验证和自主转向）

skin cancer screening

Inception.v3
72.1%
but two human dermatologists in the study attained 65.56% and 66.0% accuracy respectively.
Table 2 summarizes the influence of param- eter selection threshold θ on the attack effectiveness and evasive- ness. Observe that under proper setting (e.g., θ = 0.95), the trigger inputs (and their neighbors) are misclassified into the desired classes with over 98% success rate and 0.883 confidence. However, when θ = 0.99, the attack success rate drops sharply to 76%. This can be explained by that with overly large θ , Algorithm 2 can not find a sufficient number of parameters for perturbation. Meanwhile, the attack evasiveness increases monotonically with θ . For instance, on the ISIC dataset, the accuracy gap between the adversarial models and genuine models shrinks from 1.2% to 0.2% as θ increases from 0.65 to 0.99.

这种攻击很难防御

对于信誉良好的原始模型，首要任务使验证其真实性，即通过数字签名机制，但许多原始模型包含数亿个参数，大小为千兆字节，同时在不同平台上存储和传输模型会导致完全不同的模型。需验证特定平台的原始模型。
审查模型的完整性，但特征空间维度高，无法进行详尽的搜索

释义

maliciously crafted primitive models (“adversarial models”)

西杭

发布了267 篇原创文章 · 获赞 51 · 访问量 25万+

他的留言板关注