【学习】adversarial attack、evasion attacks


motivation

You train a lot of neural networks. We are trying to deploy neural networks in the real world. Are the networks robust to inputs constructed to fool them? Useful for spam classification, malware detection, network intrusion detection, etc.
insert image description here

一、adversarial attack

Classification:
insert image description here
insert image description here
Calculate the Loss of the two attacks
insert image description here
to calculate the distance between the two pictures: the following two methods should be small.
insert image description here
insert image description here
insert image description here

In the sign function, if the value is greater than 0, it is 1, and if it is less than 0, it is -1.
insert image description here
In the previous attack, we knew the network parameters, which is called a white box attack. You can't get model parameters in most online APIs. Are we safe if we don't release models? No, because black box attacks are possible.
If you have the training data of the target network, train a proxy network yourself, and use the proxy network to generate the attacked object. Black
insert image description here
box attacks are easy to succeed
insert image description here
. Change a pixel:
insert image description here
noise in the same place can attack many pictures.
insert image description here
Attacks on speech and NLP:
insert image description herereal world attack:
insert image description here

The attacker needs to find perturbations beyond a single image.
It is impossible for a camera to accurately capture extreme differences between adjacent pixels in a disturbance.
It is desirable to make perturbations that consist primarily of colors that are reproducible by the printer.
insert image description here
insert image description here
Parasitic attack method:
insert image description here
attack during training:
insert image description here

defense

Active defense and passive defense
insert image description here
Just do some blurring of the picture.
insert image description here
Compression and generator
insert image description here
blurring will be easy to crack. If it is random (change it arbitrarily),
insert image description here
find a loophole to fill:
insert image description here
Problem: It may not be able to block new attacks, and it will take a lot of effort computing resources

二、evasion attacks

What should an effective adversarial example satisfy?
High correlation with attack target
Overlap between original and perturbed samples Syntacticity
of perturbed samples Semantic
preservation

insert image description hereinsert image description here
insert image description here
Fluency is scored by the perplexity of the pre-trained language model (smaller PPL is higher)
insert image description here
Semantic similarity between converted and original samples
Distance of swapped word embeddings and original word embeddings
How to choose this threshold?
insert image description here
insert image description here

3. Search method

insert image description here
Evasion Attacks: Search Methods
Find Perturbations That Achieve Goals and Satisfy Constraints
Greedy Search
Greedy Search for Word Importance Ranking (WIR)
Genetic Algorithm

greedy search

Score each transition at each position, then replace words in order of decreasing score until a flip is predicted
insert image description here

Greedy Search with Word Importance Ranking (WIR)

Step 1: Score the importance of each word; Step 2: Swap from most to least important word Sort
insert image description here
insert image description here
word importance by leave-one-out (LOO): see See how the probability decreases when words are removed from the input
insert image description here
insert image description here

genetic algorithm

Genetic Algorithm: Evolution and Selection Based on Fitness
insert image description here
insert image description here
insert image description here

textFooler

insert image description here
insert image description here

PWWS

insert image description here

other

insert image description here
insert image description here
insert image description here
insert image description here
TF-Adjusted: They propose a modified version of TextFooler with stronger constraints
insert image description here
Word replacement by changing the inflected form of verbs, nouns and adjectives
insert image description here

universal trigger

Trigger strings that are not mission related, but when added to the original string, can perform targeted attacks
insert image description here
Step 1: Determine how many words are needed for triggers, and initialize them with some
words Gradients of word embeddings, and find tokens that minimize the objective function arg min(ei–EO)Ve C iEVocab
insert image description here
Step 3: Update the trigger fake news classifier with the newly found word
insert image description here
insert image description here
insert image description here
, when the trigger “%@” is in the input , which will classify the input as "not fake news"
insert image description here

Guess you like

Origin blog.csdn.net/Raphael9900/article/details/128490892