adversarial attack
motivation
You train a lot of neural networks. We are trying to deploy neural networks in the real world. Are the networks robust to inputs constructed to fool them? Useful for spam classification, malware detection, network intrusion detection, etc.
一、adversarial attack
Classification:
Calculate the Loss of the two attacks
to calculate the distance between the two pictures: the following two methods should be small.
In the sign function, if the value is greater than 0, it is 1, and if it is less than 0, it is -1.
In the previous attack, we knew the network parameters, which is called a white box attack. You can't get model parameters in most online APIs. Are we safe if we don't release models? No, because black box attacks are possible.
If you have the training data of the target network, train a proxy network yourself, and use the proxy network to generate the attacked object. Black
box attacks are easy to succeed
. Change a pixel:
noise in the same place can attack many pictures.
Attacks on speech and NLP:
real world attack:
The attacker needs to find perturbations beyond a single image.
It is impossible for a camera to accurately capture extreme differences between adjacent pixels in a disturbance.
It is desirable to make perturbations that consist primarily of colors that are reproducible by the printer.
Parasitic attack method:
attack during training:
defense
Active defense and passive defense
Just do some blurring of the picture.
Compression and generator
blurring will be easy to crack. If it is random (change it arbitrarily),
find a loophole to fill:
Problem: It may not be able to block new attacks, and it will take a lot of effort computing resources
二、evasion attacks
What should an effective adversarial example satisfy?
High correlation with attack target
Overlap between original and perturbed samples Syntacticity
of perturbed samples Semantic
preservation
Fluency is scored by the perplexity of the pre-trained language model (smaller PPL is higher)
Semantic similarity between converted and original samples
Distance of swapped word embeddings and original word embeddings
How to choose this threshold?
3. Search method
Evasion Attacks: Search Methods
Find Perturbations That Achieve Goals and Satisfy Constraints
Greedy Search
Greedy Search for Word Importance Ranking (WIR)
Genetic Algorithm
greedy search
Score each transition at each position, then replace words in order of decreasing score until a flip is predicted
Greedy Search with Word Importance Ranking (WIR)
Step 1: Score the importance of each word; Step 2: Swap from most to least important word Sort
word importance by leave-one-out (LOO): see See how the probability decreases when words are removed from the input
genetic algorithm
Genetic Algorithm: Evolution and Selection Based on Fitness
textFooler
PWWS
other
TF-Adjusted: They propose a modified version of TextFooler with stronger constraints
Word replacement by changing the inflected form of verbs, nouns and adjectives
universal trigger
Trigger strings that are not mission related, but when added to the original string, can perform targeted attacks
Step 1: Determine how many words are needed for triggers, and initialize them with some
words Gradients of word embeddings, and find tokens that minimize the objective function arg min(ei–EO)Ve C iEVocab
Step 3: Update the trigger fake news classifier with the newly found word
, when the trigger “%@” is in the input , which will classify the input as "not fake news"