Machine Learning: Adversarial Attacks on Natural Language Processing

Attacks in NLP

Insert image description here

Related topics

Insert image description here
Insert image description here

Introduction

Insert image description here
Insert image description here
Previous attacks focused on images and speech, while there was less content on NLP. The complexity of NLP is related to the dictionary:
Insert image description here
Insert image description here
Insert image description here
NLP can only add noise to the embedding features.

Evasion Attacks

Insert image description here
Insert image description here
filmAfter changing the sentiment category of movie reviews films, the reviews changed from negative to positive.
Insert image description here
In structural analysis, if one word is changed, the result will be completely different.
Insert image description here
The model is very fragile. See if there are any ways to make your model more robust.
Insert image description here
Insert image description here
Insert image description here

Insert image description here

imitation AttacksInsert image description hereInsert image description here

Insert image description here
Insert image description here
Insert image description here
synonym replacement

Insert image description here
Find similar vectors in embedding space for replacement

Insert image description here
KNN clustering to zoom in

Insert image description here

Large model prediction for substitution
Insert image description here
Insert image description here
Insert image description here
Use the gradient of embedding to obtain word substitution.
Insert image description here
Sort in the order that makes the loss change, and then take the top-k words to maximize the loss.
Insert image description here
Character-level substitution, exchange, deletion, and insertion.


Motivation

Insert image description here

Example of Attack

Insert image description here
Insert image description here
Insert image description here
Adding some noise can make the classifier identify errors.
Insert image description here
Insert image description here
Insert image description here
Designing loss can make untargeted or targeted attacks possible.
Insert image description here
Insert image description here
In the case of L2 norm, changing one has the same effect as changing each.
Insert image description here
Insert image description here
Insert image description here
Insert image description here

Backdoor Attacks

Insert image description here
How to attack when you don’t know the training data? This is a black box attack.
Insert image description here
Integrated attack, diagonal attack.
Insert image description here
The dark blue area is the range that it can normally be recognized as correct. To attack, just move it to the area that is not blue.
Insert image description here

One pixel attack

Insert image description here
Changing a single pixel value can cause the classifier to fail.

Universal adversarial attack

Insert image description here
I found a noise that can make the discriminator identify errors when added to a lot of pictures.

In addition to images, other fields can also be attacked, such as sound, NLP, etc.
Insert image description here
Insert image description here
After adding the red text at the end, the answers in the question and answer system will be the same.

Attack in the Physical world

Insert image description here
Adding glasses to the man causes the camera recognition algorithm to identify the woman on the right.
Insert image description here
An attack on the license plate system, Peugeot's recognition system.
Insert image description here
By stretching the horizontal line of 3 a little longer, Tesla saw the speed limit of 35 as 85, causing acceleration.

Insert image description here
The number of white squares will correspond to different categories.

Open a backdoor in the model:
Insert image description here
start the attack during the training phase. Although the training data looks normal to the human eye, it will only identify a certain picture incorrectly and not other pictures.
Public image training set (which may contain attack images)

Defense

Insert image description here

passive defense

Once trained, don't move it and add a shield in front of the model.
Insert image description here
For example, blurring has little impact on the original image, but has a huge impact on the attack image. In addition, the confidence rate will be slightly lowered.
Insert image description here
Insert image description here

  • Image Compression
  • Image generation: Use image generation to generate the same input image, and then filter the attack images

Insert image description here
If your passive defense measures are known to others, they can update their attacks and break through your passive defenses. For example, the blurring process can be regarded as the first layer of the network.

When doing defense, add your randomness and various different defenses so that the attacker does not know what your defense is.

Active defense

Train a robust model that is not easily broken.
Insert image description here

A new training data was created, each sample was attacked, but the labels were corrected. Then the two batches of data are trained together.
If new attack data is found, it is added to the training data for further training.
Insert image description here
However, it is not very able to block new attacks and can still be broken by attacks. In addition, it requires constant repeated training and relatively large training resources.

Someone has invented a method that can achieve adversairal training for free without requiring new computing resources.
Insert image description here

Summary

Insert image description here
Both attack and defense methods are evolving.

Guess you like

Origin blog.csdn.net/uncle_ll/article/details/132656667