Fragment Notes | Large Model Attack and Defense Briefing

Foreword : Different from traditional AI attack and defense (backdoor attacks, adversarial samples, poisoning attacks, etc.), today's large model attack and defense involves the following aspects:

Insert image description here


1. Credibility issues of large models

1.1 False content generation

Large models may generate and spread false content. This phenomenon is called the hallucination problem of language models. It refers to the situation where the content generated by the model is inconsistent with the real world or is meaningless. This situation is mainly due to the fact that the language model lacks real-world knowledge and the meaning of language, making it difficult for the model to understand and express real-world concepts and information. This situation is prevalent in modern natural language processing, especially in problems in the open-ended generative domain. The harm is to induce and manipulate users' opinions and behaviors.

Hallucination problems in language models can be divided into two categories: intrinsic hallucination and extrinsic hallucination . Internal illusion refers to the inconsistency between the output content and the source input content, such as incorrect year information, name information, etc.; external illusion refers to information that cannot be judged to be correct through the source information, and there is no way to support or deny it. But external illusions can sometimes be beneficial because they are based on correct external knowledge and can enrich the information that generates results. But in most cases external illusions still need to be treated with caution because they increase information uncertainty from a factual security perspective.

Causes: (1) Irregular training data; (2) Exposure bias problem: the decoding difference between training and inference processes, that is, the decoder is trained based on facts during training, but the decoder can only learn from its own inference history during inference Zhonglai generates further, so as the generation sequence becomes longer, the hallucination becomes more severe.

Solutions:
(1) Select high-quality data sets for training and clean up the noise in the data sets.
(2) By improving the encoder structure, the feature extraction results are optimized and hallucinations are reduced.
(3) Measure credible output of large models. Similar to the confidence of general models, large models can add a feasibility assessment of the output content during the training process, and provide the confidence to users as a reference at the same time.
(4) Use controllable text generation methods to control the degree of illusion to meet the needs of different real-world applications. In conversational and abstract summarization tasks, hallucination questions are not necessarily all negative questions.
(5) Reduce self-contradiction problems in generating long sentences.

It is worth mentioning that in the field of vision-language cross-modal generation (Vision-Language Generation), research on the illusion problem is still in a very early stage. Currently, relevant research is mainly carried out on image description scenes, as shown in the figure below to generate text The object in does not appear in the input image.

Figure 1.1 Illusion problem in image description scene
First, there is a lack of empirical and theoretical analysis on the occurrence of hallucination phenomena in many tasks such as visual storytelling, visual common sense reasoning, and video subtitles. Second, more effective evaluation metrics are needed. Although CHAIR can automatically evaluate the degree of object hallucination in image captions, it requires a predefined list of object categories and does not generalize well. Furthermore, there is currently no automatic measurement method for the problem of hallucinations in other tasks such as open-ended visual question answering. Finally, how to complete the controlled generation of text based on existing content is an important research direction in mitigating visual-linguistic illusions.

references

  1. Survey of Hallucination in Natural Language Generation (ACM Computing Surveys, 2023)

  2. Object Hallucination in Image Captioning (EMNLP, 2018) UC Berkeley & Boston University
    代码:https://github.com/LisaAnne/Hallucination

  3. On Hallucination and Predictive Uncertainty in Conditional Language Generation (EACL, 2021) University of California, Santa Barbara

  4. Let there be a clock on the beach:Reducing Object Hallucination in Image Captioning (WACV, 2022) Computer Vision Center, UAB, Spain
    代码:https://github.com/furkanbiten/object-bias/tree/main

  5. Plausible May Not Be Faithful: Probing Object Hallucination in Vision-Language Pre-training (EACL, 2023) The Hong Kong University of Science and Technology
    代码:https://github.com/wenliangdai/VLP-Object-Hallucination

  6. Deconfounded Image Captioning: A Causal Retrospect (TPAMI, 2021)

1.2 Privacy leakage

Attack : There are two types of privacy leaks caused by large models:

(1) Explicit privacy leakage: Large models use user instructions as training data and inadvertently convert the training data into generated content, and these training data may contain sensitive user information. The large model will store the contents of the dialog box, including but not limited to user personal information such as name, email account, etc.

(2) Implicit privacy leakage: By collecting the contents of dialog boxes, the large model can infer potentially sensitive information such as user preferences, interests, behaviors, etc., and make accurate advertising recommendations based on this.

Defense : Privacy protection for input and output data


2. Security issues of large models

Generative large models such as ChatGPT are essentially large-scale models based on deep learning. They also face many threats to artificial intelligence security, including model theft, and various traditional attacks (adversarial sample attacks, backdoor attacks, prompt attacks, data injection poison, etc.) to cause output errors.

2.1 Model stealing attack

Attack : Model stealing refers to the attacker relying on a limited number of model queries to obtain a local model that has the same functions and effects as the target model. The attacker attempts to restore the model's design and parameters by analyzing the model's input, output, and internal structure. This may lead to the leakage of model intellectual property and pose security risks.

Defense : In order to prevent model theft, the following technologies can be used to protect model parameters:
(1) Model encryption: Encrypt the parameters of the model.
(2) Model watermark: Trace and verify large models to ensure their origin and legality.
(3) Model integration: By integrating multiple models together, the robustness and security of the model can be improved. Ensemble learning techniques can improve model performance and security by combining the predictions of multiple models.
(4) Model distillation: Reduce the size of the model. Small models are more tolerant of noise and disturbances.
(5) Access control: Ensure the security of large models during deployment and use, including access control, identity authentication, rights management, and data protection. This helps prevent unauthorized access and misuse.

2.2 Data theft attack

Attack : Large models usually need to process a large amount of sensitive data. Attackers may try to obtain the distribution of data used in the training process by accessing the model or intercepting the input and output of the model, thereby obtaining sensitive information [1].
Defense : (1) Set up a corresponding mechanism to determine whether the user is conducting a query for the purpose of stealing. (2) Encrypt and upload user sensitive information.

2.3 Prompt word attack

The construction of prompt enables the pre-trained large model to output results that are more consistent with human language and understanding, but different prompt templates may still cause some security issues and privacy issues. Prompt words are constantly mentioned as a medium for human interaction with large language models. Prompt word attack is a new type of attack method, including prompt word injection, prompt word leakage and prompt word jailbreaking. These attack methods may cause the model to generate inappropriate content, leak sensitive information, etc.

  • Prompt word injection : Adding malicious or unintended content to prompts to hijack the output of a language model. Tip leaks and jailbreaks are actually subsets of this attack;
  • Prompt word leakage : Extract sensitive or confidential information from LLM’s response;
  • Tip Word Jailbreaking : Bypassing security and censorship features.

For a detailed introduction to "Prompt word attack", please see the blog: Large model attack and defense | Prompt word attack __Meilinger_'s blog - CSDN blog

2.4 Against sample attacks

By making minor modifications to input samples, attackers are able to trick the model, leading to erroneous predictions. This can have a negative impact on the reliability and security of the model.

2.5 Backdoor attack

Attackers insert backdoors into the model, causing it to produce incorrect output results or leak sensitive information under certain conditions. This can lead to models being misused or controlled by attackers.

2.6 Data Poisoning

……

3. Covert communication based on large models

Due to the large scale of training data, large language models have natural advantages in covert communication - they can more reasonably simulate the real data distribution and improve the statistical imperceptibility of generated cipher text to a certain extent. Attackers use large models to generate fluent cipher text and transmit it in public channels. At present, cross-modal steganography is gradually attracting the attention of researchers, and it is worth trying to combine large models to complete cross-modal steganography.

The development of text steganography is as follows:
Insert image description here
Insert image description here
According to research, there is currently no relevant research work on large model text steganography. In addition, steganalysis algorithms for generative large model steganography also need to be proposed.

References

  1. Paper Study|A review of the development of generative text steganography
  2. Paper Study|A review of the development of generative cross-modal steganography

4. Property rights issues of large models

Problem : The copyright ownership of large-scale model-generated works is currently unclear.

Measures :
(1) In the training process of large models, in addition to the original input itself, data sources and property rights information also need to be used as training data. This will make it possible to accurately query whether certain property rights are involved when using large models for creative tasks, and citations and payments are required. The implementation of this function will greatly increase the value of data, avoid property rights disputes, and also allow ChatGPT to better assist scientific research and creation.
(2) Use blockchain technology to record and protect the copyright of data sources. The use of blockchain technology also facilitates traceability analysis in the subsequent handling of property rights disputes.
(3) Use electronic watermark technology to protect the copyright of data sources and the copyright of practical models.


5. Ethical issues of large models

5.1 Ideology

5.2 Prejudice and discrimination

5.3 Political struggle

5.4 Employment equity

5.5 Information cocoon room

In view of the ethical issues existing in large models, it is necessary to establish a detection mechanism for various types of information, set up a real-time supervision system, and record violations of large models.


Postscript : The above are some common contents of large model attack and defense. Personally, I feel that the main difference between large model attack and defense and traditional AI attack and defense is the difference in degree - because large models are widely used in various scenarios, their impact on human society is naturally greater than ordinary ones. Model, for this reason, the research on attack and defense of large models is quite critical and needs to be carried out urgently.

References

  1. 2023 Generative Large Model Security and Privacy White Paper, Zhijiang Laboratory, 2023.

Guess you like

Origin blog.csdn.net/qq_36332660/article/details/132865932