In order to avoid risks, how to watermark the large model?

Large language models, such as the recently developed ChatGPT, can compose documents, create executable code, and answer questions, often with human-like capabilities.


As the use of these large models becomes more common, a greater risk emerges that they could be used for malicious purposes. These malicious purposes include: using automated bots to conduct social engineering and election manipulation campaigns on social media platforms, creating fake news and web content, and using artificial intelligence systems to cheat on academic writing and programming assignments, among others.


In addition, the widespread availability of AI-generated data flooding the Internet complicates the construction of future datasets, as the quality of synthetic data is often inferior to human content, which many researchers have to detect and exclude before model training.


For all these reasons, detecting and supervising AI-generated text becomes the key to reducing the harm of large models.


In response to this problem, a paper proposes a method for watermarking the output of large language models - embedding signals into the generated text that are invisible to humans but detectable by algorithms. Generate watermarks without retraining language models, and detect watermarks without accessing APIs or parameters.


This article thinks about how to detect that a piece of text is the output of a large model. The watermarking technique they discovered may be a good detection scheme. Watermarks refer to hidden patterns in text, imperceptible to humans, but recognizable by algorithms as machine-generated text.


This paper proposes an efficient watermarking technique that can detect machine-generated text from short-length tokens (only 25 tokens are required) with an extremely low probability of false positives (labeling human text as machine-generated) .


Watermark detection algorithms can be made public and let third parties (such as social media platforms) run them themselves, or they can be kept private and run through an API.


cd47e42bc695e723895dd32250ec37d6.jpeg


For watermark detection, the paper also proposes a statistical test method with interpretable p-values, and an information-theoretic framework for analyzing watermark sensitivity. The method proposed in this study is simple and novel, and provides thorough theoretical analysis and solid experiments.


Given the serious challenges of detecting and generating text from large language models (LLMs), this research could have a significant impact on the machine learning community.


This article proposes to add a watermark to each token during the generation of the model. Given a prompt, when the model decodes the t-th token, the probability predicted by the language model is calculated using the prompt and the first t-1 words. Probability p(t). At this time, the watermarking model will use the t-1th token to calculate a hash value to obtain a random number, and then divide the vocabulary of p(t) into two parts randomly, one part is called the green table, and the other part is the red table , and the watermark model will only sample and decode on the green table, and try not to generate tokens on the red table, so as to generate a text with a hidden pattern.

As can be seen from the schematic diagram above, for text without a watermark (No watermark), the generator does not know which tokens the red table and the green table are, so it will randomly generate green or red words. For the text with watermark (With watermark), most of them are green words, so that it can be distinguished whether the text is generated by the model according to the hypothesis test that violates the red table.


experiment analysis

To simulate various realistic language modeling scenarios, the authors randomly select texts to slice from a news-like subset of the C4 dataset. For each random string, the authors crop a fixed-length token from the end and treat it as a "baseline" to generate results. The remaining token is a hint. For experimental runs using polynomial sample decoding, we take samples from the dataset until we achieve at least 500 model-generated results of length T = 200 ± 5 tokens.


In runs using greedy and beamsearch decoding, we suppress EOS tokens during generation to counter beamsearch's tendency to generate short sequences. Then, we truncate all sequences to T=200. A larger oracle language model (OPT-2.7B) is used to compute model-generated and human baseline perplexity (PPL).

0da4c0ae817edfb252ad6e2231992002.jpeg


Very strong watermarking of short sequences can be achieved by choosing a small green list size γ and a large green list bias δ. However, creating stronger watermarks may reduce the quality of the resulting text. As shown the trade-off between watermark strength (z-score) and text quality (perplexity) for various combinations of watermark parameters. For each parameter choice, we calculated results using 500 ± 10 sequences of length T = 200 ± 5 markers.


Interestingly, the authors found that for a small green list, γ=.1 is Pareto optimal. Shows the tradeoff between watermark strength and accuracy when using beam search. Beam search works synergistically with soft watermarking rules. Especially when using 8 beams, the dots in the right side of the figure above form almost vertical lines, achieving very little perplexity for strong watermarks.


Original address of the paper: https://openreview.net/forum?id=aX8ig9X2a7

Guess you like

Origin blog.csdn.net/specssss/article/details/132080252