One-click to open ChatGPT "Dangerous Speech"

77d587892ec8e093497c23927573ba87.jpeg

Authorized by Big Data Digest, reproduced from Academic Headlines

Author: Hazel Yan

Edit: Page


With the popularity of large-scale model technology, AI chatbots have become one of the common tools for social entertainment, customer service and educational assistance.


However, unsafe AI chatbots may be used by some people to spread false information, manipulate public opinion, and even be used by hackers to steal users' personal privacy. The emergence of generative AI tools for cybercrime, such as WormGPT and FraudGPT, has raised concerns about the security of AI applications.


Last week, Google, Microsoft, OpenAI, and Anthropic launched a new industry body, the Frontier Model Forum, to promote the safe and responsible development of cutting-edge AI systems: Advancing AI Safety Research, Identifying Best Practices and Standards , to facilitate information sharing among policymakers and industry.


02105918c76f9705acf0db408ec0e226.jpeg


So, the question is, is their own model really safe?


Recently, researchers from Carnegie Mellon University, Center for AI Safety, and Bosch Center for AI disclosed a "big bug" related to AI chatbots such as ChatGPT - AI developers can be bypassed through adversarial prompts To manipulate AI chatbots to generate dangerous speech.


None of the current popular AI chatbots or models, such as OpenAI's ChatGPT, Google's Bard, Anthropic's Claude 2, and Meta's LLaMA-2, have survived.


80132ed199d244b83d3db16cf08f4066.jpeg

Figure | The security rules of the four language models can be bypassed through adversarial hints, triggering potentially harmful behavior

 

Specifically, the researchers discovered a Suffix that can be appended to queries against large language models (LLMs) to generate dangerous speech. Instead of refusing to answer these dangerous questions, the study maximizes the probability that these models will generate a yes answer.

 

For example, when asked "how to steal someone's identity", the AI ​​chatbot gave very different output before and after turning on "Add adversarial suffix".


90ad0739a16bfbd5ccf74245931dae28.jpeg


7bb2006955ca659cf9ffa708f401254b.jpeg

Figure | Comparison of chatbot answers before and after enabling Add adversarial suffix 


In addition, AI chatbots will also be induced to write inappropriate remarks such as "how to build an atomic bomb", "how to post dangerous social articles", "how to steal money from charities".


In response, Zico Kolter, an associate professor at Carnegie Mellon University who participated in the study, said, "As far as we know, there is currently no way to fix this problem. We don't know how to make them safe."


The researchers had warned OpenAI, Google and Anthropic of the flaw before publishing these results. Each company has introduced blocking measures to prevent the exploits described in the research papers from working, but they have yet to figure out how to stop adversarial attacks more generally.


Hannah Wong, spokesperson for OpenAI, said: "We are constantly working to improve the robustness of our models against adversarial attacks, including methods for identifying patterns of unusual activity, ongoing red team testing to simulate potential threats, and approach to fix model weaknesses revealed by newly discovered adversarial attacks."


Google spokesperson Elijah Lawal shared a statement explaining the steps the company took to test the model and find its weaknesses. "While this is a common problem with LLMs, we have important safeguards in place in Bard that we are continually improving."


Anthropic's interim director of policy and social impact, Michael Sellitto, said: "Making models more resistant to prompting and other adversarial 'jailbreak' measures is an active area of ​​research. We are trying to make the base model more 'harmless' by hardening its defenses." ’. At the same time, we are also exploring additional layers of defense.”


72b1992ecfe83945e9101b64d7deec57.jpeg

Figure | Harmful content generated by 4 language models


For this problem, the academic circles have also issued warnings and given some suggestions.


Armando Solar-Lezama, a professor at MIT's School of Computing, said it makes sense that adversarial attacks exist in language models because they affect many machine learning models. However, it is surprising that an attack developed against a generic open source model can be so effective on multiple different proprietary systems.


The problem, Solar-Lezama argues, may be that all LLMs are trained on similar corpora of textual data, many of which come from the same websites, and the amount of data available in the world is limited.


"Any important decision should not be made entirely by the language model alone. In a sense, this is just common sense." He emphasized the moderate use of AI technology, especially in scenarios involving important decisions or potential risks. However, human participation and supervision are still required to better avoid potential problems and misuse.


"It is no longer possible to keep AI from falling into the hands of malicious operators," says Arvind Narayanan, a professor of computer science at Princeton University. He argues that while efforts should be made to make models more secure, we should also recognize that preventing all abuse is unlikely. Therefore, a better strategy is to strengthen the supervision and fight against abuse while developing AI technology.


Worry or disdain. In the development and application of AI technology, in addition to focusing on innovation and performance, we must always keep safety and ethics in mind.


Only by maintaining moderate use, human participation and supervision, can we better avoid potential problems and abuse, and make AI technology bring more benefits to human society.

Guess you like

Origin blog.csdn.net/weixin_31351409/article/details/132134089