"Using Chatbots Against Themselves to 'Jailbreak' Each Other"

Nanyang Technological University (NTU) computer scientists have discovered a way to compromise Artificial Intelligence (AI) chatbots by training and using an AI chatbot to generate prompts capable of jailbreaking other chatbots. According to the team, jailbreaking involves computer hackers finding and exploiting flaws in a system's software to force it to do something its developers have purposefully restricted it from doing. The researchers named the method they used to jailbreak Large Language Models (LLMs), Masterkey. They first reverse-engineered how LLMs detect and defend against malicious queries. Based on this information, they taught an LLM to automatically learn and generate prompts that sidestep the defenses of other LLMs. This process can be automated, resulting in a jailbreaking LLM capable of adapting to and creating new jailbreak prompts even after developers patch their LLMs. This article continues to discuss the experiment and the importance of the team's findings.

Nanyang Technological University reports "Using Chatbots Against Themselves to 'Jailbreak' Each Other"

Submitted by grigby1

Submitted by Gregory Rigby on Thu, 12/28/2023 - 16:00