"Boffins Fool AI Chatbot Into Revealing Harmful Content – With 98 Percent Success Rate"

Purdue University researchers have developed a method for interrogating Large Language Models (LLMs) in a way that almost always breaks their etiquette training. LLMs such as Bard, ChatGPT, and Llama are trained on large datasets that may contain questionable or harmful information. Artificial Intelligence (AI) giants like Google, OpenAI, and Meta try to "align" their models using "guardrails" to prevent chatbots based on these models from generating harmful content. However, users have tried to "jailbreak" them by crafting input prompts capable of bypassing protections or undoing the guardrails with further fine-tuning. The Purdue University team developed a novel approach that uses model makers' tendency to disclose probability data related to prompt responses. In their paper titled "Make Them Spill the Beans! Coercive Knowledge Extraction from (Production) LLMs," the team describes a technique called LINT, which is short for LLM interrogation. This article continues to discuss the team's LINT technique.

The Register reports "Boffins Fool AI Chatbot Into Revealing Harmful Content – With 98 Percent Success Rate"

Submitted by grigby1

Submitted by Gregory Rigby on