2024 Q4 | Science of Security Virtual Organization

2024 Q4

Research Team Status

Project Goals

Characterizing weaknesses in Judge-LLM models. This not only advances our understanding about LLMs, but also helps to motivate the development of new defense methods to mitigate token segmentation biases.
This is aligned with our long-term goal of improving model robustness and developing AI safety metrics.

Accomplishments

We developed a new attack to mislead Judge-LLM models, which are specifically designed to prevent jailbreaking attacks. This identifies an important vulnerability in current LLM pipelines. Specifically, our attack method places emojis within
tokens to increase the embedding differences between sub-tokens and the original tokens.
Our experiments with six state-of-the-art Judge LLMs show that the emoji attack allows 25% of harmful responses to bypass detection by Llama Guard and Llama Guard 2, and up to 75% by ShieldLM. These results highlight the need for stronger
Judge LLMs to address this vulnerability.
One potential defense strategy is to design prompts that filter out abnormal characters in the responses of target LLMs. But this is difficult if the attacker uses different delimiters for different tokens. We show that our attack can be even successful if we use “gpt-3.5-turbo” as an LLM filter to remove unnecessary symbols.
Moreover, we demonstrate that the emoji attack can be effectively combined with existing jailbreak techniques to evade
the detection of Judge LLMs.

Publications and presentations

Lead PI:

Co-Pi(s):