Research Team Status

  • Names of researchers and position 
    • Michael W. Mahoney (Research Scientist)
    • N. Benjamin Erichson (Research Scientist)
    • Serge Egelman  (Research Scientist)
    • John Cava (PhD student)
    • Zhipeng Wei (incoming Postdoc)
       

Project Goals

  • Characterizing weaknesses in Judge-LLM models. This not only advances our understanding about LLMs, but also helps to motivate the development of new defense methods to mitigate token segmentation biases.
     
  • This is aligned with our long-term goal of improving model robustness and developing AI safety metrics. 

Accomplishments

  • We developed a new attack to mislead Judge-LLM models, which are specifically designed to prevent jailbreaking attacks. This identifies an important vulnerability in current LLM pipelines. Specifically, our attack method places emojis within
    tokens to increase the embedding differences between sub-tokens and the original tokens.
  • Our experiments with six state-of-the-art Judge LLMs show that the emoji attack allows 25% of harmful responses to bypass detection by Llama Guard and Llama Guard 2, and up to 75% by ShieldLM. These results highlight the need for stronger
    Judge LLMs to address this vulnerability. 
  • One potential defense strategy is to design prompts that filter out abnormal characters in the responses of target LLMs. But this is difficult if the attacker uses different delimiters for different tokens. We show that our attack can be even successful if we use “gpt-3.5-turbo” as an LLM filter to remove unnecessary symbols.
  • Moreover, we demonstrate that the emoji attack can be effectively combined with existing jailbreak techniques to evade
    the detection of Judge LLMs.

Publications and presentations

  • Preprint: https://arxiv.org/pdf/2411.01077