Research Team Status
- Names of researchers and position
- Michael W. Mahoney (Research Scientist)
- N. Benjamin Erichson (Research Scientist)
- Serge Egelman (Research Scientist)
- Zhipeng Wei (incoming Postdoc)
Project Goals
- We extended our work on characterizing weaknesses in Judge-LLM models. Specifically. Specifically, we demonstrated that emojis can be used to enhance jailbreaks against Judge LLM Detection.
- This not only advances our understanding about LLMs, but also helps to motivate the development of new defense methods to mitigate token segmentation biases.
- This is aligned with our long-term goal of improving model robustness and developing AI safety metrics.
Accomplishments
- We provide additional experiments for studying the semantic ambiguity in addition to intrinsic semantic meaning of emojis.
- Experiments show that LLMs are affected in by the semantic meaning of emojis and just by the token segmentation bias introduced by injecting the emojis in the response.
- We evaluated additional LLM models including Claude, and Gemini.
- DeepSeek is surprisingly robust as compared to other models.
- We carefully revised the paper (see https://arxiv.org/pdf/2411.01077).
- We started to investigate attacks on AI Agent framework.
Publications and presentations
- Updated Preprint: https://arxiv.org/pdf/2411.01077
Report Materials
Files
Report File(s)
Emoji_Attack_v2.pdf
(2.97 MB)