2025 Q1 | Science of Security Virtual Organization

2025 Q1

Research Team Status

Names of researchers and position
- Michael W. Mahoney (Research Scientist)
- N. Benjamin Erichson (Research Scientist)
- Serge Egelman (Research Scientist)
- Zhipeng Wei (incoming Postdoc)

Project Goals

We extended our work on characterizing weaknesses in Judge-LLM models. Specifically. Specifically, we demonstrated that emojis can be used to enhance jailbreaks against Judge LLM Detection.
This not only advances our understanding about LLMs, but also helps to motivate the development of new defense methods to mitigate token segmentation biases.
This is aligned with our long-term goal of improving model robustness and developing AI safety metrics.

Accomplishments

We provide additional experiments for studying the semantic ambiguity in addition to intrinsic semantic meaning of emojis.
- Experiments show that LLMs are affected in by the semantic meaning of emojis and just by the token segmentation bias introduced by injecting the emojis in the response.
We evaluated additional LLM models including Claude, and Gemini.
- DeepSeek is surprisingly robust as compared to other models.
We carefully revised the paper (see https://arxiv.org/pdf/2411.01077).
We started to investigate attacks on AI Agent framework.

Publications and presentations

Lead PI:

Co-Pi(s):

Report Materials

Files

Report File(s)