Research Team Status

  • Names of researchers and position 
    • Michael W. Mahoney (Research Scientist)
    • N. Benjamin Erichson (Research Scientist)
    • Serge Egelman  (Research Scientist)
    • Zhipeng Wei (incoming Postdoc)
       

Project Goals

  • We extended our work on characterizing weaknesses in Judge-LLM models. Specifically. Specifically, we demonstrated that emojis can be used to enhance jailbreaks against Judge LLM Detection.
     
  • This not only advances our understanding about LLMs, but also helps to motivate the development of new defense methods to mitigate token segmentation biases.
     
  • This is aligned with our long-term goal of improving model robustness and developing AI safety metrics. 
     

Accomplishments

  • We provide additional experiments for studying the semantic ambiguity in addition to intrinsic semantic meaning of emojis.
    • Experiments show that LLMs are affected in by the semantic meaning of emojis and just by the token segmentation bias introduced by injecting the emojis in the response.
  • We evaluated additional LLM models including Claude, and Gemini.
    • DeepSeek is surprisingly robust as compared to other models.
  • We carefully revised the paper (see https://arxiv.org/pdf/2411.01077). 
     
  • We started to investigate attacks on AI Agent framework.

Publications and presentations

  • Updated Preprint: https://arxiv.org/pdf/2411.01077
Report Materials
Files