Can Your LLM Keep a Secret?
Jaiden Fairoze¹,², Sanjam Garg¹,³, and Steve Lu⁴
¹University of California, Berkeley
²FAIR at Meta
³Carabid, Inc.
⁴Stealth Software Technologies, Inc
|
ASBTRACT As large language models (LLMs) transition from novelties to critical infrastructure, preventing First, we ask if typical strategies such as input filtering, the dominant defense against prompt Next, we then study whether secrets can be reliably hidden in prompts despite adversarial interference. We capture this as follows: an attacker injects instructions into a model interaction and, by observing only the model’s output, infers sensitive properties of the hidden input without explicit disclosure. For this setting, we show that such attacks can be launched on state-of-the-art models and agents built on top of these models. Our results suggest that, current LLM architectures fundamentally lack mechanisms to guarantee secrecy under adversarial interaction. |
|
BIO Prof. Sanjam Garg is an Associate Professor at the University of California, Berkeley. His research interests are in cryptography and its applications to security and privacy. He obtained his Ph.D. from the University of California, Los Angeles in 2013 and his undergraduate degree from the Indian Institute of Technology, Delhi in 2008. Prof. Garg is the recipient of various honors such as the ACM Doctoral Dissertation Award, the Sloan Research Fellowship and the IIT Delhi Graduates of the Last Decade Award. Prof. Garg's research has been recognized with a test of time award at FOCS 2023, and best paper awards at EUROCRYPT 2013, CRYPTO 2017, EUROCRYPT 2018 and TCC 2025. Past students and postdoctoral researchers from Prof. Garg's research group are now faculty/researchers at top institutions, such as Columbia University, Brown University, the University of Toronto, Microsoft Research, etc. |