Can Your LLM Keep a Secret?

Jaiden Fairoze¹,², Sanjam Garg¹,³, and Steve Lu⁴

¹University of California, Berkeley 
²FAIR at Meta 
³Carabid, Inc. 
⁴Stealth Software Technologies, Inc

ASBTRACT

As large language models (LLMs) transition from novelties to critical infrastructure, preventing
the leakage of sensitive information has become a central challenge. We study this problem through
a cryptographic lens, formalizing secrecy as a game between a defender who embeds a secret along with any safeguarding instructions in a prompt and an adversary who injects auxiliary prompt
instructions and attempts to recover the secret from the model’s output. In this talk, we show
fundamental limitations of existing defenses.

First, we ask if typical strategies such as input filtering, the dominant defense against prompt
injection, work. Specifically, we demonstrate how adversaries can, in production systems, encode
malicious intent into computationally hard-to-detect structures, exploiting the asymmetry between
lightweight guardrails and the powerful models they protect. These attacks align with emerging
theoretical results suggesting inherent limits to universal input filtering.

Next, we then study whether secrets can be reliably hidden in prompts despite adversarial interference. We capture this as follows: an attacker injects instructions into a model interaction and, by observing only the model’s output, infers sensitive properties of the hidden input without explicit disclosure. For this setting, we show that such attacks can be launched on state-of-the-art models and agents built on top of these models.

Our results suggest that, current LLM architectures fundamentally lack mechanisms to guarantee secrecy under adversarial interaction.

BIO

Prof. Sanjam Garg is an Associate Professor at the University of California, Berkeley. His research interests are in cryptography and its applications to security and privacy. He obtained his Ph.D. from the University of California, Los Angeles in 2013 and his undergraduate degree from the Indian Institute of Technology, Delhi in 2008. Prof. Garg is the recipient of various honors such as the ACM Doctoral Dissertation Award, the Sloan Research Fellowship and the IIT Delhi Graduates of the Last Decade Award. Prof. Garg's research has been recognized with a test of time award at FOCS 2023, and best paper awards at EUROCRYPT 2013, CRYPTO 2017, EUROCRYPT 2018 and TCC 2025. Past students and postdoctoral researchers from Prof. Garg's research group are now faculty/researchers at top institutions, such as Columbia University, Brown University, the University of Toronto, Microsoft Research, etc.

Submitted by Katie Dey on