Can Your LLM Keep a Secret? | Science of Security Virtual Organization

Can Your LLM Keep a Secret?

Slides for this presentation are available to registered HCSS community members. Please log in to access the file.

Once logged in, you may download the slides here: [HCSS 2026 Garg]

If you do not already have an account, you may create one here:
https://sos-vo.org/user/register

Jaiden Fairoze (UC, Berkeley and FAIR at Meta), Sanjam Garg (UC Berkeley and Carabid, Inc.), and Steve Lu (Stealth Software Technologies, Inc.

ASBTRACT

As large language models (LLMs) transition from novelties to critical infrastructure, preventing the leakage of sensitive information has become a central challenge. We study this problem through a cryptographic lens, formalizing secrecy as a game between a defender who embeds a secret along with any safeguarding instructions in a prompt and an adversary who injects auxiliary prompt instructions and attempts to recover the secret from the model’s output. In this talk, we show fundamental limitations of existing defenses.

First, we ask if typical strategies such as input filtering, the dominant defense against prompt injection, work. Specifically, we demonstrate how adversaries can, in production systems, encode malicious intent into computationally hard-to-detect structures, exploiting the asymmetry between lightweight guardrails and the powerful models they protect. These attacks align with emerging theoretical results suggesting inherent limits to universal input filtering.

Next, we then study whether secrets can be reliably hidden in prompts despite adversarial interference. We capture this as fo lows: an attacker injects instructions into a model interaction and, by observing only the model’s output, infers sensitive prope ties of the hidden input without explicit disclosure. For this setting, we show that such attacks can be launched on state-of-the-art models and agents built on top of these models.

Our results suggest that, current LLM architectures fundamentally lack mechanisms to guarantee secrecy under adversarial interaction.

BIO

Prof. Sanjam Garg is an Associate Professor at the University of California, Berkeley. His research interests are in cryptography and its applications to security and privacy. He obtained his Ph.D. from the University of California, Los Angeles in 2013 and his undergraduate degree from the Indian Institute of Technology, Delhi in 2008. Prof. Garg is the recipient of various honors such as the ACM Doctoral Dissertation Award, the Sloan Research Fellowship and the IIT Delhi Graduates of the Last Decade Award. Prof. Garg's research has been recognized with a test of time award at FOCS 2023, and best paper awards at EUROCRYPT 2013, CRYPTO 2017, EUROCRYPT 2018 and TCC 2025. Past students and postdoctoral researchers from Prof. Garg's research group are now faculty/researchers at top institutions, such as Columbia University, Brown University, the University of Toronto, Microsoft Research, etc.