2025 Q4 | Science of Security Virtual Organization

2025 Q4

Research Team Status

Project Goals

This quarter, we continued our investigation of prompt-based attacks in multi-agent systems, with a specific focus on prompt infection rather than single-step prompt injection. In our terminology, prompt injections are adversarial instructions that influence a single interaction, whereas prompt infections are attacks designed to persist and spread across multiple agents and stages of a workflow (e.g., summarization, routing, and planning). Our goal is to evaluate whether optimization-based methods can turn existing injection-style attacks into durable infections that reliably survive these intermediate transformations in realistic multi-agent architectures.

This question matters for deployment: if multi-agent systems are easily infectable, then common design patterns for LLM orchestration are fundamentally unsafe; if they are more robust than expected, this reshapes both how we prioritize defenses and how we design future red-teaming strategies.

Accomplishments

We implemented an optimization-based prompt infection attack based on the Greedy Coordinate Gradient (GCG) approach, with the objective to find adversarial strings so that intermediate agents are explicitly encouraged to preserve and propagate the injected content. We then evaluated this attack in multi-agent pipelines that include summarization agents, tracking both how many stages the infection survives and how strongly it influences downstream behavior. The attack is successful in very shallow systems, but the performance drops in systems with 2 or more layers.
Our results were more negative than expected: the multi-agent systems we studied were surprisingly robust, with optimized infections achieving low success rates and often being diluted or removed by intermediate agents. This suggests that incremental improvements to existing prompt infection algorithms are not sufficient to compromise larger multi-agent systems. Instead, these findings point to the need for fundamentally new infection objectives and algorithms that more directly exploit the structure and coordination patterns of multi-agent workflows. On the other hand, these findings also show that agent systems can improve robustness, for the price of increased computational costs and increased inference time.

Lead PI:

Co-Pi(s):