Work-in-Progress: Establishing Systems Level Competitive Leaderboards for Network Security

ABSTRACT

Autonomous attackers have entered the real world, with frontier
AI lab Anthropic detecting the use of its foundation models in a
predominantly autonomous cyber espionage campaign. However,
we are currently under-equipped to address these novel threat ac-
tors. There is no shared framework to systematically evaluate these autonomous attackers, or to test the efficacy of defenses against them. Existing benchmarks for agentic systems in network security are narrow and evaluate individual "CTF-style" challenges, where agents solve isolated exploit or defense puzzles in simplified network settings. These setups lack the complexity of network-level interactions across hosts, subnets, and services that characterize realistic network security scenarios. We argue for system-level and competitive evaluations of autonomous attackers against autonomous defenders on realistic networks. To this end, we present a roadmap and an initial implementation for an extensible framework for executing system-level competitive evaluations. We identify key requirements for network security leaderboards, emphasizing system-level realism and competitive evaluation between agents, and use the framework to create a leaderboard of autonomous attackers and defenders. We also highlight the utility of the leader-board with early interesting findings illustrating how LLM-based autonomous defenders can thwart autonomous LLM-based attackers.

BIO

Lakshmi Adiga is a senior undergraduate researcher at Carnegie Mellon University, where she works under the supervision of Prof. Vyas Sekar. Her research focuses on network security and autonomous security systems, with an emphasis on building intelligent, self-managing defenses. She previously worked on Incalmo, a project advancing automated network security through LLM-driven red teaming. Her work explores how AI and autonomous systems can be leveraged to address complex, real-world security challenges.

Submitted by Katie Dey on