ABSTRACT Autonomous attackers have entered the real world, with frontier AI lab Anthropic detecting the use of its foundation models in a predominantly autonomous cyber espionage campaign. However, we are currently under-equipped to address these novel threat ac- tors. There is no shared framework to systematically evaluate these autonomous attackers, or to test the efficacy of defenses against them. Existing benchmarks for agentic systems in network security are narrow and evaluate individual "CTF-style" challenges, where agents solve isolated exploit or defense puzzles in simplified network settings. These setups lack the complexity of network-level interactions across hosts, subnets, and services that characterize realistic network security scenarios. We argue for system-level and competitive evaluations of autonomous attackers against autonomous defenders on realistic networks. To this end, we present a roadmap and an initial implementation for an extensible framework for executing system-level competitive evaluations. We identify key requirements for network security leaderboards, emphasizing system-level realism and competitive evaluation between agents, and use the framework to create a leaderboard of autonomous attackers and defenders. We also highlight the utility of the leader-board with early interesting findings illustrating how LLM-based autonomous defenders can thwart autonomous LLM-based attackers. |