Deterministic Simulation Testing for the Assurance of Complex Software Systems
ABSTRACT It is nearly impossible to build truly high confidence in a distributed system using conventional means of testing. Many classes of distributed system failures emerge in rare situations that, without e.g., detailed post hoc knowledge of root causes, are hard to elicit via manually-written end-to-end and unit tests. In addition, the exact circumstances that induce a specific failure are sensitive to non-deterministic factors such as the timing effects of network latency. Even if failures are induced during testing, they are often difficult to reliably reproduce. Deterministic simulation testing (DST) is a testing approach that addresses these challenges by running autonomous tests against an entire software system inside a controlled virtual environment so that any execution can be reproduced exactly. I will explain the principles and implementation of DST, and how the technique enables the development of high-confidence distributed systems. I will present several examples of DST implementation in complex, correctness-critical systems like databases, and discuss the architecture of a specific DST platform, Antithesis, which can be applied to virtually any system. First, I will introduce general DST design principles. These include: (1) virtualizing all sources of non-determinism (time, network, disk, scheduling) so that the simulator is the sole source of entropy; (2) driving exploration from a seeded pseudo-random source so that the behavior of each run is both diverse and reproducible; (3) integrating systematic fault injection—crashes, partitions, delays, data corruption—into the simulated environment; and (4) checking system-level invariants and correctness properties to detect violations. Second, I will illustrate how organizations developing various systems have implemented these principles, and the engineering practices that these teams have adopted around DST. I will also discuss the barriers to wider adoption, chief among them the stringent constraints conventional DST places on the architecture of the system under development, many of which require DST be incorporated, from the beginning, into the design of the system. Third, I will describe the architecture and design of Antithesis, a DST platform that enables DST to be applied to systems not originally designed with DST in mind, via the use of a deterministic hypervisor that simulates entire software stacks, including virtual CPU, networking, storage, and time. Building on this infrastructure, Antithesis also handles parallelization, exploration, and property checking: developers articulate high-level safety and liveness invariants, then the platform runs many parallel simulations that vary workload, scheduling, and injected faults. Finally, I will describe the use of DST as an assurance mechanism for AI generated code. The volume of code AI assistants are able to generate necessitates scalable, automated assurance mechanisms for validating generated code. I will show how DST provides a framework for producing safer, more understandable AI-generated software. |
BIO Michael Vaughn is a senior software at Antithesis, working on their fuzzer and deterministic hypervisor. He earned a Ph.D. in computer science from the University of Wisconsin-Madison under the supervision of Tom Reps, working on tools for automated program transformation via partial evaluation. His research interests span programming languages, software engineering, and operating systems, and has published research in a variety of areas, including file systems and mutation testing. |