# Automated Theory Substitution: Toward Proof-Driven Software Development

Formal verification of large, real-world software systems has shown increasing success over the years (seL4, CompCert, s2n) but has failed to attain wide-spread adoption. This is largely due to the esoteric expertise required to write and maintain formal proofs. Experience with interactive theorem provers (ITPs) is not nearly as ubiquitous among software developers as, say, integrating unit tests into CI/CD pipelines. To alleviate this lack of experience and democratize proving, we present AutoMatEd THeorY SubsTitution (AMETHYST), an artificially intelligent proof assistant. AMETHYST is an effort to improve the state of proof repair automation using machine learning. Specifically, we apply a language model and other machine learning techniques to a dataset of hand-written proofs and repairs for the Isabelle/HOL ITP. We demonstrate that AMETHYST can complete truncated proofs which represent once-correct proofs that broke due to code changes.

Test-driven software development has become common practice in the last decade. Automated tests are designed and written before actual software. The software is then written to fulfill the requirements and pass the tests. These tests are executed automatically as part of a continuous integration/continuous deployment (CI/CD) pipeline. If a test fails, it indicates either an expected change in the code functionality or a new bug. The developer must determine which case it is, and then fix the test or the code. The problem with this approach is that no human can imagine, much less enumerate, every case of a program’s behavior. Thus, in all but the simplest of systems, much is left untested before deployment.

In the formal verification paradigm, segments of code are mathematically proven to exhibit a specific behavior. For example, one could guarantee that memory is never accessed beyond the end of a buffer, that a race condition never occurs, or that a loop invariant always holds. In this way, one does not have to imagine every edge case of a program’s behavior; instead, a particular property is proven to hold for entire categories of (sometimes all) inputs or program states. However, a number of challenges stand in the way of wider adoption:

- In many cases, proofs are written inside interactive theorem provers (ITPs), which have a much higher learning curve than typical software development environments.
- There are many ITPs (e.g. Isabelle, Coq, Lean), each implementing its own formal logic language. These languages are quite different from C++, Java, Python, etc.
- New features, even relatively simple changes, added to previously verified code frequently break proofs about the code in ways that are disproportionately difficult to repair.
- Software architects and engineers are not familiar with these languages and tools. It would be cumbersome or impossible to repair broken proofs, much less write new ones from scratch.
- To overcome these challenges, software shops could introduce these tools into their development process. However, this would require architects and engineers to climb the years-long learning curve to proficiency.

To ease the pain of this climb and aid in the creation of assured software now, we introduce AMETHYST, a suite of Transformer neural networks which are trained to write proof tactics for the Isabelle theorem prover. It interfaces with Isabelle and employs a value function and tree search algorithm to explore the proof space. Our contributions are as follows:

- A lightweight socket interface (called Pysa) for interacting with the Isabelle theorem prover.
- A novel value function for predicting the utility of individual subgoals of a proof state.
- A neuro-symbolic method (using a GPT language model) of predicting the next tactic in a proof.
- The ability to explicitly include previously proven lemmas as facts in the current proof.
- An interactive visualization of the search tree for a lemma is integrated into VSCode, offering explainability to the human.
- A curriculum-based teacher-forced reinforcement learning (RL) algorithm which updates, and thus ties together, multiple models at once.

**Mr. John Scebold** is a lead research scientist at Two Six Technologies. He has over 10 years of experience as a software engineer on large mission critical platforms for the IC. He has a Masters in Computer Science and 6 years experience as a data scientist specializing in NLP (old fashioned and neural methods), graph theory and combinatorics. *"I find or impose structure in data and then exploit it."*

**Mr. Jared Ziegler** is an Associate Research Director for High Assurance Solutions at Two Six Technologies. He has 20 years of experience analyzing critical, high assurance systems with a focus on vulnerability research and formal methods, including 16 years at the National Security Agency.