Small, Reliable Ways to Use LLMs: Macro Refolding, Proof Writing, and Protocol Modeling
Walt Woods, Mike Dodds - Galois Inc.
LLMs represent a significant leap forward for applying neurosymbolic ML to software engineering and formal methods problems. Yet, their outputs cannot be trusted due to their tendency to hallucinate, proposing obviously wrong solutions or facts to user queries. Fortunately, formal methods allow us to identify use cases for LLMs that retain all of their benefits – ease of use, ability to understand and manipulate code – while avoiding their downsides. We have conducted small experiments on the automatic writing of memory safety proofs, the transformation of C code into Rust, and working with RFCs to turn protocol specifications into rigorous models that can be queried and interacted with over a chatbot interface. We call these LLM use cases checked candidate generation – that is, synthesizing artifacts that may be automatically checked through other means, when such artifacts would normally take significant time and expertise to author. These scenarios represent powerful use cases, where LLMs that can greatly enhance both formal verification techniques and traditional software engineering tools.
The first experiment explored the authoring of SAW proofs, documented in a public blog post titled, “Applying GPT-4 to SAW Formal Verification”. Here, we show how LLMs are capable of turning a few simple examples of code/memory proof specification pairings into a means of generating new memory proof specifications from code alone. This includes the inference of necessary pointer array sizes given context cues in a function body. The second group of experiments explored supplementing a C to Rust translation tool that Galois has been developing, also documented in a public blog post titled, “Using GPT-4 to Assist C to Rust Translation”. While classical transpilation methods can take us far, they tend to have issues with, e.g., macro expansions causing the resulting code to be nearly illegible (one 4,000 line C program became a 24,000 line Rust program). Through integration with GPT-4, we were able to design a lightweight process that refolds these macros, while ensuring that the task was done correctly by ensuring that the High level Intermediate Representation (HIR) between the original code and refolded code remained unchanged. Finally, we have explored using LLMs to turn protocol standards written in prose (for example RFCs) into formal models that a user can interact with and modify over a standard chatbot interface, including results not previously made public. We conducted 17 different small experiments, showing that LLMs were effective at interacting with formal methods tools (including ACL2s), both in terms of generating input specifications and processing the tools’ outputs. However, they were surprisingly bad at detailed logical reasoning – justifying their use as a communications layer over existing formal methods. These experiments all highlight a few truths of the new LLM-based technologies available. First, it’s never been easier to prototype a small demonstration of how these technologies might synthesize formal models from natural language. Second, checked candidate generation, or automatic checking of synthesized models, represents a large step toward reducing the expertise required to use formal methods for software security and integrity. Last but not least, where an LLM’s output cannot be automatically checked, we find they still have great promise at reducing the effort required to inspect and modify formal models, making these powerful tools more accessible to a wider software engineering audience.