Metrics for Large Language Model-Generated Proofs in a High-Assurance Application Domain    

pdf

Large Language Model (LLM) Artificial Intelligence (AI) systems have generated significant enthusiasm in the computer science research community for their potential to perform a number of computer language processing tasks, including source code generation from natural language descriptions, source-to-source translation, and the like.  We are interested in the use of  LLMs for automated theorem proving, particularly proof repair.  Proof repair is a great use case for LLMs, because LLMs have a (well-deserved) reputation for producing "hallucinations", textual output that appears convincing at first glance, but is in fact, nonsensical. Submitting LLM-generated candidate proofs to a theorem prover constitutes a foolproof "hallucination detector", as the proofs will fail if the AI-generated proof scripts are flawed.

We have experienced notable success in the use of LLMs for proof generation for the Copland high-assurance remote attestation system proofbase,  Copland is a domain-specific language developed by researchers at the University of Kansas, and is designed for describing, analyzing and executing attestation protocols. Its formal semantics is specified using the Coq theorem prover, and Copland implementation code is automatically generated from the Coq specification via verified synthesis techniques. 

As a first step towards employing LLMs for proof repair for large proofbases, we have successfully utilized the GPT-4 LLM to generate Copland proofs that have been previously created by human users.  We have obtained greater than 60% accuracy employing  one GPT-4 "shot", which improves to 90% with one additional "shot".  Using a maximum of three "shots", the overall proof correctness rate increases to 97%.  We were able to generate Coq proofs with up to 50 proof steps using this technique, and our LLM-generated proofbase currently consists of over 1,400 lines of Copland Coq source. 

Given this initial success, we have undertaken research to reduce the amount of input text required to be submitted to the LLM, decreasing cost while increasing accuracy, as well as to create statistical metrics to gauge the quality of LLM-generated candidate proofs.  Research questions include: 1. Can automated classification techniques be successfully used to develop a recommendation system which can reduce the number of input tokens provided to the LLM in order to prove a given lemma?  2. Can statistical methods be used to assess the quality of utterances during an LLM proof attempt conversation, thus providing feedback to the user as to whether additional LLM "shots" are warranted?  We will provide initial results indicating that both of these research questions have affirmative answers.


Dr. Amer Tahat is a Senior Engineer at Collins Aerospace. His previous roles include Assistant Research Professor at Pennsylvania State University (PSU) in the College of Information Sciences and Technology and Assistant Research Professor in Electrical and Computer Engineering at Virginia Tech University. Dr. Tahat earned his Ph.D. in Computational Science and Engineering from the Computer Science Department at Michigan Technological University, with his doctoral research co-supervised by NASA Langley Research Center, focusing on the safety of critical system developments. In addition, he has completed postgraduate programs in Advanced Cybersecurity from Stanford University, Cloud Computing from the University of Texas at Austin, and a Data Science and AI PGE from MIT. Dr. Tahat has worked on projects supported by several agencies, including DARPA (Defense Advanced Research Projects Agency), the Office of Naval Research (ONR), where he served as a Co-PI, and the National Science Foundation (NSF).

License: CC-3.0
Submitted by Amy Karns on