Fine Tuning Large Language Model for Secure Code Generation | Science of Security Virtual Organization

Fine Tuning Large Language Model for Secure Code Generation
Author	Junjie Li Aseem Sangalay Cheng Cheng Yuan Tian Jinqiu Yang
Abstract	AI pair programmers, such as GitHub s Copilot, have shown great success in automatic code generation. However, such large language model-based code generation techniques face the risk of introducing security vulnerabilities to codebases. In this work, we explore the direction of fine-tuning large language models for generating more secure code. We use real-world vulnerability fixes as our fine-tuning dataset. We craft a code-generation scenario dataset (C/C++) for evaluating and comparing the pre-trained and fine-tuned models. Our experiments on GPT-J show that the fine-tuned GPT-J achieved 70.4\% and 64.5\% ratios of non-vulnerable code generation for C and C++, respectively, which has a 10\% increase for C and a slight increase for C++ compared with the pre-trained large language model.
Year of Publication	2024
Date Published	apr
URL	https://ieeexplore.ieee.org/document/10599549
DOI	10.1145/3650105.3652299
Google Scholar \| BibTeX \| DOI