K-ASTRO: Structure-Aware Adaptation of LLMs for Code Vulnerability Detection
ABSTRACT Large Language Models (LLMs) are transforming software engineering tasks, including code vulnerability detection—a critical area of software security. However, existing methods often rely on resource-intensive models or graph-based techniques, limiting their accessibility and practicality. This paper introduces K-ASTRO, a lightweight Transformer model that combines semantic embeddings from LLMs with structural features of Abstract Syntax Trees (ASTs) to improve both efficiency and accuracy in code vulnerability detection. Our approach introduces an AST-based augmentation technique inspired by mutation testing, a structure-aware attention mechanism that incorporates augmented AST features, and a joint adaptation pipeline to unify code semantics and syntax. Experimental results on three large-scale datasets—BigVul, DiverseVul, and PrimeVul—demonstrate state-of-the-art performance while enabling rapid inference on CPUs with minimal training time. By offering a scalable, interpretable, and efficient solution, K-ASTRO bridges the gap between LLM advancements and practical software vulnerability detection, providing open-sourced tools to foster further research. |
BIO |
Yifan Zhang (中文: 张一凡, 日本語: チャン・イーファン, Русский: Ифань Чжан) is a Ph.D. student in Computer Science at Vanderbilt University, under the supervision of Prof. Kevin J. Leach and Prof. Yu Huang at the Institute for Software Integrated Systems (VU-ISIS). He is also jointly pursuing the Online Master of Science in Computer Science (OMSCS) at Georgia Institute of Technology. Before joining Vanderbilt & GaTech, he earned his B.A., B.Eng., and M.Eng. from China University of Petroleum (CUP) at Beijing and worked as a full-time machine learning engineer at the Intelligent Risk Management Lab at JD.COM. During his Ph.D. he also worked with Google Research, IBM Research, Intel Corporation and ByteDance as research interns. His research interests lie in AI for PL/SE, specifically symbolizing code abstract interpretations on LLM-based neural code analysis and comprehension tasks.
|