Chinese Generation and Security Index Evaluation Based on Large Language Model
Author
Abstract

This study investigates the performance and security indicators of mainstream large language models in Chinese generation tasks. It explores potential security risks associated with these models and offers suggestions for improvement. The study utilizes publicly available datasets to assess Chinese language generation tasks, develops datasets and multidimensional security rating standards for security task evaluations, compares the performance of three models across 5 Chinese tasks and 6 security tasks, and conducts Pearson correlation analysis using GPT-4 and questionnaire surveys. Furthermore, the study implements automatic scoring based on GPT-3.5-Turbe. The experimental findings indicate that the models excel in Chinese language generation tasks. ERNIE Bot outperforms in the evaluation of ideology and ethics, ChatGPT excels in rumor and falsehood and privacy security assessments, and Claude performs well in assessing factual fallacy and social prejudice. The fine-tuned model demonstrates high accuracy in security tasks, yet all models exhibit security vulnerabilities. Integration into the prompt project proves to be effective in mitigating security risks. It is recommended that both domestic and foreign models adhere to the legal frameworks of each country, reduce AI hallucinations, continuously expand corpora, and update iterations accordingly.

Year of Publication
2024
Date Published
aug
URL
https://ieeexplore.ieee.org/document/10661189
DOI
10.1109/IALP63756.2024.10661189
Google Scholar | BibTeX | DOI