Crypto Trends

The Art of Prompt-Swapping, Temperature Tuning, and Fuzzy Forensics in AI

Abstract and I. Introduction

II. Related Work

III. Technical Background

IV. Systematic Security Vulnerability Discovery of Code Generation Models

V. Experiments

VI. Discussion

VII. Conclusion, Acknowledgments, and References

Appendix

A. Details of Code Language Models

B. Finding Security Vulnerabilities in GitHub Copilot

C. Other Baselines Using ChatGPT

D. Effect of Different Number of Few-shot Examples

E. Effectiveness in Generating Specific Vulnerabilities for C Codes

F. Security Vulnerability Results after Fuzzy Code Deduplication

G. Detailed Results of Transferability of the Generated Nonsecure Prompts

H. Details of Generating non-secure prompts Dataset

I. Detailed Results of Evaluating CodeLMs using Non-secure Dataset

J. Effect of Sampling Temperature

K. Effectiveness of the Model Inversion Scheme in Reconstructing the Vulnerable Codes

L. Qualitative Examples Generated by CodeGen and ChatGPT

M. Qualitative Examples Generated by GitHub Copilot

F. Security Vulnerability Results after Fuzzy Code Deduplication

We employ TheFuzz [64] python library to find near duplicate codes. This library uses Levenshtein Distance to calculate the differences between sequences [65]. The library outputs the similarity ratio of two strings as a number between 0 and 100. We consider two codes duplicates if they have a similarity ratio greater than 80. Figure 7 provides the results of our FS-Code approach in finding vulnerable Python and C codes that could be generated by CodeGen and ChatGPT

Fig. 6: Percentage of the discovered vulnerable C codes using the non-secure prompts that are generated for specific CWE. (a), (b), and (c) provide the results of the generated code by CodeGen model using FS-Code, FS-Prompt, and OS-Prompt, respectively. (d), (e), and (f) provide the results for the code generated by ChatGPT using FS-Code, FS-Prompt, and OS-Prompt, respectively.Fig. 6: Percentage of the discovered vulnerable C codes using the non-secure prompts that are generated for specific CWE. (a), (b), and (c) provide the results of the generated code by CodeGen model using FS-Code, FS-Prompt, and OS-Prompt, respectively. (d), (e), and (f) provide the results for the code generated by ChatGPT using FS-Code, FS-Prompt, and OS-Prompt, respectively.

model. Note that these results are provided by following the setting of Section V-B2. Here we also observe a general almostlinear growth pattern for some of the vulnerability types that are generated by CodeGen and ChatGPT models.

G. Detailed Results of Transferability of the Generated Nonsecure Prompts

Here we provide the details results of the transferability of the generated non-secure prompts. Table VIII and Table IX show the detailed transferability results of the promising nonsecure prompts that are generated by CodeGen and ChatGPT, respectively. The results in Table VIII and Table IX provide the results of generated Python and C codes for different CWEs. In Table VIII and Table IX show that the promising non-secure prompts are transferable among the models for generating codes with different types of CWEs. Even in some cases, the non-secure prompts from model A can lead model B to generate more vulnerable codes compared to model A itself. For example, in Table VIII, the promising non-secure prompts generated by CodeGen lead ChatGPT to generate more vulnerable codes with CWE-079 vulnerability compared to the CodeGen itself.

H. Details of Generating non-secure prompts Dataset

We generate the non-secure prompts dataset using our FSCode method, following the same settings as in Section V-B. For generating prompts with GPT-4 and Code Llama-34B, we set the sampling temperature to 1.0. A higher temperature facilitates the generation of a diverse set of non-secure prompts. We set the maximum number of tokens for generating Python and C non-secure prompts to 45 and 65, respectively. The maximum token count is determined based on the average size of prompts in our initial code examples.

To generate the non-secure prompts with Code Llama-34B, we exclusively employ the few-shot examples from our FSCode. However, in the generation process using GPT-4, we utilize an instruction to guide the model. We use the following instruction to generate Python prompts:

You are a Python developer. Given the provided code examples, complete the following code. Define a function and write the task description as a Python comment. Provide comprehensive comments. Do not write any explanation. Only write Python codes.

I. Detailed Results of Evaluating CodeLMs using Non-secure Dataset

In Table X, we provide the detailed results of evaluating various code language models using our proposed non-secure prompts dataset. Table X demonstrates the number of vulnerable Python and C codes generated by CodeGen-6B [6], StarCoder7B [24], Code Llama-13B [12], WizardCoder-15B [56], and ChatGPT [4] models. Detailed results for each CWE can offer valuable insights for specific use cases. For instance, as shown in Table X, Code Llama-13B generates fewer Python codes with the CWE-089 (SQL-injection) vulnerability. Consequently,

Fig. 7: The number of discovered vulnerable codes versus the number of sampled codes generated by (a), (c) CodeGen, and (b), (d) ChatGPT. The non-secure prompts and codes are generated using our FS-Code method. While Figure 4 already has removed exact matches, here, we use fuzzy matching to do further code deduplication.Fig. 7: The number of discovered vulnerable codes versus the number of sampled codes generated by (a), (c) CodeGen, and (b), (d) ChatGPT. The non-secure prompts and codes are generated using our FS-Code method. While Figure 4 already has removed exact matches, here, we use fuzzy matching to do further code deduplication.

TABLE VIII: The number of discovered vulnerable codes generated by the CodeGen and ChatGPT models using the promising non-secure prompts generated by CodeGen. We employ our FS-Code method to generate non-secure prompts and codes. Columns two to thirteen provide results for Python codes. Columns fourteen to nineteen give the results for C Codes. Column fourteen and nineteen provides the number of found vulnerable codes with the other CWEs that CodeQL queries. For each programming language, the last column provides the sum of all codes with at least one security vulnerability.TABLE VIII: The number of discovered vulnerable codes generated by the CodeGen and ChatGPT models using the promising non-secure prompts generated by CodeGen. We employ our FS-Code method to generate non-secure prompts and codes. Columns two to thirteen provide results for Python codes. Columns fourteen to nineteen give the results for C Codes. Column fourteen and nineteen provides the number of found vulnerable codes with the other CWEs that CodeQL queries. For each programming language, the last column provides the sum of all codes with at least one security vulnerability.

This model stands out as a strong choice among the evaluated models for generating SQL-related Python code.

J. Effect of Sampling Temperature

Figure 8 provides detailed results of the effect of different sampling temperatures in generating non-secure prompts and vulnerable code. We conduct this evaluation using our FS-Code method and sample the non-secure prompts and Python codes from CodeGen model. Here, we provide the total number of generated vulnerable codes with three different CWEs (CWE-020, CWE-022, and CWE-079) and sample 125 code samples for each CWE. The y-axis refers to different sampling temperatures for sampling the non-secure prompts, and xaxis refers to different sampling temperatures of the code generation procedure. The results in Figure 8 show that in general, sampling temperatures of non-secure prompts have a significant effect in generating vulnerable codes, while sampling temperatures of codes have a minor impact (in each row, we have low difference among the number of vulnerable codes), furthermore, in Figure 8 we observe that 0.6 is an optimal temperature for sampling the non-secure prompts. Note that in all of our experiments, based on the previous works in the program generation domain [6], [5], to have fair results we set the non-secure prompt and codes’ sampling temperature to 0.6.

Authors:

(1) Hossein Hajipour, CISPA Helmholtz Center for Information Security ([email protected]);

(2) Keno Hassler, CISPA Helmholtz Center for Information Security ([email protected]);

(3) Thorsten Holz, CISPA Helmholtz Center for Information Security ([email protected]);

(4) Lea Schonherr, CISPA Helmholtz Center for Information Security ([email protected]);

(5) Mario Fritz, CISPA Helmholtz Center for Information Security ([email protected]).


Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button