The Dark Side Of AI: Reliability, Safety, And Security In Code Generation

Table of Links
Abstract and 1 Introduction
2. Prior conceptualisations of intelligent assistance for programmers
3. A brief overview of large language models for code generation
4. Commercial programming tools that use large language models
5. Reliability, safety, and security implications of code-generating AI models
6. Usability and design studies of AI-assisted programming
7. Experience reports and 7.1. Writing effective prompts is hard
7.2. The activity of programming shifts towards checking and unfamiliar debugging
7.3. These tools are useful for boilerplate and code reuse
8. The inadequacy of existing metaphors for AI-assisted programming
8.1. AI assistance as search
8.2. AI assistance as compilation
8.3. AI assistance as pair programming
8.4. A distinct way of programming
9. Issues with application to end-user programming
9.1. Issue 1: Intent specification, problem decomposition and computational thinking
9.2. Issue 2: Code correctness, quality and (over)confidence
9.3. Issue 3: Code comprehension and maintenance
9.4. Issue 4: Consequences of automation in end-user programming
9.5. Issue 5: No code, and the dilemma of the direct answer
10. Conclusion
A. Experience report sources
References
5. Reliability, safety, and security implications of code-generating AI models
AI models that generate code present significant challenges to issues related to reliability, safety, and security. Since the output of the model can be a complex software artifact, determining if the output is “correct” needs a much more nuanced evaluation than simple classification tasks. Humans have trouble evaluating the quality of software, and practices such as code review, applying static and dynamic analysis techniques, etc., have proven necessary to ensure good quality of human-written code. Current methods for evaluating the quality of AI-generated code, as embodied in benchmarks such as HumanEval (Chen, Tworek, Jun, Yuan, de Oliveira Pinto, et al., 2021), MBPP (Austin et al., 2021), and CodeContests (Y. Li et al., 2022b), determine functional correctness of entire functions based on a set of unit tests. Such evaluation approaches fail to consider issues of code readability, completeness, or the presence of potential errors that software developers constantly struggle to overcome.
Previous work (Chen, Tworek, Jun, Yuan, de Oliveira Pinto, et al., 2021) explores numerous implications of AI models that generate code, including issues of over-reliance, misalignment (the mismatch between what the user prompt requests and what the user really wants), bias, economic impact, and security implications. While these topics each are extensive and important, due to space limitations we only briefly mention them here and point to additional related work when possible. Over-reliance occurs when individuals make optimistic assumptions about the correctness of the output of an AI model, leading to harm. For code generating models, users may assume the code is correct, has no security vulnerabilities, etc. and those assumptions may lead to lower quality or insecure code being written and deployed. Existing deployments of AI models for code, such as GitHub Copilot (Ziegler, 2021), have documentation that stresses the need to carefully review, test, and vet generated code just as a developer would vet code from any external source. It remains to be seen if over-reliance issues related to AI code generation will result in new software quality challenges.
Since AI that generates code is trained on large public repositories, there is potential for low-quality training data to influence models to suggest low-quality code or code that contains security vulnerabilities. One early study of GitHub Copilot (Pearce et al., 2021) examines whether code suggestions may contain known security vulnerabilities in a range of scenarios and finds cases where insecure code is generated. Beyond carefully screening new code using existing static and dynamic tools that detect security vulnerabilities in human-generated code, there are also possible mitigations that can reduce the likelihood that the model will make such suggestions. These include improving the overall quality of the training data by removing low-quality repositories, and fine-tuning the large-language model specifically to reduce the output of known insecure patterns.