Crypto Trends

The Uncanny Valley Of Code: Why AI-Generated Code Is So Hard To Debug

Abstract and 1 Introduction

2. Prior conceptualisations of intelligent assistance for programmers

3. A brief overview of large language models for code generation

4. Commercial programming tools that use large language models

5. Reliability, safety, and security implications of code-generating AI models

6. Usability and design studies of AI-assisted programming

7. Experience reports and 7.1. Writing effective prompts is hard

7.2. The activity of programming shifts towards checking and unfamiliar debugging

7.3. These tools are useful for boilerplate and code reuse

8. The inadequacy of existing metaphors for AI-assisted programming

8.1. AI assistance as search

8.2. AI assistance as compilation

8.3. AI assistance as pair programming

8.4. A distinct way of programming

9. Issues with application to end-user programming

9.1. Issue 1: Intent specification, problem decomposition and computational thinking

9.2. Issue 2: Code correctness, quality and (over)confidence

9.3. Issue 3: Code comprehension and maintenance

9.4. Issue 4: Consequences of automation in end-user programming

9.5. Issue 5: No code, and the dilemma of the direct answer

10. Conclusion

A. Experience report sources

References

7.2. The activity of programming shifts towards checking and unfamiliar debugging

When code can be generated quickly, as observed with the studies in Section 6, checking the correctness of generating code becomes a major bottleneck. This shift, or tradeoff, of faster authoring at the expense of greater time spent checking code, is not without criticism. For some it is the wrong balance of priorities between system and programmer.

Correspondingly, some users have developed heuristics for when the cost of evaluating the correctness of the code is greater than the time or effort saved by code generation, such as to focus on very short (e.g., single line) completions and ignore longer completions.

Furthermore, some users have found that rather than having suggestions show all the time, which can be distracting and time consuming, more intentional use can be made of Copilot by switching off autosuggestion and only triggering code completion manually using a keyboard shortcut. However, this requires users to form a mental model of when Copilot is likely to help them in their workflow. This mental model takes time and intentionality to build, and may be incorrect. Moreover, it introduces a new cognitive burden of constantly evaluating whether the current situation would benefit from LLM assistance. Commenters on Hacker News raise these issues:

“I find I spend my time reviewing Copilot suggestions (which are mostly wrong) rather than thinking about code and actually doing the work.”

“[…] It’s much quicker to read code than to write it. In addition, 95% of Copilots suggestions are a single line and they’re almost always right (and also totally optional).[…] I admit that I’m paranoid every time it suggests more than 2 lines so I usually avoid it. […] I’ve run into Copilot induced headaches twice. Once was in the first week or so of using it. I sweared off [sic] of using it for anything more than a line then. Eventually I started to ease up since it was accurate so often and then I learned my second lesson with another mistake. […]”

“[…] writing code is not the bottleneck in need of optimization. Conceiving the solution is. Any time “saved” through Copilot and it’s ilk is immediately nullified by having to check it’s correctness. […]”

“What I want is a copilot that finds errors […] Invert the relationship. I don’t need some boilerplate generator, I need a nitpicker that’s smarter than a linter. I’m the smart thinker with a biological brain that is inattentive at times. Why is the computer trying to code and leaving mistake catching to me? It’s backwards.”

“I turned off auto-suggest and that made a huge difference. Now I’ll use it when I know I’m doing something repetitive that it’ll get easily, or if I’m not 100% sure what I want to do and I’m curious what it suggests. This way I get the help without having it interrupt my thoughts with its suggestions.”

Another frequent experience is that language models can introduce subtle, difficult to detect bugs, which are not the kind that would be introduced by a human programmer writing code manually. Thus, existing developer intuitions around the sources of errors in programs can be less useful, or even misleading, when checking the correctness of generated code.

One developer reported their experience of having an incorrect, but plausible-sounding field name suggested by Copilot (accessTokenSecret instead of accessSecret) and the consequent wild goose chase of debugging before discovering the problem. As sources of error, these tools are new, and developers need to learn new craft practices for debugging. “There are zero places that can teach you those things. You must experience them and unlock that kind of knowledge.”, the developer concludes, “Don’t let code completion AI tools rule your work. […] I don’t blame [Copilot] for this. I blame myself. But whatever. At least I got some experience.”. Commenters on Hacker News report similar experiences:

“[…] The biggest problem I’ve had is not that it doesn’t write correctly, it’s that it think it knows how and then produce good looking code at a glance but with wrong logic. […]”

“[…] it has proved to be very good at producing superficially appealing output that can stand up not only to a quick scan, but to a moderately deep reading, but still falls apart on a more careful reading. […] it’s an uncanny valley type effect. […] it’s almost the most dangerous possible iteration of it, where it’s good enough to fool a human functioning at anything other than the highest level of attentiveness but not good enough to be correct all the time. See also, the dangers of almost self-driving cars; either be self-driving or don’t but don’t expect halfway in between to work well.”

“[…] The code it generates looks right but is usually wrong in really difficult to spot ways but things you’d never write yourself.”

Many developers reported concerns around such tools repeating private information, or repeating copyrighted code verbatim, which might have implications for the licenses in their own projects. Notions of the dangers of such “stochastic parrots” (Bender et al., 2021) are not new and have been well-explored, and are not as directly connected to the user experience of programming assistance as some of the other concerns we have listed here. As such, we will not enter that discussion in depth here, except to mention that these concerns were present in several blog articles and online discussions.

Thus, in practice, programmers describe the challenges of writing effective prompts, misinterpreted intent, code that includes subtle bugs or poor programming practices, the burden of inspecting and checking that generated code is correct, and worries about private information, plagiarism and copyright.


Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button