an image showing a programmer verifying code using a magnifying glass in a pixel art style.

The Challenge of Verifying AI-Generated Code

Recent Research and Innovations

5 min readJul 8, 2024

“Namely, programming assistants eliminate effort writing code and replace it with effort reviewing code” — Yan et al. (2024)

In the rapidly evolving landscape of AI-assisted programming, a crucial challenge has emerged: how can developers effectively verify the code generated by AI tools? Recent academic research published at CHI 2024 has shed light on this issue and proposed innovative solutions. Let’s dive into these findings and explore their implications for the future of coding.

The verification challenge, quantified

A group of researchers from MIT and Microsoft delved into how programmers interact with GitHub Copilot, a popular AI-driven code completion tool, in their study titled “Reading Between the Lines: Modeling User Behavior and Costs in AI-Assisted Programming” (ACM DL). The findings are eye-opening: programmers dedicate 22.4% of their coding time to verifying AI suggestions, highlighting the need for tools and techniques to streamline this process.

To quantitatively study this verification challenge, the authors developed a novel taxonomy of common programmer activities when interacting with code-recommendation (CodeRec) systems like GitHub Copilot. This taxonomy, called CodeRec User Programming States (CUPS), includes 12 distinct states that capture the diverse range of actions programmers take when using AI assistance. These states encompass activities such as actively verifying suggestions, thinking about new code to write, waiting for suggestions, crafting prompts, and editing previously accepted suggestions.

Figure 3 from Mozannar et al. (2024): Taxonomy of programmer’s activities when interacting with CodeRec– CUPS. — Figure 3 from Mozannar et al. (2024): Taxonomy of programmer’s activities when interacting with CodeRec.

Equipped with this taxonomy, the researchers then conducted a user study with 21 developers. Participants were asked to retrospectively label their coding sessions using the CUPS taxonomy while reviewing screen recordings of their interactions with Copilot.

The study revealed that a significant portion of developers’ time was spent on activities directly related to interacting with Copilot, particularly verifying suggestions. On average, participants spent 51.5% of their coding sessions on Copilot-related activities, with verifying suggestions being the most time-consuming state, accounting for 22.4% of the average session duration.

Two approaches to tackling the verification challenge

Published at the same venue, two research papers propose innovative solutions to address the verification challenge.

Lightweight Anchored Explanations of Just-Generated Code

The first approach, developed by Yan et al. (ACM DL), focuses on providing instant, concise explanations alongside AI-generated code within the code editor. These explanations, called “instantly visible in-situ explanations” (Ivie), are designed to be unobtrusive and easily dismissible, offering insights into the purpose and functionality of the code snippets. By providing context and clarification directly within the coding environment, Ivie aims to reduce the cognitive load associated with verification.

In a lab study comparing Ivie to a chatbot-based code comprehension tool, researchers found that Ivie led to improved comprehension of generated code, decreased perceived task load, and was considered less distracting than the chatbot.

Figure 1 from Yan et al. (2024): Ivie augments the interactive programming assistant with instant explanations that help programmers examine generated code. — Figure 1 from Yan et al. (2024)

Validating AI-Generated Code with Live Programming

Ferdowsi et al. (ACM DL) proposed leveraging “live programming” (LP), a technique that displays a program’s runtime values in real time. By integrating LP into a Python editor with an AI assistant, researchers found that programmers were more successful in evaluating AI suggestions and experienced lower cognitive load for certain tasks. The continuous feedback provided by LP allows for quicker identification and correction of errors, making the verification process more efficient. A user study found that live programming reduced over and under-reliance on AI-generated code and lowered the cognitive load of validation, particularly for tasks well-suited to validation by execution.

LEAP, the live programming environment the authors created to evaluate the proposed technique stands on the shoulder of Python Tutor, a well-established educational tool that visualizes code execution. LEAP allows users to not only see the runtime values of AI-suggested code but also to quickly switch between multiple suggestions and observe how each suggestion would execute in real time. This interactive feature enables users to compare and contrast different options, facilitating a deeper understanding of the code’s behavior and potential consequences.

Figure 1 from Ferdowsi et al. (2024): Leap is a Python environment that enables validating AI-generated code suggestions via Live Programming — Figure 1 from Ferdowsi et al. (2024)

Combining these two approaches

While both approaches aim to enhance users’ ability to verify AI-generated code, they differ in their fundamental strategies. Ivie focuses on providing explanations of code to aid understanding, while live programming emphasizes real-time feedback through visualizing execution traces. These approaches are not mutually exclusive and could potentially be combined for a more comprehensive solution.

Imagine a scenario where Ivie’s explanations are integrated into a live programming environment. As the AI assistant generates code, Ivie would provide instant explanations, and the live programming environment would display the runtime values. This combination could offer a powerful toolset for programmers, allowing them to understand the code’s intent and observe its behavior simultaneously.

Final thoughts

The research on supporting program verification is still in its early stages, but the potential for improving developer productivity is immense. As AI continues to play a larger role in software development, it’s crucial to invest in tools and techniques that empower programmers to trust and effectively utilize AI-generated code.

Recent advances described in this blog post suggests a notable advantage enjoyed by programming languages and frameworks with fast compilers or interpreters in the age of AI-assisted programming. These languages enable rapid feedback through live programming or similar techniques, allowing developers to spend less time manually verifying AI-generated code and more time focusing on higher-level tasks. This could be a significant factor for language and framework designers to consider as they evolve their systems for AI-assisted programming.

References

Hussein Mozannar, Gagan Bansal, Adam Fourney, and Eric Horvitz. 2024. Reading Between the Lines: Modeling User Behavior and Costs in AI-Assisted Programming. In Proceedings of the CHI Conference on Human Factors in Computing Systems (CHI ‘24). Association for Computing Machinery, New York, NY, USA, Article 142, 1–16. https://doi.org/10.1145/3613904.3641936

Litao Yan, Alyssa Hwang, Zhiyuan Wu, and Andrew Head. 2024. Ivie: Lightweight Anchored Explanations of Just-Generated Code. In Proceedings of the CHI Conference on Human Factors in Computing Systems (CHI ‘24). Association for Computing Machinery, New York, NY, USA, Article 140, 1–15. https://doi.org/10.1145/3613904.3642239

Kasra Ferdowsi, Ruanqianqian (Lisa) Huang, Michael B. James, Nadia Polikarpova, and Sorin Lerner. 2024. Validating AI-Generated Code with Live Programming. In Proceedings of the CHI Conference on Human Factors in Computing Systems (CHI ‘24). Association for Computing Machinery, New York, NY, USA, Article 143, 1–8. https://doi.org/10.1145/3613904.3642495