Smarter API Reviews With Gen AI
With the rise of generative AI, businesses are trying to figure out how they can implement it in their environment to speed up one process or another. Perhaps the best place to start looking are existing pain points in any process, then working through ideas of how AI can apply to the situation.
(This blog post was written by a Real Human 😜)
Dall-E 3 image prompt: A futuristic, intelligent robot with a sleek design sitting at a computer, analyzing code diffs on a screen. The robot has a friendly, approachable appearance, and the screen shows complex code with highlighted sections. The environment suggests a modern, high-tech office space.
One major pain point in The Qt Project has historically been completing API Reviews on-time before a release. Since API additions and changes are significant alterations to how the Qt Framework is used, and such changes to existing APIs are often breaking for users, these changes must be carefully considered before being allowed into the final release. Adding new features to Qt often means adding new APIs, and we want those new APIs to be well designed and stable for users in the future. At times though, existing APIs need to be changed and we need to also ensure that those changes are well reasoned and have no alternatives that avoid breaking compatibility for users.
The way this has been done for years works, but has lead to missed release dates on a number of occasions where a significant change to the API was initially merged, but upon review at release time, needs to be retracted or significantly changed before final release. To combat this delay between implementation of an API change and final preparation for a release, we wanted to move the reviewing of API changes earlier in the cycle. But how?
Initial discussion about this centered around just simply tagging any header file changes for manual review, but the noise created by this was deemed to be excessive. Such a solution would just lead to even more work. But what if we could offload some of initial code analysis to an AI to at the least, determine if a change should be deemed "significant" or not?
What is GPT?
GPT, short for "Generative Pretrained Transformer", is a complex mathematical model that "understands" the relationships between data. In uses like ChatGPT which you may have used, the model has been trained on language data to create a model of how words in a language relate to each other. By studying billions of words in books, documents, blog posts, recording transcripts and so on, the model understands how one word relates to another depending on contexts. In this way, it can also understand new inputs and generate outputs, word-by-word, forming a response similar to how we speak. When it generates a new word for output, it reviews the context and generates the most likely to occur next word in the sentence.
There are now many LLMs (Large Language Model) out there today, and OpenAI's GPT-4 model is one such LLM. There have been four major iterations in GPT thus far, with each leapfrogging the previous model in capability, memory, and comprehension.
While GPT has traditionally been used for human language comprehension, it's just a model and can also be trained to understand relationships in other types of data as well, such as code. This means that we can have a chat with it about some block of code and ask it for analysis or have a back-and-forth conversation to better understand the code ourselves.
Where we stand today
Since mid-December, we've been operating a proof-of-concept bot which monitors changes submitted to codereview.qt-project.org and runs the generated diff through a GPT-4 analysis. Since generative AI does not act on its own, it must be prompted in order to generate an output. A prompt can contain instructions, a request, contextual information, and so on. Here's what we use for our API reviews, along with a raw diff of a given code change:
[abridged] "Task: Classify the change in a public header file as significant to the behavior and usage of the API or not. Additional qualifications: changes to 'private:' sections of public headers are not significant; changes to platform plugins, sometimes identified by the file path, are not significant; whitespace-only changes are not significant; ..."
Along with some backend tricks to shoehorn GPT into providing a relevant response, the behavior has proven generally reliable when using GPT-4. In about one month of operation, over 230 changes have been tagged with the "Needs API-Review" hashtag, each receiving a brief analysis of what changes make it significant to the use and operation of the public API.
Example of an addition of a new API
In this example, new functionality is being added to QRemoteObjectHost.
Git diff: Show more
When prompted, GPT-4 provides the following analysis of the new API additions, summarizing them nicely:
- A new
#include <QtRemoteObjects/qremoteobjectnode.h>directive is added in
qconnectionfactories.h, which could potentially expose additional symbols or types that were not available before in this header. This change can affect users who include this header.
- The addition of the
setError(QRemoteObjectNode::ErrorCode)signal in the
qconnectionfactories.hfile is a significant change. It represents a new API for signaling errors with an error code, which will affect how users of the API can handle error conditions.
- The addition of a new
SocketAccessErrorenum value to the
qremoteobjectnode.his a significant change. It adds a new potential error state that users of the API will need to handle.
- The addition of the
static void setLocalServerOptions(QLocalServer::SocketOptions options)function to the
qremoteobjectnode.his a significant change. It introduces a new API for setting options on the local server, which will influence how users configure their instances of
These changes can affect both the behavior of existing code (if it relies on previous error codes or does not expect the new
setErrorsignal) and the way developers use the API (by providing new methods and configuration options).
In this case, GPT identifies the new behavior, but also makes a note that even though this is an addition, it has the potential to cause issues for existing users if their code is not ready to handle the new signal. Such a warning may be useful to remind a reviewer that additional scrutiny and perhaps a documentation note is needed before merging the change.
Example of a change to an existing API
When a simple change is made to an existing API in qvariantanimation.h, this what the bot sees:
When prompted, GPT-4 provides the following analysis which can be used to prime the reviewer later about what change was made:
"In qvariantanimation.h, the change from a typedef to a using declaration for KeyValue alters the underlying type from QPair to std::pair, potentially impacting how API users interact with KeyValue instances."
Generative AI is a fickle thing. As mentioned earlier, it just guesses the next word, one at a time, over and over until it thinks it should stop. (Check out how LLMs function, it's pretty mind boggling.) While GPT 3.5 is pretty great for having a chat with, it often forgets what you told it, and doesn't follow instructions very well. It also suffers from "recency bias", where the words later in your prompt are considered more important that earlier words. This a bit of a problem since some of the instructions may be ignored, or worse, some of the change's diff being passed are simply not considered when generating output. This resulted in a less than stellar hit rate when using GPT 3.5 for analysis. Even taking steps to request a more consistent and less creative output, it would still inconsistently ignore whole parts of the request, or even just hallucinate entirely what was added or removed in a change.
Dall-E 3 prompt: A confident robot in an office, proudly presenting incorrect information on a whiteboard. The robot has an air of self-assurance, using a marker to underline wrong answers with a smile. The scene is slightly humorous, highlighting the robot's misplaced confidence in its incorrect answers. The environment is a modern office setting, and the whiteboard is filled with equations and text, with some parts clearly marked as incorrect but the robot appears oblivious.
In November, Microsoft introduced a beefed up version of GPT 3.5, called 3.5-instruct, which was supposed to improve its instruction-following ability as the name suggests. While this worked better, the model still suffered from hallucinations of what was actually changed in a diff.
To get around some of these issues, a best-of-three model was tried such that two of the three shots needed to agree on the significance of the change before adopting a result. This at least improved overall accuracy, but outputs were still lacking detail and clearly suffering from not fully understanding the changes made at a software development level.
When the cost of GPT-4 was lowered, even a cursory test showed greatly improved results. GPT-4 has a better memory for what it reads, and is able to more clearly comment on its own reasoning behind a decision. Due to higher accuracy, the bot was moved back to a single-shot configuration and only minor tweaks to the prompt have been needed since.
Next Steps and beyond
AI is not a magic bullet. If it ever is, that's still quite far off from today. Unlike a human, it has no idea what it's actually doing; it simply responds to a contextual prompt by choosing the most likely next word. That means that if it chooses wrongly at any point, the rest of the reply can be sent down the wrong path and producing a confident, but very wrong answer. Some steps can be taken to mitigate this, but not without time and cost. Despite the shortcomings of current generation generative AI, we have a solid start to reducing the effort for our API reviews.
Next, we want to explore larger contexts for evaluating the significance of changes, including multiple files in a single-shot evaluation so the full change is understood, instead of the file-by-file evaluation that occurs today. Further, we hope that future iterations of GPT or related LLM tech can improve the hit rate and instruction-following. Even though GPT-4 is miles ahead of 3.5 for this use case, it still makes mistakes and ignores context on occasion.
By shifting the review to a per-change effort at the time of approval for merging, we can save hours spent on reviews during the crunch time right before release. This new method also gives a bit more context so that discussions on the necessity of a change need to only happen once. In our traditional reviews, a rather complicated script performed a bunch of hard-coded logic to exclude irrelevant lines of code, creating throwaway commits that collected all the API changes together for review. Although the previous method provided a quick overview of the changes, it lacked the necessary context. Additionally, tracking down the source of the change and discussing its importance was a time-consuming process. Given the significant amount of time that could elapse between writing a change and the API review, it could be challenging to remember the reasoning behind each change.
This proof-of-concept API Review Bot is just another tool to help everyone in The Qt Project contribute more easily and get the attention a change deserves sooner. In the end, every change still requires human eyes on it, but we hope that this new bot makes that a little easier.
Subscribe to our newsletter
Try Qt 6.6 Now!
Download the latest release here: www.qt.io/download.
Qt 6.6. is a feature release with focus on improving UX capabilities including responsive UI technology and the Qt Graph module.
Check out all our open positions here and follow us on Instagram to see what it's like to be #QtPeople.