Code Quality: Are LLMs Better Than Humans?

Written by Prof. Dr. Rainer Koschke | May 12, 2026 10:30:36 AM

A Critical Look at a Recent Empirical Study on AI-Generated Code

The hype around generative AI and Large Language Models (LLMs) has produced some bold claims: that software developers are becoming obsolete, that AI writes better code than humans, and that the age of the human programmer is drawing to a close. But how much of that is actually backed by empirical evidence? A 2025 study published at the IEEE/ACM International Conference on Mining Software Repositories (MSR), one of the top venues in software engineering research, takes a serious empirical stab at the question.

The paper, by Jamil, Abid, and Shamail, is titled "Can LLMs Generate Higher Quality Code Than Humans? An Empirical Study."¹⁾This article is a summary of what they did, what they found, and more importantly, what their findings don't quite tell us.

Please note that the following intends to delve deeply into just one study. It's not aiming to summarize all publications on the topic (which would inevitably lead to a superficial treatment).

What The Study Set Out To Do

The researchers asked two questions:

Can GPT models generate higher-quality code than humans?
Which combination of prompt style and GPT model produces the best results?

They tested two versions of ChatGPT — GPT-3.5-Turbo and GPT-4 — and used three levels of prompt detail:

Basic: just the function's task description and its docstring
Intermediate: adds test cases and structural hints (e.g., don't change the function name, avoid syntax errors)
Detailed: adds requirements around optimization, readability, and security

That gives six combinations in total, all compared against hand-written Python code from the HumanEval benchmark — a dataset of 164 Python functions created by OpenAI to evaluate their LLM’s coding ability.

How They Measured "Quality"

This is where things get interesting — and a little tricky.

The study uses a two-pronged definition of quality: functional correctness (does the code actually work?) and maintainability (is it well-written and easy to work with?).

Correctness is measured by running the test cases that ship with HumanEval. Maintainability is measured through a battery of standard code metrics, computed via tools like Radon, Complexipy, Bandit, and Pylint:

LOC – Lines of code
Cyclomatic Complexity (CC) – A measure of branching in the code
Maintainability Index (MI) – A composite readability/maintainability score
Cognitive Complexity (CogC) – How hard the code is for a human to reason about
Halstead metrics – Measurements based on operators and operands
Bandit warnings – Potential security issues
Pylint violations – Style and coding standard issues

Except for the maintainability index, lower is better.

To combine all these metrics into a single verdict, the authors use a decision-making technique called TOPSIS (Technique for Order of Preference by Similarity to Ideal Solution). The idea is to find the solution closest to the theoretical "best" across all metrics simultaneously, and farthest from the theoretical "worst." It's a reasonable approach, but it does require you to define what the ideal best and worst values actually are — which turns out to be non-trivial for metrics like LOC.

What the Results Actually Showed

On correctness: The hand-written HumanEval functions pass all their tests by definition — they wouldn't be in the benchmark otherwise. The best-performing LLM combination (GPT-4 with the most detailed prompt) still produced 21 incorrect functions out of 164, a failure rate of roughly 18%. That's not insignificant. The bottom line: there's a high probability that any given piece of LLM-generated code contains bugs. Tests and review are not optional.

On maintainability: When the researchers restricted the comparison to only the correctly generated functions, GPT-4 with the detailed prompt came out ahead in a meaningful portion of cases. Specifically, compared to the human-written code, it produced functions with:

Fewer lines of code in ~47% of cases
Lower cyclomatic complexity in ~34% of cases
Lower vocabulary (Halstead V) in ~39% of cases
Lower maintainability index in ~54% of cases
Lower cognitive complexity in ~46% of cases

The authors describe GPT-4 as demonstrating "superior performance across all evaluated metrics." But there's a subtlety worth paying attention to here.

A Closer Look at Those Numbers

The study reports the share of cases where generated code beats human-written code — but not the share where it loses. Take LOC as an example: if generated code has fewer lines in 47% of cases, that means it has more lines in 53% of cases (assuming the two are never identical). By that logic, for lines of code, human-written code is actually the winner more often than not.

The authors don't highlight this inversion, which makes the results harder to interpret. It's not that the paper is misleading — but cherry-picking one side of a comparison can create an overly rosy picture.

There's also a deeper methodological question: are these metrics even good proxies for maintainability? Decades of software engineering research haven't conclusively validated most of them as reliable predictors of how hard code is to maintain in practice. Shorter code isn't always better code. Lower complexity doesn't always mean more readable. And for security findings from static analysis tools, both false positives and false negatives are common.

Finally, when you're aggregating multiple metrics, it matters whether they're independent. If several metrics are highly correlated — measuring essentially the same thing — giving each of them equal weight in TOPSIS effectively double- or triple-counts that dimension, potentially skewing the outcome.

What the Study Can and Can't Tell Us

The study's sample is entirely drawn from HumanEval, which consists of short, stand-alone Python functions — the kind of tasks you'd encounter in a coding interview. There's no guarantee this is representative of real-world codebases, which tend to involve longer, more intertwined, context-heavy code. Selecting HumanEval as a benchmark simply because it is available is what researcher call convenience sampling: you use what's available. It's a common pragmatic choice, but it limits how broadly you can generalize the conclusions.

HumanEval is also a benchmark created by OpenAI, the same organization behind GPT. That's worth keeping in mind: LLMs may have been trained in ways that make them particularly well-suited to performing on this specific benchmark. Whether the results hold for other code, other languages, or more complex software systems is an open question.

The Bottom Line

The study's own summary puts it well: "GPT models can generate code with higher internal quality in some cases, they do not consistently produce correct or trustworthy code."

That's a nuanced and honest conclusion. For small, well-defined Python functions with good prompts, GPT-4 can produce code that is at least competitive with human-written code on several metrics — and sometimes cleaner. But roughly 1 in 5 generated functions will be functionally wrong, and the maintainability advantage is marginal and context-dependent.

In practical terms, this means:

Always review and test LLM-generated code. An 18% error rate is far too high to trust blindly.
Detailed prompts matter. The most detailed prompt type consistently outperformed simpler ones.
GPT-4 outperforms GPT-3.5 across the board in this study.
Maintainability claims should be taken with a grain of salt. The metrics used are imperfect, and the differences are often small.

The era of AI replacing developers wholesale hasn't arrived. What has arrived is a genuinely useful tool for generating code — one that can save time and sometimes produce cleaner results, but only when treated as a starting point rather than a finished product.

So, can LLMs Produce Better Code Than Humans?

Sometimes — but not reliably, and not without caveats. Based on this study, GPT-4 with a well-crafted prompt can generate Python code that scores better than human-written code on several maintainability metrics in a meaningful portion of cases. In that narrow sense, yes, LLMs can produce "better" code. But the picture falls apart quickly under scrutiny. Around 18% of the generated functions are outright incorrect, the maintainability gains are modest and metric-dependent, and the entire comparison is limited to short, self-contained coding tasks — a far cry from the complexity of real software projects.

The honest answer is that "better" depends heavily on how you measure it, what kind of code you're writing, and what you do with the output. LLMs are capable of producing clean, concise code for well-defined problems, but they are not consistently more capable than humans, and they cannot yet be trusted without human review.

The question itself may also be the wrong one to ask. Rather than treating LLMs and humans as competitors, the more useful frame is whether LLMs make developers more effective — and the evidence for that, at least anecdotally, is considerably stronger.

¹⁾Jamil, Abid & Shamail, "Can LLMs Generate Higher Quality Code Than Humans? An Empirical Study," MSR 2025. The original paper and data are available at Github.com

View full post