The hype around generative AI and Large Language Models (LLMs) has produced some bold claims: that software developers are becoming obsolete, that AI writes better code than humans, and that the age of the human programmer is drawing to a close. But how much of that is actually backed by empirical evidence? A 2025 study published at the IEEE/ACM International Conference on Mining Software Repositories (MSR), one of the top venues in software engineering research, takes a serious empirical stab at the question.
The paper, by Jamil, Abid, and Shamail, is titled "Can LLMs Generate Higher Quality Code Than Humans? An Empirical Study."1) This article is a summary of what they did, what they found, and more importantly, what their findings don't quite tell us.
Please note that the following intends to delve deeply into just one study. It's not aiming to summarize all publications on the topic (which would inevitably lead to a superficial treatment).
The researchers asked two questions:
They tested two versions of ChatGPT — GPT-3.5-Turbo and GPT-4 — and used three levels of prompt detail:
That gives six combinations in total, all compared against hand-written Python code from the HumanEval benchmark — a dataset of 164 Python functions created by OpenAI to evaluate their LLM’s coding ability.
This is where things get interesting — and a little tricky.
The study uses a two-pronged definition of quality: functional correctness (does the code actually work?) and maintainability (is it well-written and easy to work with?).
Correctness is measured by running the test cases that ship with HumanEval. Maintainability is measured through a battery of standard code metrics, computed via tools like Radon, Complexipy, Bandit, and Pylint:
Except for the maintainability index, lower is better.
To combine all these metrics into a single verdict, the authors use a decision-making technique called TOPSIS (Technique for Order of Preference by Similarity to Ideal Solution). The idea is to find the solution closest to the theoretical "best" across all metrics simultaneously, and farthest from the theoretical "worst." It's a reasonable approach, but it does require you to define what the ideal best and worst values actually are — which turns out to be non-trivial for metrics like LOC.
On correctness: The hand-written HumanEval functions pass all their tests by definition — they wouldn't be in the benchmark otherwise. The best-performing LLM combination (GPT-4 with the most detailed prompt) still produced 21 incorrect functions out of 164, a failure rate of roughly 18%. That's not insignificant. The bottom line: there's a high probability that any given piece of LLM-generated code contains bugs. Tests and review are not optional.
On maintainability: When the researchers restricted the comparison to only the correctly generated functions, GPT-4 with the detailed prompt came out ahead in a meaningful portion of cases. Specifically, compared to the human-written code, it produced functions with:
The authors describe GPT-4 as demonstrating "superior performance across all evaluated metrics." But there's a subtlety worth paying attention to here.
The study reports the share of cases where generated code beats human-written code — but not the share where it loses. Take LOC as an example: if generated code has fewer lines in 47% of cases, that means it has more lines in 53% of cases (assuming the two are never identical). By that logic, for lines of code, human-written code is actually the winner more often than not.
The authors don't highlight this inversion, which makes the results harder to interpret. It's not that the paper is misleading — but cherry-picking one side of a comparison can create an overly rosy picture.
There's also a deeper methodological question: are these metrics even good proxies for maintainability? Decades of software engineering research haven't conclusively validated most of them as reliable predictors of how hard code is to maintain in practice. Shorter code isn't always better code. Lower complexity doesn't always mean more readable. And for security findings from static analysis tools, both false positives and false negatives are common.
Finally, when you're aggregating multiple metrics, it matters whether they're independent. If several metrics are highly correlated — measuring essentially the same thing — giving each of them equal weight in TOPSIS effectively double- or triple-counts that dimension, potentially skewing the outcome.
The study's sample is entirely drawn from HumanEval, which consists of short, stand-alone Python functions — the kind of tasks you'd encounter in a coding interview. There's no guarantee this is representative of real-world codebases, which tend to involve longer, more intertwined, context-heavy code. Selecting HumanEval as a benchmark simply because it is available is what researcher call convenience sampling: you use what's available. It's a common pragmatic choice, but it limits how broadly you can generalize the conclusions.
HumanEval is also a benchmark created by OpenAI, the same organization behind GPT. That's worth keeping in mind: LLMs may have been trained in ways that make them particularly well-suited to performing on this specific benchmark. Whether the results hold for other code, other languages, or more complex software systems is an open question.
The study's own summary puts it well: "GPT models can generate code with higher internal quality in some cases, they do not consistently produce correct or trustworthy code."
That's a nuanced and honest conclusion. For small, well-defined Python functions with good prompts, GPT-4 can produce code that is at least competitive with human-written code on several metrics — and sometimes cleaner. But roughly 1 in 5 generated functions will be functionally wrong, and the maintainability advantage is marginal and context-dependent.
In practical terms, this means:
The era of AI replacing developers wholesale hasn't arrived. What has arrived is a genuinely useful tool for generating code — one that can save time and sometimes produce cleaner results, but only when treated as a starting point rather than a finished product.
Sometimes — but not reliably, and not without caveats. Based on this study, GPT-4 with a well-crafted prompt can generate Python code that scores better than human-written code on several maintainability metrics in a meaningful portion of cases. In that narrow sense, yes, LLMs can produce "better" code. But the picture falls apart quickly under scrutiny. Around 18% of the generated functions are outright incorrect, the maintainability gains are modest and metric-dependent, and the entire comparison is limited to short, self-contained coding tasks — a far cry from the complexity of real software projects.
The honest answer is that "better" depends heavily on how you measure it, what kind of code you're writing, and what you do with the output. LLMs are capable of producing clean, concise code for well-defined problems, but they are not consistently more capable than humans, and they cannot yet be trusted without human review.
The question itself may also be the wrong one to ask. Rather than treating LLMs and humans as competitors, the more useful frame is whether LLMs make developers more effective — and the evidence for that, at least anecdotally, is considerably stronger.
1)Jamil, Abid & Shamail, "Can LLMs Generate Higher Quality Code Than Humans? An Empirical Study," MSR 2025. The original paper and data are available at Github.com