Safety-Critical CUDA: What Developers Need to Know

TL;DR

Safety-critical CUDA demands stricter engineering discipline: As CUDA moves into domains like medical, automotive, and robotics, software must follow formal safety standards (e.g., MISRA, AUTOSAR, ISO 26262) and prioritize robustness over rapid delivery.
Static analysis tools strengthen — not replace — developer judgment: Linters and analyzers help uncover buffer errors, unchecked allocations, security risks, and standards violations early, enabling safer, more localized fixes and better long-term code quality.
Integrate automated verification into CI/CD for sustainable safety: Continuous code analysis aligned with CUDA safety guidelines helps teams systematically detect risks, enforce compliance, and maintain reliability in complex safety-critical GPU software.

Nicholas Wilt is a former NVIDIA CUDA architect who worked on CUDA from its inception, the author of The CUDA Handbook and writer of the Parallel Programmer newsletter, with over 25 years of experience in high-performance computing, graphics, and parallel programming.

This is the first of our joint expert series on Axivion for CUDA.

“As computers have become more pervasive, they increasingly have found their way into safety-critical applications. For software running on those computers, the stakes are higher and the software must be designed and built more carefully.”

A few weeks ago, the folks at Qt Group reached out to see if they could contract with me to write about the new CUDA support for Axivion, their code analyzer tool that is designed to complement (not replace!) seasoned developers’ adherence to coding standards such as MISRA, AUTOSAR, and NVIDIA’s own CUDA Security guidance.

Code analyzers have a long history in our industry. In the 1970s, when K&R’s The C Programming Language was new, a program called lint was developed by a computer scientist at Bell Labs named Stephen C. Johnson. lint itself is a historical curiosity, with functionality that has long since been rendered obsolete by language design (e.g., mismatched argument types) or subsumed into compilers (e.g., pedantic warnings about portability issues). But the idea persists that there’s a role to play for tools that complement compilers’ core task of translating source code into machine code and, in an homage to its origins, such tools are called linters.

My first experience with using a linter in production was at Microsoft in the 1990s, which acquired PREfast, a static code analyzer that visited much more rigorous scrutiny on the code than the compiler. Management was concerned about security vulnerabilities, especially ones caused by buffer overflow bugs, and PREfast was designed to detect such bugs through holistic analysis of the source code. A team was assigned to run the entire Windows code base through PREfast, and a bug was opened for every issue it found.

That exercise had a bigger influence on how I write code than any other single event in my career as a software developer—because some of those bugs were damn near impossible to fix without regression risk.

If a function neglected to check the return value from malloc(), then passed the resulting pointer into a call stack, where only some code paths dereferenced the bad pointer… PREfast would catch that bug, and outline the exact set of conditions needed for it to reproduce. But fixing the bug sometimes would require touching code in many places, often in ways that made the fix hard to verify by inspection.

The bugs opened by PREfast generally could be dropped into one of three categories:

Most often, PREfast raised legitimate bugs that could be fixed with minimal disruption to the source code.
Occasionally, we were able to prove that PREfast had raised a spurious concern.
Every so often, we had an unsatisfying resolution: it seemed like there was a bug, but without a repro case, and if the affected code was distributed throughout source code, our attempts to ‘fix’ the ‘bug’ seemed just as likely to introduce regressions.

Microsoft trusted its engineers; if the person investigating a PREfast issue attested that it had been fixed (or that it was not a bug), the bug would be closed. For bugs that were deemed spurious, there was a mechanism to mark them so PREfast would not raise those issues again. If memory serves, marking a bug that way required sign-off from multiple people, akin to a code review.

The experience of triaging, fixing, confirming the fixes for these bugs, and figuring out how to improve test coverage of the code, had a profound effect on my approach to software development. The latter two categories, in particular, motivated me to think about how I could write code that 1) does not raise spurious concerns with the static code analyzer and, more importantly, 2) enables bug fixes to be more local, e.g. by handling resource allocation failures at the call site, and propagating errors from unified code paths.

As computers have become more pervasive, they increasingly have found their way into safety-critical applications. For software running on those computers, the stakes are higher and the software must be designed and built more carefully. We can’t go about building software for medical devices, airplanes, weapons systems, or automobiles in the same way that we build software for computer games or e-commerce Web sites. NASA has long adhered to coding standards and software development practices that bias strongly for robustness over speed of development. Prominent failures, such as the notorious Therac-25 radiation therapy machine¹, have motivated development of coding standards that are more formal, and have more force in effect, than the advisory coding standards that most professional software engineering organizations have internally negotiated.

In the late 1990s, the automobile industry, recognizing the inevitability of computers continuing to find more applications throughout their products, convened the Motor Industry Software Reliability Association (MISRA) to formalize coding standards. Now in its third edition, MISRA has found its way into other application domains, such as the Joint Strike Fighter (JSF) C++ Coding Standards.² More recently, the AUTOSAR (AUTomotive Open System ARchitecture) was developed to standard software development for Electronic Control Units (ECUs), and the International Standards Organization (ISO) incorporated Product Development At The Software Level as Section 6 of the ISO 26262 standard for Functional Safety of Road Vehicles.

The simple presence of MISRA, or claimed adherence to its standards, won’t protect companies if employees fail to adhere to best practices. Around 15 years ago, as if to underscore why the automotive industry was leading the field of software engineering in developing such standards, Toyota Corporation was investigated for UA (unintended acceleration) events that had killed an estimated 89 people in May 2010. A related class action lawsuit was settled for $1.6B in December 2012, and in 2014, Toyota was further fined $1.2B for concealing safety defects. Expert witness Michael Barr spent more than 20 months reviewing Toyota’s source code and testified about its quality, even citing an internal Toyota document that characterized it as “spaghetti code.” Barr and a NASA team that was contracted to do an evaluation both checked the code against MISRA standards and found thousands of violations. For those interested, this article has a (lengthy) summary as well as links to the hundreds of pages of testimony by expert witnesses Koopman and Barr.

“[Static analysis and code verification tools] are intended to complement, not replace, seasoned engineers’ judgment.”

By now, it should be clear why there is an appetite for tools such as Axivion, which analyzes source code and flags possible violations of the various safety standards, citing them by chapter and verse. They are intended to complement, not replace, seasoned engineers’ judgment. Like the leaders at Microsoft who trusted their developers to triage and address the issues raised by PREfast, the architects of these coding standards recognized that it’s more realistic to be pragmatic than didactic. That said, if I were an engineering director at a company building software for safety-critical systems, I’d consider thoughtfully incorporating Axivion into the CI/CD workflow.

And as far as I know, Axivion is the first and only offering that performs this service, not only for C/C++ and C# code, but also for CUDA C++ code.³ Only recently, NVIDIA published CUDA C++ Guidelines for Robust and Safety Critical Programming, and Qt Group has incorporated the guidance from that document into their tool. Such expanded support is needed as CUDA continues to find applications in robotics, autonomous vehicles, and other safety-critical domains.

As a first exercise, the Qt folks ran the source code for The CUDA Handbook through their tool, and sent a report summarizing its findings. The CUDA Handbook isn’t intended for safety-critical applications, per se, but it’s an open source code base we can usefully examine to get a sense of the error reporting and how a developer would triage and address the issues raised by the tool.

In our next article, we’ll put Axivion through its paces and take a look at its findings.

Safety-Critical CUDA: What Developers Need to Know

Sign Up for Updates

Learn about Axivion for CUDA

Related Articles

The ROI of Stability: How High‑Performing Teams Expand Beyond OSS Tools

7 Signs Your CUDA Code Base Needs Static Analysis