Qt Weekly #28: Qt and CUDA on the Jetson TK1

NVIDIA's Jetson TK1 is a powerful development board based on the Tegra K1 chip. It comes with a GPU capable of OpenGL 4.4, OpenGL ES 3.1 and CUDA 6.5. From Qt's perspective this is a somewhat unorthodox embedded device because its customized Linux system is based on Ubuntu 14.04 and runs the regular X11 environment. Therefore the approach that is typical for low and medium-end embedded hardware, running OpenGL-accelerated Qt apps directly on the framebuffer using the eglfs platform plugin, will not be suitable.

In addition, the ability to do hardware-accelerated computing using CUDA is very interesting, especially when it comes to interoperating with OpenGL. Let's take a look at how CUDA code can be integrated with a Qt-based application.

Jetson TK1The board

Building Qt

This board is powerful enough to build everything on its own without any cross-compilation. Configuring and building Qt is no different than in any desktop Linux environment. One option that needs special consideration however is -opengl es2 because Qt can be built either in a GLX + OpenGL or EGL + OpenGL ES configuration.

For example, the following configures Qt to use GLX and OpenGL:

configure -release -nomake examples -nomake tests

while adding -opengl es2 requests the usage of EGL and OpenGL ES:

configure -release -opengl es2 -nomake examples -nomake tests

If you are planning to run applications relying on modern, non-ES OpenGL features, or use CUDA, then go for the first. If you however have some existing code from the mobile or embedded world relying on EGL or OpenGL ES then it may be useful to go for #2.

The default platform plugin will be xcb, so running Qt apps without specifying the platform plugin will work just fine. This is the exact same plugin that is used on any ordinary X11-based Linux desktop system.

Vsync gotchas

Once the build is done, you will most likely run some OpenGL-based Qt apps. And then comes the first surprise: applications are not synchronized to the vertical refresh rate of the screen.

When running for instance the example from qtbase/examples/opengl/qopenglwindow, we expect a nice and smooth 60 FPS animation with the rendering thread throttled appropriately. This unfortunately isn't the case. Unless the application is fullscreen. Therefore many apps will want to replace calls like show() or showMaximized() with showFullScreen(). This way the thread is throttled as expected.

A further surprise may come in QWidget-based applications when opening a popup or a dialog. Unfortunately this also disables synchronization, even though the main window still covers the entire screen. In general we can conclude that the standard embedded recommendation of sticking to a single fullscreen window is very valid for this board too, even when using xcb, although for completely different reasons.


After installing CUDA, the first and in fact the only challenge is to tackle the integration of nvcc with our Qt projects.

Unsurprisingly, this has been tackled by others before. Building on this excellent article, the most basic integration in our .pro file could look like this:

... # QT, SOURCES, HEADERS, the usual stuff

CUDA_SOURCES = cuda_stuff.cu

CUDA_DIR = /usr/local/cuda
CUDA_ARCH = sm_32 # as supported by the Tegra K1

LIBS += -L $$CUDA_DIR/lib -lcudart -lcuda
osx: LIBS += -F/Library/Frameworks -framework CUDA

cuda.commands = $$CUDA_DIR/bin/nvcc -c -arch=$$CUDA_ARCH -o ${QMAKE_FILE_OUT} ${QMAKE_FILE_NAME}
cuda.dependency_type = TYPE_C
cuda.depend_command = $$CUDA_DIR/bin/nvcc -M ${QMAKE_FILE_NAME}
cuda.input = CUDA_SOURCES
cuda.output = ${QMAKE_FILE_BASE}_cuda.o

In addition to Linux this will also work out of the box on OS X. Adapting it to Windows should be easy. For advanced features like reformatting nvcc's error messages to be more of Creator's liking, see the article mentioned above.

A QOpenGLWindow-based application that updates an image via CUDA on every frame could now look something like the following. The approach is the same regardless of the OpenGL enabler in use: QOpenGLWidget or a custom Qt Quick item would operate along the same principles: call cudaGLSetGLDevice when the OpenGL context is available, register the OpenGL resources to CUDA, and then do map - invoke CUDA kernel - unmap - draw on every frame.

Note that in this example we are using a single pixel buffer object. There are other ways to do interop, for example we could have registered the GL texture, got a CUDA array out of it and bound that either to a CUDA texture or surface.

// functions from cuda_stuff.cu
extern void CUDA_init();
extern void *CUDA_registerBuffer(GLuint buf);
extern void CUDA_unregisterBuffer(void *res);
extern void *CUDA_map(void *res);
extern void CUDA_unmap(void *res);
extern void CUDA_do_something(void *devPtr, int w, int h);

class Window : public QOpenGLWindow, protected QOpenGLFunctions { public: ... void initializeGL(); void paintGL();

private: QSize m_imgSize; GLuint m_buf; GLuint m_texture; void *m_cudaBufHandle; };


void Window::initializeGL() { initializeOpenGLFunctions();


QImage img("some_image.png"); m_imgSize = img.size(); img = img.convertToFormat(QImage::Format_RGB32); // BGRA on little endian

glGenBuffers(1, &m_buf); glBindBuffer(GL_PIXEL_UNPACK_BUFFER, m_buf); glBufferData(GL_PIXEL_UNPACK_BUFFER, m_imgSize.width() * m_imgSize.height() * 4, img.constBits(), GL_DYNAMIC_COPY); glBindBuffer(GL_PIXEL_UNPACK_BUFFER, 0);

m_cudaBufHandle = CUDA_registerBuffer(m_buf);

glGenTextures(1, &m_texture); glBindTexture(GL_TEXTURE_2D, m_texture);

glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA8, m_imgSize.width(), m_imgSize.height(), 0, GL_BGRA, GL_UNSIGNED_BYTE, 0);


void Window::paintGL() { glClear(GL_COLOR_BUFFER_BIT);

void *devPtr = CUDA_map(m_cudaBufHandle); CUDA_do_something(devPtr, m_imgSize.width(), m_imgSize.height()); CUDA_unmap(m_cudaBufHandle);

glBindBuffer(GL_PIXEL_UNPACK_BUFFER, m_buf); glBindTexture(GL_TEXTURE_2D, m_texture); // Fast path due to BGRA glTexSubImage2D(GL_TEXTURE_2D, 0, 0, 0, m_imgSize.width(), m_imgSize.height(), GL_BGRA, GL_UNSIGNED_BYTE, 0); glBindBuffer(GL_PIXEL_UNPACK_BUFFER, 0);

... // do something with the texture

update(); // request the next frame } ...

The corresponding cuda_stuff.cu:

#include <stdio.h>
#ifdef Q_OS_MAC
#include <OpenGL/gl.h>
#include <GL/gl.h>
#include <cuda.h>
#include <cuda_gl_interop.h>

void CUDA_init() { cudaDeviceProp prop; int dev; memset(&prop, 0, sizeof(cudaDeviceProp)); prop.major = 3; prop.minor = 2; if (cudaChooseDevice(&dev, &prop) != cudaSuccess) puts("failed to choose device"); if (cudaGLSetGLDevice(dev) != cudaSuccess) puts("failed to set gl device"); }

void *CUDA_registerBuffer(GLuint buf) { cudaGraphicsResource *res = 0; if (cudaGraphicsGLRegisterBuffer(&res, buf, cudaGraphicsRegisterFlagsNone) != cudaSuccess) printf("Failed to register buffer %u\n", buf); return res; }

void CUDA_unregisterBuffer(void *res) { if (cudaGraphicsUnregisterResource((cudaGraphicsResource *) res) != cudaSuccess) puts("Failed to unregister resource for buffer"); }

void *CUDA_map(void *res) { if (cudaGraphicsMapResources(1, (cudaGraphicsResource **) &res) != cudaSuccess) { puts("Failed to map resource"); return 0; } void *devPtr = 0; size_t size; if (cudaGraphicsResourceGetMappedPointer(&devPtr, &size, (cudaGraphicsResource *) res) != cudaSuccess) { puts("Failed to get device pointer"); return 0; } return devPtr; }

void CUDA_unmap(void *res) { if (cudaGraphicsUnmapResources(1,(cudaGraphicsResource **) &res) != cudaSuccess) puts("Failed to unmap resource"); }

__global__ void run(uchar4 *ptr) { int x = threadIdx.x + blockIdx.x * blockDim.x; int y = threadIdx.y + blockIdx.y * blockDim.y; int offset = x + y * blockDim.x * gridDim.x;

... }

void CUDA_do_something(void *devPtr, int w, int h) { const int blockSize = 16; // 256 threads per block run<<<dim3(w / blockSize, h / blockSize), dim3(blockSize, blockSize)>>>((uchar4 *) devPtr); }

This is all that's needed to integrate the power of Qt, OpenGL and CUDA. Happy hacking!

Blog Topics: