Qt Weekly #28: Qt and CUDA on the Jetson TK1

March 03, 2015 by Laszlo Agocs | Comments

NVIDIA's Jetson TK1 is a powerful development board based on the Tegra K1 chip. It comes with a GPU capable of OpenGL 4.4, OpenGL ES 3.1 and CUDA 6.5. From Qt's perspective this is a somewhat unorthodox embedded device because its customized Linux system is based on Ubuntu 14.04 and runs the regular X11 environment. Therefore the approach that is typical for low and medium-end embedded hardware, running OpenGL-accelerated Qt apps directly on the framebuffer using the eglfs platform plugin, will not be suitable.

In addition, the ability to do hardware-accelerated computing using CUDA is very interesting, especially when it comes to interoperating with OpenGL. Let's take a look at how CUDA code can be integrated with a Qt-based application.

The board

Building Qt

This board is powerful enough to build everything on its own without any cross-compilation. Configuring and building Qt is no different than in any desktop Linux environment. One option that needs special consideration however is -opengl es2 because Qt can be built either in a GLX + OpenGL or EGL + OpenGL ES configuration.

For example, the following configures Qt to use GLX and OpenGL:

configure -release -nomake examples -nomake tests

while adding -opengl es2 requests the usage of EGL and OpenGL ES:

configure -release -opengl es2 -nomake examples -nomake tests

If you are planning to run applications relying on modern, non-ES OpenGL features, or use CUDA, then go for the first. If you however have some existing code from the mobile or embedded world relying on EGL or OpenGL ES then it may be useful to go for #2.

The default platform plugin will be xcb, so running Qt apps without specifying the platform plugin will work just fine. This is the exact same plugin that is used on any ordinary X11-based Linux desktop system.

Vsync gotchas

Once the build is done, you will most likely run some OpenGL-based Qt apps. And then comes the first surprise: applications are not synchronized to the vertical refresh rate of the screen.

When running for instance the example from qtbase/examples/opengl/qopenglwindow, we expect a nice and smooth 60 FPS animation with the rendering thread throttled appropriately. This unfortunately isn't the case. Unless the application is fullscreen. Therefore many apps will want to replace calls like show() or showMaximized() with showFullScreen(). This way the thread is throttled as expected.

A further surprise may come in QWidget-based applications when opening a popup or a dialog. Unfortunately this also disables synchronization, even though the main window still covers the entire screen. In general we can conclude that the standard embedded recommendation of sticking to a single fullscreen window is very valid for this board too, even when using xcb, although for completely different reasons.

CUDA

After installing CUDA, the first and in fact the only challenge is to tackle the integration of nvcc with our Qt projects.

Unsurprisingly, this has been tackled by others before. Building on this excellent article, the most basic integration in our .pro file could look like this:

... # QT, SOURCES, HEADERS, the usual stuff
CUDA_SOURCES = cuda_stuff.cu
CUDA_DIR = /usr/local/cuda

CUDA_ARCH = sm_32 # as supported by the Tegra K1
INCLUDEPATH += $$CUDA_DIR/include

LIBS += -L $$CUDA_DIR/lib -lcudart -lcuda

osx: LIBS += -F/Library/Frameworks -framework CUDA
cuda.commands = $$CUDA_DIR/bin/nvcc -c -arch=$$CUDA_ARCH -o ${QMAKE_FILE_OUT} ${QMAKE_FILE_NAME}

cuda.dependency_type = TYPE_C

cuda.depend_command = $$CUDA_DIR/bin/nvcc -M ${QMAKE_FILE_NAME}

cuda.input = CUDA_SOURCES

cuda.output = ${QMAKE_FILE_BASE}_cuda.o

QMAKE_EXTRA_COMPILERS += cuda

In addition to Linux this will also work out of the box on OS X. Adapting it to Windows should be easy. For advanced features like reformatting nvcc's error messages to be more of Creator's liking, see the article mentioned above.

A QOpenGLWindow-based application that updates an image via CUDA on every frame could now look something like the following. The approach is the same regardless of the OpenGL enabler in use: QOpenGLWidget or a custom Qt Quick item would operate along the same principles: call cudaGLSetGLDevice when the OpenGL context is available, register the OpenGL resources to CUDA, and then do map - invoke CUDA kernel - unmap - draw on every frame.

Note that in this example we are using a single pixel buffer object. There are other ways to do interop, for example we could have registered the GL texture, got a CUDA array out of it and bound that either to a CUDA texture or surface.


...
// functions from cuda_stuff.cu
extern void CUDA_init();
extern void *CUDA_registerBuffer(GLuint buf);
extern void CUDA_unregisterBuffer(void *res);
extern void *CUDA_map(void *res);
extern void CUDA_unmap(void *res);
extern void CUDA_do_something(void *devPtr, int w, int h);
class Window : public QOpenGLWindow, protected QOpenGLFunctions
{
public:
    ...
    void initializeGL();
    void paintGL();
private:
    QSize m_imgSize;
    GLuint m_buf;
    GLuint m_texture;
    void *m_cudaBufHandle;
};
...
void Window::initializeGL()
{
    initializeOpenGLFunctions();
    CUDA_init();
    QImage img("some_image.png");
    m_imgSize = img.size();
    img = img.convertToFormat(QImage::Format_RGB32); // BGRA on little endian
    glGenBuffers(1, &m_buf);
    glBindBuffer(GL_PIXEL_UNPACK_BUFFER, m_buf);
    glBufferData(GL_PIXEL_UNPACK_BUFFER, m_imgSize.width() * m_imgSize.height() * 4, img.constBits(), GL_DYNAMIC_COPY);
    glBindBuffer(GL_PIXEL_UNPACK_BUFFER, 0);
    m_cudaBufHandle = CUDA_registerBuffer(m_buf);
    glGenTextures(1, &m_texture);
    glBindTexture(GL_TEXTURE_2D, m_texture);
    glTexImage2D(GL_TEXTURE_2D, 0, GL_RGBA8, m_imgSize.width(), m_imgSize.height(), 0, GL_BGRA, GL_UNSIGNED_BYTE, 0);
    glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MIN_FILTER, GL_LINEAR);
    glTexParameteri(GL_TEXTURE_2D, GL_TEXTURE_MAG_FILTER, GL_LINEAR);
}
void Window::paintGL()
{
    glClear(GL_COLOR_BUFFER_BIT);
    void *devPtr = CUDA_map(m_cudaBufHandle);
    CUDA_do_something(devPtr, m_imgSize.width(), m_imgSize.height());
    CUDA_unmap(m_cudaBufHandle);
    glBindBuffer(GL_PIXEL_UNPACK_BUFFER, m_buf);
    glBindTexture(GL_TEXTURE_2D, m_texture);
    // Fast path due to BGRA
    glTexSubImage2D(GL_TEXTURE_2D, 0, 0, 0, m_imgSize.width(), m_imgSize.height(), GL_BGRA, GL_UNSIGNED_BYTE, 0);
    glBindBuffer(GL_PIXEL_UNPACK_BUFFER, 0);
    ... // do something with the texture
    update(); // request the next frame
}
...

The corresponding cuda_stuff.cu:


#include <stdio.h>
#ifdef Q_OS_MAC
#include <OpenGL/gl.h>
#else
#include <GL/gl.h>
#endif
#include <cuda.h>
#include <cuda_gl_interop.h>
void CUDA_init()
{
    cudaDeviceProp prop;
    int dev;
    memset(&prop, 0, sizeof(cudaDeviceProp));
    prop.major = 3;
    prop.minor = 2;
    if (cudaChooseDevice(&dev, &prop) != cudaSuccess)
        puts("failed to choose device");
    if (cudaGLSetGLDevice(dev) != cudaSuccess)
        puts("failed to set gl device");
}
void *CUDA_registerBuffer(GLuint buf)
{
    cudaGraphicsResource *res = 0;
    if (cudaGraphicsGLRegisterBuffer(&res, buf, cudaGraphicsRegisterFlagsNone) != cudaSuccess)
        printf("Failed to register buffer %u\n", buf);
    return res;
}
void CUDA_unregisterBuffer(void *res)
{
    if (cudaGraphicsUnregisterResource((cudaGraphicsResource *) res) != cudaSuccess)
        puts("Failed to unregister resource for buffer");
}
void *CUDA_map(void *res)
{
    if (cudaGraphicsMapResources(1, (cudaGraphicsResource **) &res) != cudaSuccess) {
        puts("Failed to map resource");
        return 0;
    }
    void *devPtr = 0;
    size_t size;
    if (cudaGraphicsResourceGetMappedPointer(&devPtr, &size, (cudaGraphicsResource *) res) != cudaSuccess) {
        puts("Failed to get device pointer");
        return 0;
    }
    return devPtr;
}
void CUDA_unmap(void *res)
{
    if (cudaGraphicsUnmapResources(1,(cudaGraphicsResource **) &res) != cudaSuccess)
        puts("Failed to unmap resource");
}
__global__ void run(uchar4 *ptr)
{
    int x = threadIdx.x + blockIdx.x * blockDim.x;
    int y = threadIdx.y + blockIdx.y * blockDim.y;
    int offset = x + y * blockDim.x * gridDim.x;
    ...
}
void CUDA_do_something(void *devPtr, int w, int h)
{
    const int blockSize = 16; // 256 threads per block
    run<<<dim3(w / blockSize, h / blockSize), dim3(blockSize, blockSize)>>>((uchar4 *) devPtr);
}

This is all that's needed to integrate the power of Qt, OpenGL and CUDA. Happy hacking!

Development Framework & Tools

Qt Framework

Qt Development Tools

Qt Design Studio

Qt Quality Assurance

Qt Digital Ads

Qt Insight

Quality Assurance Tools

Squish

Coco

Test Center

Axivion Static Code Analysis

Axivion Architecture Verification

More

Qt 6

Licensing

Qt Features

Qt for Python

Industry & Platform Solutions

Industry

Automotive

Aviation & Aerospace

Industrial Vehicles

Micro-Mobility Interfaces

Consumer Electronics

Industrial Automation

Medical Devices

Platform

Desktop, Mobile & Web

Embedded Devices

MCU (Microcontrollers)

Cloud Solutions

More

Next-Gen UX

Limitless Scalability

Productivity

Our Ultimate Collection of Resources

Development Framework & Tools

Qt Resource Center

Qt Blog

Qt Success Stories

Qt Demos

Quality Assurance Tools

QA Resources

QA Blog

QA Success Stories

More

Live Events & Webinars

Documentation

Take Learning Qt to the Next Level

Learn with us

Qt Academy

Qt Educational License

Qt Documentation

Qt Forum

We're Here for You—Support and Services

Helpful Links

Contact Us

Qt Partners

Qt Support

Qt Customer Portal

Qt Customer Success

Qt Professional Services

Qt Weekly #28: Qt and CUDA on the Jetson TK1

Building Qt

Vsync gotchas

CUDA

Blog Topics:

Comments

LEARN MORE

Embedded Devices with Qt

Subscribe to our newsletter

Subscribe Newsletter

Try Qt 6.7 Now!

We're Hiring

Read Next

Student project: Vector Map Rendering with Qt

Moving to Microsoft Visual C/C++ (MSVC) compiler 2022 in Qt 6.8 packages. Please test this on Qt 6.7

The Future Landscape of Industrial Automation