OpenCL - the compute intermediate language

We are now fast approaching the yearly Game Developer Conference, and around this time last year my favourite topic of conversation was the need for a "virtual ISA" that would include the current and future processor architectures, particularly GPUs. The term "virtual ISA" implied an assembly-like language and toolset that could be used to generate high performance (most likely data parallel) code without being tied to a specific architecture. This is much the same as the "virtual ISA" that LLVM provides for a wide variety of (primarily) CPU architectures.

Even a year on, it remains such a simple idea that once stated it becomes a wonder that it still doesn't exist. The main change in conversation is that it has become clearer that this is actually the role OpenCL should attempt to fill.

Why do we need a virtual ISA, and why not another language?

Right now, the reality is that no one knows the best way to efficiently program the emerging heterogeneous architectures. In fact, I don't think we even understand how to program "normal" GPUs yet. C++ will inevitably be crowbarred into some form of working shape, but I'd rather not settle for this as a long term solution.

A compute software stack sounds like a far more attractive option. And to build a well functioning stack you will need a solid foundation. As an illustration of what occurs without one, you should observe the rather scary rise in the number of compiler-like software projects that generate CUDA C as their output. This use case is not something CUDA C is designed for, and as OpenCL is essentially a clone of CUDA C, you may well think that this trend is something future OpenCL revisions should pay attention to.

In fact, there are a few other interesting trends to note.

A change in the wind?

A significant strength of the CUDA architecture is PTX, the NVIDIA-specific assembly-like language that CUDA C compiles to. OpenCL does not yet have a parallel of this. Several independent little birdies tell me that the committee felt attempting to define this as part of the initial standard of OpenCL could have derailed the process before OpenCL had a chance to find its feet. However more birdies tell me that this view has now mostly changed, and that defining something akin to a virtual ISA could actually be back on the cards.

PTX as a unofficial standard?

As nice as PTX is, it is not a GPU version of LLVM. Current support is limited to a single PDF reference and an assembler tool. The very impressive GPU ray tracer OptiX uses PTX as its target, but the authors do have the significant advantage of residing inside NVIDIA HQ. The compiler toolkit that makes LLVM so attractive is missing. Not to mention that PTX only targets NVIDIA hardware - although this itself is an interesting area for change, and one where I feel NVIDIA could gain a lot of momentum. One project I will be keeping a close eye on is GPU-Ocelot, which appears to be going a fair way towards addressing the shortcomings of PTX. While PTX may not be an "official" standard, much like Adobe Flash, it could establish itself as one anyway.

LLVM as a data parallel ISA?

Given the close parallels, we should seriously question whether LLVM can support GPU-like architectures. As a reasonably mature project, LLVM already has a lot in its favour and can point to some successes in this area. Notably, its use in the Mac OpenGL implementation, the Gallium 3D graphics device driver project, and an experimental PTX backend. I spoke with a few people who I know have seriously considered this option and found the jury still out. I haven't use LLVM in sufficient anger to come to a decision.

One obvious obstacle is what LLVM would compile to? I would expect a serious cross-vendor virtual ISA to have direct support from all hardware manufacturers. A level of indirection through LLVM or any other third-party ISA is unlikely to gain sufficient ground if it's not explicitly supported by all vendors through a standard such as OpenCL.

Demand and evolution

With relatively little movement in the industry for a year or more, I do occasionally consider whether I have misread the demand for a virtual ISA. But not for long! Apart from the clear advantage for code generation and domain specific language implementations (a topic I'm very interested in), a virtual ISA should become the target bytecode for GPU and HPC languages and APIs such as OpenGl, OpenCl, DirectX, DirectCompute, CUDA and their successors. It's a long and growing list.

While uncertainty surrounds the best way to program your GPU we can expect to see unhealthy but unavoidable amounts of biodiversity. But if we want to prevent our industries from painfully diverging we should at least agree on a foundational virtual ISA that we start to unify behind and build on.


Comments

  1. Quote

    In two blog-articles of April and July last year, named "Does GPGPU have a bright future?" and "The rise of the GPGPU-compilers", I spoke about the same. OpenCL is to low-level and therefore is going to be the hidden power of software, maintained in specialised libraries.

    OpenCL has the most potential to be the hidden layer between software and compute-device, but it's not about innovation but about money.

  2. Quote
    Eddie Edwards said 12 February, 2011, 6:27 am:

    Hey Sam,

    Something like LLVM would be cool, but I don't see the necessity. Making a compiler output OpenCL or CUDA or Cg/HLSL is simple enough, just as the first C++ compilers output C. Then each vendor can implement those languages how they like.

    I'm sure LLVM will grow to support such use-cases. If the industry settles on PTX then LLVM may get simpler and smaller, but nothing fundamental will become possible that wasn't before. PTX is nice to have, that's all.

    The focus should not be on the specifics here, but on the more general problems: how can I easily refactor code so that it runs on a different mix of coprocessors? When my console has 4 scalar cores with 4-way SIMD plus 32 cores of Larrabee plus a 2048-unit GPU, how can I retarget code when I find I need to change my occupancy profile?

    The problem is the huge differences in architecture - exemplified by the fact that RenderTris(vertex[], index[], shader()) is implemented in hardware on a GPU but software on Larrabee. PTX covers the shader() part, but implementations of RenderTris, plus the API calls to actually trigger the call, are completely different and must be selected per architecture. It's there that PTX may break down as it may lack access to non-generic features per architecture (e.g. SPU channel registers or special Larrabee opcodes).

    Still, food for thought!

  3. Quote
    Eddie Edwards said 12 February, 2011, 7:44 am:

    PTX rocks (just researched it! Thanks for the link!)

    But it solves a very specific problem, given it's programming model, which is essentially mapping Cg (+ pointers) to a single SIMD way ... similar to what Larrabee's shader compiler does in 16-way SIMD. But as-is it would not be suitable for writing the Larrabee renderer. It covers maybe half of the Larrabee ISA.

    Still it goes a long way!

  4. Quote

    The challenge with your idea is that what makes sense for a GPU in terms of ISA does not support more general compute requirements for DSP and computer vision very well.

    This is particularly true in respect of the completely different locality requirements for GPUs which perform 10s of identical operations (shader kernels) on millions of triangles versus computer vision algorithms which are data-dependent and take 10s-100s of cycles (operations) per pixel

    This means that being able to support multiple outstanding requests to external DRAM is not much use in CV applications whereas it is fundamental to CV and DSP applications

    In other words

    GPU simple highly parallel HW and SW
    DSP/CV complex data-dependent behaviour and SW

    Your idea is therefore most applicable across individual domains and not across domains IMHO

    -David

  5. Quote

    Hi David,

    It's true there is a mid ground between the kind of parallelism we see on cpus (tens of threads), to the typical parallelism we see on gpus (many thousands of threads). A lot of graphics falls into the upper end of this range, but not all of it, and some of this is due to a chicken-egg effect. Desktop GPUs need thousands of independent threads, so that's the kind of graphics algorithms we've been writing. I think the mid ground, and richer nested parallelism models are the more interesting now. So I think there's a good bit of common ground between CV and graphics.

    Anyway, I'll drop you a line and we can talk more!

    Cheers,
    Sam

Leave a Comment

(required)

(required)

Formatting Your Comment

The following XHTML tags are available for use:

<a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

URLs are automatically converted to hyperlinks.