Bringing Deep Learning to OpenCL

Aug 14, 2017 | By: Choong Ng

I’m excited to announce Vertex.AI’s work to bring deep learning to OpenCL and share a first look at our results so far. This work is intended to make deep learning accessible to more people and speed up progress across the field. Read on for the details and what’s coming next.

Deep learning is a rapidly developing field with major impact on nearly every area of technology and business. One common obstacle to putting it into practice is device compatibility. Popular deep learning frameworks such as Google’s TensorFlow target Linux and NVIDIA GPUs with minimal support at best for other configurations. Rapid adoption of user-friendly interfaces like Keras (also from Google) are making this technology easier to use but don’t address the underlying compatibility restrictions. In Vertex’s work to support edge computing devices we’ve solved the problem for ourselves by building a software platform that allows us to deploy nearly anywhere.

Still, platform support is not only an edge computing problem. One of the major bottlenecks to deep learning’s advance is a lack of practitioners. High quality educational materials, like the courses at and Coursera, are available to anyone with an Internet connection and the right computer setup. That second one can be daunting though; getting set up with the right hardware with all the correct drivers, proprietary libraries, and other software correctly configured is challenging. One way to reduce complexity and increase compatibility is to support an open standard like OpenCL.

Supporting OpenCL is not a new idea. You can find a long discussion in TensorFlow’s all-time most active ticket, #22 OpenCL support, as well as tickets for other frameworks1 and a handful of ongoing development efforts.2 Although not strictly OpenCL, I’ll note that in addition AMD has been working on an AMD-specific set of libraries called MIOpen built on their ROCm tool chain. Still, none of this work has reached wide adoption.

We can help, and we will. Soon we will make freely available a version of our technology with support for the popular Keras interface and compatibility with a range of OpenCL-capable devices. We’ve started work on this in earnest with our first goal to support many of Keras’s built-in neural nets. One of the key challenges to solving OpenCL support is tuning the code for good throughput. Here is a first look at our current progress showing our prototype running Keras’s built-in ResNet-50 image classifier:

Keras ResNet-50 Inference Throughput chart

This chart shows inference (classification) throughput on Linux normalized against NVIDIA K80 running Keras on TensorFlow with cuDNN. We’ve chosen the K80 as a reference both because it is readily available to rent in the cloud and because we use it frequently in our own work. The blue bars are the exact same Keras code running on our platform with OpenCL. To show portability we include two common AMD GPUs. The RX 480 is a low-cost consumer gaming GPU, while the R9 Nano is a close relative of the FirePro S9300 that Google has announced will be available to customers really soon and AMD’s MI8 machine learning accelerator.

On the K80 our software reaches 83% of the throughput of the TensorFlow + cuDNN reference. From what we know about the work Google and NVIDIA have put into performance, we feel pretty good about our from-scratch implementation being this close already. There’s a lot of opportunity on our side to find better performance though, both through exploiting the untapped performance on AMD and moving to newer and faster NVIDIA boards. For an idea of how much headroom there is we can use the CLPeak synthetic benchmark to gauge raw performance:

CLPeak Memory Bandwidth and Compute chart

These numbers are also normalized against the K80. I won’t spend much time on the details now but the compute numbers are very roughly predictive of a chip’s deep learning performance. At a glance you can reasonably expect a R9 Nano or GTX 1070 to eventually train and run neural nets about twice as fast as a K80 while the RX 480 should be about 40% faster than K80. Raw numbers:

CLPeak Memory Bandwidth and Compute table

Our current throughput on AMD is quite usable but there clearly is a lot of room for us to grow. Additionally, newer GPUs including the GTX 1070 look promising. From the standpoint of reaching the most people, Intel’s integrated Processor Graphics is interesting because it is easily the most ubiquitous laptop GPU architecture. In recent generations it may be fast enough to be useful for deep learning so we will study it as well.

This work is high priority for us and we will continue improving throughput, bringing up more Keras nets, testing new hardware, and posting the results regularly. We will also start to publish some of the details on how our technology works and the work we do on edge devices. To get the latest updates:

For any other questions or ideas for collaboration please e-mail and I’ll connect you with the right person here.

  1. Discussions on GitHub:

  2. Links to some of the other efforts to bring OpenCL support to popular deep learning frameworks:

© 2018 Intel Corporation