Open Source Deep Learning on AMD and Beyond

Aug 17, 2017 | By: Choong Ng

Earlier this week, we posted a first look at our work to bring deep learning to more people on more platforms. Today, we’re adding details on our plan to open source our software and an update on our development progress. With our support for the OpenCL open standard, people with a GPU from any manufacturer, including NVIDIA, AMD, and Intel, will soon be able to get started with real datasets in minutes. Users won’t need to sacrifice speed for that freedom, our software is as fast as TensorFlow + cuDNN in some cases and it will continue to improve.

First, availability. Jeremy Howard of the excellent fast.ai raised the question on Twitter:

Jeremy Howard Tweet

Our core engine, including support for the Keras framework, will be open source and open to outside contributions in the near future. The precise timing will be driven by technical considerations—we are working hard to get that ready, and we’ll provide progress updates next week.

Initially we will support Ubuntu Linux and the desktop hardware we have in our lab, currently a set of common consumer GPUs, and aim to provide good performance on each one. We’ll provide detail next week on that as well. Here are updated numbers for an expanded set of image classification nets:

Throughput Chart

The methodology here is the same as our last post: we use inference throughput running a real net on Keras, normalized to NVIDIA Tesla K80 for each network. From the top, first we include TensorFlow 1.2 + cuDNN 5.1 on K80—by definition 1.0—and then K80 and other GPUs running our Keras integration. Four nets from the Keras applications directory are included, including Google’s recent Xception network. We include CLPeak compute throughput (GFLOPS) as the first data point for each card to give an idea of how fast the underlying chip is. The AMD R9 Nano, for example, is 2.08x faster than the K80 in pure GFLOPS but only runs Xception 1.05x faster indicating we have work to do there.

Three apparent takeaways: First, on K80, Google’s Xception net runs slightly faster on our implementation than TensorFlow + cuDNN. We’ve scrutinized the numbers and we’re confident that result is correct. It appears the just-released TF 1.3 + cuDNN 6 is faster; we’ll upgrade and post new numbers next week.

Second, we should be much faster on both AMD cards than what we’re seeing currently. We have a strong hypothesis about how to improve that and work is ongoing. This is of keen interest to us because Radeon Vega family cards have the potential to bring 7x the throughput of a K80 (26 TFLOPS) to ordinary desktop PCs for $1000 or less.

Third, despite the handicap, the budget RX 480 is still faster on some workloads than the K80. This card is in the same family as the AMD GPUs Apple puts in its Macs—this lays the groundwork for a future version of our software supporting Mac users.

If you’d like to weigh in on how you use deep learning, your hardware wish list, or anything else find us at:

© 2017 Vertex.AI