Tensor Compilers: Comparing PlaidML, Tensor Comprehensions, and TVM

May 19, 2018 | By: Brian Retford and Jeremy Bruestle

Comments on Hacker News

One of the most complex and performance critical parts of any machine learning framework is its support for device specific acceleration. Indeed, without efficient GPU acceleration, much of modern ML research and deployment would not be possible. This acceleration support is also a critical bottleneck, both in terms of adding support for a wider range of hardware targets (including mobile) as well as for writing new research kernels. Much of NVIDIA’s dominance in machine learning can be attributed to its greater level of software support, largely in the form of the cuDNN acceleration library.

We wrote PlaidML to overcome this bottleneck. PlaidML is capable of automatically generating efficient GPU acceleration kernels for a wide range of hardware for both existing machine learning operations and new research kernels. Because writing a kernel is a complex process, GPU kernels have typically been written by hand. Along with PlaidML, two additional projects, Tensor Comprehensions and TVM, are attempting to change this paradigm. Tensor Comprehensions makes the point about the importance of these technologies in their very well written announcement.

In this post, we compare PlaidML, Tensor Comprehensions, and TVM along multiple dimensions, including examining performance and feature set. We begin with performance.

Performance

To evaluate performance, we created a branch of PlaidBench to produce timings for specific kernels. For now, we’ve focused on dense layers and convolutional layers. We would have preferred to do timings for entire networks similar to our evaluation of our relative performance with TensorFlow and cuDNN (for example, PlaidML is about twice as fast as TF/cuDNN on ResNet batch size 4, see appendix). However currently it is difficult to benchmark full networks using Tensor Comprehensions.

The benchmarking code constructs tensor operations and then runs them in the tightest loop possible (ignoring memory transfers and focusing on computation times). All of our timings were validated by running against CUDA drivers and comparing to the number provided by nvprof. More specific details as well as information to reproduce our results can be found later in the methodology section of this post. Getting straight to it though, here are our initial results:

Note the TVM results have been updated, see methodology

Relative performance of PlaidML, Tensor Comprehensions, and TVM

We also examined compilation time as an element of performance. With regard to compilation times, it’s hard to make an apples-to-apples comparison of the three frameworks. PlaidML and Tensor Comprehensions both do fully automatic kernel generation with no user oversight. PlaidML searches over a limited space of possible kernels using a model of hardware, with optional performance guided optimization. Tensor Comprehensions does a larger search using a genetic algorithm. TVM on the other hand relies on human intervention to provide a schedule, simplifying the labor intensive process of handwriting kernels, but not fully automating it. It’s unclear how much effort goes into creating these schedules, but you can see an example for batch size 1 convolutions for generally NVIDIA-like cards. Note that in order to get decent performance across the board these schedules must be created for many combinations of memory layout (NCHW v NHWC, etc) as well as batch sizes, input sizes, and output sizes.

For PlaidML, we consider the full compilation process, including any performance guided optimizations. For Tensor Comprehensions, we also consider the full autotuning process. For TVM, we only consider the time required to generate the kernel given the existence of a pre-written schedule.

Compilation times of PlaidML, Tensor Comprehensions, and TVM

Note, the compilations times here are quite variable, so we’ve made the y-axis log-scale to capture the data. Check out the methodology section for more details.

Feature Set

In addition to overall performance, PlaidML, Tensor Comprehensions, and TVM differ in which features they currently support. We’ve made a chart highlighting some of these differences.

Feature matrix of PlaidML, Tensor Comprehensions, and TVM

We’ve already mentioned the different approach each compiler takes toward auto-tuning. Regarding automatic differentiation of tensor operations, please see our blog post on fully-automatic-differentiation. There is also a difference in the set of kernels that can be reasonably generated with each compiler currently, and their subsequent effect on how useful the system is for running real networks. PlaidML is quite mature in this area, running almost any Keras or ONNX networks. PlaidML and TVM both support a wide range of drivers, which is critical for portability. PlaidML’s support for complex left hand expressions enables autodiff but also allows kernel authors to write kernels in the most intuitive way for the operation being expressed.

Some key takeaways from these differences:

Performance Testing Methodology

We modified PlaidBench to produce timings for specific kernels.

For now, we’ve focused on dense layers and convolutional layers, though we will add more and we’ll release the new benchmarking module with the next version of PlaidBench.

The code constructs tensor operations and then runs them in the tightest loop possible (ignoring memory transfers and focusing on computation times). All of our timings were validated against CUDA backends using nvprof to ensure accuracy. You should be able to reproduce these timings as well as examine all the details of our methods via checking out the branch linked to above.

Operations

The choice of operations was necessarily somewhat arbitrary. We tried to include both unusual and common shapes. We also made sure to avoid cherry-picking by deciding on the set of operations to test before running any benchmarks. Here are the details of the shapes we used:

# ci, h, w, co, i, j
"conv2d_odd_sml" : lambda p: conv2d.Conv2d(p, 16, 57, 57, 34, 3, 2)
"conv2d_odd_med" : lambda p: conv2d.Conv2d(p, 133, 16, 16, 266, 4, 4)
"conv2d_resnet50_med" : lambda p: conv2d.Conv2d(p, 256, 14, 14, 1024, 1, 1)
"conv2d_vgg_lrg" : lambda p: conv2d.Conv2d(p, 128, 122, 122, 128, 3, 3)    
# i, j, k
"dense_odd_sml" : lambda p: dense.Dense(p, 122, 98, 179)
"dense_odd_med" : lambda p: dense.Dense(p, 110, 512, 313) 
"dense_odd_lrg" : lambda p: dense.Dense(p, 333, 455, 633)

Tensor Comprehensions

Autotune was set to 7 generations, a population size of 13, and a timeout of 30 minutes. Based on our testing of a smaller set of kernels, additional autotuning seemed to provide marginal additional speedup. We would’ve liked to run the autotuner for much longer in general, but as kernel sizes increase, it becomes increasing untenable, without a per-kernel timeout.

TVM

We relied purely on publicly available schedules for TVM kernels. We made sure to include kernels that existing TVM tests and schedules appeared optimized to handle. It’s understood that the published schedules for TVM are optimized for batch size 1, hence the general lack of speedup for larger batches.

A brief update: Since our original publication we discovered that the data used to generate the graphs was out of date with the code. The code now makes explicit calls to ‘sync’ after each batch. This is critical, as otherwise the gpu can schedule some kernels while others are still finishing. We were in fact not calling sync at all. This has, in generally, severely impacted TVM’s performance for large, dense operations, but they rest were fairly accurate in the first place and have maybe dropped by 5-10%. As noted, the code is available and we’d love help ensuring it’s doing the right thing.

Mea culpa: our ‘manually compare against nvprof’ technique broke down here, as we didn’t do it for every kernel, and for smaller ones, it was accurate. Clearly, we need to automate that process.

Conclusion

PlaidML is a mature and high performance way to generate GPU accelerated kernels. TVM and Tensor Comprehensions are both amazing projects that we have a lot of respect for. As Tensor Comprehensions said in their introduction, these technologies represent an order of magnitude increase in developer productivity.

PlaidML’s specific features represent yet another order of magnitude increase in developer productivity. Full autodiff, instant optimization for any shape, and complete network support mean that a small team of people can now design entirely new network architectures that can be deployed anywhere, immediately.

Social media:

Appendix

PlaidML vs TVM Whole Network

As noted above, we couldn’t easily include TC in this game, and it’s unclear whether it’s really suited for use as a portability mechanism. TVM is fairly easy to run whole networks against thanks to its integration with (NNVM)[https://github.com/dmlc/nnvm]. We forked the nnvm-rocm tool and used it to compare NNVM/TVM against PlaidML.

We intentionally did not enable any of TVM’s fallback optimizations (i.e., use cuDNN and cuBLAS). A hugely important part of these frameworks is to enable portability and make it easier to bring full net support to new hardware, quickly. We used OpenCL for these numbers because it’s the most portable platform. We did run against CUDA as well though. There’s really no difference in our experience.

Cases where TVM has ‘0’ is because the networks would not compile and run against the current versions of NNVM and TVM.

PlaidML relative performance to TVM on OpenCL on an NVIDIA GTX 1070
Inferences / second for batch size 1 on a GTX 1070
PlaidML relative performance to TVM on OpenCL on an NVIDIA GTX 1070
Inferences / second for batch size 1 on an R9 Fury

PlaidML vs TF/cuDNN

Chart of relative performance of PlaidML and TensorFlow with cuDNN acceleration on various networks:

Relative performance of PlaidML and TensorFlow/cuDNN
© 2018 Vertex.AI