Benchmarking Deep Neural Nets for Real-Time Vision
Recently we posted early results from our work to bring deep learning to more people through OpenCL support including initial benchmarks on AMD and NVIDIA hardware. As a business we are building on this technology to bring real-time computer vision to every device. In this post we will discuss the key issue of processing speed, open source a tool we use to measure speed on real workloads, and share our performance progress. Through careful optimization our runtime software, code-named Plaid, is now up to 1.4x faster than TensorFlow 1.3 + cuDNN 6 for real-time vision tasks.
Before digging into the details, a note about why we compare our software to TensorFlow on NVIDIA: Deep learning has advanced dramatically in recent years, enabled by NVIDIA graphics processors, their proprietary CUDA and cuDNN software, and open source deep learning frameworks like Google’s TensorFlow. It’s all very high quality work, and the whole field has benefited greatly from it. The purpose of our Plaid software is to provide the same capability on every chip from every vendor, putting it into the hands of all people building the future. Speed is critical yet extremely difficult to obtain. The NVIDIA+TensorFlow stack is popular and performs very well; this makes it the best reference for our software.
Background: Real Time Vision
Deep neural networks are radically expanding the kinds of tasks solvable with computer vision. As recently as five years ago, vision tasks were only reliably solvable in carefully controlled environments, or in narrow use cases such as face detection where elaborate hand-crafted algorithms are sufficient. In recent years, starting with work in the widely cited AlexNet paper, advances in neural networks have made possible much more sophisticated processing of visual data. Now these algorithms are matching and exceeding human performance in a wide range of tasks, from identifying dog breeds in photos to detecting skin cancer, and the field is advancing rapidly. Soon they will be driving our cars.
Combined with inexpensive high-quality cameras and advances in processing power, this technology makes new kinds of automation possible. For camera-driven applications where predictable performance and high reliability are required it is necessary to process data locally on the device or, in venture capitalist parlance, at the “deep edge.” This environment looks different from a typical datacenter training workload—images must be processed as soon as they are captured by the camera and that information acted on immediately. Processing time is critical.
There are a variety of ways to evaluate processing time for deep learning platforms but the method with the most predictive power is to time a workload representative of your application. One common vision task is detecting objects of a certain type, typically termed “object detection.”
In the last few years a family of efficient object detection algorithms have emerged including “YOLO” (You Only Look Once), “Single Shot MultiBox Detector”, and “Faster R-CNN.” These approaches essentially use neural networks developed for image classification as “feature extractors” and stack additional “region proposal” layers on top. For an example of how this type of algorithm might be applied to a self-driving car look at “Vehicle Detection using SSD,” it provides video as well as a nice write-up.
Most of the computation in these approaches is in the underlying feature extraction component. Google has written a good paper, “Speed/accuracy trade-offs for modern convolutional object detectors”, describing the relationship between processing requirements and accuracy in object detection. High-quality object classifiers such as Google’s Xception neural network are frequently applied unchanged to vision tasks, and when not used directly still serve as a representative workload. To get an idea of how a given hardware/software combination will perform you can simply measure the processing throughput of one of the available implementations.
Measuring an existing net such as Xception is best done in a way as close as practical to a production workload. At Vertex we use the Keras framework for prototyping and, conveniently, it already includes a working Xception implementation. We actually use Keras applications and examples for much of our benchmarking and compatibility testing; Keras’s integrations for TensorFlow, CNTK, MXNet, and Theano make it possible to compare performance across any platform that any of those frameworks support.
To make it possible for anyone to test their hardware and software choice the same way we’re sharing part of our benchmark suite, called PlaidBench, on GitHub. PlaidBench takes the approach of running those popular vision networks, including Xception, with little to no modification. To measure throughput representative of a real-time vision application PlaidBench runs inference (classification) on a series of images, one at a time, and reports the cumulative time spent waiting for results.
As an example of this methodology we compare Plaid to TensorFlow running Xception on a NVIDIA GeForce GTX 1070. This combination represents the upper end of available edge compute power combined with a relatively large but accurate network, for most applications both compute and network need to be scaled down. To provide a more detailed picture of we include data from 100 runs for each software stack displayed as box charts (the X axis is scaled to the result data for legibility):
Each run processes 1024 images, one at a time, and reported units are total seconds. A handful of things are visible in this chart: First, the Plaid runtime is generally about 40% faster than TensorFlow. Second, Plaid’s performance variance is lower. Third, on this hardware both configurations will benefit greatly from work to reduce variability. Fourth, TensorFlow is once almost 50% faster than its median, suggesting a path to performance close to Plaid (on NVIDIA).
To gather results on your hardware, check out PlaidBench from GitHub and give it a try:
choong@ubuntu:~/plaidbench$ python profile.py xception Using TensorFlow backend. Loading the data Upscaling the data Loading the model Compiling Running initial batch Warmup Doing the main timing Example finished, elapsed: 19.63527708054
It’s pretty simple, let us know what you find. PlaidBench should run fine on any framework that supports Keras, although the currently available public frameworks pretty much only run on NVIDIA hardware. We are continuing work to extract and document our core runtime software for open source, after which any OpenCL-supported device will be within reach. If you’d like to track our progress, share your use case, PlaidBench results, or anything else find us at:
- Hacker News
- Twitter: @vertexai
- Our discussion group about open deep learning: Deep Learning Everywhere