Fully Automatic Differentiation for Tensor Expressions

May 17, 2018 | By: Tim Zerrell

Comments on Hacker News

Graphic showing a formula for the derivative of a convolution

Deep learning advances routinely require the construction of new neural network operations. Adding such operations has proven to be a labor-intensive step that introduces delays on the order of months. PlaidML enables researchers to add ops in hours through sophisticated code generation algorithms utilizing the Tile language. This post will explain a portion of this process, unique to PlaidML, that automatically generates gradient kernels, and will compare to related works such as TensorComprehensions and NNVM/TVM.

PlaidML, TensorComprehensions, and NNVM all allow researchers to immediately compile kernels for novel tensor operations. By automating kernel construction, these frameworks allow researchers to test experimental kernels without waiting for an engineering team to implement them with GPU-friendly code. However, developing working gradients for these kernels is challenging, and without gradients new operations cannot be trained and their efficacy cannot be tested.

PlaidML automatically differentiates all operations. While almost all machine learning frameworks support autodiff at a network level, they typically require the derivatives of each operation to be written by hand. As far as we know, PlaidML is currently the only framework to automatically produce gradients of even the primitive operations themselves. To add a new operation to PlaidML, it is only necessary to write a forward-pass operation in PlaidML’s Tile language; PlaidML is capable of compiling such code into efficient kernels for both forward-pass and derivative computations. These methods work on arbitrary numbers of dimensions, handle all common padding cases, and can be used with striding, dilation, grouping, and techniques researchers have yet to imagine!

To accomplish this, PlaidML first generates valid but inefficient Tile code for the derivative, and then transforms it to make it efficient. These transformations are applied to all Tile code. Thus, in addition to enabling automatic gradient generation, these transformations empower researchers to focus on correctness and clarity of code, rather than translatability into performant kernels, when writing forward pass operations.

Today we’ll explore how these algorithms perform gradient generation for an operation based on multiply-accumulate. We’ll look at how the Tile code for a strided dilated convolution is processed. If you are unfamiliar with Tile and would like a beginner’s introduction so you can follow the technical details of this post, I recommend our Tile tutorial.

Autodiff of Multiply Accumulate

Let’s consider a 2x dilated stride 3 convolution with “valid” padding. Note that this padding mode is implicit in the Tile code: it is specified by the size of the input and output tensors, as well as the constant offsets (in this case all are 0) that occur inside the indices. All operations that would logically be “out of bounds” on any tensor simply do not contribute to the accumulations on the left hand side, which is logically equivalent to padding the inputs with 0’s. We’ll also use only 2 spatial dimensions to make the formulas shorter, but if we wanted to do a 7D case instead it would be fully supported. Here’s Tile code for the forward pass of a 2x dilated stride 3 convolution with “valid” padding (we assume appropriate dimensions for I and K to meet the above criteria):

function (I[N, H, W, CI], K[KH, KW, CI, CO]) -> (O) {
    O[n, y, x, co: N, H/3, W/3, CO] = +(I[n, 3*y + 2*j, 3*x + 2*i, ci] * K[j, i, ci, co]);
}

This corresponds to the following formula for the convolution:

The convolution formula O[n, y, x, c_o] = sum_{i, j, c_i} (I[n, 3y + 2j, 3x + 2i, c_i] * K[j, i, c_i, c_o])

For automatic differentiation, we will have the derivative of some ultimate output y with respect to O, which we will denote DO. Using the product rule, chain rule, and distributive property, it is possible to compute DI, the derivative of y with respect to I, by the formula

The convolution derivative D_I[n, 3y + 2j, 3x + 2i, c_i] = sum_{i, j, c_i} (D_O[n, y, x, c_o] * K[j, i, c_i, c_o])

Note that DI and DO have the same indices as I and O respectively: this formula just flips I and O in the forward formula and switches to derivatives. In general, if the operation is a multiply accumulate, this simple flip always produces a valid derivative. Of particular note is that while the original convolution needed no edge case handling in the resulting kernel, the derivative does. However, since the semantics of Tile specify that whenever an index is out of bounds the accumulation does not occur, there is no need for explicit edge handling.

However this transformed Tile code introduces a challenge: the formulas for the indices of DI, the tensor being written, are multivariable formulas rather than single variables. This would mean a kernel implementation over the original indices would need to do proper atomic updates and manage concurrency on its accumulators. But as we will see in more detail later, PlaidML includes code for rewriting such equations; the corresponding Tile code compiles and runs without issue:

function (DO[N, OH, OW, CO], K[KH, KW, CI, CO]) -> (DI) {
    DI[n, 3*y + 2*j, 3*x + 2*i, ci: N, 3*OH, 3*OW, CI] = +(DO[n, y, x, co] * K[j, i, ci, co]);
}

Note that we are only writing this kernel as a demonstration: PlaidML will create it automatically without any user input beyond a request for the derivative of the forward-pass Tile function. This automatically generated code is somewhat less human-readable; for example, here is the pre-optimization derivative code PlaidML generated when I made test code for this post:

function (
  X_I_0[X_I_0_0, X_I_0_1, X_I_0_2, X_I_0_3],
  X_I_1[X_I_1_0, X_I_1_1, X_I_1_2, X_I_1_3]
) -> (
  X_T15,
  X_T17
) {
  X_T13 = 0.007936507936507936;
  X_T5[x0, x1, x2, x3 : 2, 3, 3, 7] = +(X_T13[]);
  X_T0[n, 2*k0 + 3*x0, 2*k1 + 3*x1, ci : 2, 9, 9, 5] = +(X_T5[n, x0, x1, co] * X_I_0[k0, k1, ci, co]);
  X_T15 = ident(X_T0);
  X_T16[k0, k1, ci, co : 2, 2, 5, 7] = +(X_I_1[n, 2*k0 + 3*x0, 2*k1 + 3*x1, ci] * X_T5[n, x0, x1, co]);
  X_T17 = ident(X_T16);
}

This code includes some non-cosmetic differences: it produces derivatives with respect to both I and K; they are the derivatives of the average of output values of the convolution; and it explicitly includes the dimensions of the input and kernel tensors I used for testing.

Under the Hood

Two functions are key to Tile’s ability to run this function efficiently: reduce and defract. The reduce function converts output indices from polynomials to single variables, and defract removes any fractional indices introduced by reduce. Together, these functions reduce the number of formula constraints that need to be verified to keep all indices in bounds and increase the decoupling of output indices for simpler parallelization.

Reduce

Tile uses a straightforward affine transformation to convert output indices into single variables. In our example, it uses the transformation

ci = v3
co = v6
i = v5
j = v4
n = v0
x = 1/3*v2 - 2/3*v5
y = 1/3*v1 - 2/3*v4

to produce the reduced Tile code

DI[v0, v1, v2, v3] = +(DO[v0, 1/3*v1 - 2/3*v4, 1/3*v2 - 2/3*v5, v6] * K[v4, v5, v3, v6]);

As mentioned earlier, Tile functions also implicitly include constraints on the index variables (i.e., that make the resulting indices be integers within bounds for the tensor). These constraints are also transformed into the new index variables.

Defractionalization

The reduce step often introduces fractional coefficients into one or more index formulas. It would be expensive to check whether such formulas are integral at runtime, and so we call defract to perform a defractionalization step to eliminate these fractions. This algorithm is too complex to discuss in a blog post; if you are curious it is implemented here (to understand this code you will need to be familiar with lattices, including duality and the Hermite Normal Form).

For our example, defract selects the transformation

v0 = v0_0
v1 = v1_0 + 3*v1_1
v2 = v2_0 + 3*v2_1
v3 = v3_0
v4 = 2*v1_0 + 3*v4_0
v5 = 2*v2_0 + 3*v5_0
v6 = v6_0

producing the Tile code

DI[v0_0, v1_0 + 3*v1_1, v2_0 + 3*v2_1, v3_0] = +(DO[v0_0, -v1_0 + v1_1 - 2*v4_0, -v2_0 + v2_1 - 2*v5_0, v6_0] * K[2*v1_0 + 3*v4_0, 2*v2_0 + 3*v5_0, v3_0, v6_0]);

Defractionalization sometimes reintroduces multivariable formulas into the output indices (as indeed it did in this case). When this happens, an additional constraint is introduced such that all valid combinations of index variable values used in the output indices produce distinct output indices. In our example, these constraints are

0 <= v1_0 < 3
0 <= v2_0 < 3

So there is only one way to write any value a of the second index of DI, namely v1_0 = a % 3 and v1_1 = a / 3 (using integer division). Thus, even when defractionalization introduces multivariable formulas, each location in the output tensor is only written to in one way, which is invaluable in generating an efficient kernel.

Impact

By using reduce and defract, PlaidML lets practitioners developing a new operation focus on forward-pass correctness. Once forward-pass code is written in Tile, an efficient gradient kernel and GPU-optimized code are produced automatically without requiring any additional work from the user. This reduces the time required to implement new operations to hours, speeding the research-deployment cycle and creating a way to get new advances deployed sooner.

Social media:

© 2018 Intel Corporation