Lazy Execution in GPU MATLAB® computing

by sandeep on February 9, 2010

in MATLAB®

Several users have written to us inquiring about the need for a better understanding of Jacket’s compile on-the-fly approach for GPU computing. This capability in Jacket is one of the ways that makes it possible for users of the M language to run high performance M code on NVIDIA GPUs without having to program in C++ and CUDA. This article tries to illustrate the advantages of this design and also provides a few examples as to how it can be used efficiently.

Jacket employs a compile on-the-fly approach primarily for arithmetic expressions, referred to as “lazy execution”. Each compilation using this approach comes with an overhead cost, and is why Jacket tries to reduce the number of compilations done using this approach (the Jacket User Guide provides more background). The GFORCE statement can be used to bypass the lazy execution design to force computation at any stage, if applicable.

Consider the following example,

A = cos(a) + sin(b)
B = sin(b) + cos(a)

In this case, lazy execution implies that only two kernels are compiled and executed – one each for the results A and B. If this philosophy isn’t used, individual commands for sin(a), cos(b), plus need to be dispatched for execution – there would have to be six dispatches (two for the additions, and one for each trig operation), which would ultimately be much costlier in the context of a large loop.

To better illustrate the point, consider the following codes:

Code A: Jacket code

gcache flush; % Makes a clean slate. Removes old traces
tic
for i = 1:10000
  p = grand(100); q = grand(100);
  gforce(p,q); % Forces inputs to exist before proceeding
  cp = cos(p);
  sp = sin(p);
  cq = cos(q);
  sq = sin(q);
  A = cp + sq;
  B = cq + sp;
  gforce(A,B); % Makes sure A,B are computed
end
toc

- This code just compiles one kernel that computes A,B on the first iteration of the loop, and this compiled kernel is reused subsequently on every iteration. On an Intel Xeon W5580 CPU @ 3.20GHz with a Tesla C1060, this code timed 13.67 seconds.

Code B: Precompiled Kernels, No Lazy Execution,

% Simulation of the case where computation is needed at every step
% with precompiled kernels

gcache flush; % Makes a clean slate. Removes old traces

% Precompile all kernels that will be used in the loop
x = sin(grand(100)); gforce(x); % sin kernel compiled
y = cos(grand(100)); gforce(y); % cos kernel compiled
z = x + y; gforce(z); % kernel for a+b compiled
tic
for i = 1:10000
  p = grand(100); q = grand(100);
  gforce(p,q); % Forces inputs to exist
  cp = cos(p); gforce(cp);
  sp = sin(p); gforce(sp);
  cq = cos(q); gforce(cq);
  sq = sin(q); gforce(sq);
  A = cp + sq; gforce(A);
  B = cq + sp; gforce(B);
end
toc

The timed code here uses all pre-compiled kernels, but does a dispatch for each computation involved. However, in the same environment as code-A, it takes 22.36 seconds to execute.

Having got this simple piece of code out of the way, there are a couple of other aspects of this compile on-the-fly/ lazy execution mechanism that need to be highlighted:

  • In an ideal scenario, the first few iterations of a loop should be able to generate all the kernels needed to run the loop, and hence generate a good speed-up over a large number of iterations. Most loops in fact (even complex ones!) satisfy this criterion. For example, in Code A above, it is easy to see that if there is a kernel that is capable of a (cos + sin) computation, everything that is needed to run the loop is there by the end of the first iteration itself
  • In such a scenario, the first few iterations themselves would serve as ‘warm-up’ to generate good speed-ups for the entire loop. For example, timing each iteration in Code A reveals that the first iteration takes 0.93 seconds (includes compilations), and subsequent iterations take about 1.2 milliseconds each.
  • However, if the loop is poorly designed, there may be excessive compilations, which cause bad performance. For example, if there is a CPU operand in a loop that changes every iteration (as shown below), it needs a compilation for every iteration.
A = grand(n);
for ii = 1:m
  A = exp(ii) * A;  % BAD: this causes recompile every iteration
end
  • The trick to get around these excessive compilations is to make “ii” a GPU variable (shown below). It may be noted that the same behavior would be seen if a changing CPU variable exists in place of the iterator (e.g. A = rand * A). The same solution applies.
  • for ii = gsingle(1:m)
      A = exp(ii) * A;  % GOOD: this avoids recompile
    end
    
  • To avoid kernels getting too big (in terms of memory / amount of work they do), Jacket compiles them even if results are not requested / needed.
  • In some cases of loops involving many element-wise arithmetic operations, Jacket may not be able to efficiently cache the arithmetic operations resulting in excessive re-compilations of the loop body. If you notice arithmetic loop performance to be slow or staggered, try using gforce in the loop to denote an evaluation breakpoint.
a = grand(1024);
gcache flush;
tic
for i = gsingle(1:300)
  a = a + sin(a)/4;
  a = a + log(a/4);
  b = a ./ exp(a);
  b = (a + b).^3 .* log(a);
  c = (a - 5) .* (a - 3).^2;
  a = c ./ a;
  % Try adding gforce() here.
end
toc

This code segment takes 2.59s to run as given, but adding the gforce statement at the indicated place accelerates it to be done in 1.08s. However, it must be noted that putting a gforce statement in every loop may not result in acceleration of the code. In fact, in some cases, it may worsen performance by forcing execution flow away from more optimal kernels.

The AccelerEyes team is continuing to drive innovation to overcome this limitation, and make this on-the-fly compilation feature of Jacket more efficient in terms of both finding most optimal kernels and speed of compilation.

  • Pingback: Sr. Software Engineer ~ Production Extraction Framework … | DevBlogr

  • http://www.lyrimages.fr Raphael Attie

    This is a great idea to summarize here the improvements and solutions that rose from the questions asked in the forum. I hope this will keep going, it makes communication between developers and users very vivid and exciting.

  • http://www.melonakos.com melonakos

    Thanks for the kind comment Raphael… we’ll keep posting here and hope that the content helps everyone gain more productivity for your apps.

  • http://blog.accelereyes.com/blog/2010/02/09/109/#respond Keith

    Hi

    On page: http://blog.accelereyes.com/blog/2010/02/09/109/#respond

    In this code:

    Code A: Jacket code

    gcache flush; % Makes a clean slate. Removes old traces
    tic
    for i = 1:10000
    p = grand(100); q = grand(100);
    gforce(p,q); % Forces inputs to exist before proceeding
    p = cos(p); <——————————————– cp = …
    sp = sin(p);
    cq = cos(q);
    sq = sin(q);
    A = cp + sq;
    B = cq + sp;
    gforce(A,B); % Makes sure A,B are computed
    end
    toc

    I believe you need "p = cos(p);" to be "cp = cos(p);"

  • Sandeep

    @ Keith .. Thanks for the note. You’re right – that has to be ‘cp’. Apologies for any confusion caused by the typo.

Previous post:

Next post: