Speeding up critical code

by James Malcolm on October 5, 2010

in CUDA

With Jacket 1.5, we released a big new feature:  GCOMPILE. This allows you to convert critical sections of your MATLAB code directly into GPU kernels to further increase speed.  In an earlier post we introduced the prototype and have been working with several beta users over the past month to get it ready.  In this post, we’ll give some more details and start to look at the speedups you can quickly and easily achieve.  You can find more information about it on the wiki.

Some of the best features of GCOMPILE are the ability to use IF statements, WHILE loops, and FOR loops in your code now.  Make sure to check out the wiki pages about these and the other features — we’d love to get feedback.

For the most trivial of examples, GCOMPILE probably won’t offer much speedup.  However, the more complicated the arithmetic, the better the improvement that can be seen.

My personal favorite method of using GCOMPILE is with the verbatim script, so I can have the code I’m working on right near the GCOMPILE statement.  Verbatim is a neat script which lets you get the text of a comment immediately below a given line in a script, as a string.  This not a creation of AccelerEyes; it comes from the Verbatim page on MathWorks Central.  We’ve also included Verbatim in the Jacket installation for your convenience.

For the timings in this article, I’m using a Core i7 2.67 GHz machine running Fedora (64 bit) with 6GB of RAM.  The GPU is an NVIDIA GTX 260 with 896MB of VRAM. Each compiled kernel we discuss is being called on 1024×1024 matrices of single-precision floats. To get non-trivial times, we repeat this in a for-loop 10,000 times. For example:

A = grand(1024,1024,'single');
fn = gcompile(verbatim);
%{
  ...
%}
tic
for i = 1:10000
  fn(A);
end
gcompile_time = toc

Let’s look at a trivial kernel:

fn = gcompile(verbatim);
%{
  function y = addup(a, b, c, d)
    y = a + b .* c + d;
  end
%}

As you can see, this kernel doesn’t do much.  Running it 10,000 times with native MATLAB took 116.42 seconds. Running it with Jacket’s JIT yielded a vastly better 2.53 seconds. GCOMPILE got us a run time of 2.25 seconds.  Better, but still not that much to be excited about, huh?

Let’s consider the following function now:

fn2 = gcompile(verbatim);
%{
  function [e, f] = addup(a, b, c, d)
    e = a + b .* c + d;
    f = a .* b - c .* d;
  end
%}

This function calculates two variables’ values instead of just one.  Let’s compare it to this Jacket code:

  e = a + b .* c + d;
  f = a .* b - c .* d;

The MATLAB native version took a total time of 240.09 seconds — enough time to go grab a cup of coffee! The Jacket JIT took a much better time of 4.49 seconds. GCOMPILE beat that though, taking only 2.74 seconds. As you can see, the computation of the second variable only added 0.49 seconds, whereas with Jacket’s regular JIT, it added 1.96 seconds — that’s a speedup of 64%. As another test we computed 3 values instead of 2. With that, we had a MATLAB run time of 348.24 seconds, JIT run time of 7.02 seconds, and a GCOMPILE time of 3.34 seconds.  GCOMPILE took less than half as much time as the JIT, and was over 100 times faster than native MATLAB.

Why?  When performing a computation on the GPU, there’s a certain amount of overhead for each kernel execution. GCOMPILE can pack multiple computations into a single kernel, but the current version of Jacket’s standard JIT still has to perform these computations as independent kernels — we’re working on this :). In the example above, one kernel computed e and the other kernel computed f.  Because of this, computing the pair requires two kernel executions, whereas with GCOMPILE, only one. The more values you can compute in a single kernel execution, the faster your code will execute!  By maximizing what occurs in each kernel, we can minimize the overhead of calling kernels on the GPU, and this offers us the ability to get our code to run much faster.

Now, let’s look at a little bit more realistic example.  In the demos folder of Jacket, we’ve distributed an example called FDTD_Example.  In there is a script called fdtd_gpu.  This version uses Jacket’s JIT and offers an incredible speed-up over running on the CPU.  How much faster can we make this go using GCOMPILE?

Looking through, the first arithmetic we come across is:

  ey = ga.*(dy - iy - sy);
  iy = iy + gb.*ey;

Putting this in a function ought to be easy!

function [oey, oiy] = main(ga, dy, sy, iy, gb, ey)
  oey = ga .* (dy - iy - sy);
  oiy = iy + gb .* oey;
end

And now we adjust that function so we can GCOMPILE it:

update_ey_iy = gcompile(verbatim);
%{
  function [oey, oiy] = main(ga, dy, sy, iy, gb, ey)
    oey = ga .* (dy - iy - sy);
    oiy = iy + gb .* oey;
  end
%}

Where the code was being called before, we change it to:

[ey, iy] = update_ey_iy(ga, dy, sy, iy, gb, ey);

That wasn’t so bad!  What else can we change?

dy_hat_temp = gi3r.*dy_hat + gi2r.*0.5.*(dyhz - dxhx);
dy = gj3r.*dy + gj2r.*(dy_hat_temp - dy_hat);
dy_hat = dy_hat_temp;

We change this to a function now:

update_d_hat = gcompile(verbatim);
%{
  function [dy_hat_, dy_] = main(gi3r, dy_hat, gi2r, dyhz, dxhx, gj3r, dy, gj2r)
    tmp = gi3r.*dy_hat + gi2r.*0.5.*(dyhz - dxhx);
    dy_ = gj3r.*dy + gj2r.*(tmp - dy_hat);
    dy_hat_ = tmp;
  end
%}

And replace the original code with a call to that function:

[dy_hat, dy] = update_d_hat(gi3r, dy_hat, gi2r, dyhz, dxhx, gj3r, dy, gj2r);

Lastly, we convert this bit to a function:

ihz = ihz + gj1r.*curl_ez_vec;
hz = fi3r.*hz + fi2r.*(0.5.*curl_ez_vec + ihz);

ihx = ihx + gi1r.*curl_ey_vec;
hx = fj3r.*hx + fj2r.*(0.5.*curl_ey_vec + ihx);

It becomes:

update_hz_hx = gcompile(verbatim);
%{
  function [ihz_, hz_, ihx_, hx_] = main(ihz, hz, gj1r, curl_ez_vec, fi3r, fi2r, ...
                                         ihx, hx, gi1r, curl_ey_vec, fj3r, fj2r)
    ihz = ihz + gj1r .* curl_ez_vec;
    hz = fi3r .* hz + fi2r .* (0.5 .* curl_ez_vec + ihz);
    ihz_ = ihz; hz_ = hz;
    ihx = ihx + gi1r .* curl_ey_vec;
    hx = fj3r .* hx + fj2r .* (0.5 .* curl_ey_vec + ihx);
    ihx_ = ihx; hx_ = hx;
  end
%}

And the call becomes:

[ihz, hz, ihx, hx] = update_hz_hx(ihz, hz, gj1r, curl_ez_vec, fi3r, fi2r, ...
                                  ihx, hx, gi1r, curl_ey_vec, fj3r, fj2r);

One thing to be careful about:  We definitely don’t want to call GCOMPILE every iteration!  This would result in a new kernel being compiled every time we execute the loop, and that would DEFINITELY be a performance killer.

Now we fire it up.  In the original CPU example, my test machine ran at approximately 28 fps. With Jacket’s JIT version, this improved to 160 fps.  Using GCOMPILE to rewrite those critical sections now pushes it to 220 fps.  That’s an improvement of 38% over JIT and a 785% improvement over the CPU!

We hope this shows you how powerful GCOMPILE is, and we’re still working to push more and more features into it. Check it out and tell us what you think!

Comments on this entry are closed.

Previous post:

Next post: