Active Contours are mathematical models that enable detection of objects within images, and are extensively used in Computer Vision as self-adapting frameworks for the delineation and tracking of objects.
To demonstrate Jacket’s cross-platform versatility, we implemented the Chan Vese contour tracking app on Android. The video can be viewed here.
Today, however, we’d like to use a MATLAB implementation of active contours as an example of how to take a large project, and with minimal changes, achieve speedups with Jacket.
We’ll dangle the proverbial carrot first: the GPU Chan-Vese implementation contains only three kinds of changes overall, and the computational code is exactly the same for both CPU and GPU versions. Plus, take a look at the speed-ups below! How did we accomplish that?
0. Identify the bottleneck
As with any project, the biggest question is where to begin. Fortunately, MATLAB’s Profiler will give you a good sense of where the bottlenecks are:
From the profile, we have our target: the function kappa() that consumes about 60% of the time.
We now formulate a set of objectives to ensure proper Jacketization:
- Change CPU code as necessary to ensure optimized GPU performance. Target bottleneck first.
- Ensure that all large intermediate variables are generated on the GPU
- Minimize operations between mixed CPU and GPU types (scalars and small-sized variables are okay)
- Keep the computational chunk of code exactly the same between CPU and GPU runs.
1. Change CPU code
Some investigation reveals that in kappa.m:
- The same matrix, P, is indexed in many different patterns (an operation called “subsref“)
- Some index patterns repeat themselves.
While straight-up subsrefs are generally fast, the GPU starts to slow-down when it’s asked to make “uncoalesced” memory accesses. This most often happens during an indexing operation like in kappa.m. Other examples include conditional indexing, such as A(A<1).
Jacket has many built-in workarounds for commonly-used subsrefs where the GPU is slow, but a piece of code with lesser indexing operations might win over code that has more.
Here is the original kappa.m:
fy = P(3:end,2:n+1)-P(1:m,2:n+1); fx = P(2:m+1,3:end)-P(2:m+1,1:n); fyy = P(3:end,2:n+1)+P(1:m,2:n+1)-2*I; fxx = P(2:m+1,3:end)+P(2:m+1,1:n)-2*I; fxy = 0.25.*(P(3:end,3:end)-P(1:m,3:end)+P(3:end,1:n)-P(1:m,1:n));
And here’s the modified version:
subP1 = P(3:end,2:n+1); subP2 = P(1:m,2:n+1); subP3 = P(2:m+1,3:end); subP4 = P(2:m+1,1:n); subP5 = P(3:end,3:end); subP6 = P(1:m,3:end); subP7 = P(3:end,1:n); subP8 = P(1:m,1:n); fy = subP1-subP2; fx = subP3-subP4; fyy = subP1+subP2-2*I; fxx = subP3+subP4-2*I; fxy = 0.25.*(subP5-subP6+subP7-subP8);
2. Generate intermediate variables on GPU
Having changed our M code, we now need to ensure that all large helper variables are generated on the GPU. This is where the CLASS construct comes in handy.
When using CLASS, the focus is on keeping the functional part of the code identical for both CPU and GPU. To accomplish that, we need to follow the control flow through the code and change variables as necessary.
At a very high level, the flow proceeds as shown:
- Generate inputs
- Call chenvese() on those inputs
- Generate masks if necessary (maskcircle2)
- Call heaviside(), kappa()
- Plot images
Some careful inspection of the code will show that the helper functions (maskcircle2, heaviside, kappa) all contain helper variables that are hard-coded to be CPU variables. You can make them change their data-types according to the primary input, as follows:
In function heaviside():
is modified to follow the data type of that function’s primary input, z:
3. Minimize operations between mixed types
The original M-code contained lots of hard-coded casts to uint8. For example, in function chenvese():
P = rgb2gray(uint8(I));
Here, I is the input image. You need P to be of type uint8 or guint8, depending on whether I is CPU or GPU respectively. How do you make this statement follow the data-type of I?
This is accomplished using a simple function, smartcast(), with the syntax:
op = smartcast(ip, type)
op = smartcast(ip, 'uint8')
will mean that OP is a guint8 or uint8, depending on whether IP is GPU or CPU, respectively.
You can download smartcast.m here.
And that’s it! Lather, rinse, repeat.
4. Keep CPU and GPU code same
Our fourth objective is automatically achieved if we follow the steps outlined above!
This code was run on two different CPUs with a Tesla C2070 and Jacket 1.8rc (downloadable here). It gave a 12x speedup on a 1500×1200 image.
You can download the Jacketized versions here. Run demochanvesecpu for the CPU code, and demochanvesegpu for the GPU version (they both call the same functions). You can also search the M-files for the word “accelereyes” for the points where the original code was modified.
The original M-code implementation of Chan-Vese active contours was written by Yue Wu at Tufts University. Yue maintains a Google page on Image Processing research, where the original, unmodified code can be downloaded. We would like to thank Yue for his efforts and wish him all the best in his research endeavors.
Chan, T. F., & Vese, L. A. Active contours without edges. IEEE TRANSACTIONS ON IMAGE PROCESSING, VOL. 10, NO. 2, FEBRUARY 2001