<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: A case study in CUDA optimization</title>
	<atom:link href="http://blog.accelereyes.com/blog/2010/02/20/a-case-study-in-cuda-optimization/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.accelereyes.com/blog/2010/02/20/a-case-study-in-cuda-optimization/</link>
	<description>Helpful posts about GPU computing. Discussion of Jacket and ArrayFire. Real speedups on real code!</description>
	<lastBuildDate>Wed, 02 Feb 2011 15:32:00 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.1</generator>
	<item>
		<title>By: GPU MATLAB Computing &#187; Median Filtering: CUDA tips and tricks</title>
		<link>http://blog.accelereyes.com/blog/2010/02/20/a-case-study-in-cuda-optimization/comment-page-1/#comment-158</link>
		<dc:creator>GPU MATLAB Computing &#187; Median Filtering: CUDA tips and tricks</dc:creator>
		<pubDate>Fri, 05 Mar 2010 00:11:00 +0000</pubDate>
		<guid isPermaLink="false">http://www.accelereyes.com/blog/?p=167#comment-158</guid>
		<description>[...] week we posted a video recording from NVIDIA&#8217;s GTC09 conference. In the video, I walked through median filtering, presenting [...]</description>
		<content:encoded><![CDATA[<p>[...] week we posted a video recording from NVIDIA&#8217;s GTC09 conference. In the video, I walked through median filtering, presenting [...]</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: malcolm</title>
		<link>http://blog.accelereyes.com/blog/2010/02/20/a-case-study-in-cuda-optimization/comment-page-1/#comment-157</link>
		<dc:creator>malcolm</dc:creator>
		<pubDate>Fri, 05 Mar 2010 00:06:17 +0000</pubDate>
		<guid isPermaLink="false">http://www.accelereyes.com/blog/?p=167#comment-157</guid>
		<description>It&#039;s easy to force nvcc to use at most 9 registers, but then it&#039;ll just spill more into lmem -- there&#039;s no way to prevent it from spilling to lmem except changing the algorithm so it demands fewer registers.

But I like your thinking, so I put together another set of experiments--some new, some old--and I got a little carried away so I made a &lt;a href=&quot;http://www.accelereyes.com/blog/2010/03/04/median-filtering/&quot; rel=&quot;nofollow&quot;&gt;new post&lt;/a&gt;.

Thanks for the suggestion!
  -jm</description>
		<content:encoded><![CDATA[<p>It&#8217;s easy to force nvcc to use at most 9 registers, but then it&#8217;ll just spill more into lmem &#8212; there&#8217;s no way to prevent it from spilling to lmem except changing the algorithm so it demands fewer registers.</p>
<p>But I like your thinking, so I put together another set of experiments&#8211;some new, some old&#8211;and I got a little carried away so I made a <a href="http://www.accelereyes.com/blog/2010/03/04/median-filtering/" rel="nofollow">new post</a>.</p>
<p>Thanks for the suggestion!<br />
  -jm</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Rohit</title>
		<link>http://blog.accelereyes.com/blog/2010/02/20/a-case-study-in-cuda-optimization/comment-page-1/#comment-110</link>
		<dc:creator>Rohit</dc:creator>
		<pubDate>Wed, 24 Feb 2010 02:01:56 +0000</pubDate>
		<guid isPermaLink="false">http://www.accelereyes.com/blog/?p=167#comment-110</guid>
		<description>Great tutorial! I wish it was there 1.5 years ago when I was evaluating motion estimation algorithms on CUDA.

Something that is missing from the evaluation is the effect of forcing the nvcc compiler to use 9 registers and prevent it from spilling the registers to memory and still using bubble sort. It would be interesting to see if that results in an improvement or degrades performance.</description>
		<content:encoded><![CDATA[<p>Great tutorial! I wish it was there 1.5 years ago when I was evaluating motion estimation algorithms on CUDA.</p>
<p>Something that is missing from the evaluation is the effect of forcing the nvcc compiler to use 9 registers and prevent it from spilling the registers to memory and still using bubble sort. It would be interesting to see if that results in an improvement or degrades performance.</p>
]]></content:encoded>
	</item>
</channel>
</rss>

