Benchmarking the new Kepler (GTX 680)

by pavan on April 26, 2012

in Benchmarks,CUDA

NVIDIA has launched their next generation GPU based on their Kepler Architecture. They followed it up with a rather quick update to their CUDA toolkit. Considering that we have access to 3 generations of their GTX cards (480, 580 and 680), we thought we would show case how the performance has changed over the generations.

Matrix multiplication:

One terra flops!

It can be seen that the GTX 680 breaches the 1 Terraflop mark comfortably for single precision, while the GTX 580 barely scratches it. However the performance seems to peak around 2048 x 2048 and then rallies downward to match the performance of the GTX 580 at larger sizes. The high end Tesla C2070 finishes last for single precision behind the third placed GTX 480.

For double precision, as expected the C2070 is well ahead of the pack. The most interesting snippet here is that the GTX 680, finishes dead last compared to its predecessors. At about 1/10 th  of its single precision performance, the 680 is about twice as slow as the 580 which settles down a ~ 1/5th the single precision performance.

Fast Fourier Transform:

Fast fast aww no more memory!The performance gains moving from 480 to a 580 is significant (~20%), while the 680 does not seem to have huge wins over its immediate predecessor. The Fast Fourier Transform is an interesting benchmark in that, it is a case of these cards running out of memory before the peak performance is reached. At 2GB, the 680 can hold two 8192×8192 single precision, complex matrices, but the scratch space required for this algorithm is more than the free space available. All the transformations were 2D , Real to complex transforms.

SORT:

Here the GTX 680 starts off strong before losing out to the GTX 580, and eventually to the 480.  We are using the same radix-sort algorithm for all the benchmarks.  It is really astonishing that the 680 is more than 20% slower at peak.

Resources:

Benchmark Code

Benchmark Results

Related links:

Tom’s hardware: LuxMark Benchmarks

Anandtech: Retaking the performance crown

 

 

{ 8 comments }

Stu Blair April 26, 2012 at 3:20 pm

Cool!  Thank you for checking this out.  It’s nice to be reminded occasionally that NVIDIA is first and foremost a company that makes graphics cards for gamers.

Out of curiosity, have you done any benchmarks using the OpenCL versions of your products on the HD 7970?

pavan April 30, 2012 at 2:50 pm

 Stu,

We do not have a 7970 yet. However we do have a 6950 and we benchmarked it and showcased it in our webinar (http://blog.accelereyes.com/blog/2012/02/17/opencl_vs_cuda_webinar_recap/) The opencl libraries from AMD on the 6950 provided about 750GFLOPS for blas which is about the same performance of a gtx 480 with nvidia’s cublas libraries.

MySchizoBuddy May 1, 2012 at 3:42 am

Looking forward to 7970 versus GTX 680 using ArrayFire OpenCL Library.

Torben Larsen May 2, 2012 at 1:07 am

Hi Pavan.

Great comparison. Finally the 1 TFlops barrier broken for the 680. Somewhat disappointing performance as such though. With low-cost 580 cards with 3 GB memory they seem very attractive.

I assume you do not count the PCI transfers. Could you include that also? In my apps I sometime need to consider if it is worth the effort to move the data, compute on the GPU and return the result to the host. Would be nice to see them compared. The disadvantage is that the PCI transfers (latency, rate) varies quite a lot from computer to computer. Interesting it is though.Torben

robocod May 6, 2012 at 1:13 pm

We are running GTX680 benchmarks and disappointed with performance. However, the great thing about the new Kepler chip is low power consumption though. So having ability to put two chips in a single card and have the same power output as 580 is impressive and very useful. But Nvidia was positioning 680 as being 50% faster than 580, that part is disappointingly not true.

Pavan Yalamanchili May 9, 2012 at 6:43 am

Torben,

The motherboards the cards were on were wildly different (the 680 and the 480 are a bit similar). Benchmarks across multiple machines would not provide consistent results I think.

MySchizoBuddy May 10, 2012 at 1:31 pm

What happens when the matrix size doesn’t fit in the on board GDDR5. How much performance hit do you get from the maximum

WILLIAM SMITH May 14, 2012 at 4:23 am

I sometime need to consider if it is worth the effort to move the data,
compute on the GPU and return the result to the host. Would be nice to
see them compared. The disadvantage is that the PCI transfers (latency,
rate) varies quite a lot from computer to computer. Interesting it is
though.Torben http://www.datanetsolutions.org/

Comments on this entry are closed.

{ 4 trackbacks }

Previous post:

Next post: