NVIDIA has launched their next generation GPU based on their Kepler Architecture. They followed it up with a rather quick update to their CUDA toolkit. Considering that we have access to 3 generations of their GTX cards (480, 580 and 680), we thought we would show case how the performance has changed over the generations.

** Matrix multiplication**:

It can be seen that the GTX 680 breaches the 1 Terraflop mark comfortably for single precision, while the GTX 580 barely scratches it. However the performance seems to peak around 2048 x 2048 and then rallies downward to match the performance of the GTX 580 at larger sizes. The high end Tesla C2070 finishes last for single precision behind the third placed GTX 480.

For double precision, as expected the C2070 is well ahead of the pack. The most interesting snippet here is that the GTX 680, finishes *dead last* compared to its predecessors. At about 1/10 th of its single precision performance, *the 680 is about twice as slow as the 580* which settles down a ~ 1/5th the single precision performance.

**Fast Fourier Transform**:

The performance gains moving from 480 to a 580 is significant (~20%), while the 680 does not seem to have huge wins over its immediate predecessor. The Fast Fourier Transform is an interesting benchmark in that, it is a case of these cards running out of memory before the peak performance is reached. At 2GB, the 680 can hold two 8192×8192 single precision, complex matrices, but the scratch space required for this algorithm is more than the free space available. All the transformations were 2D , Real to complex transforms.

**SORT**:

Here the GTX 680 starts off strong before losing out to the GTX 580, and eventually to the 480. We are using the same radix-sort algorithm for all the benchmarks. It is really astonishing that the 680 is more than 20% slower at peak.

Resources:

Related links:

Tom’s hardware: LuxMark Benchmarks

Anandtech: Retaking the performance crown

{ 8 comments }

Cool! Thank you for checking this out. It’s nice to be reminded occasionally that NVIDIA is first and foremost a company that makes graphics cards for gamers.

Out of curiosity, have you done any benchmarks using the OpenCL versions of your products on the HD 7970?

Stu,

We do not have a 7970 yet. However we do have a 6950 and we benchmarked it and showcased it in our webinar (http://blog.accelereyes.com/blog/2012/02/17/opencl_vs_cuda_webinar_recap/) The opencl libraries from AMD on the 6950 provided about 750GFLOPS for blas which is about the same performance of a gtx 480 with nvidia’s cublas libraries.

Looking forward to 7970 versus GTX 680 using ArrayFire OpenCL Library.

Hi Pavan.

Great comparison. Finally the 1 TFlops barrier broken for the 680. Somewhat disappointing performance as such though. With low-cost 580 cards with 3 GB memory they seem very attractive.

I assume you do not count the PCI transfers. Could you include that also? In my apps I sometime need to consider if it is worth the effort to move the data, compute on the GPU and return the result to the host. Would be nice to see them compared. The disadvantage is that the PCI transfers (latency, rate) varies quite a lot from computer to computer. Interesting it is though.Torben

We are running GTX680 benchmarks and disappointed with performance. However, the great thing about the new Kepler chip is low power consumption though. So having ability to put two chips in a single card and have the same power output as 580 is impressive and very useful. But Nvidia was positioning 680 as being 50% faster than 580, that part is disappointingly not true.

Torben,

The motherboards the cards were on were wildly different (the 680 and the 480 are a bit similar). Benchmarks across multiple machines would not provide consistent results I think.

What happens when the matrix size doesn’t fit in the on board GDDR5. How much performance hit do you get from the maximum

I sometime need to consider if it is worth the effort to move the data,

compute on the GPU and return the result to the host. Would be nice to

see them compared. The disadvantage is that the PCI transfers (latency,

rate) varies quite a lot from computer to computer. Interesting it is

though.Torben http://www.datanetsolutions.org/

Comments on this entry are closed.

{ 4 trackbacks }