mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2010-11-06, 01:22   #1
ixfd64
Bemusing Prompter
 
ixfd64's Avatar
 
"Danny"
Dec 2002
California

23×13×23 Posts
Default Nvidia's next-generation graphics cards

A lot of people were disappointed with Nvidia's GTX 480 video card, one of the reasons being that it had a much lower FLOPS number than the HD 5870. Of course, the FLOPS number alone does not determine the performance of a graphics card. However, the GeForce 400 series only yields 12.5% double precision, while the Fermi architecture was designed to allow 50%. For the purpose of finding prime numbers, this isn't very useful.

Anyways, Nvidia recently released a rough roadmap that mentions two future architectures: Kepler and Maxwell. More information here: http://www.tomshardware.com/news/fer...att,11339.html

Also, the GeForce 500 series was announced about two weeks ago: http://www.tomshardware.com/news/gpu...gtx,11529.html

Some people say that the GeForce 500 series is just a "tweaked" version of GeForce 400, while others are saying that it will be a brand new design that will offer as many as 768 CUDA cores.

What does everyone make of this?
ixfd64 is offline   Reply With Quote
Old 2010-11-06, 05:50   #2
msft
 
msft's Avatar
 
Jul 2009
Tokyo

2×5×61 Posts
Default

Hi ,ixfd64

Frmky report GTX480 2M FFT 4.43 ms/iter.
TheJudger report Tesla C2050 2M FFT 4.3 ms/iter.

I think LLR performance depend memory bandwidth.
msft is offline   Reply With Quote
Old 2010-11-06, 11:00   #3
fivemack
(loop (#_fork))
 
fivemack's Avatar
 
Feb 2006
Cambridge, England

72·131 Posts
Default

I expect the graphics cards sold to the gaming market to continue to have double-precision crippled: the cost of getting decent CUDA code written is still enough that people only do it when they have a serious need, and organisations with a serious need are reasonably budget insensitive. AMD has no double-precision in anything but its highest-end graphics cards (the 58xx series), but AMD's attempts to sell something like the Tesla boards has had no perceptible success.

So they'll continue to get faster, but you'll always have to buy a $300 card to get the double precision, even if the $120 card outperforms in most other metrics the last generation's $300 card.
fivemack is offline   Reply With Quote
Old 2010-11-06, 12:46   #4
TheJudger
 
TheJudger's Avatar
 
"Oliver"
Mar 2005
Germany

11·101 Posts
Default

Quote:
Originally Posted by msft View Post
Hi ,ixfd64

Frmky report GTX480 2M FFT 4.43 ms/iter.
TheJudger report Tesla C2050 2M FFT 4.3 ms/iter.

I think LLR performance depend memory bandwidth.
Yepp, memory bandwidth is a problem in many cases, not just LLR.

A simplified calculation:
A GTX 480 has 480 cores (enabled), all running at 1401MHz: It is capable of 168GFLOPS in double precission (DP): 480(cores) * 1401MHz * 2 / 8 = 168.12GFLOPS.
Those FLOPS are multiply-adds. For each independend multiply-add we need 3 inputs and have one output. So each DP float has 64bits / 8bytes. So for each multiply-add we need to read 3x 8byte and write 1x 8byte => 32bytes of bandwidth for a single multiply-add.
168.12GFLOPS * 32bytes = ~5.4TB/sec bandwidth needed. A GTX 480 has 177.4GB/sec on device memory...
A paper from Vasily Volkov mentions that a GTX 480 has ~1.3TB/sec bandwitdh in onchip shared memory. The size of the shared memory is ~1MB in total (which is splitt into smaller pieces of L1-cache and shared memory for each multiprocessor). Register bandwidth seems to be at least 8TB/sec but all registers together are ~2MB.
None of the onchip registers, shared memory and L2-cache is big enough to hold the hole dataset for an LLR so we need really the bandwidth of the device memory.
Of course this is a very simplified reflection. In reality you can hide some traffic to the slow device memory... but device memory bandwidth is a limitation.
So now imaging why a Tesla 20x0 doesn't perform much better than a GTX 480: C2050 has 515GFLOPS DP and only 144GB/sec on device memory...


Oliver
TheJudger is offline   Reply With Quote
Old 2010-11-06, 20:27   #5
ET_
Banned
 
ET_'s Avatar
 
"Luigi"
Aug 2002
Team Italia

32×5×107 Posts
Default

I will repost here an interesting link coming from this thread...

http://www.gpgpgpu.com/gecco2009/6.pdf

Luigi
ET_ is offline   Reply With Quote
Old 2010-11-07, 10:47   #6
msft
 
msft's Avatar
 
Jul 2009
Tokyo

2×5×61 Posts
Default

Hi, TheJudger
Quote:
Originally Posted by TheJudger View Post
Yepp, memory bandwidth is a problem in many cases, not just LLR.
Linpack(TOP500) is gift.
Matrix multiplication is not depend mamory bandwidth.
msft is offline   Reply With Quote
Old 2010-11-09, 16:27   #7
Ken_g6
 
Ken_g6's Avatar
 
Jan 2005
Caught in a sieve

5·79 Posts
Default

Quote:
Originally Posted by ET_ View Post
I will repost here an interesting link coming from this thread...

http://www.gpgpgpu.com/gecco2009/6.pdf

Luigi
FYI, that looks like it's doing tests of primes < 2^32! There's no FFT involved there, and the FFT is what makes things much harder for LLR.
Ken_g6 is offline   Reply With Quote
Old 2010-11-09, 16:59   #8
ET_
Banned
 
ET_'s Avatar
 
"Luigi"
Aug 2002
Team Italia

32×5×107 Posts
Default

Quote:
Originally Posted by Ken_g6 View Post
FYI, that looks like it's doing tests of primes < 2^32! There's no FFT involved there, and the FFT is what makes things much harder for LLR.
You are right, but I'm not wrong!

I was considering uses of GPU for other taks, apart from LLR on a thread dedicated to nVidia GPUs...

Luigi
ET_ is offline   Reply With Quote
Old 2010-11-09, 18:02   #9
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

2D7716 Posts
Default

I am preparing to port some Mersenne TF code to a Tesla GPU over the coming months ... basic question about the FP arithmetic coding and optimization: Should one target HLL code to the GPU's FMADD-based ISA, e.g. by refactoring one's C code to provide "hints" to the compiler, or are there more-direct way of doing so, e.g. by way of compiler intrinsics or direct GPU-specific inline ASM?
(I have an SSE2 inline-ASM version of the key factoring modpow loop, but I expect the GPU analog of same is going to look rather different.)

Sorry if this has been answered elsewhere - I'm just at the point where I'm starting to gather the various GPU-specific documentation and figuring out which is likely to be the most relevant to my work.

Thanks,
-Ernst
ewmayer is offline   Reply With Quote
Old 2010-11-09, 19:39   #10
delta_t
 
delta_t's Avatar
 
Nov 2002
Anchorage, AK

3×7×17 Posts
Default GTX580 review

Saw this up on AnandTech:
http://www.anandtech.com/show/4008/n...eforce-gtx-580
delta_t is offline   Reply With Quote
Old 2010-11-09, 22:17   #11
Karl M Johnson
 
Karl M Johnson's Avatar
 
Mar 2010

41110 Posts
Default

I was very worried if GTX 580 would become sm_21 sh!t, but, luckly, it aint!
Woohoo!
Karl M Johnson is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
Prime95 and graphics cards keithschmidt Information & Answers 45 2016-09-10 10:08
New Linux rootkit leverages graphics cards for stealth. swl551 Lounge 0 2015-05-08 14:06
Report: Nvidia Making Dual-GK110 Graphics Card kracker GPU Computing 8 2013-08-29 11:32
how do graphics cards work so fast? ixfd64 Hardware 1 2004-06-02 03:01
Chance to use modern Graphics Cards as.. Marco Hardware 28 2003-11-02 23:21

All times are UTC. The time now is 09:59.


Sat Jul 17 09:59:20 UTC 2021 up 50 days, 7:46, 1 user, load averages: 1.36, 1.27, 1.25

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.