mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2011-07-26, 03:21   #12
davieddy
 
davieddy's Avatar
 
"Lucan"
Dec 2006
England

194A16 Posts
Default

Quote:
Originally Posted by Christenson View Post
I'm a programmer, and those "off by one's" are killers....I meant 7.

It works like this:
GPU TF on hot card =~100 X (call it 128X) faster than on CPU.

So that means if GPU does TF only and CPU does LL, GPU should do 7 more bit levels than before.
BS Eric

I said 64x faster.
It could TF 75-76 in same time as the CPU takes for 69-70
It could TF 70-75 in same time as the CPU takes for 69-70

"Off by 3" more like.

Just admit it: you "thought" (if that word is appropriate here)
that the sqrt of 64 was 8.

David

Edit: If you are suggesting that the GPU is going to spend
as long TFing one exponent as the CPU takes for an LL test (and it
sounds like you are) it could TF from 70-81.
Off by 3 again

Last fiddled with by davieddy on 2011-07-26 at 03:45
davieddy is offline   Reply With Quote
Old 2011-07-26, 06:06   #13
davieddy
 
davieddy's Avatar
 
"Lucan"
Dec 2006
England

2·3·13·83 Posts
Default OK, no need to rub it in but...

...I'm going to anyway.

Note the first line:
Quote:
Originally Posted by davieddy View Post
Please follow this carefully:

If a CPU were to TF from 70 to 73 bits, it would take 14 times longer
than the time it took to TF 69-70 (which as I said is ~1.5% of the
time for an LL test). Those 3 extra bits eliminate 4% of LL tests.

The good GPU can do these three bits in 14*1.5/10000=0.2% of the time it
takes a CPU to do an LLtest.

Say we have 1 GPU for every 500 LL-testing CPUs.
Should we eliminate 4% of LL tests, or use the GPU as 2 more
LL testers (0.4% increase in LL firepower)?

Now will you tell me where you got eight extra bits from???

David
Will you at least concede that TFing 3 more bits with a GPU gains 10 times
more than using it for LL?

Last fiddled with by davieddy on 2011-07-26 at 06:09
davieddy is offline   Reply With Quote
Old 2011-07-26, 06:25   #14
davieddy
 
davieddy's Avatar
 
"Lucan"
Dec 2006
England

2·3·13·83 Posts
Default

Quote:
Originally Posted by Rodrigo View Post
Christenson,

The other day I saw a comparative estimate of TF throughput with a more high-end card running in the 75 GHz-days/day range, where the LL production came out to like 11 GHz-days/day. Now for this GT 430 it looks like the respective figures would be 30-40 and 5-6.

This suggests about a 7:1 ratio of GHz-days/day productivity for TF work vs. LL work.

What accounts for the difference? Is it that the GPU code for TF has been honed to a higher degree, or is it something inherent to each type of work that allows for more GIMPS output from TF through a graphics card?

Rodrigo
I think the basic answer is parallelism:
Testing one potential factor is a separate operation
from testing another.

David
davieddy is offline   Reply With Quote
Old 2011-07-26, 06:41   #15
Rodrigo
 
Rodrigo's Avatar
 
Jun 2010
Pennsylvania

947 Posts
Default

Quote:
Originally Posted by davieddy View Post
I think the basic answer is parallelism:
Testing one potential factor is a separate operation
from testing another.

David


Would you be so kind as to elaborate? I'm a words guy rather than a numbers guy. Imagine you were writing "The Complete Idiot's Guide to GPU Computing."

I'll be more specific. What you said about testing one potential factor vs. testing another sounds self-evident, but I don't see what in particular it has to do with the GHz-day credits one gets for doing TF vs. LL, on a graphics card.

Simply trying to understand this process (GPU computing). Normally I prefer to have a working idea of how a new (to me) process works before I start on it, but it's entirely possible I won't really "get" it till I actually do it.

Rodrigo
Rodrigo is offline   Reply With Quote
Old 2011-07-26, 07:58   #16
davieddy
 
davieddy's Avatar
 
"Lucan"
Dec 2006
England

11001010010102 Posts
Default

I am conjecturing in the dark, and would also like elucidation.

This is simply a reasonable guess:
Suppose GPUs have lots of rudimentary processors which can test
lots of factors simultaneously.
LL iterations have to be performed in order (series), so can't take
advantage of this.
Up to a point, I think chunks of the FFT can be done simultaneously
(in parallel) giving GPUs some edge over CPUs.

That's enough wild guessing, and I am more than happy to be shot down!

(You were right to direct that at Christenson, but there are many here
who could give a good simple answer)

David
davieddy is offline   Reply With Quote
Old 2011-07-26, 15:36   #17
Rodrigo
 
Rodrigo's Avatar
 
Jun 2010
Pennsylvania

947 Posts
Default

Ahh, that makes a LOT of sense, thank you.

Let's see if anybody else confirms it.

Rodrigo
Rodrigo is offline   Reply With Quote
Old 2011-07-26, 20:00   #18
kjaget
 
kjaget's Avatar
 
Jun 2005

3·43 Posts
Default

Right now there's also other details slowing down CUDALucas GPU LL tests. Unlike CPU tests, the GPU libraries are really slow doing non-power of 2 FFTs. So where a CPU can use, say, a 2560K FFT length for current 1st time LL tests, a GPU is most efficient doing the same test with a 4096K FFT length.

I'm not sure if this is an inherent problem with GPUs or just a limitation of the current CUDA libraries. Assuming you can get the same FFT length for both CPU and GPU to run efficiently and those FFT run times scale similarly on both platforms, you can get a ~ 1.6x improvement in GPU LL throughput.

Also remember that TF on a GPU currently uses not just the GPU, but the GPU plus around 2 CPU cores. LL testing on a GPU uses basically no CPU time.

Neither of these make the case for LL testing on a GPU even in theory, but they do move the ratio a bit closer. The latter real-life issue may change the calculations used to figure out how deep to efficiently factor using GPUs - or maybe not, since you can't factor to a fraction of a bit level and that may be all it changes the end result.

Last fiddled with by kjaget on 2011-07-26 at 20:01
kjaget is offline   Reply With Quote
Old 2011-07-26, 21:08   #19
Rodrigo
 
Rodrigo's Avatar
 
Jun 2010
Pennsylvania

947 Posts
Default

kjaget,

Very informative, thanks!

The ~1.6x improvement jibes with the figures I've seen. But learning that LL testing on a GPU uses little CPU time is intriguing -- supposing that you have a dual-core PC, does that mean that you could do LL on the GPU and still have both CPU cores available for Prime95?

Rodrigo
Rodrigo is offline   Reply With Quote
Old 2011-07-26, 21:44   #20
Christenson
 
Christenson's Avatar
 
Dec 2010
Monticello

5×359 Posts
Post

Quote:
Originally Posted by Rodrigo View Post
Ahh, that makes a LOT of sense, thank you.

Let's see if anybody else confirms it.

Rodrigo
I'll concur, as long as Davieddy remembers to take the log base 2 of 128 and gets 7. I hadn't even begun to think in terms of the population of available machines.

OK, the TF algorithm goes like this:
Sieve, on the CPU, for reasonable, probably-prime factor candidates. (I *dream* of having this happen on the GPU instead).

Take each of those reasonable candidates, hand it to a GPU core, and have it compute result= 2^Exponent mod candidate
if (result = 1), then report a factor found.

The GPU can handle hundreds of these candidates in parallel, someone said "TF is embarrasingly parallel".

The LL test goes something like this:

Start with 4
For (exponent-2) number of times
{
Square your number (Take its FFT, multiply it term-wise by itself, take its inverse FFT)
subtract 2
Take it modulo 2^exponent-1
}
If the result is zero, then M^exponent -1 is prime

Remark 1: The LL test for a given exponent is inherently a serial process.
Remark 2: The fourier transform itself can be made fairly block parallel, so the GPU CPUs are probably kept busy pretty constantly. Specifically, in the classical power of 2 FFT, at, say, the 4th level, two adjecent blocks of 16 numbers are combined to make one block of 32 transformed numbers. Such a process shouldn't be difficult to get multiple threads working on together.
Remark 3: There's a *lot* of operations in an LL test. Suppose our exponent is 40M. We have
40M iterations of
2*40M log2(40M) operations in the FFTs

Letting log2(40M) be 25, we have:
N = 40M * 40M * 50 arithmetic ops to do.
With a hot GPU, with 700M ops/sec * 500 CPUs, this will take
40M * 40M * 50 /700M/500 =~ 4 * 40M * /700 =160M/700 =~ 200K secs
=~3 days

So this isn't TOO far off....
Christenson is offline   Reply With Quote
Old 2011-07-26, 21:47   #21
Christenson
 
Christenson's Avatar
 
Dec 2010
Monticello

70316 Posts
Default

Quote:
Originally Posted by Rodrigo View Post
kjaget,

Very informative, thanks!

The ~1.6x improvement jibes with the figures I've seen. But learning that LL testing on a GPU uses little CPU time is intriguing -- supposing that you have a dual-core PC, does that mean that you could do LL on the GPU and still have both CPU cores available for Prime95?

Rodrigo
Indeed the LL on the GPU leaves the CPU core(or cores) largely available for whatever you want to point them at....thus my *dreaming* in the previous post.
Christenson is offline   Reply With Quote
Old 2011-07-26, 23:25   #22
LiquidNitrogen
 
LiquidNitrogen's Avatar
 
Jun 2011
Henlopen Acres, Delaware

7×19 Posts
Default

Quote:
Originally Posted by Christenson View Post
I'm a programmer, and those "off by one's" are killers....
They happen often enough in the programming world, they get their own nickname: Fencepost errors

http://en.wikipedia.org/wiki/Off-by-one_error
LiquidNitrogen is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
Looking at new GPU card for less than $450 section31 GPU Computing 4 2016-01-19 17:04
Card Tricks davar55 Hobbies 11 2013-05-27 14:28
card probability TimSorbet Math 8 2007-09-25 20:00
Physics Card JCoveiro Software 4 2006-11-30 04:46
Card Games Uncwilly Hobbies 1 2006-06-03 12:45

All times are UTC. The time now is 15:08.


Fri Jul 7 15:08:53 UTC 2023 up 323 days, 12:37, 0 users, load averages: 1.33, 1.19, 1.16

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔