mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2018-12-23, 10:02   #23
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

5·172 Posts
Default

Quote:
Originally Posted by ewmayer View Post
My main objective is to simply settle whether SP-based FFT-mul can be used at all for moduli of interest to GIMPS:

o If not, I will need to revisit my random-walk ROE heuristic to see where it falls short;

o If yes, and my heuristics re. FFT-length and modulus size are even close to what I observe in actual practice, an SP-based GPU LL test will be of immediate interest.

But for now, I suggest being pessimistic and assuming that Preda - the only person I know who has actually tried SP for such work - correctly concluded nonfeasibility for such an approach. At least that will be my attitude - expect the worst, but hope for a pleasant surprise.
One [additional] loss of precision may be introduced by the "wiggle factors" of the SP FFT transform, which are also SP. As the FFT size approaches the nb. of SP significant bits (24), the wiggles can't provide the required precision anymore.

Remembering vaguely, what I was seeing in my experiments was under 2bits/word I think at 8M SP FFT. But I am interested in what your experiments show. (and thanks for the formula of usable bits at a given FFT size).

Another aspect is that the memory footprint of the SP FFT should not be much larger than the memory of the DP FFT, because then any computing advantage will be bottlenecked on the memory.
preda is offline   Reply With Quote
Old 2018-12-23, 14:20   #24
Neutron3529
 
Neutron3529's Avatar
 
Dec 2018
China

43 Posts
Default

Quote:
Originally Posted by Nick View Post
I would say "May the 32:1 SP:DP ratio work!"
or "I hope the 32:1 SP:DP ratio works!".
Thank you:)
I finally know why emoticons was invented.
For someone like me, it is quite difficult to seperate two different sentences with question marks and exclamation marks..
Neutron3529 is offline   Reply With Quote
Old 2018-12-23, 23:19   #25
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

5·2,351 Posts
Default

Quote:
Originally Posted by axn View Post
5x FFT size means 2.5x memory usage. There is serious indications on non-DP-crippled GPUs that LL tests are severely memory-bottlenecked. Increasing the memory usage by 2.5x would just exacerbate the situation, even if theoretically the GPU could otherwise finish the computation sequence faster. On the flip side, this means that smaller FFTS (say < 1M) might benefit more from this, which might be useful for things like LLR where a lot of projects are there (Top 5000 entry point is around 1.4mbits).
A good point - but my immediate interest is to satisfy myself as to the precision-related (in)feasibility of SP for multimegadigit modmul. If a carefully coded FFT of my own writing shows 'infeasible', full stop. In the other case, even if GIMPS wavefront/DC work is too memory-bound on current GPUs to support competitive SP-based FFT-modmul, one should never assume that what holds today will hold for future architectures, and as you note, there might still be a niche where things favor SP.
ewmayer is offline   Reply With Quote
Old 2018-12-24, 01:00   #26
kladner
 
kladner's Avatar
 
"Kieren"
Jul 2011
In My Own Galaxy!

2×3×1,693 Posts
Default

Quote:
Originally Posted by Neutron3529 View Post
Thank you:)
I finally know why emoticons was invented.
For someone like me, it is quite difficult to seperate two different sentences with question marks and exclamation marks..
You have expressed yourself, and people have understood. Better still, you have stimulated a great deal of informed discussion, and even coding experiments. Your questions have already contributed a great deal to this environment.
Thanks!
kladner is offline   Reply With Quote
Old 2018-12-27, 09:20   #27
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
"name field"
Jun 2011
Thailand

3·23·149 Posts
Default

What's wrong with a low-level "driver" that stores one DP in 3 SPs? You use it high-level the same way as you use the DP stuff. Of course, no assembly optimization or fused stuff is possible there at DP level (beside the optimizations you do for SP level).

But then, you need like 5 SP additions to "add the DP" (this because they may not align properly, you can't just add them one by one) and few tests, and you need mostly 9 SP multiplications (5 or 6 with Karatsuba or Toom-Cook) to multiply a "DP".

Make it twelve, with all the overhead.

I have put here some posts in the past where I was ranting on this subject, saying that any piece of hardware that makes one DP operation cost more than 10 or 12 SP operations may be futile for DP calculus, with a good implementation of DP-by-SP emulation.

I am thinking more and more about it, since the newest cards (like 2080 etc) with 4000 GHzDays/Day of TF invaded the market, which are as bad as 1/32 of DP/SP...

Last fiddled with by LaurV on 2018-12-27 at 09:22
LaurV is offline   Reply With Quote
Old 2018-12-27, 11:19   #28
Neutron3529
 
Neutron3529's Avatar
 
Dec 2018
China

43 Posts
Default

Quote:
Originally Posted by LaurV View Post
But then, you need like 5 SP additions to "add the DP" (this because they may not align properly, you can't just add them one by one) and few tests, and you need mostly 9 SP multiplications (5 or 6 with Karatsuba or Toom-Cook) to multiply a "DP".

Make it twelve, with all the overhead.
Maybe someone could implement a NTT(Number Theorem Transform).
such algorithm just keep the multiply operation as O(m) rather than O(m2) (here we use m integer (stored in SP) to represent an integer stored in DP or even long double.)
Although CRT(Chinese Remainder Theorem) may take much of time, it will only increase the time cost from O(nlog(n)) to O(mnlog(n))
which might keep a high performance with 1:32 DP:SP ratio.
Neutron3529 is offline   Reply With Quote
Old 2018-12-27, 14:08   #29
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
"name field"
Jun 2011
Thailand

3·23·149 Posts
Default

Well, for that you need DP, for carry propagation, etc. My suggestion was to let all the higher level as it is, but instead of using "doubles" for it, you emulate doubles with another type that uses 3 floats to store one double (obviously, only two are not enough). I still believe that such contraption will be faster on a card with 1/16 or 1/32 than the native DP. But of course, I have no "proof" and I am not clever enough, free enough, and motivated enough, to try implementing it by myself.
LaurV is offline   Reply With Quote
Old 2018-12-27, 14:18   #30
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

26458 Posts
Default

Quote:
Originally Posted by LaurV View Post
Well, for that you need DP, for carry propagation, etc. My suggestion was to let all the higher level as it is, but instead of using "doubles" for it, you emulate doubles with another type that uses 3 floats to store one double (obviously, only two are not enough). I still believe that such contraption will be faster on a card with 1/16 or 1/32 than the native DP. But of course, I have no "proof" and I am not clever enough, free enough, and motivated enough, to try implementing it by myself.
Often the 1/32 sounds more scary than it really is. While the DP instructions themselves are indeed slow, they do not alone define the total time taken. They are interleaved with memory access (extremely slow, slower than the 1/32 DP), some "control" instructions (tests, jumps), and plenty of integer instructions (e.g. for memory address computations).

I'd say, if you take 1/32 DP and turn it into 1/2 DP while keeping everything else unchanged, you'd get maybe a 10% speedup (depending on the particular algorithm, but this would be my estimation for our FFTs).
preda is offline   Reply With Quote
Old 2018-12-27, 15:16   #31
LaurV
Romulan Interpreter
 
LaurV's Avatar
 
"name field"
Jun 2011
Thailand

282916 Posts
Default

Quote:
Originally Posted by preda View Post
you'd get maybe a 10% speedup
well... this is hard to believe, but I am sure you know what you say, and if that is the score, then all my theory is bullshit...
LaurV is offline   Reply With Quote
Old 2018-12-27, 15:28   #32
retina
Undefined
 
retina's Avatar
 
"The unspeakable one"
Jun 2006
My evil lair

2×3,343 Posts
Default

CPUs and GPUs keep getting faster and faster internally. But feeding those things with the correct data at the correct time at high speed is really hard. So I would not be too surprised to see only a 10% speed boost from improving just the DP throughput circuits. But I would also not be too surprised to see a really clever programmer be able to squeeze out 20%-30% improvement by optimising the data flows.

As to whether it is worth all the effort for said programmer ... Perhaps yes, if the right motivations are presented. Perhaps no, when one considers a years worth of work to optimise things and then a new batch of hardware makes all the work redundant. Added to the fact that many GPUs manufacturers jealously hide the internals secrets, and make it as difficult as possible for anyone to learn how to make it work better.
retina is online now   Reply With Quote
Old 2018-12-27, 16:22   #33
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

737310 Posts
Default

Quote:
Originally Posted by retina View Post
CPUs and GPUs keep getting faster and faster internally. But feeding those things with the correct data at the correct time at high speed is really hard. So I would not be too surprised to see only a 10% speed boost from improving just the DP throughput circuits. But I would also not be too surprised to see a really clever programmer be able to squeeze out 20%-30% improvement by optimising the data flows.

As to whether it is worth all the effort for said programmer ... Perhaps yes, if the right motivations are presented. Perhaps no, when one considers a years worth of work to optimise things and then a new batch of hardware makes all the work redundant. Added to the fact that many GPUs manufacturers jealously hide the internals secrets, and make it as difficult as possible for anyone to learn how to make it work better.
Quite so. There's an axiom in engineering, a well designed system is uniformly weak. Computing devices get designed for a spectrum of types of typical work for a mass market. Crunching 100-million-bit primes is not typical work. The device budget for a given die size, process, yield, etc is allocated in an attempt to do the most good overall.
You raise a good point in regard to using SP vs. DP. More than two SP to represent one DP imposes a data storage and transfer penalty. Storage capacity is not an issue in ram capacity for primality testing on any gpu fast enough for the attempt, but it is for P-1 factoring. On the cpu side, George has periodically stated various chips are memory bottlenecked in prime95, to the extent that he explores code alternatives to accomplish the same thing in more cpu clock cycles but fewer memory fetches.
The programmer's work to refine the software is unlikely to be wasted if he completes it. A new hardware model takes time to get established in the fleet. Existing hardware lasts for years. clLucas was not a waste, any more than mfakto or gpuowl's early versions. It enabled LL testing on AMD gpus, and was a step on the way to what we have now; choices.
Hmm, how about some quad precision implementation in the hardware?

Last fiddled with by kriesel on 2018-12-27 at 16:25
kriesel is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
does half-precision have any use for GIMPS? ixfd64 GPU Computing 9 2017-08-05 22:12
translating double to single precision? ixfd64 Hardware 5 2012-09-12 05:10
so what GIMPS work can single precision do? ixfd64 Hardware 21 2007-10-16 03:32
New program to test a single factor dsouza123 Programming 6 2004-01-13 03:53
4 checkins in a single calendar month from a single computer Gary Edstrom Lounge 7 2003-01-13 22:35

All times are UTC. The time now is 00:58.


Tue Feb 7 00:58:12 UTC 2023 up 172 days, 22:26, 1 user, load averages: 0.60, 0.87, 1.02

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔