mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2018-03-22, 02:38   #12
airsquirrels
 
airsquirrels's Avatar
 
"David"
Jul 2015
Ohio

11×47 Posts
Default

Quote:
Originally Posted by Prime95 View Post
A 4M FFT needs ~135 MB of bandwidth per iteration. A 4M FFT is 32MB of data requiring two passes over the data. Thus, 32MB read + 32MB write + 32MB read + 32MB write + somewhere between 5MB and 10MB of read-only sin/cos/weights data.
You’re right - I was only accounting for one pass. Of course I’m not sure how all the constants translate into integer based calculations, but constant in hardware are significantly faster than constants in software/CUDA.
airsquirrels is offline   Reply With Quote
Old 2018-03-22, 03:15   #13
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

22·3·112 Posts
Default

I'm personally very interested in how this turns out; I wish you all good luck.

I did try to implement a FGT (fast gallois transform) integer convolution in OpenCL/GCN, but somehow the performance was disappointing and I didn't investigate further. (I'm not sure what part of the blame goes to the compiler for generating poor integer code).

Just mentioning, a radically different integer approach for convolution is
"Nussbaumer convolution" described in Richard Crandall "Topics in Advanced Scientific Computation". This requires mostly integer additions! no multiplications, or constant tables. The drawback is in the memory access pattern, which is not fit for CPU or GPU architectures (low locality), but who knows may be OK for something different like FPGA.

Another drawback is that the "Nussbaumer convolution" naturally computes a negacyclic convolution, instead of a cyclic convolution that we need for mersenne modulo.
preda is offline   Reply With Quote
Old 2018-03-23, 13:34   #14
bsquared
 
bsquared's Avatar
 
"Ben"
Feb 2007

3×5×251 Posts
Default

Discussion on NN prime finding moved to here
bsquared is offline   Reply With Quote
Old 2018-06-01, 14:26   #15
tServo
 
tServo's Avatar
 
"Marv"
May 2009
near the Tannhäuser Gate

3·269 Posts
Default

Quote:
Originally Posted by jasonp View Post
I will probably get access to a compute capability 3.0+ Nvidia GPU in the near future, and have been toying with the idea of porting the integer FFT convolution framework I've been building off and on over the last few years to CUDA GPUs,

Do consumer GPUs still have their double precision throughput sharply limited, i.e. do the cards with 1:32 DP throughput vastly outnumber the server cards that do double precision faster? Using integer arithmetic on these cards has a shot at enabling convolutions that are faster than using cufft in double precision, if there is much more headroom on consumer GPUs for arithmetic compared to the memory bandwidth available. Are current double precision LL programs bandwidth limited on consumer GPUs even with the cap on double precision floating point throughput?

The latest source is here and includes a lot of stuff already, including optimizations for a range of x86 instruction sets.

Would there be interest in another LL tester for CUDA? I've never worked with OpenCL so CUDA is what I know.
Jason,
Just wondering if you have started this yet.
I have an interest in what you propose.
tServo is offline   Reply With Quote
Old 2018-06-02, 18:11   #16
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

5·23·31 Posts
Default

Very basic stuff only; real life is swamping everything for the foreseeable future.
jasonp is offline   Reply With Quote
Old 2018-06-02, 18:39   #17
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

24×3×163 Posts
Default

Quote:
Originally Posted by jasonp View Post
The latest source is here and includes a lot of stuff already, including optimizations for a range of x86 instruction sets.

Would there be interest in another LL tester for CUDA? I've never worked with OpenCL so CUDA is what I know.
Yes, or PRP. With or without Jacobi check or Gerbicz check respectively (with preferred).
On NVIDIA, there's already CUDALucas and gpulucas for LL in CUDA, nothing via OpenCl; nothing yet on NVIDIA for PRP in CUDA or OpenCl. Neither LL program has the Jacobi check, which would be a nice addition.
Preda's contemplating addressing NVIDIA somehow with gpuOwL or a variant; avoiding duplication might be good.
A summary of current software available is described at http://www.mersenneforum.org/showthread.php?t=22450 and http://www.mersenneforum.org/showthread.php?t=23371
kriesel is online now   Reply With Quote
Old 2018-06-04, 17:25   #18
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

5·23·31 Posts
Default

In my mind the only reason to attempt something like this would be to speed up enough consumer grade GPUs to measurably increase the throughput of GIMPS. I've been toying with number theoretic FFTs off and on for almost 20 years now and it would be nice to find a real niche for them in a project like this.
jasonp is offline   Reply With Quote
Old 2018-06-04, 19:45   #19
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

17·487 Posts
Default

Quote:
Originally Posted by jasonp View Post
In my mind the only reason to attempt something like this would be to speed up enough consumer grade GPUs to measurably increase the throughput of GIMPS. I've been toying with number theoretic FFTs off and on for almost 20 years now and it would be nice to find a real niche for them in a project like this.
The last time I looked at CUDA, I think about 4 years ago, the problem is that 32x32 multiplies are limited to the DP throughput rate. Shifts are also severely limited. 32-bit adds run at full speed (as long as there is no carry propagation involved). Can your integer FFT algorithms take advantage of this scenario?
Prime95 is offline   Reply With Quote
Old 2018-06-05, 12:54   #20
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

DED16 Posts
Default

There's some interesting reading on the subject here.

It's ironic that FFTs using Winograd's prime factor algorithm were developed in the 70s and 80s to optimally reduce the number of multiplications in FFTs, and suddenly after 30 years that really matters for integer FFTs where multiplies have lower throughput than anything else. A single modular multiply requires one high-half and two low-half integer multiplications when the prime modulus and the underlying arithmetic is chosen carefully.

I don't know if the overhead of an integer FFT using 30-bit prime moduli would swamp the savings in throughput from avoiding double precision floating point; my gut says the integer option would be faster on lower-end cards.
jasonp is offline   Reply With Quote
Old 2018-06-05, 15:06   #21
airsquirrels
 
airsquirrels's Avatar
 
"David"
Jul 2015
Ohio

20516 Posts
Default

I’m still interested in this for FPGAs

I happen to have ended up with exceptional access to a very large number of High end Xilinx FPGAs, and I would be more than happy to provide one or two on loan to anyone who has time or has a grad student to implement some GIMPS factoring or LL algorithms in them.
airsquirrels is offline   Reply With Quote
Old 2018-06-05, 15:54   #22
Mysticial
 
Mysticial's Avatar
 
Sep 2016

17C16 Posts
Default

Quote:
Originally Posted by jasonp View Post
There's some interesting reading on the subject here.

It's ironic that FFTs using Winograd's prime factor algorithm were developed in the 70s and 80s to optimally reduce the number of multiplications in FFTs, and suddenly after 30 years that really matters for integer FFTs where multiplies have lower throughput than anything else. A single modular multiply requires one high-half and two low-half integer multiplications when the prime modulus and the underlying arithmetic is chosen carefully.

I don't know if the overhead of an integer FFT using 30-bit prime moduli would swamp the savings in throughput from avoiding double precision floating point; my gut says the integer option would be faster on lower-end cards.
Unfortunately, 30-bit primes won't be enough for LL unless you go Gaussian.

30-bit primes have enough roots-of-unity to do convolutions of around a billion bits, but you also need roots-of-two. And since there's no observable correlation between them, the longest transform length that you can do with 30-bit primes will probably only be around 2^16-ish.

More context here with 64-bit primes: http://www.mersenneforum.org/showthr...t=unity&page=4

Last fiddled with by Mysticial on 2018-06-05 at 15:55
Mysticial is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
128 bit integer division in CUDA? cseizert GPU Computing 8 2016-11-27 15:41
Non-power-of-two FFTs jasonp Computer Science & Computational Number Theory 15 2014-06-10 14:49
P95 PrimeNet causes BSOD; small FFTs, large FFTs, and blend test don't KarateF22 PrimeNet 16 2013-10-28 00:34
In Place Large FFTs Failure nwb Information & Answers 2 2011-07-08 16:04
gmp-ecm and FFTs dave_dm Factoring 9 2004-09-04 11:47

All times are UTC. The time now is 14:58.


Fri Jul 7 14:58:41 UTC 2023 up 323 days, 12:27, 0 users, load averages: 0.83, 1.01, 1.07

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔