mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing

Reply
 
Thread Tools
Old 2011-05-11, 20:27   #12
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

3×269 Posts
Default

Quote:
Originally Posted by fivemack View Post
I spent a little while thinking about using CRT in this kind of context; the problem is that the individual primes need to be ==1 modulo the size of the transform, which makes them too big to handle with MMX-like techniques if the transform is of at all reasonable size. And the restriction gets worse if you also require the primes to have an appropriate root of two for doing the DWT.

(with all the 16-bit primes ==1 mod 256 you only get to store about 50k digits)

(with all the 24-bit primes ==1 mod 2^16 you could store 16 million digits, which isn't quite enough)

32-bit primes ==1 mod 2^24 are maybe more interesting, there are 24 of them (so the effective digit size is about 10^220), and probably a mulmodp on such a prime can be done slightly faster than for a general p. Doing lots and lots of parallel FFTs modulo different fixed small primes feels quite fitting for a GPU, but I'm not competent to quantify 'feels quite fitting'.
A message from the helpdesk reached my mailbox. I have explained it as good news. So 5 months after i bought a GPU i can now know that in theory the AMD gpu can multiply 24 x 24 bits == 48 bits within 2 instructions.

The least significant 32 bits are in a single unsigned integer and the top 16 bits idemdito.

That eats 2 cycles in short.

Now i hope they provide, as some AMD engineer proposed, a solution to add this to OpenCL, if that request is still in time for the next APP SDK. That might mean that APP SDK 2.5 also provides access to the top16 bits.

Right now only a slower instruction the 32x32 bits top 32 bits are there.

So it would be possible, garantuee until the door, to study a transform for the AMD gpu in opencl using 24 bits multiplications.

Now i must admit i didn't study this very well yet, i'm sure TheJudger did do that a lot better as he's using this for TF as well and the 72 bits is his fastest code there.

Adding Carry isn't so easy in AMD. Also it seems to me that using multiply-add is rather complicated as the low bits gives a 32 bits result rathre than 24 bits, so it will be impossible probably to get a throughput
of above 880Mhz * 1536 = 1.35 Tflop

With 24 bits multiplications it's possible however to emulate bigger numbers than 64 bits i'd argue. Up to 72 bits maybe is possible. Note that would still be rather slow 72 bits numbers. 48 bits would go a lot faster i'd guess.

This will require puzzling!

Of course in contradiction to 32 bits math where overflow to 33 bits is not possible, with all this 24 bits junk, in reality it's in an integer that has 8 bits left, so overflow is nearly impossible. So the full 48 or 72 bits can get used.

Karatsuba type shifting operations also more easy then.

Or maybe even 96 bits. 4 x 24 bits.

If we multiply 96 x 96 bits using 24 bits at a time, we need 9 multiplications using Karatsuba.

I'll generate a bunch of primes close to those limits.
diep is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
ivy bridge versus haswell diep Hardware 29 2017-12-06 13:43
Freedom of Information versus the Right to Privacy Brian-E Soap Box 10 2014-07-07 17:59
CUDALucas versus Mfaktc/o Brain GPU Computing 26 2011-12-06 08:48
Head versus tail R.D. Silverman Lounge 9 2008-12-16 14:28
Pfactor versus Pminus1 GP2 Marin's Mersenne-aries 4 2003-09-30 02:52

All times are UTC. The time now is 15:15.


Fri Jul 7 15:15:13 UTC 2023 up 323 days, 12:43, 0 users, load averages: 1.45, 1.29, 1.18

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔