![]() |
|
|
#1 |
|
Sep 2006
The Netherlands
3×269 Posts |
hi,
I have been looking at the transform i implemented some years ago, with as intention to study how it would do at a gpu. What i had noticed already is that the example calculation of a 42M number is very well suited by accident for the Yap transform. If i compare with the single LL iteration step times as posted in thall.pdf then at 2048k it happens to be so lucky that 42M just fits inside the transform as it happens to be so lucky that it can store 21 bits per limb roughly. So possible solution times to this are not so interesting from objective viewpoint, as it wins somewhere a factor 2. If i take a look at a single Lucas-Lehmer squaring and rebalancing time, the table lists for CUDAlucas: 2.74 ms and 2.2 ms for gpulucas at 1024k. This means that in total we have 1024^2 * 20 * 2 unit calculations to perform. If i project this at the quickmul from Yap which i implemented, then i have 2 versions. Version 1 is the original version which is doing 2 multiplications per unit. Version 2 a version i had optimized to, i managed to get rid of 1 multiplication resulting in 1 multiplication and a prefetched table of the same image size like the original transform that stores omega. This table gets precalculated. At a CPU this gives a considerable speedup as the RAM can keep up with the speed at which the transform gets calculated. Code:
Version 1 (original) :
void qm_DFT_No_Omega_Caching(struct YapDataStruct *qm,unsigned int flaginverse) {
// if flaginverse is set then we do an inverse DFT here
// this is the old function without Omega Caching, just here in order to benchmark diff
unsigned int s;
for( s = 1; s <= qm->log2nlimbs; s++ ) { // the current level of "imaginary" recursion
uint64
omega_m,m = (1<<s),k,jmax=(1<<(s-1));
if( flaginverse )
omega_m = qh_omega_inverse(m);
else
omega_m = qh_omega(m);
for( k=0; k < qm->nlimbs; k += m ) {
uint64 omega=1,j;
for( j=0; j < jmax; j++) {
// unit starts here
uint64 t = mulmod(omega,qm->buf[k+j+jmax]),
u = qm->buf[k+j];
qm->buf[k+j] = plusmod(u,t);
qm->buf[k+j+jmax] = minusmod(u,t);
omega = mulmod(omega,omega_m);
}
}
}
}
Code:
void qm_DFT(struct YapDataStruct *qm,unsigned int flaginverse) {
// if flaginverse is set then we do an inverse DFT here
unsigned int s;
for( s = 1; s <= qm->log2nlimbs; s++ ) { // the current level of "imaginary" recursion
uint64
*omegastartptr,
omega_m,m = (1<<s),k,jmax=(1<<(s-1));
omegastartptr = qm->omega_ptr[flaginverse][s-1];
for( k=0; k < qm->nlimbs; k += m ) {
uint64 j,*loop_omega;
loop_omega = omegastartptr;
for( j=0; j < jmax; j++) {
uint64 omega = *loop_omega++,
t = mulmod(omega,qm->buf[k+j+jmax]),
u = qm->buf[k+j];
qm->buf[k+j] = plusmod(u,t);
qm->buf[k+j+jmax] = minusmod(u,t);
}
}
}
}
A single gpu (Radeon HD 6970) delivers single precision a 2.7 Tflop theoretic, based upon multiply-add. Without it it's around 1536 * 0.88Ghz = 1.351 Tflop single precision (integers). So prefetching 8 bytes of omega pure theoretically spoken is similar to: 8 * 1351 / 170 = 63 instructions. Very expensive in short. If i do calculate with the gpulucas/cudalucas times of 2.2 ms, on paper on a single GPU of AMD (note the 6990 is a double gpu card) one can lose at most 70 instructions to all this. We didn't count yet the 3n times (n being 1024K) that we need to reorder the bytes then in this transform which is another huge bottleneck that i skip for now (which is not wise i bet). The prime number i used at the x64 hardware was the number adviced i saw later on as in Crandall & Pomerance : "Primes" which give p = 2^64 - 2^32 + 1. Taking the modulo of p is very fast in 32 bits math. AMD very slow in emulating 64 bits multiplications times 64 bits, delivering 128 bits. It needs 4 PE's to do a single 32 x 32 bits calculation and that times 2 (for the high bits and low order bits). So virtual price is 8 cycles for a 32 x 32 == 64 bits calculation, we didn't add carry then yet. For all the calculations of a mulmod, a plusmod and a minusmod, we have just 70 cycles. Very little! The big crunching power of this gpu is in 32 bits code and it can multiply at most 16 x 16 bits == 32 bits within 1 cycle (actually doing a multiply add as well within 1 cycle). Maybe there is in this group people who know a way to involve CRT such that with a bunch of small primes we can build a big transform? Using the 64 bits prime number simply is very slow thanks to multiplication in integers at AMD gpu being slow. With lots of toying it might be possible to get the 64 bits code in a gpu emulated in 32 bits just around same speed like CUDAlucas, but that ain't much of a progress if i may say so. Wasting already a virtual 50 instructions or so just onto a single multiplication including modulo of a 64 x 64 bits == 128 bits, that's really expensive. Any thoughts? p.s. might one get the number of cycles to under 70 cycles, it does mean of course you have a lossless FFT in integers and it stores more bits per limb, namely roughly 21 or so (depends upon transform size), yet there is a 3n you have to add for each Lucas-Lehmer squaring which is the reordering of the array and that ain't fast at a GPU either. Last fiddled with by diep on 2011-05-09 at 11:04 |
|
|
|
|
|
#2 |
|
(loop (#_fork))
Feb 2006
Cambridge, England
144668 Posts |
I spent a little while thinking about using CRT in this kind of context; the problem is that the individual primes need to be ==1 modulo the size of the transform, which makes them too big to handle with MMX-like techniques if the transform is of at all reasonable size. And the restriction gets worse if you also require the primes to have an appropriate root of two for doing the DWT.
(with all the 16-bit primes ==1 mod 256 you only get to store about 50k digits) (with all the 24-bit primes ==1 mod 2^16 you could store 16 million digits, which isn't quite enough) 32-bit primes ==1 mod 2^24 are maybe more interesting, there are 24 of them (so the effective digit size is about 10^220), and probably a mulmodp on such a prime can be done slightly faster than for a general p. Doing lots and lots of parallel FFTs modulo different fixed small primes feels quite fitting for a GPU, but I'm not competent to quantify 'feels quite fitting'. |
|
|
|
|
|
#3 | |
|
Sep 2006
The Netherlands
3·269 Posts |
Quote:
Skip the 24 bits primes and skip the 32 bits primes, as the top bits calculation eats 4 cycles at AMD, too slow. Of course single precision floating point would be possible, but let's focus upon integers for now. Say we use a bunch of <= 15-16 bits primes and use CRT. What are the conditions we have to obey in such case? You say they have to be == 1 modulo the size of the transform. Does this mean the multiplication of the primes == 1 modulo size of transform or must each individual prime that we pick need to obey this rule? Forget MMX/SSE2, that is the domain of prime95 here. That's more complicated programming than gpu's, at gpu's just a few instructions are fast and the rest of the planet you'll have to ignore there. Note i would not have problems picking specific sized transforms unequal to power of 2. In itself executing a 70 instructions is a lot if each instruction is fast. Is there an example of how to do that somewhere in simple pseudo code? Regards, Vincent |
|
|
|
|
|
|
#4 | |
|
Sep 2006
The Netherlands
3×269 Posts |
Quote:
How many bits do we store then in each unit? If we then use some sort of karatsuba type optimized calculation using 150k bits, we have another advantage that maybe each transform fits within a single compute unit. So we can run then at the full gpu speed within each compute unit. A compute unit has in case of AMD 32 KB shared cache and 64 Processing Elements clocked at say 880Mhz. If we can use just that 32KB shared cache that is major advantage. Of course we want to be more efficient than Karatsuba (i'm not referring to Toom actually) to avoid unnecessary multiplications. Realize having 24 compute units per gpu is very inefficient (it's not a power of 2). We have a write/read speed of 1TB/s at the shared caches within each compute unit. then we can do some slowish manipulation of each compute units result to generate the entire transform result. If this gpu can run at full speed with simple instructions we can squeeze out over 1.351 Tflop per gpu. That is very mighty. Vincent |
|
|
|
|
|
|
#5 | |
|
Sep 2006
The Netherlands
3×269 Posts |
Quote:
There is not too many primes that are very ideal to use there. Basically one soon stumbles then upon the obvious p = 15 * 2^27 + 1 Multiplication is a lot more efficient than 64 x 64 bits == 128 bits of course. We lose 8 cycles here though. Yet it might be very inefficient in terms of amount of number of bits that can get stored. I'm looking here at example code as i used from Adam Schuck and Joel Veness, the last also explaining to me how to do the math regarding this transform, which was around 2005. Taking modulo of such numbers is not so efficient. What i fear is number of bits that can get stored here per limb. If that's very low then we lose a lot compared to the other transforms. |
|
|
|
|
|
|
#6 | |
|
Tribal Bullet
Oct 2004
5×23×31 Posts |
Quote:
With 24-bit primes, a 2^21-th root of 1 drastically limits the number of primes you can use. You probably would need larger primes to find enough of them that have such a large power-of-two root. You can also find primes with a small-prime-times-large-power-of-two root of 1 and get away with a smaller, non-power-of-two transform length. Finally, you can instead take a completely different approach and find one or more primes p with a large power-of-two root of -1 mod p, then construct a fast Galois transform modulo p^2. Arithmetic mod p^2 looks like (is isomorphic to) arithmetic on complex integer modulo p, and allows an FFT with many of the same tricks that are possible with floating-point FFTs. I suspect that with a good choice of one or more p, this approach is the best we can do for integer transforms. Nice choices are p = Mersenne_prime :) Last fiddled with by jasonp on 2011-05-09 at 16:50 Reason: specify answer is computed a chunk at a time |
|
|
|
|
|
|
#7 | |
|
Sep 2006
The Netherlands
3·269 Posts |
Quote:
As for that i've got a question. How about toying with 32 bits floating point? Multiply-add goes at warpspeed. Advantage AMD has over Nvidia is 1536 PE's per gpu versus nvidia 448 or so (512 max per gpu). Regards, Vincent |
|
|
|
|
|
|
#8 |
|
(loop (#_fork))
Feb 2006
Cambridge, England
2·7·461 Posts |
I'm really very confident that single-precision doesn't work; lots of people much smarter than me have looked at it in the past and failed.
23 bits just get eaten up too fast if you want a transform of remotely plausible length, and whilst I suppose you .could. use FP to implement a mod-p transform for 20-bit p, that's too small to have a big enough root of unity to let you have a reasonable length of FFT. You absolutely can't escape the primes having to be larger than the size of the transform. Last fiddled with by fivemack on 2011-05-09 at 14:28 |
|
|
|
|
|
#9 |
|
Sep 2006
The Netherlands
3·269 Posts |
hello,
I've googled around and taken a look in book. Online i can find a paper from 1999 from Crandall regarding DGT. Of course then i directly took a look as well in his book from 2005 (2nd edition: primes) which i had bought. His 1999 online paper refers to perfsci which seems Crandalls own company. It doesn't have much on this DGT. Basically in the question section the name gets mentionned. No code that's of any use to build a transform to the uninformed reader who can program. The net has nothing further on all this. I assume the important thing is using CRT to get things done. This might be the only way to succes for gpu's. And then select a bunch of small primes. The 2^61-1 prime which gets discussed in the paper online we had studied back in 2006 to see whether it was of any use. Yet back then it was not such interesting thing to deal with as cpu has native 64 bits support. Right now it would be more suited than p = 2^64 - 2^32 + 1 because you don't have carry problems then to multiply using Karatsuba the 64 x 64 bits == 128 bits multiplication using 32 bits multiplications. the 64 bits p gives carry problems there. AMD gpu bad in adding carry. Shifting up to 31 bits goes fast though. Using that mersenne might save a bunch of instructions which might be worth serious looking at, prior to looking at the DGT. The quickmul i had implemented works however only for k * 2^n + 1, not for k * 2^n - 1. Know an example transform of how to use a mersenne? So preferably some actual working code. This as we all appreciate the good intentions of Crandall to have a shot at writing pseudo-code in his book (and i have to admit it's better than NOT doing it). Regards, Vincent |
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| ivy bridge versus haswell | diep | Hardware | 29 | 2017-12-06 13:43 |
| Freedom of Information versus the Right to Privacy | Brian-E | Soap Box | 10 | 2014-07-07 17:59 |
| CUDALucas versus Mfaktc/o | Brain | GPU Computing | 26 | 2011-12-06 08:48 |
| Head versus tail | R.D. Silverman | Lounge | 9 | 2008-12-16 14:28 |
| Pfactor versus Pminus1 | GP2 | Marin's Mersenne-aries | 4 | 2003-09-30 02:52 |