mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software

Reply
 
Thread Tools
Old 2010-03-05, 23:10   #177
__HRB__
 
__HRB__'s Avatar
 
Dec 2008
Boycotting the Soapbox

24×32×5 Posts
Default

Item #542: The 2/8 split-radix butterflies for the FGT(2^n-1) require at most 12 multiplies(n->2n bits), because the roots of +/-i require only shifts and additions.

Item #543: The cyclic part of a 2/8 split-radix FGT(2^61-1) is positive and the modulo does not require 64-bit *arithmetic* shifts.

Item #544: For p=2^113-2^57+1, 4th roots are cheap and we can use 2/4 split-radix, We can also defer carrying, and hold more significant bits than 2*(2^61-1).
__HRB__ is offline   Reply With Quote
Old 2010-03-09, 01:29   #178
__HRB__
 
__HRB__'s Avatar
 
Dec 2008
Boycotting the Soapbox

24×32×5 Posts
Default

Damn! Just when I'm in the middle of implementing a floating point FFT...

sqrt(2) mod(2^n + 1) is cheap. If we split our 2^odd bits into two parts we can do a weighted cyclic convolution to do squaring mod(2^odd - 1). I think this can get rid of 50% zero-padding - at the cost of having to do two negacylic convolutions (instead of one negacyclic and one cyclic convolution as required for the mod(2^even-1) case).

I've also read the 1982 paper by Schoenhage. The 'suitable numbers' have a perk: one can use progressively smaller registers and avoid ALL shifts or reserve progressively fewer carry bits/level.
__HRB__ is offline   Reply With Quote
Old 2010-06-03, 18:21   #179
__HRB__
 
__HRB__'s Avatar
 
Dec 2008
Boycotting the Soapbox

24×32×5 Posts
Default DWTs w/prime GFNs & Nussbaumer Convolution

Using e.g. 10506404096+1 as a modulus, we get 213 free roots if we do all computations in base 1050640 (~20-bits), and we should be able to do a DWT for any number that has a 213 root (mod 10506404096+1). If we use Nussbaumer Convolution twice (128 words/8 words) and Karatsuba to do the 3 multiplies (2x weighting + 1x pointwise) we should also only be forced to do the carrying 3x so we can use paddd/psubd for almost everything.

I don't know whether there are appropriate primes that allow weighting with roots of 2, but if I'm not mistaken, it should be possible to check 100M primes of the form k2n+/-1 for many k.

Comments?
__HRB__ is offline   Reply With Quote
Old 2010-06-04, 12:55   #180
__HRB__
 
__HRB__'s Avatar
 
Dec 2008
Boycotting the Soapbox

24·32·5 Posts
Default

Quote:
Originally Posted by __HRB__ View Post
Using e.g. 10506404096+1 as a modulus, we get 213 free roots if we do all computations in base 1050640 (~20-bits), and we should be able to do a DWT for any number that has a 213 root (mod 10506404096+1). If we use Nussbaumer Convolution twice (128 words/8 words) and Karatsuba to do the 3 multiplies (2x weighting + 1x pointwise) we should also only be forced to do the carrying 3x so we can use paddd/psubd for almost everything.

I don't know whether there are appropriate primes that allow weighting with roots of 2, but if I'm not mistaken, it should be possible to check 100M primes of the form k2n+/-1 for many k.[...]
Hm...formatting came out wrong...it should read
"1050640^4096+1", "2^13", "(mod 1050640^4096+1)" and "k*2^n+/-1"

Also, if we're doing stuff (mod k*2^n+/-1), then we can also use Schoenhage-Strassen, because many k should have 2^Nth roots (mod 2^N+1). IINM, SS should be 1.5x faster than Nussbaumer, because the inverse transform requires half as many words.
__HRB__ is offline   Reply With Quote
Old 2010-08-08, 20:32   #181
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

36 Posts
Default

hi,

I'm missing Yap.

In NTT i used the prime p = 2^64 - 2^32 + 1
A bunch of multiplies there i could also replace by a simple lookup table.

Didn't implement CRT. Note i used C, that's not so fast like assembler.
Very inefficient code the compilers generate.

Could you elaborate on costs of that?

Then there is the idea that you can on AMD's where th emultiplication is far faster than intels (intels have cost 3.75 cycles , that's throughput latency versus AMD has 2.25 cycles throughput latency, not to confuse with total latency), you can use a kind of a rotation transform implementation in assembler where in SIMD you do the normal bitfiddling and just the multiplication in the multiplication register. First and last cycle it will block at AMD's all execution units, and the cycles in between you can do all the other stuff in SIMD.

Lots of work to get going, but would be pretty fast.

What i'm not aware of is the throughput latency of the double precision floating point multiplication in SIMD, anyone on that?

So not intel nor AMD handbooks please, as reality is going to be a lot worse than the handbooks say it is (see above for the actual latencies for the 64 x 64 integer multiplication).

Based upon the throughput latencies of those instructions it is easy to calculate what has any practical future there.

Vincent

p.s. the prime 2^64 - 2^32 + 1 has as advantage that something like 2^96 == -1 (mod p)

Here is C code i wrote for mulmod (probably not so optimal - sorry for that):

FORCEINLINE uint64 mulmod(uint64 x,uint64 y) { // returns : x * y (mod P)
uint64 result,reslo,reshi;
uint64 hi_least,hi_most,hi_shifted,hi_addtolo;

#if USEINTRINSICSX64
reslo = _umul128(x,y,&reshi);
#else
reshi = mul_limbs(&reslo,x,y);
#endif
/* taking Modulo:
* p = 2^64 - 2^32 + 1
* ==> i * (2^64 - 2^32 + 1) = 0
* <==> 2^64 * i = 2^32 * i - i
*
* so if i is top (most significant) 32 bits from 128 bits
* then : 2^96 * i == 2^64 * i - 2^32 * i
*
* because 2^96 == -1 (mod P) ==>
* reduction of i * 2^96 is similar to
* substraction of least significant dword
*/

// first a simplistic implementation to see whether the DFT/NTT with this P works anyway
// splits hi quadword in 2 dwords:

// reduction of bits 96.. ..127 we wait for until at end of modulo
// so we start with the 96 bits
hi_least = reshi&0x00000000ffffffff;
hi_shifted = reshi<<32;
hi_most = reshi>>32;

//printf("i=%I64u reslo=%I64u bit97=%I64u\n",
// i,reslo,bit97);
//printf("left96hi=%I64u left96lo=%I64u\n",left96hi,left96lo);

// now: 2^64 * i = 2^32 * i - i
hi_addtolo = hi_shifted-hi_least;

// now it is : reslo + hi_addtolo - hi_most;
result = plusmod(reslo,hi_addtolo);
result = minusmod(result,hi_most);
/*printf("mulmod %I64u . %I64u = %I64X%I64X modres = %I64u\n",
x,y,reshi,reslo,result);*/
return result;
}


Quote:
Originally Posted by __HRB__ View Post
There are two competing algorithms...

1. Schoenhage-Strassen
Example: for 66M (2^26 - k*2^15) bits, with k>15
1.1. Preparation
Pad the number with zeros to 2^27 bits. Split into 2^15 parts with 2^12 bits. To prepare the transform in the ring 2^13+1, we split the 2^13+1 bits into 512 words of 16 bits and put 4 of these into 4 dwords of a SSE2 register using only the lower 16 bits, reserving the upper 16 bits to hold the overflow after additions.
Total clocks: 0 (static structure)
1.2. Outer NTT:
1.2.1 Do 2^7 transforms of length 2^8.
All required roots are shifts larger than 16 bits, and can be efficiently performed by changing the order of loads, so an optimal implementation will do 2 paddd(or psubd)/clock resp. 0.125 clocks/word.
Clocks: 0.125*512*2^7*2^8*8 = 16,777,216
1.2.2 Twiddle & transpose the results of the previous step.
Every 2nd twiddle requires one extra addition (sqrt(2)-trick). The shifting and carrying can be combined and requires two shifts, one addition, and one logical operation. An optimal implementation will require the logical operation to go to port #1 on Core2, which cannot be used by any of the other instructions. The two shifts can only go to port #0 so we get 2 cycles/4 words=0.5 cycles/word.
Clocks: 0.5*512*2^15 = 8,388,608
1.2.3 Do 2^8 transforms of length 2^7 analog to step 1.2.1.
Clocks: 0.125*512*2^7*2^8*7 = 14,680,064
1.2.4 Perform final carrying:
If done properly, the required shift, add, and will go to three different ports: 1 cycle/4 words
Clocks: 0.25*512*2^7*2^8 = 4,194,304
Total clocks: 44,040,192
1.3. Inner NTT:
1.3.1 Preparation
Group 4 words of the 512 words from step 1 to form moduli and perform negacyclic weighting (multiplying by 256th roots) by shifts, which only takes 0.125 cycles/word as the other half of the the words are zero. An immediate tweak would be to replace the following additions with logical or, which has a 1.5x higher throughput.
Clocks: 0.125*512*2^15 = 2,097,152
1.3.2 Transform
Perform an NTT analog to step 1.2.x., but with a balance of 4/3 instead of 8/7.
Clocks: 8*(0.125*7 + 0.5 + 0.125)*2^7*2^15 = 50,331,648
1.3.3 Harlan's Trick
The result of the previous step requires 136 bits, so the top 8 bits have been wrapped around and have been subtracted from the bottom 8 bits.
Fix: Do another transform with modulus 2^128+1, putting the bottom 8 bits of every 4th word from step 1.2.4, into every second word, do a negacyclic convolution of length 2^5, subtract the bottom 8 bits in every second word to the result of step 1.3.2. and reconstruct the upper bits for the carrying after the convolution.
Harlan's trick only adds 25% to the runtime of the inner NTT, not 100%.
Total Clocks: 50,331,648*1.25 = 62,914,560
1.4 Pointwise multiply
For sake of simplicity, let's assume that squaring a 128 bit-number requires three 64x64->128 multiplies at 4 clocks each. This instruction goes to port #1, so most of the 12 cycles will be hidden, and the total clocks are more likely to be determined by packing and unpacking 16-bit values.
Clocks: 12*1.25*2^7*2^15 = 62,914,560
1.5 Total clocks
2x step 1.2 + 2x step 1.3 + step 1.4 = 232,783,872
Or about 0.116 ms on a 2Ghz Core2, which is just as fast as prime95. Things get really interesting if one considers that only a maximum of 24 bits of the 32 bit words were actually used, so there is potential for a 50% speed-up, if someone can figure out how to use 24 bits instead of 16, keeping most roots free, without any significant increase in instruction count.

Powers of 2^96 or 2^192 might be an answer...depending on how much a radix-3 NTT costs...
2. Trade secret
It wouldn't be faster than the above for general multiplication, but it would allow weighted transforms and could potentially be 15% faster than prime95.
There are many possibilities that haven't been examined yet. For instance,
2^52+2^26-1
2^48+2^24-1
are both prime, allow easy mod'ing, fit nicely into two 32-bit words, are suitable for FGTs, so one can do transforms of length 2^25 and use the CRT!
Unfortunately, it turns out (read: I think) that the required sqrt(2) to do weighted transforms are unusable for power-of-two transforms...

Stay tuned...

Last fiddled with by diep on 2010-08-08 at 20:37
diep is offline   Reply With Quote
Old 2010-08-08, 22:41   #182
__HRB__
 
__HRB__'s Avatar
 
Dec 2008
Boycotting the Soapbox

24·32·5 Posts
Default

The problem is that if you are not using SSE2 you immediately suffer a 100% penalty, because 128-bit registers can simply do twice as much as 64-bit registers. Your best bet is to use 2^96==-1 (mod 2^64-2^32+1), stuff 4x4x24-bit chunks into 4 SSE registers and eliminate redundant instructions.
__HRB__ is offline   Reply With Quote
Old 2010-08-08, 23:08   #183
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

2D916 Posts
Default

Quote:
Originally Posted by __HRB__ View Post
The problem is that if you are not using SSE2 you immediately suffer a 100% penalty, because 128-bit registers can simply do twice as much as 64-bit registers. Your best bet is to use 2^96==-1 (mod 2^64-2^32+1), stuff 4x4x24-bit chunks into 4 SSE registers and eliminate redundant instructions.
Obviously that is a big problem, because SSE2 can't do 64 x 64 == 128 bits with (unsigned) integers; however the interesting thing is a correct comparision.

So the question is how many cycles throughput latency (so not the real latency as that's less interesting) for the double precision multiplication instruction in SIMD which multiplies {a1,a2} * {b1,b2} == {a1b1,a2b2} does it take from the cpu?

Anyone?

Of course there is only 1 multiplication unit capable of doing this.

In Integer transforms the cost is known; a1 * b1 == a1b1 (full 128 bits)
And that's 2.25 cycles at AMD, which is not too bad in fact, yet far from perfect.

So an objective compare with floating point is crucial.

Basically the only thing you have to achieve is get an integer transform in next specint testset. That solves all problems at once, as then next generation cpu's will have of course in SIMD the 64 x 64 == 128 bits multiplication for integers; at once of course integer transforms will be faster then than floating point. Yet AFAIK this is not the case currently.

Vincent

Last fiddled with by diep on 2010-08-08 at 23:11
diep is offline   Reply With Quote
Old 2010-08-08, 23:16   #184
__HRB__
 
__HRB__'s Avatar
 
Dec 2008
Boycotting the Soapbox

24·32·5 Posts
Default

Quote:
Originally Posted by diep View Post
Obviously that is a big problem, because SSE2 can't do 64 x 64 == 128 bits with (unsigned) integers; however the interesting thing is a correct comparision.

So with how many cycles throughput latency, so not the real latency, as that's less interesting) for the double precision multiplication instruction in SIMD which multiplies {a1,a2} * {b1,b2} == {a1b1,a2b2}.

Of course there is only 1 multiplication unit capable of doing this.

In Integer transforms the cost is known; a1 * b1 == a1b1 (full 128 bits)
And that's 2.25 cycles at AMD, which is not too bad in fact, yet far from perfect.

So an objective compare with floating point is crucial.

Basically the only thing you have to achieve is get an integer transform in next specint testset. That solves all problems at once, as then next generation cpu's will have of course in SIMD the 64 x 64 == 128 bits multiplication for integers; at once of course integer transforms will be faster then than floating point.

Vincent
See: http://www.mersenneforum.org/showthread.php?t=13176
Performance for a split radix real FFT, i.e ~4/3*n*log2(n) adds and 2/3*n*log2(n) multiplies, on CPUs is limited by the throughput of the floating point adder, not the multiplier.
__HRB__ is offline   Reply With Quote
Old 2010-08-08, 23:43   #185
Ken_g6
 
Ken_g6's Avatar
 
Jan 2005
Caught in a sieve

5×79 Posts
Default

Edit: Let's try this again:

Quote:
Originally Posted by diep View Post
Obviously that is a big problem, because SSE2 can't do 64 x 64 == 128 bits with (unsigned) integers....
That's not entirely true. Here's an excerpt from PPSieve (GPL2), using SSE2 intrinsics:

Code:
        // T * P
        mtemp = _mm_srli_epi64(mp, 32);		// Get the high doubleword.
        mtemp2 = _mm_srli_epi64(mt, 32);	// Get the high doubleword.
        mtemp = _mm_mul_epu32(mtemp, mt);
        mtemp2 = _mm_mul_epu32(mtemp2, mp);
        mt = _mm_mul_epu32(mt, mp);
        mtemp = _mm_add_epi32(mtemp, mtemp2);	// Just need the low doublewords.
        mtemp = _mm_slli_epi64(mtemp, 32);	// Move result to other column (high doublewords).
        mt = _mm_add_epi32(mt, mtemp);		// Add the results; only need high doublewords.
That does a 64x64=64 multiply. It shouldn't take too much more code to get the upper half: one multiply, some shifts, and some adds.

However, it is true that a 64x64=128 multiply in 64-bit x86 mode is faster than this.

Last fiddled with by Ken_g6 on 2010-08-08 at 23:53
Ken_g6 is offline   Reply With Quote
Old 2010-08-08, 23:59   #186
diep
 
diep's Avatar
 
Sep 2006
The Netherlands

36 Posts
Default

Quote:
Originally Posted by __HRB__ View Post
See: http://www.mersenneforum.org/showthread.php?t=13176
Performance for a split radix real FFT, i.e ~4/3*n*log2(n) adds and 2/3*n*log2(n) multiplies, on CPUs is limited by the throughput of the floating point adder, not the multiplier.
Yes i saw also post here claiming 13% system time or something similar to that. That's amazing much.

Does it mean basically every floating point instruction is real slow, or some other worst case gets hit in the chip?

How many instructions per cycle effectively can be pushed through with that code at a size that fits within the caches (so when RAM is not the problem, what's the IPC that gets achieved in terms of instructions, not so much gflops in short).

As that might reveal some interesting data.

Vincent
diep is offline   Reply With Quote
Old 2010-08-09, 01:44   #187
__HRB__
 
__HRB__'s Avatar
 
Dec 2008
Boycotting the Soapbox

24·32·5 Posts
Default

Quote:
Originally Posted by diep View Post
How many instructions per cycle effectively can be pushed through with that code at a size that fits within the caches (so when RAM is not the problem, what's the IPC that gets achieved in terms of instructions, not so much gflops in short).
Aim for:

~3 instructions/clock for Core2
~2.7 instructions/clock for Athlon II.
__HRB__ is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
Modifying the Lucas Lehmer Primality Test into a fast test of nothing Trilo Miscellaneous Math 25 2018-03-11 23:20
Lucas-Lehmer test Mathsgirl Information & Answers 23 2014-12-10 16:25
Question on Lucas Lehmer variant (probably a faster prime test) MrRepunit Math 9 2012-05-10 03:50
Sumout Test in Lucas Lehmer? paramveer Information & Answers 8 2012-01-30 08:23
Lucas-Lehmer Test storm5510 Math 22 2009-09-24 22:32

All times are UTC. The time now is 23:24.


Fri Jul 16 23:24:02 UTC 2021 up 49 days, 21:11, 1 user, load averages: 1.58, 1.65, 1.66

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.