![]() |
|
|
#177 |
|
Dec 2008
Boycotting the Soapbox
72010 Posts |
Item #542: The 2/8 split-radix butterflies for the FGT(2^n-1) require at most 12 multiplies(n->2n bits), because the roots of +/-i require only shifts and additions.
Item #543: The cyclic part of a 2/8 split-radix FGT(2^61-1) is positive and the modulo does not require 64-bit *arithmetic* shifts. Item #544: For p=2^113-2^57+1, 4th roots are cheap and we can use 2/4 split-radix, We can also defer carrying, and hold more significant bits than 2*(2^61-1). |
|
|
|
|
|
#178 |
|
Dec 2008
Boycotting the Soapbox
24×32×5 Posts |
Damn! Just when I'm in the middle of implementing a floating point FFT...
sqrt(2) mod(2^n + 1) is cheap. If we split our 2^odd bits into two parts we can do a weighted cyclic convolution to do squaring mod(2^odd - 1). I think this can get rid of 50% zero-padding - at the cost of having to do two negacylic convolutions (instead of one negacyclic and one cyclic convolution as required for the mod(2^even-1) case). I've also read the 1982 paper by Schoenhage. The 'suitable numbers' have a perk: one can use progressively smaller registers and avoid ALL shifts or reserve progressively fewer carry bits/level. |
|
|
|
|
|
#179 |
|
Dec 2008
Boycotting the Soapbox
10110100002 Posts |
Using e.g. 10506404096+1 as a modulus, we get 213 free roots if we do all computations in base 1050640 (~20-bits), and we should be able to do a DWT for any number that has a 213 root (mod 10506404096+1). If we use Nussbaumer Convolution twice (128 words/8 words) and Karatsuba to do the 3 multiplies (2x weighting + 1x pointwise) we should also only be forced to do the carrying 3x so we can use paddd/psubd for almost everything.
I don't know whether there are appropriate primes that allow weighting with roots of 2, but if I'm not mistaken, it should be possible to check 100M primes of the form k2n+/-1 for many k. Comments? |
|
|
|
|
|
#180 | |
|
Dec 2008
Boycotting the Soapbox
24·32·5 Posts |
Quote:
"1050640^4096+1", "2^13", "(mod 1050640^4096+1)" and "k*2^n+/-1" Also, if we're doing stuff (mod k*2^n+/-1), then we can also use Schoenhage-Strassen, because many k should have 2^Nth roots (mod 2^N+1). IINM, SS should be 1.5x faster than Nussbaumer, because the inverse transform requires half as many words. |
|
|
|
|
|
|
#181 | |
|
Sep 2006
The Netherlands
36 Posts |
hi,
I'm missing Yap. In NTT i used the prime p = 2^64 - 2^32 + 1 A bunch of multiplies there i could also replace by a simple lookup table. Didn't implement CRT. Note i used C, that's not so fast like assembler. Very inefficient code the compilers generate. Could you elaborate on costs of that? Then there is the idea that you can on AMD's where th emultiplication is far faster than intels (intels have cost 3.75 cycles , that's throughput latency versus AMD has 2.25 cycles throughput latency, not to confuse with total latency), you can use a kind of a rotation transform implementation in assembler where in SIMD you do the normal bitfiddling and just the multiplication in the multiplication register. First and last cycle it will block at AMD's all execution units, and the cycles in between you can do all the other stuff in SIMD. Lots of work to get going, but would be pretty fast. What i'm not aware of is the throughput latency of the double precision floating point multiplication in SIMD, anyone on that? So not intel nor AMD handbooks please, as reality is going to be a lot worse than the handbooks say it is (see above for the actual latencies for the 64 x 64 integer multiplication). Based upon the throughput latencies of those instructions it is easy to calculate what has any practical future there. Vincent p.s. the prime 2^64 - 2^32 + 1 has as advantage that something like 2^96 == -1 (mod p) Here is C code i wrote for mulmod (probably not so optimal - sorry for that): FORCEINLINE uint64 mulmod(uint64 x,uint64 y) { // returns : x * y (mod P) uint64 result,reslo,reshi; uint64 hi_least,hi_most,hi_shifted,hi_addtolo; #if USEINTRINSICSX64 reslo = _umul128(x,y,&reshi); #else reshi = mul_limbs(&reslo,x,y); #endif /* taking Modulo: * p = 2^64 - 2^32 + 1 * ==> i * (2^64 - 2^32 + 1) = 0 * <==> 2^64 * i = 2^32 * i - i * * so if i is top (most significant) 32 bits from 128 bits * then : 2^96 * i == 2^64 * i - 2^32 * i * * because 2^96 == -1 (mod P) ==> * reduction of i * 2^96 is similar to * substraction of least significant dword */ // first a simplistic implementation to see whether the DFT/NTT with this P works anyway // splits hi quadword in 2 dwords: // reduction of bits 96.. ..127 we wait for until at end of modulo // so we start with the 96 bits hi_least = reshi&0x00000000ffffffff; hi_shifted = reshi<<32; hi_most = reshi>>32; //printf("i=%I64u reslo=%I64u bit97=%I64u\n", // i,reslo,bit97); //printf("left96hi=%I64u left96lo=%I64u\n",left96hi,left96lo); // now: 2^64 * i = 2^32 * i - i hi_addtolo = hi_shifted-hi_least; // now it is : reslo + hi_addtolo - hi_most; result = plusmod(reslo,hi_addtolo); result = minusmod(result,hi_most); /*printf("mulmod %I64u . %I64u = %I64X%I64X modres = %I64u\n", x,y,reshi,reslo,result);*/ return result; } Quote:
Last fiddled with by diep on 2010-08-08 at 20:37 |
|
|
|
|
|
|
#182 |
|
Dec 2008
Boycotting the Soapbox
24×32×5 Posts |
The problem is that if you are not using SSE2 you immediately suffer a 100% penalty, because 128-bit registers can simply do twice as much as 64-bit registers. Your best bet is to use 2^96==-1 (mod 2^64-2^32+1), stuff 4x4x24-bit chunks into 4 SSE registers and eliminate redundant instructions.
|
|
|
|
|
|
#183 | |
|
Sep 2006
The Netherlands
72910 Posts |
Quote:
So the question is how many cycles throughput latency (so not the real latency as that's less interesting) for the double precision multiplication instruction in SIMD which multiplies {a1,a2} * {b1,b2} == {a1b1,a2b2} does it take from the cpu? Anyone? Of course there is only 1 multiplication unit capable of doing this. In Integer transforms the cost is known; a1 * b1 == a1b1 (full 128 bits) And that's 2.25 cycles at AMD, which is not too bad in fact, yet far from perfect. So an objective compare with floating point is crucial. Basically the only thing you have to achieve is get an integer transform in next specint testset. That solves all problems at once, as then next generation cpu's will have of course in SIMD the 64 x 64 == 128 bits multiplication for integers; at once of course integer transforms will be faster then than floating point. Yet AFAIK this is not the case currently. Vincent Last fiddled with by diep on 2010-08-08 at 23:11 |
|
|
|
|
|
|
#184 | |
|
Dec 2008
Boycotting the Soapbox
13208 Posts |
Quote:
Performance for a split radix real FFT, i.e ~4/3*n*log2(n) adds and 2/3*n*log2(n) multiplies, on CPUs is limited by the throughput of the floating point adder, not the multiplier. |
|
|
|
|
|
|
#185 | |
|
Jan 2005
Caught in a sieve
5×79 Posts |
Edit: Let's try this again:
Quote:
Code:
// T * P
mtemp = _mm_srli_epi64(mp, 32); // Get the high doubleword.
mtemp2 = _mm_srli_epi64(mt, 32); // Get the high doubleword.
mtemp = _mm_mul_epu32(mtemp, mt);
mtemp2 = _mm_mul_epu32(mtemp2, mp);
mt = _mm_mul_epu32(mt, mp);
mtemp = _mm_add_epi32(mtemp, mtemp2); // Just need the low doublewords.
mtemp = _mm_slli_epi64(mtemp, 32); // Move result to other column (high doublewords).
mt = _mm_add_epi32(mt, mtemp); // Add the results; only need high doublewords.
However, it is true that a 64x64=128 multiply in 64-bit x86 mode is faster than this. Last fiddled with by Ken_g6 on 2010-08-08 at 23:53 |
|
|
|
|
|
|
#186 | |
|
Sep 2006
The Netherlands
36 Posts |
Quote:
Does it mean basically every floating point instruction is real slow, or some other worst case gets hit in the chip? How many instructions per cycle effectively can be pushed through with that code at a size that fits within the caches (so when RAM is not the problem, what's the IPC that gets achieved in terms of instructions, not so much gflops in short). As that might reveal some interesting data. Vincent |
|
|
|
|
|
|
#187 | |
|
Dec 2008
Boycotting the Soapbox
2D016 Posts |
Quote:
~3 instructions/clock for Core2 ~2.7 instructions/clock for Athlon II. |
|
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Modifying the Lucas Lehmer Primality Test into a fast test of nothing | Trilo | Miscellaneous Math | 25 | 2018-03-11 23:20 |
| Lucas-Lehmer test | Mathsgirl | Information & Answers | 23 | 2014-12-10 16:25 |
| Question on Lucas Lehmer variant (probably a faster prime test) | MrRepunit | Math | 9 | 2012-05-10 03:50 |
| Sumout Test in Lucas Lehmer? | paramveer | Information & Answers | 8 | 2012-01-30 08:23 |
| Lucas-Lehmer Test | storm5510 | Math | 22 | 2009-09-24 22:32 |