![]() |
|
|
#23 |
|
Jul 2021
3·11 Posts |
That's why I implemented both FFT and NTT from scratch by myself with those skills that I have.
I implemented both so that to have relative comparison. Both my variants are average-optimized so theirs comparison is quite fair. Also theirs inner loops and formulas look very much the same. My FFT uses AVX1 SIMD to do DOUBLE multiplies. And NTT uses non-SIMD 64x64->128 multiply (built-in in CPU). The reason I see why my FFT is 10x slower than my NTT is because NTT uses full 64-bit words (actually around 60-bit) for computation, because NTT doesn't suffer from error overflow, there is no error. While my FFT uses splitting into very small words of 3,4-bits size. So 64/3 = 21x difference in word size. If anyone can correct me - if I used wrong thechnique of FFT with splitting into 3-bit words? Is there any other way to solve task? This 3-bit splitting was done to keep final error within bounds of 0.05-0.1. 3-bit is for 2^25 bytes input numbers sizes. Of cause for smaller sizes like 2^20 I use 7-8 bits. This 3 bits were chosen based on experimental results for random inputs of different sizes, table was built to achieve this. NTT doesn't suffer from cumulative error. Instead it uses Chinese Reminder Theorem (CRT) to restore higher precision. In CRT I used 3 primes, because NTT with 64-bit words needs around 64 + 64 + 64 = 192-bit final result meaning 3 primes of 64-bits each are needed in CRT. So this gives 3x slow down of NTT, but still NTT is 10x faster than my FFT. So my conclusion was that if for my middle-optimized implementations of both FFT and NTT with almost same algorithms and loops are different in 10x times then I could imagine that if Prime95 gurus implemented NTT with same high optimizations in my like they did in GWNUM then it could happen that imaginary Prime95's NTT could be faster than Prime95's FFT, that's why I created this discussion thread initially. Also people say that NTT is much slower due to lots of modular divisions. But I don't have any divisions in my code at all, only multiplications. Because I used Barrett and Montgomery reducing which allowed replacing all divisions with multiplicaions (and shift/add). Last fiddled with by moytrage on 2021-07-18 at 05:23 |
|
|
|
|
|
#24 |
|
Jul 2021
3310 Posts |
Basically exactly this is done in my code of both FFT and NTT. I have input array A and no recursion. Outer loop runs Log(N) times and inner loop goes through A sequentially from start to finish. And no recursion. To achieve no recursion input array is permuted. So I have Log(N) sequential reads/writes of my A array, hence I conclude that my algorithm is already quite cache friendly.
|
|
|
|
|
|
#25 |
|
Jul 2021
3·11 Posts |
I don't understand why NTT should be much slower computationally-wise. People say it is slow due to lots of division in modular reduction. But I used Barrett and Montgomery reductions everywhere which allowed me to use no divisions at all, only mult/add operations.
So after Barrett+Montomerry reduction I don't see the reason why FFT should be much faster. Both of them use SIMD and/or built-in CPU multiplications and that's it. So both algorithms look to me about the same. More than that they both have almost the same code, loops and algorithms. I can't see real difference. Last fiddled with by moytrage on 2021-07-18 at 06:16 |
|
|
|
|
|
#26 |
|
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
7,823 Posts |
If your NTT (which NTT?) is 9.8 sec/iter and your FFT is 10-15 times slower than your NTT, we'll call it 12, that's ~118. sec/iter for moytrage FFT, while prime95 you timed as 0.34 sec. That's a ratio of ~346. between FFT implementations.
IBDWT vs. padded FFT is what, 2:1 based on length for padding? Bits/word in DP FFT is typ 17-19 in prime95 or other established code (CUDALucas, Mlucas, Gpuowl) depending on FFT length, not 3-4, so that would be ~4-6 fold. Memory or cache efficiency probably gets killed by the low bits/word, and perhaps the usage patterns as Prime95 described. He's done a LOT of optimization work over the decades. The integer based code I thoroughly tested for exponent limit due to iteration error in Gpuowl 1.9 was good for ~20 bits/word, in M61 Fast Galois transform 4M length. It was SP or M31 that was expected to provide ~3-5 bits/word. Author's description at V1.8: https://mersenneforum.org/showpost.p...postcount=224; V1.9 exponent M61 transform limit test results at https://www.mersenneforum.org/showpo...31&postcount=8 I don't know why you're only able to use 3-4 bits/word in your DP based FFT and yet see .07 -.1 roundoff error at that. Other software goes up to ~0.3-0.4 roundoff error or slightly higher for performance with many more bits/word for DP based FFTs. Usable bits/word versus FFT length are given in a column on the right of some of the more recent processor-specific benchmarks in the prime95 reference thread. If you plan to turn your code into a prime testing program, reading background on the development of Gpuowl, Mlucas, CUDALucas, prime95 could be enlightening. There are forum threads for most if not all. If the code is wrong, even rarely, relative speed does not much matter in an application as demanding as primality testing. Every branch, every case, needs well designed test inputs. |
|
|
|
|
|
#27 | |
|
Jul 2021
2116 Posts |
Quote:
Code:
{
{6, 22}, {7, 22}, {8, 20}, {9, 19}, {10, 20}, {11, 17}, {12, 16}, {13, 14}, {14, 13}, {15, 12},
{16, 11}, {17, 10}, {18, 9}, {19, 8}, {20, 7}, {21, 5}, {22, 4}, {23, 3}, {24, 3},
}
These bits I got when tried to limit error below 0.1. So this number of bits is the largest number of bits that still give error below 0.1 All errors measured experimentally on random input numbers. Of cause if I limit error to 0.4 then of cause I get more bits. I just thought limiting to 0.4 is not very safe. Of cause if some extra errors checking things are done then 0.4 is also allowable. The only reason I can think of why error would be higher than in Prime95 is because maybe of using `-ffast-math` switch when compiling. It makes math faster but less accurate. Also I use SIMD which is less accurate than built-in CPU multiplication. Do I understand correctly that Prime95 somehow for some reason uses 17-19 bits with error of 0.4? Does it use DOUBLE-64 computations? Not 128 or 256 bit floats? Not long-double 80 bit floats? Of cause if I use 18 bits instead of 3 bits then I'll be 6x times faster. How came that Prime95 managed to use 18 bits with 0.4 error? Actually I didn't want to make any Primality-proof program out of my code, especially if it is dozens times slower than existing Prime95. I only implemented FFT and NTT from scratch just for educational purpose because always wanted to learn what FFT/NTT is. If somebody quite clever already tested that NTT is much slower than IBDWT for same level of optimizations then I'm quite happy, that was my only concern when starting this thread, concern "if" Prime95 could become faster. If everybody here says that NTT is inherintly slower than FFT, then I could understand why FFT was chosen for Prime95. I just can't understand - both FFT and NTT uses just multiplications (and few additions), taking into account that all NTT's divisions can be replaced by multiplications of Barrett and Montgomery reductions, then why FFT is always considered faster? Both are just multiplications, around same number of operations O(Log(N) * N), then both should be about the same. Integer multiplications is a bit slower than DOUBLE, but NTT can use almost all 64 bits for computational words, while FFT only 17-19 bits as in Prime95. Last fiddled with by moytrage on 2021-07-18 at 07:25 |
|
|
|
|
|
|
#28 |
|
Jul 2021
1000012 Posts |
So I rerun my bits evaluation part and found out that for error 0.4 my code needs 8-bits, to remind for error 0.1 it needed 4 bits. Sure 8-bit words will be 2 times faster than 4-bit word.
How come that 18 bits can be used I can't see if I do just DOUBLE-64 bit operations everywhere, not float-128 or float-256. Maybe I did some cruel mistake somewhere like doing multiplications/additions in some very wrong order that reduces precision a lot. BTW does anyone know exact formula of error? If two large numbers are pseudo-random what is the formula for error based on numbers size and number of bits in a word? Also do I understand correctly that if I have K-bit word and N such words then it means that I need at least float type that can hold (K + K + Log2(N)) bits of mantissa? Because when doing polynomials multiplication, each polynomial of degree N and each coefficient having K bits then multiplying these two polynomials has each coefficient of size not more than (K + K + Log2(N)) bits, because each final coefficient's value is at most N * (2^K)^2 after opening parenthesis. Taking above into account then if word is 19 bits and large numbers are 2^25 bytes (2^28 bits) then there will be 2^23.75 such words. So using formula above maximal polynomial's coefficient after multiplication could be of size (19 + 19 + 23.75)=61.75 bits. But standard IEEE double-64 type has only 52-bit mantissa, meaning it can't hold 61.75 bits precisely. How come then that 19 bits can be used for word size? Last fiddled with by moytrage on 2021-07-18 at 08:04 |
|
|
|
|
|
#29 |
|
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
7,823 Posts |
There's something very unusual about that bits vs length table. It seems to me it gives for DP about what might be expected for SP. Some examples, that use DP FFT:
CUDALucas v2.06, 17.6 bits/word at 18M FFT length, on GTX1060 GPU Code:
| Date Time | Test Num Iter Residue | FFT Error ms/It Time | ETA Done | | Jul 18 00:17:41 | M332197331 112030000 0x88fc24530de05291 | 18432K 0.33594 36.0679 360.67s | 91:21:46:05 33.72% | Code:
Starting M43158617 fft length = 2304K | Date Time | Test Num Iter Residue | FFT Error ms/It Time | ETA Done | | Jul 24 20:20:44 | M43158617 20000 0x6c847491e15e5282 | 2304K 0.28906 4.0205 80.41s | 2:00:10:42 0.04% | Code:
2021-07-18 02:02:47 asr2/radeonvii 999299999 OK 103300000 10.34% 0af6d6e23d1b882d 13969 us/it + check 7.43s + save 3.78s; ETA 144d 20:45 17 errors 2 Mlucas V17, 17.6 bits/word on 18M FFT length, one core of Xeon E5645 Code:
M332220523: using FFT length 18432K = 18874368 8-byte floats. this gives an average 17.601676676008438 bits per digit Using complex FFT radices 36 16 16 32 32 [Jul 22 13:31:39] M332220523 Iter# = 10000 [ 0.00% complete] clocks = 02:16:04.515 [816.4516 msec/iter] Res64: 1A313D709BFA6663. AvgMaxErr = 0.171224865. MaxErr = 0.250000000. |
|
|
|
|
|
#30 | ||
|
Jul 2021
2116 Posts |
Quote:
Quote:
|
||
|
|
|
|
|
#31 |
|
Jun 2003
155816 Posts |
To OP: Instead of implementing your own FFT (probably too inefficient) or comparing against a special purpose software like P95 (too specialized), try using a high-performance generalized FFT library like FFTW for your FFT baseline. You could also use OpenPFGW which supports testing of arbitrary numbers using generic FFT (rather than special form numbers using IBDWT). It shares the same underlying FFT routines with P95, so it is very high performance.
Having said that, 3-4 bits per fft limb sounds too conservative. You might be able to get away with more number of bits. Also, are you using balanced digits representation? That will allow you to put many more bits per limb. |
|
|
|
|
|
#32 | |
|
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
7,823 Posts |
Quote:
See the paragraph starting "Maximum usable" in https://www.mersenneforum.org/showpo...65&postcount=6 IEEE 754 binary64 is 53 bit (1 implicit) for normalized values. |
|
|
|
|
|
|
#33 |
|
Jul 2021
418 Posts |
I implemented both FFT and NTT as a part of self-educational process, I didn't know about them before and was always wanted to learn and implement them.
And only as a byproduct I found out that FFT got slower, only because FFT used 4 bits and NTT almost 64 bits per word. So my initial purpose wasn't to invent some cool and fast library faster than Prime95. Also in advance I didn't know that FFT will appear to be slower. Please read my previous post above, do you think that my formulas are not correct? Using these formulas I can't see how 18-19 bits words in FFT can be used with 52-bit mantissa of IEEE double-64. I compared my FFT and NTT because they are two about the same, same level of optimizations and almost same skeleton of algorithm, same loops, almost same formulas. Both use only multiplications (no divisions) (NTT uses only multiplications because of Barrett and Montgomery reductions). And I can see clearly that only reason that FFT is 10x slower than NTT is because NTT uses almost 64 bit words and FFT uses 4 bit words, hence their Fourier table sizes 16x different hence the slow down. But already I found out how to do some speedup, 8-bit words can be used instead of 4-bit, because 8-bit words give 0.4 error, while before I was targetting 0.1 error. If I can target 0.4 error then of cause I can use 8-bit words and get 2x speedup. |
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| PRP on gpu is faster that on cpu | indomit | Information & Answers | 4 | 2020-10-07 10:50 |
| faster than LL? | paulunderwood | Miscellaneous Math | 13 | 2016-08-02 00:05 |
| My CPU is getting faster and faster ;-) | lidocorc | Software | 2 | 2008-11-08 09:26 |
| Faster way to do LLT? | 1260 | Miscellaneous Math | 23 | 2005-09-04 07:12 |
| Faster than LL? | clowns789 | Miscellaneous Math | 3 | 2004-05-27 23:39 |