![]() |
|
|
#12 | |
|
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
7,823 Posts |
Quote:
And/or look for InterimResidues OutputIterations in prime.txt Please provide the actual specs of your laptop (CPU model number, at minimum), so that timings are meaningful as benchmarks. Last fiddled with by kriesel on 2021-07-17 at 20:17 |
|
|
|
|
|
|
#13 |
|
P90 years forever!
Aug 2002
Yeehaw, FL
17×487 Posts |
|
|
|
|
|
|
#14 |
|
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest
7,823 Posts |
You tested two custom code paths, against a third custom code path? For what inputs, outputs, specifically?
Last fiddled with by kriesel on 2021-07-17 at 21:35 |
|
|
|
|
|
#15 |
|
Sep 2016
22×5×19 Posts |
I'll mention that ProtoNTT is just a prototype. y-cruncher's actual NTT is completely different implementation-wise. (and faster)
Computationally, FFTs are much faster than NTT, but NTT wins the memory bandwidth game. Given that the CPU/memory gap keeps increasing, we can argue that NTTs would win in the long term. Let's say we give NTTs the performance win now. That's not enough. LL tests can't easily use NTT because the IBDWT needs both sufficiently deep roots-of-unity and roots-of-two - something that's hard to come by in the modular arithmetic ring. Last fiddled with by Mysticial on 2021-07-17 at 23:11 |
|
|
|
|
|
#16 |
|
If I May
"Chris Halsall"
Sep 2002
Barbados
2×112×47 Posts |
|
|
|
|
|
|
#17 | |
|
Jul 2021
3×11 Posts |
Quote:
Also I expect y-cruncher's version of NTT that I also used to be quite well tested. But the point is in different thing - not to proof for myself that algorithm gives correct multiplication results, but just to compare speeds, relative speed of FFT against NTT. |
|
|
|
|
|
|
#18 | |
|
Jul 2021
3·11 Posts |
Quote:
In NTT I used UInt-64 words and also almost 64 bit prime numbers. These words where multiplied as 64x64 bit to 128 bit. Also used Montgomery and Barrett reduction everywhere to make things much faster. And precomputed table of twiddling factors (powers of root of union W). If you're speaking about FFT then I did following things: 1) I split input large numbers into N-bit words. "N" was chosen such that final measured maximal error is around 0.07 for different tests with random inputs. So mapping table was created to map between large number size and "N". For very large numbers like 2^25 bytes N was quite small, around 3-4 bits. Probably because of this my FFT was really slow, because instead of using large input words I used quite small with few bits. But right now I don't know other good FFT solution besides splitting into small N-bit words. Otherwise long-float (128-bit or 256-bit float) multiplication should be done which will be probably much slower. 2) After splitting I did forward FFT transform of A and B numbers, then did pairwise multiplication, and finally inverse FFT transform of result. 3) Measured maximal error to be withing desired bounds and did carry-over to propagate DOUBLE-sized integers. After that joined words to make final result. 4) FFT algorithm was also tested with naive multiplication to give correct results. Tested for different sizes of input numbers but within affordable time (minutes) of running slow naive algorithm. |
|
|
|
|
|
|
#19 | |
|
Jul 2021
3·11 Posts |
Quote:
Which means that Prime95's multiplitcation is much faster compared to my NTT version. So Prime95's is 0.34 seconds and my (NTT) is 9.8 seconds. So I have to take my initial words back "that NTT might be faster". Right now I can't understand really the reason why Prime95 is much faster. Maybe because Prime95 uses IBDWT (Irrational Base Discrete Weighted Transform) which I don't know details about. Or mabye because IBDWT is quite different compared to FFT. Or I did something wrong in my FFT algorithm that made it slow (I described my FFT approach in previous post). Or maybe Prime95 used very aggressive ASM level optimizations with SIMD instructions which I didn't do. PS. I have mobile (laptop) CPU "Intel Core i7 2630QM" at base frequencey of around 2GHz with 8 hardware threads (4 cores) and AVX-1 support. (also because my CPU is usually overheated it slows down to 1GHz almost all of the time). Last fiddled with by moytrage on 2021-07-18 at 04:24 |
|
|
|
|
|
|
#20 |
|
Jul 2021
3·11 Posts |
As you're the author of ProtoNTT, can you please comment on how to implement fast FFT multiplication? If you've read my few last posts above then you will know that I used splitting input large numbers into N-bit words for the input of FFT, where N is quite small (around 3-4 bits) for very large (2^25 bytes) numbers.
This 3-bit splitting made FFT really slow, because NTT uses 64-bit words, and FFT just 3-bit words. 3-bit is 21x times smaller than 64-bit. This 3-bit words splitting where necessary by FFT not to make overflow of maximal error (around 0.1). Can you suggest in short how else differently to implement FFT to use full 64-bit input words and not to loose precision (and have out of bounds error). Also I didn't do long 128 or 256 bit float multiplication inside FFT. I just used regular DOUBLE multiplication using SIMD instructions like _mm256_mul_pd (doc is here https://software.intel.com/sites/lan...pd&expand=4894). Last fiddled with by moytrage on 2021-07-18 at 04:35 |
|
|
|
|
|
#21 | ||
|
∂2ω=0
Sep 2002
República de California
22·2,939 Posts |
Quote:
Quote:
|
||
|
|
|
|
|
#22 | ||
|
P90 years forever!
Aug 2002
Yeehaw, FL
17×487 Posts |
Quote:
What you need to do is dig deeper into why your particular NTT or FFT implementations are faster or slower. That is no easy task. Quote:
I suspect (given the 29x speed difference) the cause is memory management. Fast implementations load up the L2 and L3 caches with data, do as much work as possible, then write that back to slow memory. Assuming an FFT size of 2^24, one possible FFT/NTT implementation is read all data, butterfly, write all data, repeat 24 times. Do the same for the inverse FFT. That would be 48 R/Ws of memory. Prime95 does 2 R/Ws -- a 24x difference. If my assumption is correct, memory accesses are swamping the CPU costs. IOW, you are not able to meaningfully compare the two algorithms using your existing code. Although you might be able to for much smaller inputs. BTW, I applaud your work flow. Write code, test ideas, ask questions when faced with results that fly in the face of conventional wisdom, no ego, learn, iterate. You'd be surprised how many visitors we get making grand proclamations and unwilling to listen to reasoned rebuttals. |
||
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| PRP on gpu is faster that on cpu | indomit | Information & Answers | 4 | 2020-10-07 10:50 |
| faster than LL? | paulunderwood | Miscellaneous Math | 13 | 2016-08-02 00:05 |
| My CPU is getting faster and faster ;-) | lidocorc | Software | 2 | 2008-11-08 09:26 |
| Faster way to do LLT? | 1260 | Miscellaneous Math | 23 | 2005-09-04 07:12 |
| Faster than LL? | clowns789 | Miscellaneous Math | 3 | 2004-05-27 23:39 |