mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software

Reply
 
Thread Tools
Old 2021-07-17, 19:51   #12
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

5,821 Posts
Default

Quote:
Originally Posted by moytrage View Post
Started your Prime95 speed test, all as you suggested. But already running for 15 minutes and still there is no print-out in worker's window.

Can you tell approximately how often print out to worker window is done for exponent of this size at 1 thread (1 core of 1Ghz)?
RT(Fine)M. (The previously mentioned documentation files that come with the program.)
And/or look for InterimResidues
OutputIterations
in prime.txt

Please provide the actual specs of your laptop (CPU model number, at minimum), so that timings are meaningful as benchmarks.

Last fiddled with by kriesel on 2021-07-17 at 20:17
kriesel is online now   Reply With Quote
Old 2021-07-17, 20:01   #13
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

2×3,821 Posts
Default

Quote:
Originally Posted by kriesel View Post
RTFM.
That's a bit rude.
Prime95 is offline   Reply With Quote
Old 2021-07-17, 21:35   #14
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

132758 Posts
Default

Quote:
Originally Posted by moytrage View Post
tested that both algorithms gave correct results, by doing naive multiplication.
You tested two custom code paths, against a third custom code path? For what inputs, outputs, specifically?

Last fiddled with by kriesel on 2021-07-17 at 21:35
kriesel is online now   Reply With Quote
Old 2021-07-17, 23:10   #15
Mysticial
 
Mysticial's Avatar
 
Sep 2016

23·43 Posts
Default

I'll mention that ProtoNTT is just a prototype. y-cruncher's actual NTT is completely different implementation-wise. (and faster)

Computationally, FFTs are much faster than NTT, but NTT wins the memory bandwidth game. Given that the CPU/memory gap keeps increasing, we can argue that NTTs would win in the long term.

Let's say we give NTTs the performance win now. That's not enough. LL tests can't easily use NTT because the IBDWT needs both sufficiently deep roots-of-unity and roots-of-two - something that's hard to come by in the modular arithmetic ring.

Last fiddled with by Mysticial on 2021-07-17 at 23:11
Mysticial is offline   Reply With Quote
Old 2021-07-18, 00:42   #16
chalsall
If I May
 
chalsall's Avatar
 
"Chris Halsall"
Sep 2002
Barbados

132·59 Posts
Default

Quote:
Originally Posted by Prime95 View Post
That's a bit rude.
Empirically, the MF doesn't deal with "Snowflakes" well.

Or, perhaps put another way, Snowflakes need not enter the debate.

I think we've already covered this. Repeatedly, and sometimes with some levity... 9-)
chalsall is online now   Reply With Quote
Old 2021-07-18, 03:54   #17
moytrage
 
Jul 2021

418 Posts
Default

Quote:
Originally Posted by kriesel View Post
You tested two custom code paths, against a third custom code path? For what inputs, outputs, specifically?
I tested both versions of NTT multiplications on different large random numbers. Compared multiplication results against reference implementation of multiplication using simplest naive school-grade algorithm of multiplication. Of cause to test this correctness I used not to very large numbers because school algorithm is not so fast to handle huge numbers.

Also I expect y-cruncher's version of NTT that I also used to be quite well tested.

But the point is in different thing - not to proof for myself that algorithm gives correct multiplication results, but just to compare speeds, relative speed of FFT against NTT.
moytrage is offline   Reply With Quote
Old 2021-07-18, 04:11   #18
moytrage
 
Jul 2021

418 Posts
Default

Quote:
Originally Posted by kriesel View Post
It's not clear to me how you avoid issues with carry propagation or loss, apparently using no guard bits. Perhaps I misunderstood your reference to 225 byte operands (32MB each) in 222 8-byte words each.
Are you referring to FFT or NTT when saying about carry propagation or loss? Because in NTT there is no loss, it is precise algorithm. I used Chinese Reminder Theorem (CRT) at final step to restore full precision values. Due to usage of this CRT one can support arbitrary large precision of multiplied words if needed.

In NTT I used UInt-64 words and also almost 64 bit prime numbers. These words where multiplied as 64x64 bit to 128 bit. Also used Montgomery and Barrett reduction everywhere to make things much faster. And precomputed table of twiddling factors (powers of root of union W).

If you're speaking about FFT then I did following things:

1) I split input large numbers into N-bit words. "N" was chosen such that final measured maximal error is around 0.07 for different tests with random inputs. So mapping table was created to map between large number size and "N". For very large numbers like 2^25 bytes N was quite small, around 3-4 bits. Probably because of this my FFT was really slow, because instead of using large input words I used quite small with few bits. But right now I don't know other good FFT solution besides splitting into small N-bit words. Otherwise long-float (128-bit or 256-bit float) multiplication should be done which will be probably much slower.

2) After splitting I did forward FFT transform of A and B numbers, then did pairwise multiplication, and finally inverse FFT transform of result.

3) Measured maximal error to be withing desired bounds and did carry-over to propagate DOUBLE-sized integers. After that joined words to make final result.

4) FFT algorithm was also tested with naive multiplication to give correct results. Tested for different sizes of input numbers but within affordable time (minutes) of running slow naive algorithm.
moytrage is offline   Reply With Quote
Old 2021-07-18, 04:24   #19
moytrage
 
Jul 2021

3×11 Posts
Default

Quote:
Originally Posted by Viliam Furik View Post
you can use Prime95 with this worktodo line........ Let it run until you get a few print-outs in the worker window, and tell us the average of the numbers it says after "ms/iter:". 1 iteration is a little bit more than 1 multiplication.
If everything I did correctly to test Prime95 multiplication then I got printed-out 1 iteration (1 multiplication) at a speed of around 340 ms/iter.

Which means that Prime95's multiplitcation is much faster compared to my NTT version. So Prime95's is 0.34 seconds and my (NTT) is 9.8 seconds.

So I have to take my initial words back "that NTT might be faster".

Right now I can't understand really the reason why Prime95 is much faster. Maybe because Prime95 uses IBDWT (Irrational Base Discrete Weighted Transform) which I don't know details about. Or mabye because IBDWT is quite different compared to FFT. Or I did something wrong in my FFT algorithm that made it slow (I described my FFT approach in previous post). Or maybe Prime95 used very aggressive ASM level optimizations with SIMD instructions which I didn't do.

PS. I have mobile (laptop) CPU "Intel Core i7 2630QM" at base frequencey of around 2GHz with 8 hardware threads (4 cores) and AVX-1 support. (also because my CPU is usually overheated it slows down to 1GHz almost all of the time).

Last fiddled with by moytrage on 2021-07-18 at 04:24
moytrage is offline   Reply With Quote
Old 2021-07-18, 04:33   #20
moytrage
 
Jul 2021

2116 Posts
Default

Quote:
Originally Posted by Mysticial View Post
Computationally, FFTs are much faster than NTT...
As you're the author of ProtoNTT, can you please comment on how to implement fast FFT multiplication? If you've read my few last posts above then you will know that I used splitting input large numbers into N-bit words for the input of FFT, where N is quite small (around 3-4 bits) for very large (2^25 bytes) numbers.

This 3-bit splitting made FFT really slow, because NTT uses 64-bit words, and FFT just 3-bit words. 3-bit is 21x times smaller than 64-bit. This 3-bit words splitting where necessary by FFT not to make overflow of maximal error (around 0.1).

Can you suggest in short how else differently to implement FFT to use full 64-bit input words and not to loose precision (and have out of bounds error).

Also I didn't do long 128 or 256 bit float multiplication inside FFT. I just used regular DOUBLE multiplication using SIMD instructions like _mm256_mul_pd (doc is here https://software.intel.com/sites/lan...pd&expand=4894).

Last fiddled with by moytrage on 2021-07-18 at 04:35
moytrage is offline   Reply With Quote
Old 2021-07-18, 04:36   #21
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
Rep├║blica de California

2·73·17 Posts
Default

Quote:
Originally Posted by moytrage View Post
If everything I did correctly to test Prime95 multiplication then I got printed-out 1 iteration (1 multiplication) at a speed of around 340 ms/iter.

Which means that Prime95's multiplitcation is much faster compared to my NTT version. So Prime95's is 0.34 seconds and my (NTT) is 9.8 seconds.

So I have to take my initial words back "that NTT might be faster".

Right now I can't understand really the reason why Prime95 is much faster. Maybe because Prime95 uses IBDWT (Irrational Base Discrete Weighted Transform) which I don't know details about.
http://www.faginfamily.net/barry/Pap...Transforms.pdf

Quote:
Or mabye because IBDWT is quite different compared to FFT. Or I did something wrong in my FFT algorithm that made it slow (I described my FFT approach in previous post). Or maybe Prime95 used very aggressive ASM level optimizations with SIMD instructions which I didn't do.
IBDWT leverages the mathematical structure of the FFT (or suitably chosen NTT) to do an "implicit mod" modulo M(p) - that means no more zero-padding the FFT to double the width, That saves 2-3x right there. A huge amount of further deep optimization work, including, yes, aggressive use of SIMD, can easily account for another 10x, so there you are,
ewmayer is offline   Reply With Quote
Old 2021-07-18, 04:50   #22
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

1DDA16 Posts
Default

Quote:
Originally Posted by moytrage View Post
So I have to take my initial words back "that NTT might be faster".
That is jumping too quickly to a conclusion in the other direction.

What you need to do is dig deeper into why your particular NTT or FFT implementations are faster or slower. That is no easy task.

Quote:
Right now I can't understand really the reason why Prime95 is much faster. Maybe because Prime95 uses IBDWT (Irrational Base Discrete Weighted Transform) which I don't know details about. Or mabye because IBDWT is quite different compared to FFT. Or I did something wrong in my FFT algorithm that made it slow
IBDWT allows prime95 to use half-size FFTs compared to general purpose multiplication. So a factor of 2.

I suspect (given the 29x speed difference) the cause is memory management. Fast implementations load up the L2 and L3 caches with data, do as much work as possible, then write that back to slow memory. Assuming an FFT size of 2^24, one possible FFT/NTT implementation is read all data, butterfly, write all data, repeat 24 times. Do the same for the inverse FFT. That would be 48 R/Ws of memory. Prime95 does 2 R/Ws -- a 24x difference.

If my assumption is correct, memory accesses are swamping the CPU costs. IOW, you are not able to meaningfully compare the two algorithms using your existing code. Although you might be able to for much smaller inputs.

BTW, I applaud your work flow. Write code, test ideas, ask questions when faced with results that fly in the face of conventional wisdom, no ego, learn, iterate. You'd be surprised how many visitors we get making grand proclamations and unwilling to listen to reasoned rebuttals.
Prime95 is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
PRP on gpu is faster that on cpu indomit Information & Answers 4 2020-10-07 10:50
faster than LL? paulunderwood Miscellaneous Math 13 2016-08-02 00:05
My CPU is getting faster and faster ;-) lidocorc Software 2 2008-11-08 09:26
Faster way to do LLT? 1260 Miscellaneous Math 23 2005-09-04 07:12
Faster than LL? clowns789 Miscellaneous Math 3 2004-05-27 23:39

All times are UTC. The time now is 21:54.


Thu Oct 28 21:54:31 UTC 2021 up 97 days, 16:23, 0 users, load averages: 0.96, 1.25, 1.33

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.