mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software

Reply
 
Thread Tools
Old 2017-10-09, 21:03   #23
Mysticial
 
Mysticial's Avatar
 
Sep 2016

22×5×19 Posts
Default

Quote:
Originally Posted by preda View Post
I have this question about the NTT performance:

For mersenne multiplication, would the NTT size need to be twice the size of the corresponding floating-point FFT, because the NTT can't make use of the "irrational base" trick (which is based on a pre-weighting with a floating-point weight)?

If there is no equivalent trick to the "irrational base weighting" for the NTT to halve the transform size, then I think it's hard for NTT to compete.

(I'm interested in NTT because it would avoid the double-precision penalty of the GPUs).
The irrational base weighting can be done on NTTs as long as there are sufficient modular roots-of-two. (and I believe you also need a root-of-M for all your non-power-of-two M length factors) But your intuition is right to be suspicious.

In the normal (real number) modular space, you will not find any 32-bit primes that have both roots-of-unity and roots-of-two of sufficient depth. (I've tested every single prime below 2^32 and found none with both roots-of-unity and roots-of-two of useful depth.)

From that experiment, I hypothesize that the greatest depth you can get seems to be about O(sqrt(N)) where N is the size of the ring. This is actually somewhat expected if you assume that there is no correlation between primes with deep roots-of-unit and primes with deep roots-of-two.

In the real number modular space, N = p - 1. This limits your transform length to about 2^16 for 32-bit primes which is useless. I don't remember what the result of my test was, but I think 2^14 was the best you could do.

Alternative 1: 64-bit Primes

Going to 64-bit primes will let you do transform lengths of around 2^32. But I don't know of a systematic way to search for them. The only search I've done to date is to brute-force all primes of the form p = m * 2^32 + 1 and found only 5 that also had roots-of-two of depth 2^28 or more. (This took around 24 hours on a 4-core Skylake using Mathematica.)

Unfortunately, those 5 are not that useful. They're all greater than 2^63 so you can't efficiently use Shoup's mulmod algorithm. If someone were to pursue this step, they'd have to broaden the search to p = m * 2^28 + 1 (or even smaller than 2^28) to hopefully find several of them with similar depth in quadratic residuals.

In any case, using 63-bit primes is going to be slow for reasons mentioned in earlier posts.

Alternative 2: Gaussian Primes

If we go into the Gaussian space (complex mod arithmetic), the ring size increases to N = p^2 - 1. This means that for a Gaussian prime of size about 2^32, we can expect to find some with both roots-of-unity and roots-of-two on the order of 2^32. 2^32 is enough to cover the current range of LL sizes.

It's easy to find a primitive root to get a generator for a deep root-of-unity. But I'm not sure how to find quadratic residuals. Ernst says they exist in sufficient depth for Mersenne primes, but I don't know how to actually find them. (At least I haven't tried very hard and Mathematica doesn't seem to have a quadratic residual function for Gaussian integers.)
Mysticial is offline   Reply With Quote
Old 2017-10-10, 02:22   #24
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

5·23·31 Posts
Default

I'm not completely sure, but the use of Gaussian integers for arithmetic mod p^2 for p=M61 should just be a shorthand for dealing with integers modulo p^2. So maybe you can find square roots modulo p^2 by means of computing the square root mod p and then one step of Hensel lifting, and then make the top 61 bits the real and the bottom 61 bits the imaginary part of a Gaussian integer.
jasonp is offline   Reply With Quote
Old 2017-10-10, 05:57   #25
ATH
Einyen
 
ATH's Avatar
 
Dec 2003
Denmark

22·863 Posts
Default

Quote:
Originally Posted by Mysticial View Post
Alternative 1: 64-bit Primes

Going to 64-bit primes will let you do transform lengths of around 2^32. But I don't know of a systematic way to search for them. The only search I've done to date is to brute-force all primes of the form p = m * 2^32 + 1 and found only 5 that also had roots-of-two of depth 2^28 or more. (This took around 24 hours on a 4-core Skylake using Mathematica.)

Unfortunately, those 5 are not that useful. They're all greater than 2^63 so you can't efficiently use Shoup's mulmod algorithm. If someone were to pursue this step, they'd have to broaden the search to p = m * 2^28 + 1 (or even smaller than 2^28) to hopefully find several of them with similar depth in quadratic residuals.

In any case, using 63-bit primes is going to be slow for reasons mentioned in earlier posts.
I can see what you mean by deep roots-of-two. For example for your p=2411682483 * 2^32 + 1:
Ordp2 = 12862306576 . So 2^12862306576 = 1 (mod p).

But what do you mean with deep roots-of-unity ? I assume you mean:
Root of unity modulo n

So x^k = 1 (mod p) again but this time x!=2. So does "deep" mean k is large again? or is x large?

Last fiddled with by ATH on 2017-10-10 at 05:58
ATH is offline   Reply With Quote
Old 2017-10-10, 06:25   #26
Mysticial
 
Mysticial's Avatar
 
Sep 2016

5748 Posts
Default

Quote:
Originally Posted by ATH View Post
I can see what you mean by deep roots-of-two. For example for your p=2411682483 * 2^32 + 1:
Ordp2 = 12862306576 . So 2^12862306576 = 1 (mod p).

But what do you mean with deep roots-of-unity ? I assume you mean:
Root of unity modulo n

So x^k = 1 (mod p) again but this time x!=2. So does "deep" mean k is large again? or is x large?

The roots-of-unity that are needed for the twiddle factors. Whereas the roots-of-two are needed for the weights for the IBDWT.

So yes, by "depth", I mean a very large k.

Code:
p = 2411682483*2^32 + 1 = 10358097392821075969

//  Root-of-Unity of depth k = 32
6176428888740961571^(2^31) ≡ -1 mod p
6176428888740961571^(2^32) ≡ 1 mod p

//  Root-of-Two of depth k = 28
36411718325718249^(2^28) ≡ 2 mod p
So this prime allows an NTT transform length of 2^32. But the maximum IBDWT transform length is only 2^28.

Last fiddled with by Mysticial on 2017-10-10 at 06:32
Mysticial is offline   Reply With Quote
Old 2017-10-10, 07:30   #27
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

267548 Posts
Default

Just some brief comments, very busy with my own code, but did spend some quality time over the weekend revisiting some basic issues related to finding primitive roots of maximal order (i.e. order q^2-1, where q = 2^p-1) for FGTs modulo various M-primes. For p=31, it appears the primitive root of form (by analogy to the convenient one for p=61) n+I having the smallest n is 12+I. Power-of-2 DFTs can take advantage that roots of power-of-2 orders have convenient special form: square roots = +1,-1, 4th roots = +1,+I,-1,-I, and 8th roots = w^j (j=0,...,7) with w = (1+I)*2^(p-1).

Quote:
Originally Posted by Mysticial View Post
I did some calculations for the DP + M31 method. (I've edited this multiple times to fix several calculation errors.)
...
Code:
Word Size = 64 + 32 = 96 bits = 12 bytes
Usable Bits / word = 53 + 31 = 84 bits
The most bits of data we can fit into each word is about 32 bits.

For transform length 2^24:
Code:
32 * 2 + 24/2 = 76 bit coefficients (8 bits of  safety)
(However, we may lose more bits in order to properly resolve the combining step.)
...
Since word-size is half for the M31 NTT, the amount of work done is doubled. Once we combine that with the cost of the radix 4 DP FFT we get: 170 / 2 + 24 = 109 AVX512 uops.
I;m not sure I follow your useful-bits logic here ... back when I first seriosuly played with the hybrid FFT+FGT(M61) stuff I only did power-of-2 transform lengths and the compiler (this was back in my Fortran-90 days, using the Dec f90 compiler for Alpha and its intrsinsic for inlining the UMULH hardware instruction) was doing some really stupid sh*t in terms of taking my beautifully interleaved double-float/mod-m61 DFT code and actively segregating the 2 kinds of instructions (i.e. collecting all the float-double stuff, then following that with the segregated modular ops, thus defeating the aim of keeping both kinds of functional units busy), but I *did* get a fully-working IBDWT-based hybrid LL tester out of it - it was simply too slow relative to just using a pure-FFT and longer transofrm length to be useful. The resulting permissible input size was more or less exactly what my roundoff-error model predicted, namely that for a hybrid FFT/FGT(M(p)) we gain p/2 bits in maximal allowable input wordsize. Thus if pure-FFT at the given transform length allows B bits per input word, combining that with an FGT over M61 adds 30.5 bits - and indeed, I was able to use right around 50 bits per input word at a 1Melement transform length, compared to just over 19 bits per word for FFT-only at the same length.

For FGT over M31 we gain 15.5 BPW, so again using a !M FFT @19 BPW as the floating-point baseline:

FFT+FGT(M61): (19+30.5) BPW, 2.6x more than FFT alone;
FFT+FGT(M31): (19+15.5) BPW, 1.8x more than FFT alone,

I.e. using M61 gains ~1.5x the BPW of M31, while roughly doubling the modular-arithmetic workload.

Quote:
Originally Posted by Mysticial View Post
The irrational base weighting can be done on NTTs as long as there are sufficient modular roots-of-two. (and I believe you also need a root-of-M for all your non-power-of-two M length factors) But your intuition is right to be suspicious.
This is where the FGT is truly sweet - please (re)visit my original hybrid-transform posts and search for "Roots of 2 turn out to be easy". For (real) transform length N the IBDWT needs fractional powers 2^(j/N) with j=0,...,N-1, which are distinct in terms of the FFT's arithmetic over the reals; in contrast for FGT over M(p) the same fractional-order roots of 2 all collapse to the smaller set of p distinct ones 2^j (j=0,...,p-1) defined over the field, i.e. your (say) 3701/2^20-th floating-point roots of 2 in the floating transform turns into a circular shift with shift count in [0,p) for the FGT-based IBDWT, and the same shift count will be shared by every (p)th IBDWT weight of the full set of N such weights.
ewmayer is offline   Reply With Quote
Old 2017-10-10, 15:45   #28
Mysticial
 
Mysticial's Avatar
 
Sep 2016

22×5×19 Posts
Default

Quote:
Originally Posted by ewmayer View Post
I;m not sure I follow your useful-bits logic here ... back when I first seriosuly played with the hybrid FFT+FGT(M61) stuff I only did power-of-2 transform lengths and the compiler (this was back in my Fortran-90 days, using the Dec f90 compiler for Alpha and its intrsinsic for inlining the UMULH hardware instruction) was doing some really stupid sh*t in terms of taking my beautifully interleaved double-float/mod-m61 DFT code and actively segregating the 2 kinds of instructions (i.e. collecting all the float-double stuff, then following that with the segregated modular ops, thus defeating the aim of keeping both kinds of functional units busy), but I *did* get a fully-working IBDWT-based hybrid LL tester out of it - it was simply too slow relative to just using a pure-FFT and longer transofrm length to be useful. The resulting permissible input size was more or less exactly what my roundoff-error model predicted, namely that for a hybrid FFT/FGT(M(p)) we gain p/2 bits in maximal allowable input wordsize. Thus if pure-FFT at the given transform length allows B bits per input word, combining that with an FGT over M61 adds 30.5 bits - and indeed, I was able to use right around 50 bits per input word at a 1Melement transform length, compared to just over 19 bits per word for FFT-only at the same length.

For FGT over M31 we gain 15.5 BPW, so again using a !M FFT @19 BPW as the floating-point baseline:

FFT+FGT(M61): (19+30.5) BPW, 2.6x more than FFT alone;
FFT+FGT(M31): (19+15.5) BPW, 1.8x more than FFT alone,

I.e. using M61 gains ~1.5x the BPW of M31, while roughly doubling the modular-arithmetic workload.
I don't think we're in disagreement. My (rough) estimate was using a slightly larger safety margin to arrive at 32 BPW (bits per word) for FFT + M31. You're arriving at 34.5 BPW. So they're close.

I'm not sure exactly how much safety is needed since I don't have an implementation to test empirically.

Quote:
This is where the FGT is truly sweet - please (re)visit my original hybrid-transform posts and search for "Roots of 2 turn out to be easy". For (real) transform length N the IBDWT needs fractional powers 2^(j/N) with j=0,...,N-1, which are distinct in terms of the FFT's arithmetic over the reals; in contrast for FGT over M(p) the same fractional-order roots of 2 all collapse to the smaller set of p distinct ones 2^j (j=0,...,p-1) defined over the field, i.e. your (say) 3701/2^20-th floating-point roots of 2 in the floating transform turns into a circular shift with shift count in [0,p) for the FGT-based IBDWT, and the same shift count will be shared by every (p)th IBDWT weight of the full set of N such weights.
Interesting now that I re-read it. I'll need to try it out later. And to see if and how that parameterization can be used to directly find primes with deep quadratic residuals.


-------


I noticed another error in my M31 NTT pseudo-code. I left off the reductions needed for the multiply-mods. Fixing that increases the cost to:
  • DP FFT: 24 / 17 = 1.41 AVX512 uops / bit
  • DD FFT: 226 / 43 = 5.26 AVX512 uops / bit
  • DP+M31: 121 / 32 = 3.78 AVX512 uops / bit

But it's still cheaper than DD FFT at this point.
Mysticial is offline   Reply With Quote
Old 2017-10-11, 02:56   #29
ATH
Einyen
 
ATH's Avatar
 
Dec 2003
Denmark

65748 Posts
Default

Quote:
Originally Posted by Mysticial View Post
The roots-of-unity that are needed for the twiddle factors. Whereas the roots-of-two are needed for the weights for the IBDWT.

So yes, by "depth", I mean a very large k.

Code:
p = 2411682483*2^32 + 1 = 10358097392821075969

//  Root-of-Unity of depth k = 32
6176428888740961571^(2^31) ≡ -1 mod p
6176428888740961571^(2^32) ≡ 1 mod p

//  Root-of-Two of depth k = 28
36411718325718249^(2^28) ≡ 2 mod p
So this prime allows an NTT transform length of 2^32. But the maximum IBDWT transform length is only 2^28.
Thanks, so I completely misunderstood the first part, I haven't really worked with modular square roots, so it is not the first thing that comes to mind.

For each square root there are 2 solutions r1 and r2=p-r1, so after 28 square roots there are possibly 2^28 different deep roots. Will all these different roots always exist to the same "depth" or do you have to search them all?
I assume the depth is the same because I just tested your root-of-two with a recursive function, and that was the case.

Last fiddled with by ATH on 2017-10-11 at 03:07
ATH is offline   Reply With Quote
Old 2017-10-11, 04:17   #30
Mysticial
 
Mysticial's Avatar
 
Sep 2016

22·5·19 Posts
Default

Quote:
Originally Posted by ATH View Post
Thanks, so I completely misunderstood the first part, I haven't really worked with modular square roots, so it is not the first thing that comes to mind.

For each square root there are 2 solutions r1 and r2=p-r1, so after 28 square roots there are possibly 2^28 different deep roots. Will all these different roots always exist to the same "depth" or do you have to search them all?
I assume the depth is the same because I just tested your root-of-two with a recursive function, and that was the case.
Yes, there will be 2^28 different deep roots. Given any root-of-two of depth k, you can generate another one of the same depth by multiplying it by a root-of-unity of the same depth or less.

Which brings up a new point that I hadn't realized before. If there exists a root-of-two of depth k, then there must also exist a root-of-unity of depth k.

This means that for a given prime number p, the depth of the quadratic residuals can never exceed the depth of roots-of-unity. This should also apply to all bases instead of just base 2.

------

EDIT: I'm not so sure actually. I don't see a requirement that a prime with a root-of-two of depth k has 2^k distinct roots-of-two. And if such a prime exists, then its root-of-unity depth will be less than the root-of-two depth.

Last fiddled with by Mysticial on 2017-10-11 at 04:36
Mysticial is offline   Reply With Quote
Old 2017-10-11, 10:04   #31
ATH
Einyen
 
ATH's Avatar
 
Dec 2003
Denmark

22×863 Posts
Default

But the important thing is that the maximum "depth" is the same for all of them? You do not have to search the entire "tree" to find one with highest depth?
ATH is offline   Reply With Quote
Old 2017-10-11, 16:09   #32
Mysticial
 
Mysticial's Avatar
 
Sep 2016

22×5×19 Posts
Default

Quote:
Originally Posted by ATH View Post
But the important thing is that the maximum "depth" is the same for all of them? You do not have to search the entire "tree" to find one with highest depth?
I believe the answer is yes. But I'm becoming less and less sure the more you ask since I'm in no way an expert in the field.

The reason why I think the answer is yes is because once you find any root-of-two (R) of depth (say 2^28), you will automatically be able to generate the remaining (2^28-1) others of the same depth by multiplying them by roots-of-unity. (assuming those exist at the same depth)

These 2^28 roots-of-two must also be unique. Since p is prime and a multiplicative inverse always exists, you will be able to divide them by R to re-obtain the roots-of-unity.

So those 2^28 roots-of-two of depth k=28 will need to be the leaf nodes of the binary "tree" that you're talking about. Therefore, the tree must be completely balanced down to that point. So walking the tree down any path will guarantee that you reach a leaf node holding a root-of-two of depth k=28.
Mysticial is offline   Reply With Quote
Old 2017-10-11, 17:37   #33
R. Gerbicz
 
R. Gerbicz's Avatar
 
"Robert Gerbicz"
Oct 2005
Hungary

66916 Posts
Default

Quote:
Originally Posted by Mysticial View Post
While there exist 64-bit primes with deep roots-of-unity (of depth up to 2^57), the ones that also had deep roots of two were all the down in the 2^28-ish range. Which is about the probabilty of a "random" prime having a deep root-of-two.

My long (unoptimized) search resulted in these primes that could potentially be used to do IBDWT of length 2^28 and a max convolution size of 2^34 bits:
  1. 2411682483 * 2^32 + 1
  2. 3318192165 * 2^32 + 1
  3. 3343034547 * 2^32 + 1
  4. 3348317433 * 2^32 + 1
  5. 3635415415 * 2^32 + 1
Smaller primes with large depth:
Code:
depth=28;x=188393705187326142;p=569381225639182337;
depth=29;x=1393396111665502800;p=4588742434128658433;
in these examples x^(2^depth)==2 mod p, and p-1 has a large 2 factor (2^30 and 2^31).
R. Gerbicz is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
Double stars fivemack Astronomy 5 2017-05-14 08:50
x.265 half the size, double the computation; so if you double again? 1/4th? jasong jasong 7 2015-08-17 10:56
Double Check Unregistered Information & Answers 3 2011-10-01 04:38
Double the area, Double the volume. Uncwilly Puzzles 8 2006-07-03 16:02
Double Check P-1 PhilF PrimeNet 6 2005-07-03 14:36

All times are UTC. The time now is 13:30.


Fri Jul 7 13:30:41 UTC 2023 up 323 days, 10:59, 0 users, load averages: 1.18, 1.26, 1.21

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔