mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Software

Reply
 
Thread Tools
Old 2017-10-26, 06:58   #45
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

22·3·112 Posts
Default NTT, my first steps.

Thank you all for the many great contributions in this thread!

I started to experiment with a OpenCL NTT FFT in GpuOwl, using GF(p^2) with p=M(31)=2^31-1 -- i.e. "integer complex numbers modulo M31". Using pointers from Ernst's excellent writeup, et all.

I chose the "complex" NTT structure because:
- weighting and unweighting is done by shifts, without precomputed tables.
- modulo M31 reduction is done with bit operations only (without multiplications).

A nice side-benefit of using GF(p^2) is that most of the code is shared between the floating-point FFT and the NTT FFT -- because the FFT abstract structure is common, only the basic operations such as add/sub and mul are different.

At this point I have the basic weighted convolution working. (with un-balanced words, for FFT size 2^23, I observe about 4.5 useful bits/word).

I'm looking now to combine the double-FFT with the (M31^2)-FFT.

Ernst, I'd like to ask for some information about how you do the splitting and re-combining of the bits between the two parallel FFTs.

In particular, how to split the global "bits per word" among the two transforms. Do they both do independent carry propagation? do they have to combine and re-split the bits after each iteration?

Does the NTT in dual mode benefit from using balanced words? (my intuition says no..?)

thanks!

Last fiddled with by preda on 2017-10-26 at 07:01 Reason: spelling
preda is offline   Reply With Quote
Old 2017-10-26, 17:02   #46
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

5×23×31 Posts
Default

Not Ernst, but...

There is no carry propagation with the FGT, you produce (convolution results) modulo M31. If the inputs start in balanced format then you will have to reduce the possibly negative numbers mod M31 before doing the FGT. After the DWT convolution, you need to combine the FFT results whose top bits are correct with the mod-M31 results to get the correct bottom bits. That will lead to full-precision convolution results, which then go through carry propagation and conversion to balanced form.
jasonp is offline   Reply With Quote
Old 2017-10-27, 00:51   #47
ewmayer
2ω=0
 
ewmayer's Avatar
 
Sep 2002
República de California

22×2,939 Posts
Default

@preda: JasonP has it correct, here my additional notes:

[1] The modular transform uses strictly nonnegative-digit normalization, so 'balanced digit' refers only to the FFT data;

[2] When I played with this stuff nearly 20 years ago, in developing the carry code, I started by using a run and transform length such that there was no issue of needing the FGT data to error-correct the FFT convolution outputs, i.e. the latter were small enough that each, when 'unbalanced' to a nonnegative-digit normalization, was expected to exactly match the corresponding FGT output.

[3] Once the carry step is in place, one proceeds by gradually cranking up the average number of bits per transform input word, so that one can see a sufficient number of matching bits in the carry-step debug data to ensure the code is working properly.

[4] Once everything is working right, you should be able to use bits-per-input values large enough that there really is very little bit overlap between the 2 sets of outputs, i.e. is similar to the number of 'guard bits' one see at the bottom of one's convolution outputs (carry-step inputs) in an FFT-only implementation. I seem to recall that the allowable error level in the FFT outputs in a side-by-side-transform setting was effectively an NTT-modulus-scaled analog of the error levels one deals with in FFT-only code: For Mersenne prime q = 2^p-1 and FGT(q^2) done next to an FFT, if the FFT-only breakpoints are based on a max. ROE per output word of (say) 0.25, then for the hybrid-transform code one's FFT outputs may have a max ROE of around 0.25*2^p, corresponding to a worst-case overlap of a mere ~2 bits.

Alas, my old FGT-mod Fortran-90 code was a victim of last month's HD failure on my Macbook, and, idiotically, I had not stashed a copy of said code in my ftp-server old-code archive. But I distinctly recall the mixed-FFT-and-FGT carry subroutine being quite ugly, in no small part due to f90's awkward bitwise-ops instruction set, and I would have been unlikely to use it as a reference in my own future-C-and-SIMD reimplementation of such a hybrid transform, due to it having been so long since I coded in f90 that it would likely be quicker to just work through the various issues involved in the carry step from scratch, in C.
ewmayer is offline   Reply With Quote
Old 2017-10-29, 01:42   #48
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

101101011002 Posts
Default

Thank you for the carry explanation in dual-FFT. I haven't tried it yet though.

OTOH I did implement the FGT(M61), and my initial measurements on GPU indicate it's about 80% slower than the double-precision of the same size.

I guess part of the blame is on the compiler, who does a poor job optimizing the integer code for some reason. Part of the blame may be on the expensive multiplication (wide 32x32->64); one 61x61 multiplication requires 4 muls 32x32->64, and this seems to be significantly more expensive than one DP mul. (which is a bit surprising to me).

Anyway, the conclusion is that unfortunately I didn't get the hoped big gains from GFT(M61)..

(BTW, the data memory load of GFT(M61) is a bit less than the DP FFT of the same size, because the weighting-unweighting is done without weight tables. BUT this is compensated by the much larger code size (also memory load), blame of the compiler.)
preda is offline   Reply With Quote
Old 2017-10-29, 13:20   #49
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

101101011002 Posts
Default

My experiments also seem to indicate that FGT(M31) has about the same speed as a DP FFT of the same size (while a-priory I would have expected it to be about twice as fast as the DP).

At this point, IMO the only thing that might improve things compared to a pure-DP FFT would be a dual DP + FGT(M61) (both of half size), which would offer about 25% more bits compared to a single DP of full size.
preda is offline   Reply With Quote
Old 2017-10-30, 02:28   #50
jasonp
Tribal Bullet
 
jasonp's Avatar
 
Oct 2004

5×23×31 Posts
Default

In case it is of use, I released this code for generating complex power-of-2 size FGTs mod M61 way back when.

What about an M61 and M31 FGT? On the one hand, both can use the same abstract code to generate the core arithmetic as long as both use the same radixes, plus you would do two approximately half-size transforms. On the other hand, the convolution results require 2-prime CRT reconstruction, but that's much more straightforward than the DP+FGT equivalent.

Last fiddled with by jasonp on 2017-10-30 at 02:40
jasonp is offline   Reply With Quote
Old 2017-10-30, 19:16   #51
Mysticial
 
Mysticial's Avatar
 
Sep 2016

22·5·19 Posts
Default

Quote:
Originally Posted by preda View Post
My experiments also seem to indicate that FGT(M31) has about the same speed as a DP FFT of the same size (while a-priory I would have expected it to be about twice as fast as the DP).

At this point, IMO the only thing that might improve things compared to a pure-DP FFT would be a dual DP + FGT(M61) (both of half size), which would offer about 25% more bits compared to a single DP of full size.
How are you implementing it? I don't see how 32-bit arithmetic would be anywhere near as slow 64-bit. Especially since there's native support for 32 x 32 -> 64-bit multiply in SSE2 and SSE4.1.
Mysticial is offline   Reply With Quote
Old 2021-04-18, 18:12   #52
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

22×3×112 Posts
Default

Quote:
Originally Posted by Mysticial View Post
Perhaps the word "full order" isn't the right term to use. Unlike for roots-of-unity, each "order" for roots-of-two are not necessarily inclusive of all the lower orders. And it looks like the elements in each order will be mutually exclusive of all other orders iff 2 is not a root-of-two of itself.

This is not the case for Mersenne primes. And for M31: 2, 4, 16, 256, and 65536 together cover all orders to infinity. But the # of elements in each order is the # of distinct roots-of-unity of that order.

So there are 2^32 different roots-of-two of order 2^123456789 for M31 over the complex field.
So, does this mean that p=M31 has all the required roots-of-two for the IBDWT for Z/pZ NTT? so, is NTT(M31) a viable alternative to FGT?

Or, the problem with M31 is that it doesn't have the required roots-of-unity?

Last fiddled with by preda on 2021-04-18 at 18:46
preda is offline   Reply With Quote
Old 2021-04-23, 06:51   #53
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

145210 Posts
Default

Thank you for the explanation of Shoup's mul-mod.
In GCN (AMD GPU), there is a 32-bit mul_hi instruction, but there is no 64-bit mul_hi. "emulating" the 64-bit mul_hi is slow, almost as slow as the 64-bit mul-wide (64x64->128) (i.e. twice the expected cost).

So, in GCN, it's not great to use the Shoup's trick with B=2^64, because of the lack of a 64-bit mul_hi. I wonder what can be done in that case -- when mul_hi has (almost) the same cost as a mul-wide.

(B=2^32 is too low for out intended use).

Quote:
Originally Posted by Mysticial View Post
It's the same thing. I think I'm the only person who uses the term "Shoup's algorithm" when in reality, it's "just" a method (albeit a very power one) to perform a multiply-mod.

To compute: c = a * b mod p

Code:
B = the word size (such as 2^32 or 2^64)

a ∈ [0, B)
b ∈ [0, p)
p ∈ [1, B/2)  //  Can be any number. Doesn't need to be prime.

//  Precompute:
b' = floor(b * B / p)

//  Shoup's multiply-mod
q = floor(a * b' / B);   //  B x B integer multiply keeping the top half.
t0 = (a * b) mod B;      //  B x B integer multiply keeping the bottom half.
t1 = (q * p) mod B;      //  B x B integer multiply keeping the bottom half.
c = (t0 - t1) mod B;     //  Integer subtract ignoring overflow.

//  Output
c ≡ a * b mod p
c ∈ [0, 2p)
So the output is either fully reduced, or reduced + p. To get it back into the range: c ∈ [0, p) just do a compare and conditional-move.

Operation Count:
  • 1 upper-half integer multply
  • 2 lower-half integer multiplies
  • 1 integer subtract
  • 1 integer compare
  • 1 conditional move

The one catch is that you need to be able to precompute b' to be efficient. This is possible for NTTs since b will be a twiddle factor. Otherwise, b' can be computed on the fly with a variant of this trick. But that makes the entire multiply-mod operation about 3x more expensive.
preda is offline   Reply With Quote
Reply



Similar Threads
Thread Thread Starter Forum Replies Last Post
Double stars fivemack Astronomy 5 2017-05-14 08:50
x.265 half the size, double the computation; so if you double again? 1/4th? jasong jasong 7 2015-08-17 10:56
Double Check Unregistered Information & Answers 3 2011-10-01 04:38
Double the area, Double the volume. Uncwilly Puzzles 8 2006-07-03 16:02
Double Check P-1 PhilF PrimeNet 6 2005-07-03 14:36

All times are UTC. The time now is 13:30.


Fri Jul 7 13:30:42 UTC 2023 up 323 days, 10:59, 0 users, load averages: 1.18, 1.26, 1.21

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2023, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.

≠ ± ∓ ÷ × · − √ ‰ ⊗ ⊕ ⊖ ⊘ ⊙ ≤ ≥ ≦ ≧ ≨ ≩ ≺ ≻ ≼ ≽ ⊏ ⊐ ⊑ ⊒ ² ³ °
∠ ∟ ° ≅ ~ ‖ ⟂ ⫛
≡ ≜ ≈ ∝ ∞ ≪ ≫ ⌊⌋ ⌈⌉ ∘ ∏ ∐ ∑ ∧ ∨ ∩ ∪ ⨀ ⊕ ⊗ 𝖕 𝖖 𝖗 ⊲ ⊳
∅ ∖ ∁ ↦ ↣ ∩ ∪ ⊆ ⊂ ⊄ ⊊ ⊇ ⊃ ⊅ ⊋ ⊖ ∈ ∉ ∋ ∌ ℕ ℤ ℚ ℝ ℂ ℵ ℶ ℷ ℸ 𝓟
¬ ∨ ∧ ⊕ → ← ⇒ ⇐ ⇔ ∀ ∃ ∄ ∴ ∵ ⊤ ⊥ ⊢ ⊨ ⫤ ⊣ … ⋯ ⋮ ⋰ ⋱
∫ ∬ ∭ ∮ ∯ ∰ ∇ ∆ δ ∂ ℱ ℒ ℓ
𝛢𝛼 𝛣𝛽 𝛤𝛾 𝛥𝛿 𝛦𝜀𝜖 𝛧𝜁 𝛨𝜂 𝛩𝜃𝜗 𝛪𝜄 𝛫𝜅 𝛬𝜆 𝛭𝜇 𝛮𝜈 𝛯𝜉 𝛰𝜊 𝛱𝜋 𝛲𝜌 𝛴𝜎𝜍 𝛵𝜏 𝛶𝜐 𝛷𝜙𝜑 𝛸𝜒 𝛹𝜓 𝛺𝜔