![]() |
|
|
#45 |
|
"Mihai Preda"
Apr 2015
22×3×112 Posts |
Thank you all for the many great contributions in this thread!
I started to experiment with a OpenCL NTT FFT in GpuOwl, using GF(p^2) with p=M(31)=2^31-1 -- i.e. "integer complex numbers modulo M31". Using pointers from Ernst's excellent writeup, et all. I chose the "complex" NTT structure because: - weighting and unweighting is done by shifts, without precomputed tables. - modulo M31 reduction is done with bit operations only (without multiplications). A nice side-benefit of using GF(p^2) is that most of the code is shared between the floating-point FFT and the NTT FFT -- because the FFT abstract structure is common, only the basic operations such as add/sub and mul are different. At this point I have the basic weighted convolution working. (with un-balanced words, for FFT size 2^23, I observe about 4.5 useful bits/word). I'm looking now to combine the double-FFT with the (M31^2)-FFT. Ernst, I'd like to ask for some information about how you do the splitting and re-combining of the bits between the two parallel FFTs. In particular, how to split the global "bits per word" among the two transforms. Do they both do independent carry propagation? do they have to combine and re-split the bits after each iteration? Does the NTT in dual mode benefit from using balanced words? (my intuition says no..?) thanks! Last fiddled with by preda on 2017-10-26 at 07:01 Reason: spelling |
|
|
|
|
|
#46 |
|
Tribal Bullet
Oct 2004
5·23·31 Posts |
Not Ernst, but...
There is no carry propagation with the FGT, you produce (convolution results) modulo M31. If the inputs start in balanced format then you will have to reduce the possibly negative numbers mod M31 before doing the FGT. After the DWT convolution, you need to combine the FFT results whose top bits are correct with the mod-M31 results to get the correct bottom bits. That will lead to full-precision convolution results, which then go through carry propagation and conversion to balanced form. |
|
|
|
|
|
#47 |
|
∂2ω=0
Sep 2002
República de California
22×2,939 Posts |
@preda: JasonP has it correct, here my additional notes:
[1] The modular transform uses strictly nonnegative-digit normalization, so 'balanced digit' refers only to the FFT data; [2] When I played with this stuff nearly 20 years ago, in developing the carry code, I started by using a run and transform length such that there was no issue of needing the FGT data to error-correct the FFT convolution outputs, i.e. the latter were small enough that each, when 'unbalanced' to a nonnegative-digit normalization, was expected to exactly match the corresponding FGT output. [3] Once the carry step is in place, one proceeds by gradually cranking up the average number of bits per transform input word, so that one can see a sufficient number of matching bits in the carry-step debug data to ensure the code is working properly. [4] Once everything is working right, you should be able to use bits-per-input values large enough that there really is very little bit overlap between the 2 sets of outputs, i.e. is similar to the number of 'guard bits' one see at the bottom of one's convolution outputs (carry-step inputs) in an FFT-only implementation. I seem to recall that the allowable error level in the FFT outputs in a side-by-side-transform setting was effectively an NTT-modulus-scaled analog of the error levels one deals with in FFT-only code: For Mersenne prime q = 2^p-1 and FGT(q^2) done next to an FFT, if the FFT-only breakpoints are based on a max. ROE per output word of (say) 0.25, then for the hybrid-transform code one's FFT outputs may have a max ROE of around 0.25*2^p, corresponding to a worst-case overlap of a mere ~2 bits. Alas, my old FGT-mod Fortran-90 code was a victim of last month's HD failure on my Macbook, and, idiotically, I had not stashed a copy of said code in my ftp-server old-code archive. But I distinctly recall the mixed-FFT-and-FGT carry subroutine being quite ugly, in no small part due to f90's awkward bitwise-ops instruction set, and I would have been unlikely to use it as a reference in my own future-C-and-SIMD reimplementation of such a hybrid transform, due to it having been so long since I coded in f90 that it would likely be quicker to just work through the various issues involved in the carry step from scratch, in C. |
|
|
|
|
|
#48 |
|
"Mihai Preda"
Apr 2015
26548 Posts |
Thank you for the carry explanation in dual-FFT. I haven't tried it yet though.
OTOH I did implement the FGT(M61), and my initial measurements on GPU indicate it's about 80% slower than the double-precision of the same size. I guess part of the blame is on the compiler, who does a poor job optimizing the integer code for some reason. Part of the blame may be on the expensive multiplication (wide 32x32->64); one 61x61 multiplication requires 4 muls 32x32->64, and this seems to be significantly more expensive than one DP mul. (which is a bit surprising to me). Anyway, the conclusion is that unfortunately I didn't get the hoped big gains from GFT(M61).. (BTW, the data memory load of GFT(M61) is a bit less than the DP FFT of the same size, because the weighting-unweighting is done without weight tables. BUT this is compensated by the much larger code size (also memory load), blame of the compiler.) |
|
|
|
|
|
#49 |
|
"Mihai Preda"
Apr 2015
22×3×112 Posts |
My experiments also seem to indicate that FGT(M31) has about the same speed as a DP FFT of the same size (while a-priory I would have expected it to be about twice as fast as the DP).
At this point, IMO the only thing that might improve things compared to a pure-DP FFT would be a dual DP + FGT(M61) (both of half size), which would offer about 25% more bits compared to a single DP of full size. |
|
|
|
|
|
#50 |
|
Tribal Bullet
Oct 2004
5·23·31 Posts |
In case it is of use, I released this code for generating complex power-of-2 size FGTs mod M61 way back when.
What about an M61 and M31 FGT? On the one hand, both can use the same abstract code to generate the core arithmetic as long as both use the same radixes, plus you would do two approximately half-size transforms. On the other hand, the convolution results require 2-prime CRT reconstruction, but that's much more straightforward than the DP+FGT equivalent. Last fiddled with by jasonp on 2017-10-30 at 02:40 |
|
|
|
|
|
#51 | |
|
Sep 2016
22·5·19 Posts |
Quote:
|
|
|
|
|
|
|
#52 | |
|
"Mihai Preda"
Apr 2015
22×3×112 Posts |
Quote:
Or, the problem with M31 is that it doesn't have the required roots-of-unity? Last fiddled with by preda on 2021-04-18 at 18:46 |
|
|
|
|
|
|
#53 | |
|
"Mihai Preda"
Apr 2015
22·3·112 Posts |
Thank you for the explanation of Shoup's mul-mod.
In GCN (AMD GPU), there is a 32-bit mul_hi instruction, but there is no 64-bit mul_hi. "emulating" the 64-bit mul_hi is slow, almost as slow as the 64-bit mul-wide (64x64->128) (i.e. twice the expected cost). So, in GCN, it's not great to use the Shoup's trick with B=2^64, because of the lack of a 64-bit mul_hi. I wonder what can be done in that case -- when mul_hi has (almost) the same cost as a mul-wide. (B=2^32 is too low for out intended use). Quote:
|
|
|
|
|
![]() |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| Double stars | fivemack | Astronomy | 5 | 2017-05-14 08:50 |
| x.265 half the size, double the computation; so if you double again? 1/4th? | jasong | jasong | 7 | 2015-08-17 10:56 |
| Double Check | Unregistered | Information & Answers | 3 | 2011-10-01 04:38 |
| Double the area, Double the volume. | Uncwilly | Puzzles | 8 | 2006-07-03 16:02 |
| Double Check P-1 | PhilF | PrimeNet | 6 | 2005-07-03 14:36 |