 mersenneforum.org > Math Musings of someone learning
 Register FAQ Search Today's Posts Mark Forums Read  2016-10-12, 00:31 #1 airsquirrels   "David" Jul 2015 Ohio 11×47 Posts Musings of someone learning I am working through grasping the FFT based multiplication process in enough detail to hopefully contribute some optimizations to i.e. clLucas. I expect to say a lot of incorrect or silly things as I try to wrap my head around this in detail, help is appreciated. For some reason my brain is struggling more with this than many programming optimization problems I've worked on in the past. LL test basics = start with x=4, do x=(x^2 - 2) mod M for p-2 iterations, if x=zero you win As I understand it, there are five steps here for the inner loop b=log2(M) n = FFT size, which is roughly b/(useful bits of a double). 1. Convert our X to a polynomial P. Zero pad so it has 2n coefficients. Set complex parts to zero What is the base of the polynomial? 2. Run an FFT on P. Because we are squaring we only need P because P=Q 3. Square the 2n resulting values. These are complex, so we are squaring the complex number? (a + bi)^2 = a^2 + 2abi + bi^2 ? And we do that for all 2n? 4. Run IFFT on the 2n values and get back coefficients in base B 5. Some values might be bigger than B, do carry, round to integers, and subtract 2 while we are at it. Mod M is easy because of the base? I've read we get the mod for free but I haven't exactly pinpointed where. Is this the step? Rinse wash and repeat. Am I on the right track? Bonus Question: If I wanted to cube a huge number, could I just cube the values in step 3? Last fiddled with by airsquirrels on 2016-10-12 at 00:32   2016-10-12, 01:30 #2 Prime95 P90 years forever!   Aug 2002 Yeehaw, FL 1DFD16 Posts You are on the right track. Very good. As you might expect, there are optimizations available. The biggest is Richard Crandall's Irrational Base Discrete Weighted Transform. This saves us the padding with zeros (halving the size of an FFT) and gives us the modulo 2^P-1 for free. The other biggie is Hermetian symmetry. When doing a forward FFT of n values, you can discard half of the complex values. That is, the end result of the FFT is (mostly) complex values of the form a + bi and a - bi -- no need to remember both complex values. On the bonus question, yes you can cube the transformed values. Alas, the more arithmetic you do on transformed values, the fewer bits you can stuff in each FFT word at the beginning -- no free lunch here.   2016-10-13, 03:09 #3 airsquirrels   "David" Jul 2015 Ohio 11·47 Posts So I've been reading Crandall's paper.... http://www.ams.org/journals/mcom/199...-1185244-1.pdf Working through the example of section six, a simple squaring of 78314567209 mod (2^37-1) using an FFT of size 4.... 1. Choose a digit representation... The example computes bit sizes {10, 9, 9, 9} which leads to digits {553, 93, 381, 291} and further weights {1, 2.75, 2.5, 2.25}. This part is all laid out in the paper and I follow. 2. Compute X = DWT(N, a)x and Y is the same as X because we are squaring. Based on the initial section, DWT(4, a)x is equal to a DFT(4)ax where ax is component wise multiplication with the signal weights a. This leads to {553 + 0i, 2730.75 + 0i, 952.5 + 0i, 654.75 + 0i} as the complex input values to the DFT, which I compute with the FFT. I tried doing this with wolfram alpha, for reasons unknown. The output there is {2445.5+0. i, -199.75+1038. i, -940.+0. i, -199.75-1038. i}. The output from this FFT calculator (http://scistatcalc.blogspot.com/2013...alculator.html) is {4891 + 0i, -399.5 - 2076i, -1880 + 0i, -399.5 + 2076i}. I have no idea why we have the discrepancy. Never the less, I continue... 3. Compute Z=XY (Z=XX in this case). I do this by squaring the complex signal values point-wise, using the second FFT result that gives me: Z={23921881 + 0i, -4150175.75 + 1658724i, 3534400 + 0i, -4150175.75 - 1658724i} 4. I compute the inverse DWT(N, a) of Z, which is a-1DFT-1(4)[Z]. DFT-1[Z] Yields {4788982.375,4267508.25,8939158.125,5926232.25} which I component wise multiply with my inverted weight signal a-1={1/1, 1/2.75, 1/2.5, 1/2.25} yielding z={4788982.375,1551821.1818...,3575663.25,5926232.25,2633881}. At this point my values do not match those shown in 6.14 (z={704383,324600,523365,463577} and so I have failed. 5. Rounding is trivial... 6. Perform carry and make it an integer again. Again trivial... Where did I go wrong?   2016-10-13, 03:58   #4
axn

Jun 2003

120738 Posts Quote:
 Originally Posted by airsquirrels and further weights {1, 2.75, 2.5, 2.25}.
The weights are {1, 23/4, 21/2, 21/4} = {1, 1.682, 1.414, 1.189}   2016-10-13, 08:53 #5 henryzz Just call me Henry   "David" Sep 2007 Cambridge (GMT/BST) 3·1,979 Posts If you haven't already read this thread read it http://mersenneforum.org/showthread.php?t=120   2016-10-13, 12:43   #6
airsquirrels

"David"
Jul 2015
Ohio

11×47 Posts Quote:
 Originally Posted by axn The weights are {1, 23/4, 21/2, 21/4} = {1, 1.682, 1.414, 1.189}
Thanks! I missed the superscript...

So that leave me {553., 156.407, 538.815, 346.059}

Forward DWT:
1594.281000
14.185000 + 189.652000i
589.349000
14.185000 - 189.652000i

Squaring:
2541731.906961
-35766.667 + 5380.4272 i
347332.243801
-35766.667 - 5380.4272 i

IFFT:
704382.704191
545909.702190
740149.371190
551290.129390

Inverse weights:
704382.704191
324599.851
523364.639459
463577.9

Rounding, digit recovery, carry propagation LSB first
704383 +
324600 * 2^10 +
523365 * 2^19 +
463578 * 2^28

That looks better! Thanks for the note on the weights.

Now to write some code....   2016-10-13, 23:29 #7 airsquirrels   "David" Jul 2015 Ohio 11×47 Posts So, now I have code that does work, however I'm confused looking at clFFT / cuFFT code. In all cases they actually initialize FFT sizes that are / 2. Launching clLucas with a 4096K FFT actually uses a 2048K complex FFT internally. The same is true of CUDALucas I found one forum post that alluded to mprime etc. actually counting the number of total double precision values vs. the number of complex elements? Is that accurate information? Timings using a Fury X GPU at stock clocks. Existing clLucas: 3.8-3.9 ms/Iteration using the 2048K Complex Double Precision SixStep FFT (4096K setting) My code currently runs about 3.16 ms/Iteration using the same Complex Interleaved format and baking the complex squaring, rounding, and carry-save into the FFT kernels using clFFTs post and pre callbacks, although I may need to add some more overhead for the carry/base normalizing I don't expect that to add much more than 10% overhead. I used Andrew Thall's strategy regarding keeping the Carry/Save separate - http://www.ece.neu.edu/groups/nucar/...iles/thall.pdf. clFFT also supports using REAL->HERMITIAN_INTERLEAVED transforms and the inverse, which would provide the afore-mentioned benefits of Hermitian symmetry and I believe I can perform the squaring directly in the HERMITIAN_INTERLEAVED space. Doing the transforms in this fashion at the "4096K" FFT size yields a timing of 2.52ms/iteration at 4096k on a Fury X, or about the same performance as a Titan Black using CUDALucas.   2016-10-16, 00:24 #8 airsquirrels   "David" Jul 2015 Ohio 11·47 Posts I am having some trouble understanding how to properly restore the digit representation in step 6 of Crandall's algorithm. (Reminder that the problem is to compute 78314567209^2 modulo 2^37-1 ) For the example problem, I have Real result values (Call them vector X) of: 704383, 324600, 523365, and 463578 These can be computed into a sum based on each digit value 704383 * 2^0 + 324600 * 2^10 + 523365 * 2^19 + 463578 * 2^28 = 124715498905471 78314567209^2 = 6133171437132978049681 78314567209^2 mod 2^37-1 = 58368107274 124715498905471 mod 2^37-1 also = 58368107274 I was led to believe that restoring the digit values would give us the "mod 2^37 -1" for free, however when I compute the digit values of 58368107274 by splitting the bits = 778, 168, 224, 217 I do not see how those values are obtained from X. I attempted to add and carry the MSB from X and computed instead 895, 167, 194, and 217. Since there are no operations performed on the least significant 10 bits I fail to see how I get 778 instead of 895. Taking 124715498905471 & 1023 = 895. Crandall's section 6.14 just gives a vague "Adjustment for proper constraints on digit sizes are to follow..." however I do not find those adjustments in the paper. Any direction is appreciated. Edit: Is this where I've gone wrong? http://mersenneforum.org/showpost.ph...9&postcount=59 So I should take 704383 / (2^10) -> 687 Carry and 704383 % (2^10) = my 895. Add 687 to 324600 and continue with 2^9th. Answering my own questions... Last fiddled with by airsquirrels on 2016-10-16 at 00:40   2016-10-16, 01:26   #9
airsquirrels

"David"
Jul 2015
Ohio

11×47 Posts Quote:
 Originally Posted by airsquirrels Edit: Is this where I've gone wrong? http://mersenneforum.org/showpost.ph...9&postcount=59 So I should take 704383 / (2^10) -> 687 Carry and 704383 % (2^10) = my 895. Add 687 to 324600 and continue with 2^9th. Answering my own questions...
Sorry for so many replies to myself. So my hand-carry method is the same as what ewmayer posted in that previous thread, however I am still getting 895 for the first 10 bit word, instead of the 778 that I am looking for. What am I missing now?   2016-10-16, 16:30 #10 airsquirrels   "David" Jul 2015 Ohio 11×47 Posts I will keep journalling my progress to hopefully help someone else trying to learn and understand this later. For anyone trying to understand LL testing, I highly recommend taking a look at the gpuLucas code. It is very clean and well commented and lacks much of the deep complexity/programmed by a mathematition effects of CUDALucas, and since the FFT is handled by CUDA much of the intense complexity of prime95 is not present. https://github.com/Almajester/gpuLucas What I critically missed in the papers was that the carry needs to be treated cyclically in order to preserve the mod 2^37-1. In this case my values of 895, 167, 194, and 217 from the example problem are correct, but there is a 5th carry value of (905 in the results + 2 from previous carries.) This leads to 895 + 907 = 1802. 1802 mod 1024 = 778 with a carry of 1, bringing my 167 to 168.   2016-10-18, 13:03 #11 airsquirrels   "David" Jul 2015 Ohio 11×47 Posts Can anyone help me understand or point me to a paper on how you would perform IBDWT stuffing your words into both the real and imaginary components of a complex FFT? I've tried searching for this but perhaps I am not using the right vocabulary to describe it. Both CUDALucas and clLucas seem to employee this technique, stemming from their origins as MacLucasFFTW. This is allowing them to use 1/2 the FFT size that is actually specified. Crandall's paper makes a mention of this option without details but recommends the Real->Hemitian solution instead Last fiddled with by airsquirrels on 2016-10-18 at 13:04   Thread Tools Show Printable Version Email this Page Similar Threads Thread Thread Starter Forum Replies Last Post LaurV LaurV 3 2019-11-05 10:25 kladner Lounge 8 2013-04-18 03:08 Jeff Gilchrist Programming 3 2012-01-15 00:29 jasong jasong 6 2007-12-07 14:06 Longshot Hardware 5 2005-05-21 16:40

All times are UTC. The time now is 15:15.

Mon Nov 29 15:15:17 UTC 2021 up 129 days, 9:44, 0 users, load averages: 1.63, 1.55, 1.44 