20161012, 00:31  #1 
"David"
Jul 2015
Ohio
11×47 Posts 
Musings of someone learning
I am working through grasping the FFT based multiplication process in enough detail to hopefully contribute some optimizations to i.e. clLucas. I expect to say a lot of incorrect or silly things as I try to wrap my head around this in detail, help is appreciated. For some reason my brain is struggling more with this than many programming optimization problems I've worked on in the past.
LL test basics = start with x=4, do x=(x^2  2) mod M for p2 iterations, if x=zero you win As I understand it, there are five steps here for the inner loop b=log2(M) n = FFT size, which is roughly b/(useful bits of a double). 1. Convert our X to a polynomial P. Zero pad so it has 2n coefficients. Set complex parts to zero What is the base of the polynomial? 2. Run an FFT on P. Because we are squaring we only need P because P=Q 3. Square the 2n resulting values. These are complex, so we are squaring the complex number? (a + bi)^2 = a^2 + 2abi + bi^2 ? And we do that for all 2n? 4. Run IFFT on the 2n values and get back coefficients in base B 5. Some values might be bigger than B, do carry, round to integers, and subtract 2 while we are at it. Mod M is easy because of the base? I've read we get the mod for free but I haven't exactly pinpointed where. Is this the step? Rinse wash and repeat. Am I on the right track? Bonus Question: If I wanted to cube a huge number, could I just cube the values in step 3? Last fiddled with by airsquirrels on 20161012 at 00:32 
20161012, 01:30  #2 
P90 years forever!
Aug 2002
Yeehaw, FL
2^{2}×43×47 Posts 
You are on the right track. Very good.
As you might expect, there are optimizations available. The biggest is Richard Crandall's Irrational Base Discrete Weighted Transform. This saves us the padding with zeros (halving the size of an FFT) and gives us the modulo 2^P1 for free. The other biggie is Hermetian symmetry. When doing a forward FFT of n values, you can discard half of the complex values. That is, the end result of the FFT is (mostly) complex values of the form a + bi and a  bi  no need to remember both complex values. On the bonus question, yes you can cube the transformed values. Alas, the more arithmetic you do on transformed values, the fewer bits you can stuff in each FFT word at the beginning  no free lunch here. 
20161013, 03:09  #3 
"David"
Jul 2015
Ohio
11×47 Posts 
So I've been reading Crandall's paper.... http://www.ams.org/journals/mcom/199...11852441.pdf
Working through the example of section six, a simple squaring of 78314567209 mod (2^371) using an FFT of size 4.... 1. Choose a digit representation... The example computes bit sizes {10, 9, 9, 9} which leads to digits {553, 93, 381, 291} and further weights {1, 2.75, 2.5, 2.25}. This part is all laid out in the paper and I follow. 2. Compute X = DWT(N, a)x and Y is the same as X because we are squaring. Based on the initial section, DWT(4, a)x is equal to a DFT(4)ax where ax is component wise multiplication with the signal weights a. This leads to {553 + 0i, 2730.75 + 0i, 952.5 + 0i, 654.75 + 0i} as the complex input values to the DFT, which I compute with the FFT. I tried doing this with wolfram alpha, for reasons unknown. The output there is {2445.5+0. i, 199.75+1038. i, 940.+0. i, 199.751038. i}. The output from this FFT calculator (http://scistatcalc.blogspot.com/2013...alculator.html) is {4891 + 0i, 399.5  2076i, 1880 + 0i, 399.5 + 2076i}. I have no idea why we have the discrepancy. Never the less, I continue... 3. Compute Z=XY (Z=XX in this case). I do this by squaring the complex signal values pointwise, using the second FFT result that gives me: Z={23921881 + 0i, 4150175.75 + 1658724i, 3534400 + 0i, 4150175.75  1658724i} 4. I compute the inverse DWT(N, a) of Z, which is a^{1}DFT^{1}(4)[Z]. DFT^{1}[Z] Yields {4788982.375,4267508.25,8939158.125,5926232.25} which I component wise multiply with my inverted weight signal a^{1}={1/1, 1/2.75, 1/2.5, 1/2.25} yielding z={4788982.375,1551821.1818...,3575663.25,5926232.25,2633881}. At this point my values do not match those shown in 6.14 (z={704383,324600,523365,463577} and so I have failed. 5. Rounding is trivial... 6. Perform carry and make it an integer again. Again trivial... Where did I go wrong? 
20161013, 03:58  #4 
Jun 2003
1010100110011_{2} Posts 

20161013, 08:53  #5 
Just call me Henry
"David"
Sep 2007
Liverpool (GMT/BST)
6,011 Posts 
If you haven't already read this thread read it http://mersenneforum.org/showthread.php?t=120

20161013, 12:43  #6  
"David"
Jul 2015
Ohio
11·47 Posts 
Quote:
So that leave me {553., 156.407, 538.815, 346.059} Forward DWT: 1594.281000 14.185000 + 189.652000i 589.349000 14.185000  189.652000i Squaring: 2541731.906961 35766.667 + 5380.4272 i 347332.243801 35766.667  5380.4272 i IFFT: 704382.704191 545909.702190 740149.371190 551290.129390 Inverse weights: 704382.704191 324599.851 523364.639459 463577.9 Rounding, digit recovery, carry propagation LSB first 704383 + 324600 * 2^10 + 523365 * 2^19 + 463578 * 2^28 That looks better! Thanks for the note on the weights. Now to write some code.... 

20161013, 23:29  #7 
"David"
Jul 2015
Ohio
11×47 Posts 
So, now I have code that does work, however I'm confused looking at clFFT / cuFFT code. In all cases they actually initialize FFT sizes that are / 2.
Launching clLucas with a 4096K FFT actually uses a 2048K complex FFT internally. The same is true of CUDALucas I found one forum post that alluded to mprime etc. actually counting the number of total double precision values vs. the number of complex elements? Is that accurate information? Timings using a Fury X GPU at stock clocks. Existing clLucas: 3.83.9 ms/Iteration using the 2048K Complex Double Precision SixStep FFT (4096K setting) My code currently runs about 3.16 ms/Iteration using the same Complex Interleaved format and baking the complex squaring, rounding, and carrysave into the FFT kernels using clFFTs post and pre callbacks, although I may need to add some more overhead for the carry/base normalizing I don't expect that to add much more than 10% overhead. I used Andrew Thall's strategy regarding keeping the Carry/Save separate  http://www.ece.neu.edu/groups/nucar/...iles/thall.pdf. clFFT also supports using REAL>HERMITIAN_INTERLEAVED transforms and the inverse, which would provide the aforementioned benefits of Hermitian symmetry and I believe I can perform the squaring directly in the HERMITIAN_INTERLEAVED space. Doing the transforms in this fashion at the "4096K" FFT size yields a timing of 2.52ms/iteration at 4096k on a Fury X, or about the same performance as a Titan Black using CUDALucas. 
20161016, 00:24  #8 
"David"
Jul 2015
Ohio
517_{10} Posts 
I am having some trouble understanding how to properly restore the digit representation in step 6 of Crandall's algorithm.
(Reminder that the problem is to compute 78314567209^2 modulo 2^371 ) For the example problem, I have Real result values (Call them vector X) of: 704383, 324600, 523365, and 463578 These can be computed into a sum based on each digit value 704383 * 2^0 + 324600 * 2^10 + 523365 * 2^19 + 463578 * 2^28 = 124715498905471 78314567209^2 = 6133171437132978049681 78314567209^2 mod 2^371 = 58368107274 124715498905471 mod 2^371 also = 58368107274 I was led to believe that restoring the digit values would give us the "mod 2^37 1" for free, however when I compute the digit values of 58368107274 by splitting the bits = 778, 168, 224, 217 I do not see how those values are obtained from X. I attempted to add and carry the MSB from X and computed instead 895, 167, 194, and 217. Since there are no operations performed on the least significant 10 bits I fail to see how I get 778 instead of 895. Taking 124715498905471 & 1023 = 895. Crandall's section 6.14 just gives a vague "Adjustment for proper constraints on digit sizes are to follow..." however I do not find those adjustments in the paper. Any direction is appreciated. Edit: Is this where I've gone wrong? http://mersenneforum.org/showpost.ph...9&postcount=59 So I should take 704383 / (2^10) > 687 Carry and 704383 % (2^10) = my 895. Add 687 to 324600 and continue with 2^9th. Answering my own questions... Last fiddled with by airsquirrels on 20161016 at 00:40 
20161016, 01:26  #9  
"David"
Jul 2015
Ohio
517_{10} Posts 
Quote:


20161016, 16:30  #10 
"David"
Jul 2015
Ohio
517_{10} Posts 
I will keep journalling my progress to hopefully help someone else trying to learn and understand this later.
For anyone trying to understand LL testing, I highly recommend taking a look at the gpuLucas code. It is very clean and well commented and lacks much of the deep complexity/programmed by a mathematition effects of CUDALucas, and since the FFT is handled by CUDA much of the intense complexity of prime95 is not present. https://github.com/Almajester/gpuLucas What I critically missed in the papers was that the carry needs to be treated cyclically in order to preserve the mod 2^371. In this case my values of 895, 167, 194, and 217 from the example problem are correct, but there is a 5th carry value of (905 in the results + 2 from previous carries.) This leads to 895 + 907 = 1802. 1802 mod 1024 = 778 with a carry of 1, bringing my 167 to 168. 
20161018, 13:03  #11 
"David"
Jul 2015
Ohio
11·47 Posts 
Can anyone help me understand or point me to a paper on how you would perform IBDWT stuffing your words into both the real and imaginary components of a complex FFT? I've tried searching for this but perhaps I am not using the right vocabulary to describe it.
Both CUDALucas and clLucas seem to employee this technique, stemming from their origins as MacLucasFFTW. This is allowing them to use 1/2 the FFT size that is actually specified. Crandall's paper makes a mention of this option without details but recommends the Real>Hemitian solution instead Last fiddled with by airsquirrels on 20161018 at 13:04 
Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
Learning languages is the new fashion...  LaurV  LaurV  3  20191105 10:25 
Online languagelearning course  kladner  Lounge  8  20130418 03:08 
Learning Python  Course from Google  Jeff Gilchrist  Programming  3  20120115 00:29 
flowcharts, selflearning  jasong  jasong  6  20071207 14:06 
Learning About RAM the Hard Way  Longshot  Hardware  5  20050521 16:40 