![]() |
|
|
#12 |
|
Mar 2003
34 Posts |
It seems I begin understand why other preshmen's progs are
very slow. 1) The slowest: don't use FFT at all, use scholar multiplication O(n^2) instead. 2) Slow: Use two FFT every iteration 3) Fast: Use only once FFT at start-up and than perform all calculation with FFT-images in complex field. Is 3-rd item correct and does Prime95 do so? |
|
|
|
|
|
#13 | |
|
∂2ω=0
Sep 2002
República de California
2×32×647 Posts |
Quote:
I used the revised version to estimate what FFT length would be needed for a Pe'pin test (or p-1/ecm factoring work) on F31, before Kruppa & Forbes found the small factor of that number. The error analysis (with the key adjustable asymptotic constant set based on data from runs with GIMPS-sized numbers) predicts that N = 2^27 floating doubles should me more than adequate to test F31-sized numbers - in fact the slightly smaller 15*2^23 should also suffice. Since I didn't have a non-power-of-2 runlength capability in my Fermat code (although one can use such for Fermat-mod DWT, by combining the standard negacyclic weighting with a Mersenne-like variable-base DWT - the final version of the paper also discusses this), I used my Mersenne code with M(M31) as input for the test. Sure enough, ~100K iterations with lengths 15*2^23 and 2^27 yielded no fractional errors anywhere close to dangerous (or at least what we consider dangerous based on lots of practical experience with GIMPS), but 14*2^23 quickly yielded fatal RO errors. The same error analysis also predicts that 16 bits per input will become too large around F35-36, at which point being able to do a non-power-of-2-runlength Fermat-mod DWT will become very important - otherwise we'd have to drop down to 8 bits per input, which would more than double our runtime. |
|
|
|
|
|
|
#14 | |
|
∂2ω=0
Sep 2002
República de California
2·32·647 Posts |
Quote:
2) No, all the best programs still need 2 FFTs per iteration - they just do them very efficiently. 3) I'm not sure what you mean here. If you're referring to doing all the iterations entirely in Fourier space, that has been much discussed, but simply doesn't work - you need to come out of Fourier space to do error removal (in a floating-point-FFT-based scheme), normalization of digits and carry propagation. Then you need to take the resulting new set of input digits and get back into Fourier space to do the squaring, and so on. There's your two FFTs - no way around that, I'm afraid. So there's no black magic, no incredibly sophisticated mathematical sleight of hand (besides the FFT-based multiply and the DWT, both of which are really nifty but certainly not miraculous by any stretch) - just a small amount of higher mathematics and a hell of a lot of hard work. (Very much in the spirit of Thomas Edison's famous phrase about genius being 10% inspiration and 90% perspiration. Many would argue that the 10% figure is much too high. :)) |
|
|
|
|
|
|
#15 | ||
|
∂2ω=0
Sep 2002
República de California
101101011111102 Posts |
Quote:
|
||
|
|
|
|
|
#16 |
|
Oct 2002
43 Posts |
I feel rather bad, now, about referring to your paper (or rather, the estimate contained in it) so roughly... but my comments were entirely accurate when I wrote them. (In fact, I seem to recall mentioning it to Richard C. at the time, and being told something to the effect of "well, you might be right, but we're not going to change it now" -- but that was a couple years ago, so I might be getting confused with something else.)
|
|
|
|
|
|
#17 | |
|
∂2ω=0
Sep 2002
República de California
2×32×647 Posts |
Quote:
|
|
|
|
|
|
|
#18 | |||
|
Oct 2002
43 Posts |
Quote:
Quote:
Quote:
|
|||
|
|
|
|
|
#19 | |
|
Mar 2003
34 Posts |
Quote:
But it's still incredible that Prime95 is 30-times faster than such suboptimal progs (instead 2 or 4 times). How long digits are safe enough and optimal for 80- and 64- bit FFT? (I believe that 16 and 8 bit per "digit") |
|
|
|
|
|
|
#20 | |
|
∂2ω=0
Sep 2002
República de California
2·32·647 Posts |
Quote:
For a guide to how many bits per diigt can safely be used, look at http://www.mersenne.org/bench.htm . For instance, at 1024K (that's the number of *real* elements in the FFT - if you're using a complex-data FFT that means 19 radix-2 passes, not 20) one can use ~19 bits per input digit. Of course to achieve this in practice requires careful coding - you need to use the balanced-digit representation described in the 1994 Crandall/Fagin Math. Comp. paper, use the DWT to halve the vector length over a zero-padded implementation, and preferably use a higher-radix FFT to redce the number of passes through the data (= faster) and reduce the number of complex multiplies by FFT sincos data (= more accurate). |
|
|
|
|
|
|
#21 |
|
Mar 2003
34 Posts |
*** ewmayer
Would you be so kind as explaining me how IBWDT works? According to its name I've tried to multiply 79 by 83 (mod 100) using 101^0.5 (i.e. 10.04987562112089) as base. In this system 79 and 83 will be, for instance, (7 8.65087065215377) and (8 2.60099503103288). Two FFT and one iFFT go here, which lead us to (87.41393043446033 78.50087158036013) The cyclic carry prolonging gives us: (4.96504984437232 7.10186661139301) In decimal system it will be: 57.000000000000 That is rare luck then irrationality was eliminated at the very end, but more often was not! What is the correct algorithm? Thank you in advance!!! |
|
|
|
|
|
#22 | |
|
Aug 2002
3·83 Posts |
Quote:
|
|
|
|
|
![]() |
| Thread Tools | |
Similar Threads
|
||||
| Thread | Thread Starter | Forum | Replies | Last Post |
| why is http://www.mersenne.org/ so slow | Unregistered | Information & Answers | 19 | 2012-04-17 03:12 |
| Slow Down? | R.D. Silverman | GMP-ECM | 55 | 2011-10-16 17:28 |
| How hot is too hot? Slow is too slow? | petrw1 | Hardware | 13 | 2008-11-10 23:25 |
| Slow computer | Housemouse | Hardware | 7 | 2008-02-15 18:18 |
| Really slow machines? | Doorbasher | Hardware | 5 | 2004-08-23 22:18 |