mersenneforum.org Developer's corner
 Register FAQ Search Today's Posts Mark Forums Read

 2019-01-16, 04:12 #1 kriesel     "TF79LL86GIMPS96gpu17" Mar 2017 US midwest 139C16 Posts Developer's corner This thread is intended for data and other aids for Mersenne prime finder software testing or development. These may also be useful in assessing hardware reliability. This post Mersenne prime exponent LL worktodo lines (gpuowl
 2019-01-16, 04:26 #2 kriesel     "TF79LL86GIMPS96gpu17" Mar 2017 US midwest 22×5×251 Posts Mersenne prime exponent LL test worktodo lines (gpuowl
 2019-01-16, 04:46 #3 kriesel     "TF79LL86GIMPS96gpu17" Mar 2017 US midwest 22·5·251 Posts Mersenne prime exponent PRP and PRP-1 worktodo lines (gpuowl format) Code: #draft qa suite for checking false negatives in PRP, and limits imposed by fft lengths available # or for gpu reliability validation or code modification testing PRP=0,1,2,2,-1,1,1 PRP=0,1,2,3,-1,1,1 PRP=0,1,2,5,-1,2,1 PRP=0,1,2,7,-1,3,1 PRP=0,1,2,13,-1,6,1 PRP=0,1,2,17,-1,8,1 PRP=0,1,2,19,-1,9,1 PRP=0,1,2,31,-1,15,1 PRP=0,1,2,61,-1,30,1 PRP=0,1,2,89,-1,40,1 PRP=0,1,2,107,-1,40,1 PRP=0,1,2,127,-1,40,1 PRP=0,1,2,521,-1,40,1 PRP=0,1,2,607,-1,40,1 PRP=0,1,2,1279,-1,40,1 PRP=0,1,2,2203,-1,40,1 PRP=0,1,2,2281,-1,40,1 PRP=0,1,2,3217,-1,40,1 PRP=0,1,2,4253,-1,40,1 PRP=0,1,2,4423,-1,40,1 PRP=0,1,2,9689,-1,40,1 PRP=0,1,2,9941,-1,40,1 PRP=0,1,2,11213,-1,40,1 PRP=0,1,2,19937,-1,40,1 PRP=0,1,2,21701,-1,40,1 PRP=0,1,2,23209,-1,40,1 PRP=0,1,2,44497,-1,40,1 PRP=0,1,2,86243,-1,40,1 PRP=0,1,2,110503,-1,0,1 PRP=0,1,2,132049,-1,40,1 PRP=0,1,2,216091,-1,40,1 PRP=0,1,2,756839,-1,40,1 PRP=0,1,2,859433,-1,40,1 PRP=0,1,2,1257787,-1,56,1 PRP=0,1,2,1398269,-1,56,1 PRP=0,1,2,2976221,-1,60,1 PRP=0,1,2,3021377,-1,60,1 PRP=0,1,2,6972593,-1,63,1 PRP=0,1,2,13466917,-1,64,1 PRP=0,1,2,20996011,-1,65,1 PRP=0,1,2,24036583,-1,66,1 PRP=0,1,2,25964951,-1,66,1 PRP=0,1,2,30402457,-1,67,1 PRP=0,1,2,32582657,-1,67,1 PRP=0,1,2,37156667,-1,67,1 PRP=0,1,2,42643801,-1,68,1 PRP=0,1,2,43112609,-1,68,1 PRP=0,1,2,57885161,-1,69,1 PRP=0,1,2,74207281,-1,70,1 PRP=0,1,2,77232917,-1,70,1 PRP=0,1,2,82589933,-1,71,1 Code: #draft qa suite for checking false negatives in PRP-1 #bounds listed are greater of primenet or gpu72 indicated at mersenne.ca #tf are gpu72 target indicated at mersenne.ca #following block that are commented out are less than 1.5 bits/word at #gpuowl 128k fft length minimum #B1=1,B2=25;PRP=0,1,2,107,-1,44,2 #B1=1,B2=25;PRP=0,1,2,127,-1,44,2 #B1=6,B2=150;PRP=0,1,2,521,-1,44,2 #B1=7,B2=175;PRP=0,1,2,607,-1,44,2 #B1=14,B2=350;PRP=0,1,2,1279,-1,44,2 #B1=25,B2=625;PRP=0,1,2,2203,-1,44,2 #B1=26,B2=650;PRP=0,1,2,2281,-1,44,2 #B1=26,B2=650;PRP=0,1,2,3217,-1,44,2 #B1=48,B2=1200;PRP=0,1,2,4253,-1,44,2 #B1=49,B2=1225;PRP=0,1,2,4423,-1,44,2 #B1=108,B2=2700;PRP=0,1,2,9689,-1,44,2 #B1=111,B2=2775;PRP=0,1,2,9941,-1,44,2 #B1=125,B2=3125;PRP=0,1,2,11213,-1,44,2 #B1=223,B2=5575;PRP=0,1,2,19937,-1,44,2 #B1=243,B2=6075;PRP=0,1,2,21701,-1,44,2 #B1=260,B2=6500;PRP=0,1,2,23209,-1,44,2 #B1=498,B2=12450;PRP=0,1,2,44497,-1,44,2 #B1=964,B2=24100;PRP=0,1,2,86243,-1,44,2 #B1=10000,B2=60000;PRP=0,1,2,110503,-1,44,2 #B1=10000,B2=70000;PRP=0,1,2,132049,-1,44,2 #following are usable at 128k fft length B1=10000,B2=130000;PRP=0,1,2,216091,-1,44,2 B1=20000,B2=500000;PRP=0,1,2,756839,-1,44,2 B1=20000,B2=580000;PRP=0,1,2,859433,-1,44,2 #following seem suitable for gpuowl v3.x-5.0 supported fft lengths B1=20000,B2=240000;PRP=0,1,2,1257787,-1,60,2 B1=20000,B2=300000;PRP=0,1,2,1398269,-1,60,2 B1=40000,B2=560000;PRP=0,1,2,2976221,-1,64,2 B1=40000,B2=560000;PRP=0,1,2,3021377,-1,64,2 B1=80000,B2=1440000;PRP=0,1,2,6972593,-1,67,2 B1=150000,B2=3150000;PRP=0,1,2,13466917,-1,69,2 B1=260000,B2=6500000;PRP=0,1,2,20996011,-1,69,2 B1=280000,B2=7000000;PRP=0,1,2,24036583,-1,70,2 B1=310000,B2=7750000;PRP=0,1,2,25964951,-1,70,2 B1=350000,B2=8750000;PRP=0,1,2,30402457,-1,71,2 B1=380000,B2=9500000;PRP=0,1,2,32582657,-1,71,2 B1=430000,B2=11610000;PRP=0,1,2,37156667,-1,71,2 B1=480000,B2=12960000;PRP=0,1,2,42643801,-1,72,2 B1=490000,B2=13230000;PRP=0,1,2,43112609,-1,72,2 B1=650000,B2=17550000;PRP=0,1,2,57885161,-1,73,2 B1=810000,B2=13860000;PRP=0,1,2,74207281,-1,74,2 B1=840000,B2=23520000;PRP=0,1,2,77232917,-1,74,2 B1=870000,B2=23490000;PRP=0,1,2,82589933,-1,75,2 Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 2019-12-19 at 16:19
2019-01-17, 20:09   #4
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

10011100111002 Posts
Interim 64-bit residues for LL sequences of known Mersenne prime exponents

For comparison to the output of any new or revised software in development, the attachment contains interim res64 values for selected iteration numbers of Lucas-Lehmer (seed 4) sequences for a wide range of known Mersenne primes. These can also be used to test hardware reliability and OS/driver/application/dlls combinations for correct function.

https://www.mersenneforum.org/showpo...&postcount=179
p=82589933 interim residues list every 5M

https://www.mersenneforum.org/showpo...&postcount=192
ATH log files from cudalucas and mlucas verification runs for 82589933

Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1
Attached Files
 Reference Mp LL res64 milestones.pdf (16.9 KB, 132 views)

Last fiddled with by kriesel on 2019-11-19 at 16:04

2019-01-17, 20:09   #5
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

22×5×251 Posts
Interim 64-bit residues for PRP3 sequences of known Mersenne prime exponents

For comparison to the output of any new or revised software in development, the attachment contains interim res64 values for selected iteration numbers of PRP3 sequences for a wide range of known Mersenne primes. Can also be used to test hardware reliability and OS/driver/application/dlls combinations for correct function.

Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1
Attached Files
 Reference Mp PRP3 res64 milestones.pdf (15.3 KB, 136 views)

Last fiddled with by kriesel on 2019-11-19 at 16:04

 2019-01-20, 21:06 #6 kriesel     "TF79LL86GIMPS96gpu17" Mar 2017 US midwest 22·5·251 Posts Seed value ten LL series final residues The Lucas Lehmer series in GIMPS usage is standardized at seed value (S0) 4. Other values, such as ten, are also usable. There's a reference tabulation for various Mersenne primes of the final iterations' residues for seed 10 at http://www.hoegge.dk/mersenne/penult...sultsS0=10.txt These other seed values are to be avoided, for normal GIMPS use, in my opinion, because the math makes interim residues not comparable. A list of seed values usable in Lucas Lehmer sequences is available at https://oeis.org/A018844 Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 2019-11-19 at 16:05
2019-03-13, 18:34   #7
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

22×5×251 Posts
How fast can we multiply two integers, or square an integer?

Central to the work at the current wavefronts of GIMPS, or higher exponent, is fast math with large numbers requiring many words to contain the large number of bits involved.
This applies to primality testing by LL or probable prime test, P-1 factoring, and to a lesser extent, trial factoring.
How fast, or rather, to make it relatively independent of computing hardware speed, how efficiently can we multiply two integers, or square a large integer?

Most of the following addresses the general case of two different operands multiplied. It also mostly omits special cases such as one operand is a small power of two, which can often be accomplished quickly by shifts.
George Woltman gives an overview with computing times for early GIMPS hardware in his Colloquium available on Youtube.

Long multiplication
Beginning with the simplest method, long multiplication, which is most efficient for small multiword integers, such as occur in typical trial factoring, we multiply each digit, and add those partial products together. We select a digit size that fits the computing hardware well.
If programming an online calculator, where computations are short and frequent conversion to and from base 10 would be inefficient, one might select a power of ten as the digit size. Typically for a GIMPS application we will use a power of two, for efficiency, since many computations are done between base conversions. On modern computing hardware we might select 232 or 264 as the digit size. (Because 232 is usually faster on gpus, mfaktc uses it. Mfaktc goes up only to 95-bit factors.) For four-digit width, it might look something like the following.
Let the operands be abcd and efgh, with each letter representing the value of the respective large digit.
Here a is the most significant digit, and d the least.
Code:
         a b c d
x       e f g h
----------------
ah bh ch dh
ag bg cg dg
af bf cf df
ae be ce de
--------------------
Each of those partial products may be two digits wide. There are 16 of them (42).
They are shown in this pattern for correspondence with the operand digits, but the alignment for summing is not correct above.
Digit by digit, with alignment, the partial products are
Code:
                    bhhi bhlo dhhi dhlo
ahhi ahlo chhi chlo
bghi bglo dghi dglo
aghi aglo cghi cglo
bfhi bflo dfhi dflo
afhi aflo cfhi cflo
behi belo dehi delo
aehi aelo cehi celo
---------------------------------------
There are also 24 single-digit additions, with carries.
Generally, the effort goes up as the square of the operand length, for a fixed digit size.

Squaring a number is less effort than general multiplication in this method.
Code:
          a b c d
x         a b c d
-----------------
ac bc cc DC
ab bb CB DB
aa BA CA DA
-----------------
Since ad=da, ab=ba, ac=ca, cb=bc, db=bd, cd=dc, it's not necessary to compute the 6 partial products shown in upper case above.
In general, (n-1)n/2 multiplications can be avoided in long squaring relative to long multiplication. I think of this as folding the square in half diagonally, and needing to compute products for the diagonal terms and only one of the two off-diagonal triangles.
As n becomes large, this savings approaches half the digit multiplications. The additions must still occur.
It may be more efficient to shift the partial product left by one bit and add once with carry than to do two adds.

As the number of digits increases, the time to do multiplication or squaring increases rapidly.
Above a certain point, which is generally hundreds of bits, it is faster to use a different, more complex method.

Karatsuba
The least complex of these alternate methods was discovered by Karatsuba in 1960 and published in 1962. http://www.ccas.ru/personal/karatsuba/divcen.pdf (Before then it was conjectured the lower bound was O(n2).)
It uses an algebraic identity to accomplish a 2n-digit product of n-digit operands in 3n2/4 partial products rather than n2, at the cost of some additional additions and subtractions.
For sufficiently large operands, it is worthwhile to use the Karatsuba method recursively.
Recursion reduces the order of multiplication from O(n2) to O(nlog2(3))= O(n1.585) https://en.wikipedia.org/wiki/Karatsuba_algorithm
A rather readable description of long multiplication and Karatsuba algorithm is http://people.mpi-inf.mpg.de/~mehlho...apter2A-en.pdf
At values corresponding to the current GIMPS first or second primality test wavefront, n~85M or ~47M, or above, the ratio of number of multiplications is considerable (thousands). See the attachment.

Toom-Cook
There is a generalization of the Karatsuba approach, to splitting the operands into k pieces, where k=2 for Karatsuba.
It was discovered by Toom in 1963, and refined by Cook in 1966. So it is now called Toom-Cook.
For k=3, it is O(n1.465). Due to higher overhead, it is efficient beginning at somewhat higher operand lengths than Karatsuba. It's also possible to split the two operands differently, to k and l pieces, as when one is significantly larger.

Schönhage-Strassen (FFT)
An even faster (lower order) method is the Schönhage-Strassen algorithm, developed in 1971.
"The run-time bit complexity is, in Big O notation, O ( n ⋅ log ⁡ n ⋅ log ⁡ log ⁡ n ) for two n-digit numbers.
The algorithm uses recursive Fast Fourier transforms in rings with 2n+1 elements, a specific type of number theoretic transform."
https://en.wikipedia.org/wiki/Sch%C3...ssen_algorithm
Note that this algorithm requires zero-padding the operands. For sufficiently large operands, this is faster than Toom-Cook or other preceding methods of multiplication or squaring.
and an Excel example at http://www.gweep.net/%7Eshifty/portf...eet/index.html

Irrational Base Discrete Weighted Transform (IBDWT)
An improvement on Schönhage-Strassen is Crandall & Fagin's irrational base discrete weighted transform, published 1994.
https://www.ams.org/journals/mcom/19...-1185244-1.pdf
https://en.wikipedia.org/wiki/Irrati...hted_transform
Efficient implementations of this are the basis for the leading current GIMPS primality testing and P-1 factoring software packages.
The computations are performed modulo the Mersenne number being tested for primality, which is very quick in base two and limits doubling in size of the products, at each of the millions of iterations required.
It's common that they are implemented in floating point and sacrifice some provable-correctness of results for increased performance. In practice, the observed computation effort per iteration in the fastest available GIMPS software is O(n~1.1)
Various error checking techniques are employed to obtain acceptable reliability. This is to my knowledge the fastest currently known algorithm for operands of size relevant to primality testing or P-1 factoring progress at the GIMPS wavefront and far above.
Overall error rates per primality check, which are the result of errors due to hardware, software, numerical, and user effects, are discussed at
https://www.mersenneforum.org/showpo...3&postcount=19

Fürer and following work
Schönhage & Strassen conjectured a lower bound for multiplication of O(n log n).
Fürer published in 2007 an algorithm with O(n log n 2O(log* n)) where log* is the iteration logarithm (number of times logarithm is applied to reach a number less than or equal to one).https://en.wikipedia.org/wiki/F%C3%BCrer%27s_algorithm
Subsequently, others have lowered or better defined the order, with modifications to the Fürer algorithm or by finding other algorithms;
2008 De et al (DKSS algorithm) https://dl.acm.org/citation.cfm?id=2...dl=ACM&coll=DL;
2014 Harvey, van der Hoeven and Lecerf (https://arxiv.org/abs/1407.3360);
2015 Covanov and Thome (modified Fürer);
2016 Covanov and Thome O(n log n 22log* n) via a generalization of Fermat primes https://arxiv.org/abs/1502.02800;
2018 Harvey and van der Hoeven O(n log n 22log* n) and does not depend on unproven conjectures https://arxiv.org/abs/1802.07932;

None of these developments from 2007 onward have yet bested the IBDWT Crandall approach, in the range of number sizes of interest to the GIMPS effort, or even considerably larger. Work continues, to improve the proportionality constant or order.

Combining techniques; symmetry
The various techniques can be combined in extended-precision math software libraries, and the technique fastest for a given operand size used.
So far the description here has been in regard to symmetric multiplication (operands with essentially equal length). Long multiplication, Toom-Cook, and perhaps others support asymmetry. Asymmetry is not known to me to be of advantage in GIMPS primality testing, but it may be in P-1 factoring. The switching point between algorithms depends on the processor type and implementation. It can get rather complicated, as shown in GMP's threshold table https://gmplib.org/devel/thres/ and the color diagrams near the bottom of https://gmplib.org/devel/ that are processor-specific. It's worth noting those diagrams cover l<7600 or less, where l is "limbs", or words of size 2k, typically 32 or 64 bits, not the number of bits. https://gmplib.org/manual/Nomenclatu...tml#index-Limb
The diagonals of those diagrams would represent the symmetric operand cases. On the diagonals, we see transitions among 5 algorithms (schoolbook = conventional long multiplication, Toom22 = Karatsuba, Toom33, Toom44, fft mod2k+1) with increasing l, at differing l depending on processor model. In all cases shown, schoolbook appears fastest for sizes involved in TF (which is ~<96 bits or <3 32-bit limbs), and FFT fastest for sizes above 5000 limbs, so presumably throughout GIMPS double checking wavefront and upwards beyond mersenne.org maximum (109 bits) or Mlucas maximum (232 bits).

Practical implementation
Great care is taken in implementations such as GMP, or GWNUM (underlying prime95 and mprime) and other packages to efficiently use real machine resources such as cache levels, register counts, and memory bandwidth. Tuning the code is processor model or model family specific, because the different models' designs present differing constraints and tradeoffs.

Ernst Mayer's article in ODROID magazine https://magazine.odroid.com/article/...tical-history/

Please PM me with any suggestions or corrections for this post, or use the reference discussion thread at https://www.mersenneforum.org/showthread.php?t=23383

Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1
Attached Files
 karatsuba long mult ratio.pdf (12.6 KB, 143 views)

Last fiddled with by kriesel on 2021-02-15 at 23:06 Reason: added colloquium link sentence

 2019-03-14, 00:45 #8 kriesel     "TF79LL86GIMPS96gpu17" Mar 2017 US midwest 22×5×251 Posts PRP residue types Woltman describes and discusses PRP residue types 1 through 5 in https://www.mersenneforum.org/showpo...&postcount=209 Quoting in part: "Here is my PRP residue proposal: There are (at least) 5 PRP residue types: 1; Fermat PRP. N is PRP if a^(N-1) = 1 mod N 2: SPRP variant, N is PRP if a^((N-1)/2) = +/-1 mod N 3: Fermat PRP variant. N is PRP if a^(N+1) = a^2 mod N 4: SPRP variant. N is PRP if a^((N+1)/2) = +/-a mod N 5: Fermat PRP cofactor variant, N/d is PRP if a^(N-1) = a^d mod N/d I encourage programs to return type 1 residues as that has been the standard for prime95, PFGW, LLR for many years." Note there is also a type 0, associated with gpuowl PRP-1 runs (combined PRP and P-1), involving computing powers of a base other than 3, mod Mp. https://www.mersenneforum.org/showpo...postcount=1005 https://www.mersenneforum.org/showpo...postcount=1006 Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1 Last fiddled with by kriesel on 2019-11-19 at 16:06
2019-05-07, 21:12   #10
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

22×5×251 Posts
The Jacobi symbol check for LL

The Jacobi symbol check for the Lucas-Lehmer sequence was discovered by forum user "error" and posted 2017-08-07 at https://mersenneforum.org/showpost.p...3&postcount=30
Its announcement led to the request in the forum for additional error checks, and rather directly to the Gerbicz check for PRP.

Some programs use the Jacobi test periodically which detects LL computation errors at 50% probability. To my knowlege, only prime95/mprime and some of the early LL-based versions of gpuOwL incorporated the Jacobi test so far.
Quote:
 All valid LL-residues for Mp will have (Res-2/Mp) = -1 If you make a random error at some iteration, you have 50% chance of getting all subsequent residues with (Res-2/Mp) = 1 This condition can be checked at any suitable interval, since it won't change back again, if no further errors occur.
Computing the Jacobi symbol is a lengthy computation, so it is only done rarely (at intervals of 12 hours by default in prime95).

It appears the Jacobi check is applicable also to some portions of the P-1 factoring computations, although its application there is more complex, so it is less likely to be fast enough to be useful. As far as I know, this has never been implemented in production GIMPS P-1 code, until gpuowl V7.

Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1

Last fiddled with by kriesel on 2020-10-22 at 22:53 Reason: gpuowl V7 use of Jacobi symbol for P-1

2019-05-09, 00:28   #11
kriesel

"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

22×5×251 Posts
CUDA Toolkit compatibility vs CUDA level

CUDA toolkit levels and support, requirements for OS, driver, etc grid attached.
Note that a given version of Visual Studio can be configured to generate code for differing OS than the OS the VS runs on at the development system. (For example, XP compatible executables can be produced from a Windows 7 development system.)

Top of reference tree: https://www.mersenneforum.org/showpo...22&postcount=1
Attached Files
 compatibility.pdf (34.6 KB, 137 views)

Last fiddled with by kriesel on 2019-11-19 at 16:07

 Similar Threads Thread Thread Starter Forum Replies Last Post tServo Software 19 2016-04-23 21:30 jasong jasong 6 2013-10-16 20:09 BotXXX Hardware 16 2012-06-21 23:54 Xyzzy Linux 5 2006-06-01 14:56

All times are UTC. The time now is 09:13.

Fri Apr 16 09:13:52 UTC 2021 up 8 days, 3:54, 0 users, load averages: 1.25, 1.40, 1.48