Alternative LLR tester implementation
1 Attachment(s)
Hi people!
I implemeted an endtoend Riesel Prime tester. [URL]https://github.com/dkull/rpt[/URL] Since I wrote the README with you here in mind, I will just copy paste it from the project. I have only built it on 64 bit Linux. Other platforms will probably work. I have attached the compiled Linux executable to this post. Any discussion on the topic of these kinds of tools, methods, algorithms is welcome. [B]RPT  Riesel Prime Tester[/B] [I]Experimental[/I] LucasLehmerRiesel primality tester. [B]Note[/B] This project is in no way 'production ready', it does not check errors and does not save state. I wouldn't recommend swapping out your LLR64's for this (yet). Although I have noticed that if a CPU is stable with LLR (for a while), it should be okay to run it without error checking...am I right about this? Also, it is important to note that this software and Jean PennĂ©'s software (LLR64 [URL]http://jpenne.free.fr/index2.html[/URL]) do (almost) exactly the same thing, using the same libraries. So there is no performance gains to be had  we both rely on the speed of the GWNum library. LLR64 is probably faster in many cases (possibly due to using C/GCC or some other magic), even though it uses error checking. But I have noticed that RPT is a smidge faster when using larger k's in threaded mode. [B]What this project is[/B] This is me being interested in how the LucasLehmerRiesel primality proving works  from end to end. I first ran Jean's LLR64 software in 2007 and found my first primes that got into the TOP5000([URL]https://primes.utm.edu/primes/lists/all.txt[/URL]). I stopped for over a decade, but the topic always lingered in my mind. In 2019 I started sieving/testing again and the curiosity got the best of me and I decided to implement most of what LLR64 does with Riesel Primes. The core LLR loop is actually trivial and can be implemented in no time. Much of the complexity comes from needing to find U0 for k > 1. Eg. for Mersenne primes (k=1) U0==4. For k>1 U0 needs to be calculated, and naive implementations are slow for large k's. I have three different (naive, lessnaive and optimal) implementations in this project. The optimal one is the same one used in LLR64, which runs in O(log(bitlen(k))) time. This project will probably also implement the PRP primality testing used in PFGW([URL]https://sourceforge.net/projects/openpfgw/[/URL]) [B]What this project is not[/B] This is not an attempt to replace LLR64. LLR64 has a lot of years of work behind it, in both features and stability/safety. This project is no match to that. This project does not aim to provide factoring, trivial candidate elimination, resumability, safety, support for different prime formats, etc. [B]Building[/B] Requires the Zig compiler. And GWNum and GMP in the project directory. Both library dependencies need to be built first in their respective ways. They are not complicated to build. The Zig compiler can be downloaded in binary form from [URL]https://ziglang.org/download/[/URL] [B]Running[/B] [C] $ ./rpt <k> <n> [threads] $ ./rpt 39547695 506636 4 === RPT  Riesel Prime Tester v0.0.1 [GWNUM: 29.8 GMP: 6.2.0] === LLR testing: 39547695*2^5066361 [152521 digits] on 4 threads step 1. find U0 ... found V1 [11] using Jacobi Symbols in 0ms found U0 using Lucas Sequence in 66ms step 2. llr test ... 9%.19%.29%.39%.49%.59%.69%.78%.88%.98%. llr took 73827ms #> 39547695*2^5066361 [152521 digits] IS PRIME [/C] [B]Pseudocode of the whole thing[/B] [CODE]const k, n, b, c; # they have to have values const N = k * b ^ n  1; # find V1 # this uses Jacobi Symbols while (v1 := 1) : () : (v1 += 1) { if jacobi(v1  2, N) == 1 && jacobi(v1 + 2, N) == 1 { break } } # fastest method [ O(log(bitlen(k))) ] to find u0 # this calculates the value of the Lucas Sequence x = v1 y = (v1 * v1)  2 k_bitlen = bitlen(k) while (i := k_bitlen  2) : (i > 0) : (i = 1) { if k.bits[i] == 1 { x = (x*y)  v1 mod N y = (y*y)  2 mod N } else { y = (x*y)  v1 mod N x = (x*x)  2 mod N } } x = x * x x = x  v1 u = u0 = x mod N # LucasLehmerRiesel primality test starting from U0 while (i := 1) : (i < n  1) : (i += 1) { u = (u * u)  2 } 'prime!' if u == 0[/CODE] 
There is already much newer information on the Github page  for example detailed build instructions and some limitations and other tweaks to the software.

Looks quite interesting! :smile:
I did some tests. Single thread performance is indeed almost identical to LLR64. However, multithreading doesn't seem to work as expected: singlethreaded: [CODE] time ./rpt 3 234760 1 === RPT  Riesel Prime Tester v0.0.1 [GWNUM: 29.8 GMP: 6.2.0] === LLR testing: 3*2^2347601 [70671 digits] on 1 threads step 1. find U0 ... found V1 [3] using Jacobi Symbols in 0ms found U0 using Lucas Sequence in 0ms step 2. LLR test ... X LLR took 11946ms #> 3*2^2347601 [70671 digits] IS PRIME real 0m12,006s user 0m12,932s sys 0m0,000s [/CODE] 4 threads: [CODE] time ./rpt 3 234760 4 === RPT  Riesel Prime Tester v0.0.1 [GWNUM: 29.8 GMP: 6.2.0] === LLR testing: 3*2^2347601 [70671 digits] on 4 threads step 1. find U0 ... found V1 [3] using Jacobi Symbols in 0ms found U0 using Lucas Sequence in 1ms step 2. LLR test ... X LLR took 26103ms #> 3*2^2347601 [70671 digits] IS PRIME real 0m26,161s user 1m6,958s sys 0m18,038s [/CODE] 
Yes, I also tried it on larger N's today and it seems that the current version lags behind more and more at larger N's compared to LLR64. I'm not sure why, because during development I saw faster speeds than LLR64 in some cases. The weird thing is that the core loop is basically just calling the square function in GWnum  this function call takes 99.99% of the time in the loop. I am looking into it with the goal of at least matching the speed of LLR64, because it should be doable due to the fact that the core loop of LLR64 is exactly the same.
Though you might be surprised to see that LLR64 is also slower with smaller N's in threaded mode(i mean the same smallish N [for example 1*2^8594331] is slower on multithreaded than single thread, both on RPT and LLR64). I just discovered this today when benchmarking the slowness issue. This is probably due to small N's taking too much CPU time context switching. Larger N's should benefit from multithreaded, while small N's don't. PS! N = k*2^n1 (N equals the digits in a number) 
In general, LLR does not benefit from splitting an FFT into threads such that FFT size divided by threadcount is smaller than 128k.
That is, a 256k FFT size will run 2threaded quite nicely. FFTs below 200k in size are not efficient 2threaded. I've run a ~400k FFT 3threaded with reasonable success. 
(running all this on a Skylake i7)
Using zeropadded FMA3 FFT length 384K for both: 4 threads: 39547695*2^36640221 is not prime. LLR Res64: DA225779C3F421FD Time : 1409.882 sec. 3 threads: 39547695*2^36640341 is not prime. LLR Res64: 7F70D51694B8E6AF Time : 1645.871 sec. 384 / 4 = 96 384 / 3 = 128 I would have expected it to work better with 3 threads. Will have to look into optimal settings/recommendations for the user. Also I updated my code so it should be at least as fast as LLR64, most likely faster (due to me not using checks and safety features in the core loop). On my CPU RPT seems to be about 1% faster. 39547695*2^5066361 has 50KB fft size: ./rpt 39547695 506636 1 took 68820ms ./rpt 39547695 506636 2 took 65739ms ./rpt 39547695 506636 3 took 57495ms ./rpt 39547695 506636 4 took 63035ms ./llr64 d q"39547695*2^5066361" t1 > Time : 69.294 sec. ./llr64 d q"39547695*2^5066361" t2 > Time : 66.765 sec. ./llr64 d q"39547695*2^5066361" t3 > Time : 59.057 sec. ./llr64 d q"39547695*2^5066361" t4 > Time : 63.470 sec. And comparing larger n's: 4 threads: 39547695*2^36640221 is not prime. LLR Res64: DA225779C3F421FD Time : 1409.882 sec. 4 threads: rpt 1397.022 seconds I pushed the new binary (0.0.2) to Github releases: [url]https://github.com/dkull/rpt/releases/tag/v0.0.2[/url] 
[URL]https://github.com/dkull/rpt/releases/tag/v0.0.3[/URL]
RPT can now choose the best threadcount for you. Just give 0 as the threadcount. This information should be transferrable to LLR64, eg. you can test the best threadcount and use that in LLR64 with the same k,n Currently the best one has to be at least 5% better than the last best, this ensures we don't waste cores for negligible improvements. Also this minimizes the effects of jitter. [CODE] ./rpt_release_linux64_0_0_3 39547695 3664175 0 === RPT  Riesel Prime Tester v0.0.3 [GWNUM: 29.8 GMP: 6.2.0] === LLR testing: 39547695*2^36641751 [1103035 digits] on 0 threads step #1 find U0 ... found V1 [11] using Jacobi Symbols in 1ms found U0 using Lucas Sequence in 390ms step #1.5 benchmark threadcount ... threads 1 took 1319ms for 1000 iterations threads 2 took 770ms for 1000 iterations threads 3 took 552ms for 1000 iterations threads 4 took 489ms for 1000 iterations threads 5 took 514ms for 1000 iterations threads 6 took 543ms for 1000 iterations threads 7 took 513ms for 1000 iterations threads 8 took 466ms for 1000 iterations using fastest threadcount 4 FFT size 384KB step #2 LLR test ... ./rpt_release_linux64_0_0_3 39547695 506636 0 === RPT  Riesel Prime Tester v0.0.3 [GWNUM: 29.8 GMP: 6.2.0] === LLR testing: 39547695*2^5066361 [152521 digits] on 0 threads step #1 find U0 ... found V1 [11] using Jacobi Symbols in 0ms found U0 using Lucas Sequence in 65ms step #1.5 benchmark threadcount ... threads 1 took 159ms for 1000 iterations threads 2 took 155ms for 1000 iterations threads 3 took 149ms for 1000 iterations threads 4 took 164ms for 1000 iterations threads 5 took 162ms for 1000 iterations threads 6 took 169ms for 1000 iterations threads 7 took 174ms for 1000 iterations threads 8 took 192ms for 1000 iterations using fastest threadcount 3 FFT size 50KB step #2 LLR test ...[/CODE] 
[QUOTE=kuratkull;539074](running all this on a Skylake i7)
Using zeropadded FMA3 FFT length 384K for both: 4 threads: 39547695*2^36640221 is not prime. LLR Res64: DA225779C3F421FD Time : 1409.882 sec. 3 threads: 39547695*2^36640341 is not prime. LLR Res64: 7F70D51694B8E6AF Time : 1645.871 sec. 384 / 4 = 96 384 / 3 = 128 I would have expected it to work better with 3 threads. Will have to look into optimal settings/recommendations for the user. [/QUOTE] 4 * 1409 = 5636 threadsec 3 * 1645 = 4935 threadsec I consider 15% more threadsec 'inefficient'. I certainly didn't make clear what I meant the first time, sorry! If we did a similar comparison for 2 vs 3 threaded, I imagine the 2threaded instance would be ~4800 threadsec, within 34% of the 3threaded instance. That's a tradeoff I'm willing to make for the admin simplicity of fewer clients running, but 15% is not. 
I've never heard of Zig before. It looks pretty cool at first glance. I hope your code is more readable than the actual LLR code. I've actually used the Jacobi symbol code in class as an example of ugly code (namely the result variable being named "resul").

@[URL="https://www.mersenneforum.org/member.php?u=1636"][COLOR=red]VBCurtis[/COLOR][/URL] Aah right, I see what you mean. I personally just try to optimize the timespentoncandidate with little regard for used CPU time/power  unless adding an extra core really gives no benefit (probably around 510% for me).
@[URL="https://www.mersenneforum.org/member.php?u=8911"]Happy5214[/URL] And about Zig, yes it's a pretty neat language. This project is as much about learning/testing Zig as much as it is learning about the LLR test. But Zig hasn't let me down once during this project, and binding to C libraries was as easy as it can probably be. And I am glad to see that I managed to match/exceed(with cheating) the speed of LLR64. I am not one to throw away performance gains lightly, glad that Zig was able to handle that. I am also pretty happy that GMP provides a function for the Jacobi symbols, so I didn't have to implement it myself. 
[url]https://github.com/dkull/rpt/releases/tag/v0.1.0[/url]
0.1.0  * fixed issue with gwstartnextfft falsenegatives * selftest success at least up to n=150000 (for k<300) * print LLR64 compatible residue * automatic niceness 19 (not configurable) * extra logging when errors happen (won't catch all) * result output says 'maybe' if errors were seen * speed still on par with LLR64 So basically it seems to work  though I still wouldn't bet my head on it working in all cases (yet), though I currently do not know a case where it would fail silently. It currently crashes for large n's (>9M). Also the argument format has changed, and setting threads to 0 causes RPT to autobenchmark to find the fastest number of threads. Currently my guarantee is that I selftest 150000, which means all k<300 are tested upto known primes n=150000, and a negative case for each n is generated (n  1 if it is not a known prime). 
All times are UTC. The time now is 15:32. 
Powered by vBulletin® Version 3.8.11
Copyright ©2000  2020, Jelsoft Enterprises Ltd.