mersenneforum.org

mersenneforum.org (https://www.mersenneforum.org/index.php)
-   Riesel Prime Search (https://www.mersenneforum.org/forumdisplay.php?f=59)
-   -   Alternative LLR tester implementation (https://www.mersenneforum.org/showthread.php?t=25343)

kuratkull 2020-03-05 10:48

Alternative LLR tester implementation
 
1 Attachment(s)
Hi people!


I implemeted an end-to-end Riesel Prime tester.
[URL]https://github.com/dkull/rpt[/URL]


Since I wrote the README with you here in mind, I will just copy paste it from the project.


I have only built it on 64 bit Linux. Other platforms will probably work. I have attached the compiled Linux executable to this post.


Any discussion on the topic of these kinds of tools, methods, algorithms is welcome.



[B]RPT - Riesel Prime Tester[/B]

[I]Experimental[/I] Lucas-Lehmer-Riesel primality tester.


[B]Note[/B]

This project is in no way 'production ready', it does not check errors and does not save state. I wouldn't recommend swapping out your LLR64's for this (yet). Although I have noticed that if a CPU is stable with LLR (for a while), it should be okay to run it without error checking...am I right about this?
Also, it is important to note that this software and Jean Penné's software (LLR64 [URL]http://jpenne.free.fr/index2.html[/URL]) do (almost) exactly the same thing, using the same libraries. So there is no performance gains to be had - we both rely on the speed of the GWNum library. LLR64 is probably faster in many cases (possibly due to using C/GCC or some other magic), even though it uses error checking. But I have noticed that RPT is a smidge faster when using larger k's in threaded mode.


[B]What this project is[/B]

This is me being interested in how the Lucas-Lehmer-Riesel primality proving works - from end to end. I first ran Jean's LLR64 software in 2007 and found my first primes that got into the TOP5000([URL]https://primes.utm.edu/primes/lists/all.txt[/URL]). I stopped for over a decade, but the topic always lingered in my mind. In 2019 I started sieving/testing again and the curiosity got the best of me and I decided to implement most of what LLR64 does with Riesel Primes.
The core LLR loop is actually trivial and can be implemented in no time. Much of the complexity comes from needing to find U0 for k > 1. Eg. for Mersenne primes (k=1) U0==4. For k>1 U0 needs to be calculated, and naive implementations are slow for large k's. I have three different (naive, less-naive and optimal) implementations in this project. The optimal one is the same one used in LLR64, which runs in O(log(bitlen(k))) time.
This project will probably also implement the PRP primality testing used in PFGW([URL]https://sourceforge.net/projects/openpfgw/[/URL])


[B]What this project is not[/B]

This is not an attempt to replace LLR64. LLR64 has a lot of years of work behind it, in both features and stability/safety. This project is no match to that. This project does not aim to provide factoring, trivial candidate elimination, resumability, safety, support for different prime formats, etc.


[B]Building[/B]

Requires the Zig compiler. And GWNum and GMP in the project directory. Both library dependencies need to be built first in their respective ways. They are not complicated to build. The Zig compiler can be downloaded in binary form from [URL]https://ziglang.org/download/[/URL]

[B]Running[/B]

[C]
$ ./rpt <k> <n> [threads]
$ ./rpt 39547695 506636 4
=== RPT - Riesel Prime Tester v0.0.1 [GWNUM: 29.8 GMP: 6.2.0] ===
LLR testing: 39547695*2^506636-1 [152521 digits] on 4 threads
step 1. find U0 ...
found V1 [11] using Jacobi Symbols in 0ms
found U0 using Lucas Sequence in 66ms
step 2. llr test ...
9%.19%.29%.39%.49%.59%.69%.78%.88%.98%.
llr took 73827ms
#> 39547695*2^506636-1 [152521 digits] IS PRIME
[/C]

[B]Pseudocode of the whole thing[/B]

[CODE]const k, n, b, c; # they have to have values
const N = k * b ^ n - 1;

# find V1
# this uses Jacobi Symbols
while (v1 := 1) : () : (v1 += 1) {
if jacobi(v1 - 2, N) == 1 && jacobi(v1 + 2, N) == -1 {
break
}
}

# fastest method [ O(log(bitlen(k))) ] to find u0
# this calculates the value of the Lucas Sequence
x = v1
y = (v1 * v1) - 2
k_bitlen = bitlen(k)
while (i := k_bitlen - 2) : (i > 0) : (i -= 1) {
if k.bits[i] == 1 {
x = (x*y) - v1 mod N
y = (y*y) - 2 mod N
} else {
y = (x*y) - v1 mod N
x = (x*x) - 2 mod N
}
}
x = x * x
x = x - v1
u = u0 = x mod N

# Lucas-Lehmer-Riesel primality test starting from U0
while (i := 1) : (i < n - 1) : (i += 1) {
u = (u * u) - 2
}
'prime!' if u == 0[/CODE]

kuratkull 2020-03-05 14:41

There is already much newer information on the Github page - for example detailed build instructions and some limitations and other tweaks to the software.

Thomas11 2020-03-06 16:40

Looks quite interesting! :smile:

I did some tests. Single thread performance is indeed almost identical to LLR64.
However, multithreading doesn't seem to work as expected:

single-threaded:
[CODE]
time ./rpt 3 234760 1
=== RPT - Riesel Prime Tester v0.0.1 [GWNUM: 29.8 GMP: 6.2.0] ===
LLR testing: 3*2^234760-1 [70671 digits] on 1 threads
step 1. find U0 ...
found V1 [3] using Jacobi Symbols in 0ms
found U0 using Lucas Sequence in 0ms
step 2. LLR test ...
X
LLR took 11946ms
#> 3*2^234760-1 [70671 digits] IS PRIME

real 0m12,006s
user 0m12,932s
sys 0m0,000s
[/CODE]

4 threads:
[CODE]
time ./rpt 3 234760 4
=== RPT - Riesel Prime Tester v0.0.1 [GWNUM: 29.8 GMP: 6.2.0] ===
LLR testing: 3*2^234760-1 [70671 digits] on 4 threads
step 1. find U0 ...
found V1 [3] using Jacobi Symbols in 0ms
found U0 using Lucas Sequence in 1ms
step 2. LLR test ...
X
LLR took 26103ms
#> 3*2^234760-1 [70671 digits] IS PRIME

real 0m26,161s
user 1m6,958s
sys 0m18,038s
[/CODE]

kuratkull 2020-03-07 00:39

Yes, I also tried it on larger N's today and it seems that the current version lags behind more and more at larger N's compared to LLR64. I'm not sure why, because during development I saw faster speeds than LLR64 in some cases. The weird thing is that the core loop is basically just calling the square function in GWnum - this function call takes 99.99% of the time in the loop. I am looking into it with the goal of at least matching the speed of LLR64, because it should be doable due to the fact that the core loop of LLR64 is exactly the same.

Though you might be surprised to see that LLR64 is also slower with smaller N's in threaded mode(i mean the same smallish N [for example 1*2^859433-1] is slower on multithreaded than single thread, both on RPT and LLR64). I just discovered this today when benchmarking the slowness issue. This is probably due to small N's taking too much CPU time context switching. Larger N's should benefit from multithreaded, while small N's don't.


PS! N = k*2^n-1 (N equals the digits in a number)

VBCurtis 2020-03-07 06:36

In general, LLR does not benefit from splitting an FFT into threads such that FFT size divided by threadcount is smaller than 128k.

That is, a 256k FFT size will run 2-threaded quite nicely. FFTs below 200k in size are not efficient 2-threaded. I've run a ~400k FFT 3-threaded with reasonable success.

kuratkull 2020-03-07 11:54

(running all this on a Skylake i7)

Using zero-padded FMA3 FFT length 384K for both:
4 threads: 39547695*2^3664022-1 is not prime. LLR Res64: DA225779C3F421FD Time : 1409.882 sec.
3 threads: 39547695*2^3664034-1 is not prime. LLR Res64: 7F70D51694B8E6AF Time : 1645.871 sec.
384 / 4 = 96
384 / 3 = 128
I would have expected it to work better with 3 threads. Will have to look into optimal settings/recommendations for the user.

Also I updated my code so it should be at least as fast as LLR64, most likely faster (due to me not using checks and safety features in the core loop). On my CPU RPT seems to be about 1% faster.

39547695*2^506636-1 has 50KB fft size:

./rpt 39547695 506636 1 took 68820ms
./rpt 39547695 506636 2 took 65739ms
./rpt 39547695 506636 3 took 57495ms
./rpt 39547695 506636 4 took 63035ms

./llr64 -d -q"39547695*2^506636-1" -t1 -> Time : 69.294 sec.
./llr64 -d -q"39547695*2^506636-1" -t2 -> Time : 66.765 sec.
./llr64 -d -q"39547695*2^506636-1" -t3 -> Time : 59.057 sec.
./llr64 -d -q"39547695*2^506636-1" -t4 -> Time : 63.470 sec.

And comparing larger n's:
4 threads: 39547695*2^3664022-1 is not prime. LLR Res64: DA225779C3F421FD Time : 1409.882 sec.
4 threads: rpt 1397.022 seconds
I pushed the new binary (0.0.2) to Github releases: [url]https://github.com/dkull/rpt/releases/tag/v0.0.2[/url]

kuratkull 2020-03-07 13:41

[URL]https://github.com/dkull/rpt/releases/tag/v0.0.3[/URL]


RPT can now choose the best threadcount for you. Just give 0 as the threadcount.
This information should be transferrable to LLR64, eg. you can test the best threadcount and use that in LLR64 with the same k,n
Currently the best one has to be at least 5% better than the last best, this ensures we don't waste cores for negligible improvements. Also this minimizes the effects of jitter.



[CODE]
./rpt_release_linux64_0_0_3 39547695 3664175 0
=== RPT - Riesel Prime Tester v0.0.3 [GWNUM: 29.8 GMP: 6.2.0] ===
LLR testing: 39547695*2^3664175-1 [1103035 digits] on 0 threads
step #1 find U0 ...
found V1 [11] using Jacobi Symbols in 1ms
found U0 using Lucas Sequence in 390ms
step #1.5 benchmark threadcount ...
threads 1 took 1319ms for 1000 iterations
threads 2 took 770ms for 1000 iterations
threads 3 took 552ms for 1000 iterations
threads 4 took 489ms for 1000 iterations
threads 5 took 514ms for 1000 iterations
threads 6 took 543ms for 1000 iterations
threads 7 took 513ms for 1000 iterations
threads 8 took 466ms for 1000 iterations
using fastest threadcount 4
FFT size 384KB
step #2 LLR test ...

./rpt_release_linux64_0_0_3 39547695 506636 0
=== RPT - Riesel Prime Tester v0.0.3 [GWNUM: 29.8 GMP: 6.2.0] ===
LLR testing: 39547695*2^506636-1 [152521 digits] on 0 threads
step #1 find U0 ...
found V1 [11] using Jacobi Symbols in 0ms
found U0 using Lucas Sequence in 65ms
step #1.5 benchmark threadcount ...
threads 1 took 159ms for 1000 iterations
threads 2 took 155ms for 1000 iterations
threads 3 took 149ms for 1000 iterations
threads 4 took 164ms for 1000 iterations
threads 5 took 162ms for 1000 iterations
threads 6 took 169ms for 1000 iterations
threads 7 took 174ms for 1000 iterations
threads 8 took 192ms for 1000 iterations
using fastest threadcount 3
FFT size 50KB
step #2 LLR test ...[/CODE]

VBCurtis 2020-03-07 16:42

[QUOTE=kuratkull;539074](running all this on a Skylake i7)

Using zero-padded FMA3 FFT length 384K for both:
4 threads: 39547695*2^3664022-1 is not prime. LLR Res64: DA225779C3F421FD Time : 1409.882 sec.
3 threads: 39547695*2^3664034-1 is not prime. LLR Res64: 7F70D51694B8E6AF Time : 1645.871 sec.
384 / 4 = 96
384 / 3 = 128
I would have expected it to work better with 3 threads. Will have to look into optimal settings/recommendations for the user.
[/QUOTE]
4 * 1409 = 5636 thread-sec
3 * 1645 = 4935 thread-sec

I consider 15% more thread-sec 'inefficient'. I certainly didn't make clear what I meant the first time, sorry! If we did a similar comparison for 2- vs 3- threaded, I imagine the 2-threaded instance would be ~4800 thread-sec, within 3-4% of the 3-threaded instance. That's a tradeoff I'm willing to make for the admin simplicity of fewer clients running, but 15% is not.

Happy5214 2020-03-08 02:29

I've never heard of Zig before. It looks pretty cool at first glance. I hope your code is more readable than the actual LLR code. I've actually used the Jacobi symbol code in class as an example of ugly code (namely the result variable being named "resul").

kuratkull 2020-03-08 12:46

@[URL="https://www.mersenneforum.org/member.php?u=1636"][COLOR=red]VBCurtis[/COLOR][/URL] Aah right, I see what you mean. I personally just try to optimize the time-spent-on-candidate with little regard for used CPU time/power - unless adding an extra core really gives no benefit (probably around 5-10% for me).


@[URL="https://www.mersenneforum.org/member.php?u=8911"]Happy5214[/URL] And about Zig, yes it's a pretty neat language. This project is as much about learning/testing Zig as much as it is learning about the LLR test. But Zig hasn't let me down once during this project, and binding to C libraries was as easy as it can probably be. And I am glad to see that I managed to match/exceed(with cheating) the speed of LLR64. I am not one to throw away performance gains lightly, glad that Zig was able to handle that.

I am also pretty happy that GMP provides a function for the Jacobi symbols, so I didn't have to implement it myself.

kuratkull 2020-03-12 19:20

[url]https://github.com/dkull/rpt/releases/tag/v0.1.0[/url]


0.1.0
-----

* fixed issue with gwstartnextfft false-negatives
* selftest success at least up to n=150000 (for k<300)
* print LLR64 compatible residue
* automatic niceness 19 (not configurable)
* extra logging when errors happen (won't catch all)
* result output says 'maybe' if errors were seen
* speed still on par with LLR64


So basically it seems to work - though I still wouldn't bet my head on it working in all cases (yet), though I currently do not know a case where it would fail silently. It currently crashes for large n's (>9M).



Also the argument format has changed, and setting threads to 0 causes RPT to auto-benchmark to find the fastest number of threads.



Currently my guarantee is that I --selftest 150000, which means all k<300 are tested upto known primes n=150000, and a negative case for each n is generated (n - 1 if it is not a known prime).


All times are UTC. The time now is 15:32.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.