mersenneforum.org (https://www.mersenneforum.org/index.php)
-   GpuOwl (https://www.mersenneforum.org/forumdisplay.php?f=171)
-   -   GpuOwl 7.x (https://www.mersenneforum.org/showthread.php?t=26007)

 preda 2020-09-26 04:10

GpuOwl 7.x

A heads up about upcoming changes, each will be discussed in more detail. (GpuOwl 7.x is not yet available)

- removed LL
- removed standalone P-1
- merged PRP and P-1
- reworked P-1 second-stage
- changed savefile version and naming scheme
- changed savefile cleanup

Planned:
- rework the proof generation
- migrate to proof v2
- rework PRP savefiles towards merging with the proof checkpoints
- some SP experiments

A git branch will be created for the current 6.x for people who feel strongly about the "removed" elements.

 preda 2020-09-26 04:41

Merged PRP and P-1 first-stage

I'll explain briefly why this is a good thing compared to the "classic" P-1 first-stage.

P-1 first stage is conceptually simple:
a) given the bound B1, compute powerSmooth(B1) which is roughly the product of all primes under B1. The small primes are taken to a higher power (than 1).
b) compute X = 3^powerSmooth(B1)
c) compute the GCD(X - 1, Mp) hoping to come up with a factor

The costly operation is computing 3^powerSmooth(B1). In the classical setup, its cost is log2(powerSmooth(B1)) "squarings". (i.e. the number of bits in powerSmooth(B1)).

log2(powerSmooth(B1)) ~= 1.442 * B1
(BTW, 1.442 ~= 1/ln(2))

For example, doing first-stage with B1=1M costs about 1.4M "PRP iterations".

2. Merged with PRP
During normal PRP we compute 3^(2^k) iterating over k. We can use these values (the normal PRP iterations) for cheaper P-1 first stage -- Robert Gerbicz showed a nice way to aggregate them for computing the 3^powerSmooth(B1). The new algorithm uses a number N of "cache" buffers (thus, a lot of memory). The cost of P-1 first-stage (in iterations) *on top of the PRP* becomes:

B1 * 1.442 / (log2(N) + 2)

Where N is the number of buffers. This represents a reduction of the cost compared to the "classic" P-1 FS of a factor of log2(N)+2. For example, with about 256 buffers, that's a 10x reduction in cost.

The catch is that this tiny cost is "on top of" the PRP cost.

Let's make clear what that means:
if somebody wants to run the PRP anyway, then the additional cost of P-1 FS is really tiny and that's great. If somebody does not care for the PRP and wants to do *only* the P-1 FS, then there is no cost reduction at all compared to the "classic" P-1.

classicCost = B1 * 1.442
"number of buffers" N
"reduction factor" F = 1 / (log2(N) + 2)
cost excluding the PRP iterations = classicCost * F
cost including the PRP iterations = classicCost * (1 + F)

When F=1/10 as in the example (N=256), the cost "on top" of PRP is only 0.1 * classic, while the cost including the PRP component is 1.1 * classic (a slight *increase*).

Aside from these cost consideration, one little advantage of the new algorithm is that it is partially protected with the GEC error check. ("partially" means that only the PRP iterations themselves are protected, but this is still an improvement compared to the "classic" P-1 FS which is not protected at all).

 preda 2020-09-26 05:14

Merged PRP and P-1 FS

Another way to look at running P-1 *only* (without PRP) in the new setup:

Let's say running to a high bound B1=5M,
log2(powerSmooth(B1))=7M,

So runnnig that much P-1 FS involves also running the first 7M iterations of the normal PRP test. If after completing the P-1 FS the tester desires to *not* continue the PRP, that would mean that the already-done 7M PRP iterations are lost.

That's why in the new setup running "standalone" P-1 is inefficient and that's why I'm not planning to offer it; even though I imagine some users will be dissapointed that the beloved "standalone P-1" is gone. But hopefully higher bounds and more factors will come out of it :)

 retina 2020-09-26 06:38

[QUOTE=preda;557928]- removed LL[/QUOTE]Won't we need this to verify primes?

 preda 2020-09-26 09:40

[QUOTE=retina;557932]Won't we need this to verify primes?[/QUOTE]

It's enough to check out the "v6" branch [url]https://github.com/preda/gpuowl/tree/v6[/url] ,
build (scons or make),
and we're ready to go for a LL prime verification.

PS: what's missing is the prime to verify..

 firejuggler 2020-09-26 12:06

Does this mean that say, one could P-1 a 17M exponent to a B1 of about 11M at the same time it take to run the prp? what about the buffer size?

 preda 2020-09-26 12:29

[QUOTE=firejuggler;557949]Does this mean that say, one could P-1 a 17M exponent to a B1 of about 11M at the same time it take to run the prp? what about the buffer size?[/QUOTE]

I would not P-1 a 17M exponent at this point, because the wavefront (i.e. exponents that didn't have either P-1 or PRP done already) is at 100M.

For a 100M exponent the workflow is:
- choose a rather large B1. Let's say B1=8M.
- choose a B2. Let's say B2=50M. Note that the ratio B2/B1 is now lower, under 10, because B1 is so large.
- start the PRP.

For the first 8M*1.44 iterations, let's say 12M iterations, the PRP runs in parallel with P-1 FS. For this it uses as much memory as allowed by -maxAlloc. For example, for 256 buffers of 22MB each, about 6GB is needed. (this number of buffers does not need to jump in powers of two, any number will do).

The overhead of the P-1 FS is about 1/10 of the PRP iterations it runs in parallel with. So, over 12M iterations, the cost of P-1 FS would be about 1.2M iterations.

Once the PRP reaches the 12M mark, it runs the first-stage GCD. It also pauses the PRP and starts the P-1 second-stage. The second-stage also runs one or more GCDs, and if no factor is found in any of those GCDs, the PRP continues from iteration 12M up.

Let's consider a GPU with only 4GB of RAM, let's say 3GB are allocated for P-1 FS buffers, at 22MB/buf that accomodates about 128 buffers, so the overhead ratio in this case becomes 1/9 (i.e. 1/(2 +log2(128))) instead of 1/10 because of the lower number of buffers.

 firejuggler 2020-09-26 16:11

My intent was to run a P-1 with a B1 as close as the exponent tested., and wondered if the buffer size for the first phase was somehow manageable.
Lets say I want to do a PM-1 of a 100M exponent with a B1 of 50M. What would be the size of each buffer?

From what I understand, the prp will go upto ~80 M iteration, finish the first phase of PM1, then do the second part of PM1 then do the last 20M iteration for the prp. (or stop if a factor is found).

 firejuggler 2020-09-26 19:41

I mean, since the stage1 is free with enough memory. The problem is how the memory need scale.

 preda 2020-09-26 20:41

[QUOTE=firejuggler;557961]My intent was to run a P-1 with a B1 as close as the exponent tested., and wondered if the buffer size for the first phase was somehow manageable.
Lets say I want to do a PM-1 of a 100M exponent with a B1 of 50M. What would be the size of each buffer?

From what I understand, the prp will go upto ~80 M iteration, finish the first phase of PM1, then do the second part of PM1 then do the last 20M iteration for the prp. (or stop if a factor is found).[/QUOTE]

Stage1, even if cheaper, is not free. If you run it over the full PRP length (which would mean, for a 100M exponent, you'd run to B1=70M), you still pay 10% of the full PRP just for the P-1 which is not nothing. What's more, the PRP saving when a factor in found in P-1 become smaller as you advance towards the end of the PRP.

The memory requirements do not grow with the length of the P-1. They do grow, linearly, with the size of the exponent (because the size of a "buffer" is relative to the exponent).

P-1 second-stage is still more efficient (faster) than first-stage. Thus it's still worth doing second-stage at the end. If you want to cover the range up to 100M, I'd split it in B1,B2 (12M,100M) or (15M,100M). If you want to do B1=100M, you should follow up with B2=400M or such. Use the P-1 probability calculator to check whether such very-high bounds are worth the effort.

 Aramis Wyler 2020-09-28 00:30

Thank you for your work on this new version, Preda - version 6 is running great and I don't think anyone would blame you for letting it rest a bit.

The concept of combining P-1 FS with the PRP is very interesting to me, but your response to Firejuggler makes me think that maybe I understand the costs of FS vs SS P-1 even less than I had realized.

I understand that you said that P-1 SS is still more efficient/faster than PP-1 FS, but is it more efficient than than this new cost, with F=1/10 or even less with say, the 16GB video cards on colab? With an F=1/12 (or whatever that would be with ~15GiB worth of buffers) Would it make sense to run what we'd otherwise consider rather bizarre bounds, like 20000000 B1 and 40000000 B2?
Maybe not even bother with second stage?

All times are UTC. The time now is 21:38.

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2021, Jelsoft Enterprises Ltd.