20200926, 04:10  #1 
"Mihai Preda"
Apr 2015
2546_{8} Posts 
GpuOwl 7.x
A heads up about upcoming changes, each will be discussed in more detail. (GpuOwl 7.x is not yet available)
 removed LL  removed standalone P1  merged PRP and P1  reworked P1 secondstage  changed savefile version and naming scheme  changed savefile cleanup Planned:  rework the proof generation  migrate to proof v2  rework PRP savefiles towards merging with the proof checkpoints  some SP experiments A git branch will be created for the current 6.x for people who feel strongly about the "removed" elements. Last fiddled with by preda on 20200926 at 04:42 
20200926, 04:41  #2 
"Mihai Preda"
Apr 2015
2×691 Posts 
Merged PRP and P1 firststage
I'll explain briefly why this is a good thing compared to the "classic" P1 firststage.
P1 first stage is conceptually simple: a) given the bound B1, compute powerSmooth(B1) which is roughly the product of all primes under B1. The small primes are taken to a higher power (than 1). b) compute X = 3^powerSmooth(B1) c) compute the GCD(X  1, Mp) hoping to come up with a factor The costly operation is computing 3^powerSmooth(B1). In the classical setup, its cost is log2(powerSmooth(B1)) "squarings". (i.e. the number of bits in powerSmooth(B1)). log2(powerSmooth(B1)) ~= 1.442 * B1 (BTW, 1.442 ~= 1/ln(2)) For example, doing firststage with B1=1M costs about 1.4M "PRP iterations". 2. Merged with PRP During normal PRP we compute 3^(2^k) iterating over k. We can use these values (the normal PRP iterations) for cheaper P1 first stage  Robert Gerbicz showed a nice way to aggregate them for computing the 3^powerSmooth(B1). The new algorithm uses a number N of "cache" buffers (thus, a lot of memory). The cost of P1 firststage (in iterations) *on top of the PRP* becomes: B1 * 1.442 / (log2(N) + 2) Where N is the number of buffers. This represents a reduction of the cost compared to the "classic" P1 FS of a factor of log2(N)+2. For example, with about 256 buffers, that's a 10x reduction in cost. The catch is that this tiny cost is "on top of" the PRP cost. Let's make clear what that means: if somebody wants to run the PRP anyway, then the additional cost of P1 FS is really tiny and that's great. If somebody does not care for the PRP and wants to do *only* the P1 FS, then there is no cost reduction at all compared to the "classic" P1. classicCost = B1 * 1.442 "number of buffers" N "reduction factor" F = 1 / (log2(N) + 2) cost excluding the PRP iterations = classicCost * F cost including the PRP iterations = classicCost * (1 + F) When F=1/10 as in the example (N=256), the cost "on top" of PRP is only 0.1 * classic, while the cost including the PRP component is 1.1 * classic (a slight *increase*). Aside from these cost consideration, one little advantage of the new algorithm is that it is partially protected with the GEC error check. ("partially" means that only the PRP iterations themselves are protected, but this is still an improvement compared to the "classic" P1 FS which is not protected at all). Last fiddled with by preda on 20200926 at 04:47 
20200926, 05:14  #3 
"Mihai Preda"
Apr 2015
2546_{8} Posts 
Merged PRP and P1 FS
Another way to look at running P1 *only* (without PRP) in the new setup:
Let's say running to a high bound B1=5M, log2(powerSmooth(B1))=7M, So runnnig that much P1 FS involves also running the first 7M iterations of the normal PRP test. If after completing the P1 FS the tester desires to *not* continue the PRP, that would mean that the alreadydone 7M PRP iterations are lost. That's why in the new setup running "standalone" P1 is inefficient and that's why I'm not planning to offer it; even though I imagine some users will be dissapointed that the beloved "standalone P1" is gone. But hopefully higher bounds and more factors will come out of it :) 
20200926, 06:38  #4 
Undefined
"The unspeakable one"
Jun 2006
My evil lair
14636_{8} Posts 

20200926, 09:40  #5 
"Mihai Preda"
Apr 2015
2·691 Posts 
It's enough to check out the "v6" branch https://github.com/preda/gpuowl/tree/v6 ,
build (scons or make), and we're ready to go for a LL prime verification. PS: what's missing is the prime to verify.. Last fiddled with by preda on 20200926 at 09:41 
20200926, 12:06  #6 
"Vincent"
Apr 2010
Over the rainbow
101100011101_{2} Posts 
Does this mean that say, one could P1 a 17M exponent to a B1 of about 11M at the same time it take to run the prp? what about the buffer size?

20200926, 12:29  #7  
"Mihai Preda"
Apr 2015
2·691 Posts 
Quote:
For a 100M exponent the workflow is:  choose a rather large B1. Let's say B1=8M.  choose a B2. Let's say B2=50M. Note that the ratio B2/B1 is now lower, under 10, because B1 is so large.  start the PRP. For the first 8M*1.44 iterations, let's say 12M iterations, the PRP runs in parallel with P1 FS. For this it uses as much memory as allowed by maxAlloc. For example, for 256 buffers of 22MB each, about 6GB is needed. (this number of buffers does not need to jump in powers of two, any number will do). The overhead of the P1 FS is about 1/10 of the PRP iterations it runs in parallel with. So, over 12M iterations, the cost of P1 FS would be about 1.2M iterations. Once the PRP reaches the 12M mark, it runs the firststage GCD. It also pauses the PRP and starts the P1 secondstage. The secondstage also runs one or more GCDs, and if no factor is found in any of those GCDs, the PRP continues from iteration 12M up. Let's consider a GPU with only 4GB of RAM, let's say 3GB are allocated for P1 FS buffers, at 22MB/buf that accomodates about 128 buffers, so the overhead ratio in this case becomes 1/9 (i.e. 1/(2 +log2(128))) instead of 1/10 because of the lower number of buffers. Last fiddled with by preda on 20200926 at 12:30 

20200926, 16:11  #8 
"Vincent"
Apr 2010
Over the rainbow
101100011101_{2} Posts 
My intent was to run a P1 with a B1 as close as the exponent tested., and wondered if the buffer size for the first phase was somehow manageable.
Lets say I want to do a PM1 of a 100M exponent with a B1 of 50M. What would be the size of each buffer? From what I understand, the prp will go upto ~80 M iteration, finish the first phase of PM1, then do the second part of PM1 then do the last 20M iteration for the prp. (or stop if a factor is found). Last fiddled with by firejuggler on 20200926 at 16:35 
20200926, 19:41  #9 
"Vincent"
Apr 2010
Over the rainbow
5×569 Posts 
I mean, since the stage1 is free with enough memory. The problem is how the memory need scale.

20200926, 20:41  #10  
"Mihai Preda"
Apr 2015
2×691 Posts 
Quote:
The memory requirements do not grow with the length of the P1. They do grow, linearly, with the size of the exponent (because the size of a "buffer" is relative to the exponent). P1 secondstage is still more efficient (faster) than firststage. Thus it's still worth doing secondstage at the end. If you want to cover the range up to 100M, I'd split it in B1,B2 (12M,100M) or (15M,100M). If you want to do B1=100M, you should follow up with B2=400M or such. Use the P1 probability calculator to check whether such veryhigh bounds are worth the effort. 

20200928, 00:30  #11 
"Bill Staffen"
Jan 2013
Pittsburgh, PA, USA
2^{3}·53 Posts 
Thank you for your work on this new version, Preda  version 6 is running great and I don't think anyone would blame you for letting it rest a bit.
The concept of combining P1 FS with the PRP is very interesting to me, but your response to Firejuggler makes me think that maybe I understand the costs of FS vs SS P1 even less than I had realized. I understand that you said that P1 SS is still more efficient/faster than PP1 FS, but is it more efficient than than this new cost, with F=1/10 or even less with say, the 16GB video cards on colab? With an F=1/12 (or whatever that would be with ~15GiB worth of buffers) Would it make sense to run what we'd otherwise consider rather bizarre bounds, like 20000000 B1 and 40000000 B2? Maybe not even bother with second stage? Last fiddled with by Aramis Wyler on 20200928 at 00:30 
Thread Tools  
Similar Threads  
Thread  Thread Starter  Forum  Replies  Last Post 
gpuowl: runtime error  SELROC  GpuOwl  69  20210929 10:07 
GpuOwl PRPProof changes  preda  GpuOwl  20  20201017 06:51 
gpuOWL for Wagstaff  GP2  GpuOwl  22  20200613 16:57 
gpuowl tuning  M344587487  GpuOwl  14  20181229 08:11 
How to interface gpuOwl with PrimeNet  preda  PrimeNet  2  20171007 21:32 