GpuOwl 7.x
A heads up about upcoming changes, each will be discussed in more detail. (GpuOwl 7.x is not yet available)
 removed LL  removed standalone P1  merged PRP and P1  reworked P1 secondstage  changed savefile version and naming scheme  changed savefile cleanup Planned:  rework the proof generation  migrate to proof v2  rework PRP savefiles towards merging with the proof checkpoints  some SP experiments A git branch will be created for the current 6.x for people who feel strongly about the "removed" elements. 
Merged PRP and P1 firststage
I'll explain briefly why this is a good thing compared to the "classic" P1 firststage.
P1 first stage is conceptually simple: a) given the bound B1, compute powerSmooth(B1) which is roughly the product of all primes under B1. The small primes are taken to a higher power (than 1). b) compute X = 3^powerSmooth(B1) c) compute the GCD(X  1, Mp) hoping to come up with a factor The costly operation is computing 3^powerSmooth(B1). In the classical setup, its cost is log2(powerSmooth(B1)) "squarings". (i.e. the number of bits in powerSmooth(B1)). log2(powerSmooth(B1)) ~= 1.442 * B1 (BTW, 1.442 ~= 1/ln(2)) For example, doing firststage with B1=1M costs about 1.4M "PRP iterations". 2. Merged with PRP During normal PRP we compute 3^(2^k) iterating over k. We can use these values (the normal PRP iterations) for cheaper P1 first stage  Robert Gerbicz showed a nice way to aggregate them for computing the 3^powerSmooth(B1). The new algorithm uses a number N of "cache" buffers (thus, a lot of memory). The cost of P1 firststage (in iterations) *on top of the PRP* becomes: B1 * 1.442 / (log2(N) + 2) Where N is the number of buffers. This represents a reduction of the cost compared to the "classic" P1 FS of a factor of log2(N)+2. For example, with about 256 buffers, that's a 10x reduction in cost. The catch is that this tiny cost is "on top of" the PRP cost. Let's make clear what that means: if somebody wants to run the PRP anyway, then the additional cost of P1 FS is really tiny and that's great. If somebody does not care for the PRP and wants to do *only* the P1 FS, then there is no cost reduction at all compared to the "classic" P1. classicCost = B1 * 1.442 "number of buffers" N "reduction factor" F = 1 / (log2(N) + 2) cost excluding the PRP iterations = classicCost * F cost including the PRP iterations = classicCost * (1 + F) When F=1/10 as in the example (N=256), the cost "on top" of PRP is only 0.1 * classic, while the cost including the PRP component is 1.1 * classic (a slight *increase*). Aside from these cost consideration, one little advantage of the new algorithm is that it is partially protected with the GEC error check. ("partially" means that only the PRP iterations themselves are protected, but this is still an improvement compared to the "classic" P1 FS which is not protected at all). 
Merged PRP and P1 FS
Another way to look at running P1 *only* (without PRP) in the new setup:
Let's say running to a high bound B1=5M, log2(powerSmooth(B1))=7M, So runnnig that much P1 FS involves also running the first 7M iterations of the normal PRP test. If after completing the P1 FS the tester desires to *not* continue the PRP, that would mean that the alreadydone 7M PRP iterations are lost. That's why in the new setup running "standalone" P1 is inefficient and that's why I'm not planning to offer it; even though I imagine some users will be dissapointed that the beloved "standalone P1" is gone. But hopefully higher bounds and more factors will come out of it :) 
[QUOTE=preda;557928] removed LL[/QUOTE]Won't we need this to verify primes?

[QUOTE=retina;557932]Won't we need this to verify primes?[/QUOTE]
It's enough to check out the "v6" branch [url]https://github.com/preda/gpuowl/tree/v6[/url] , build (scons or make), and we're ready to go for a LL prime verification. PS: what's missing is the prime to verify.. 
Does this mean that say, one could P1 a 17M exponent to a B1 of about 11M at the same time it take to run the prp? what about the buffer size?

[QUOTE=firejuggler;557949]Does this mean that say, one could P1 a 17M exponent to a B1 of about 11M at the same time it take to run the prp? what about the buffer size?[/QUOTE]
I would not P1 a 17M exponent at this point, because the wavefront (i.e. exponents that didn't have either P1 or PRP done already) is at 100M. For a 100M exponent the workflow is:  choose a rather large B1. Let's say B1=8M.  choose a B2. Let's say B2=50M. Note that the ratio B2/B1 is now lower, under 10, because B1 is so large.  start the PRP. For the first 8M*1.44 iterations, let's say 12M iterations, the PRP runs in parallel with P1 FS. For this it uses as much memory as allowed by maxAlloc. For example, for 256 buffers of 22MB each, about 6GB is needed. (this number of buffers does not need to jump in powers of two, any number will do). The overhead of the P1 FS is about 1/10 of the PRP iterations it runs in parallel with. So, over 12M iterations, the cost of P1 FS would be about 1.2M iterations. Once the PRP reaches the 12M mark, it runs the firststage GCD. It also pauses the PRP and starts the P1 secondstage. The secondstage also runs one or more GCDs, and if no factor is found in any of those GCDs, the PRP continues from iteration 12M up. Let's consider a GPU with only 4GB of RAM, let's say 3GB are allocated for P1 FS buffers, at 22MB/buf that accomodates about 128 buffers, so the overhead ratio in this case becomes 1/9 (i.e. 1/(2 +log2(128))) instead of 1/10 because of the lower number of buffers. 
My intent was to run a P1 with a B1 as close as the exponent tested., and wondered if the buffer size for the first phase was somehow manageable.
Lets say I want to do a PM1 of a 100M exponent with a B1 of 50M. What would be the size of each buffer? From what I understand, the prp will go upto ~80 M iteration, finish the first phase of PM1, then do the second part of PM1 then do the last 20M iteration for the prp. (or stop if a factor is found). 
I mean, since the stage1 is free with enough memory. The problem is how the memory need scale.

[QUOTE=firejuggler;557961]My intent was to run a P1 with a B1 as close as the exponent tested., and wondered if the buffer size for the first phase was somehow manageable.
Lets say I want to do a PM1 of a 100M exponent with a B1 of 50M. What would be the size of each buffer? From what I understand, the prp will go upto ~80 M iteration, finish the first phase of PM1, then do the second part of PM1 then do the last 20M iteration for the prp. (or stop if a factor is found).[/QUOTE] Stage1, even if cheaper, is not free. If you run it over the full PRP length (which would mean, for a 100M exponent, you'd run to B1=70M), you still pay 10% of the full PRP just for the P1 which is not nothing. What's more, the PRP saving when a factor in found in P1 become smaller as you advance towards the end of the PRP. The memory requirements do not grow with the length of the P1. They do grow, linearly, with the size of the exponent (because the size of a "buffer" is relative to the exponent). P1 secondstage is still more efficient (faster) than firststage. Thus it's still worth doing secondstage at the end. If you want to cover the range up to 100M, I'd split it in B1,B2 (12M,100M) or (15M,100M). If you want to do B1=100M, you should follow up with B2=400M or such. Use the P1 probability calculator to check whether such veryhigh bounds are worth the effort. 
Thank you for your work on this new version, Preda  version 6 is running great and I don't think anyone would blame you for letting it rest a bit.
The concept of combining P1 FS with the PRP is very interesting to me, but your response to Firejuggler makes me think that maybe I understand the costs of FS vs SS P1 even less than I had realized. I understand that you said that P1 SS is still more efficient/faster than PP1 FS, but is it more efficient than than this new cost, with F=1/10 or even less with say, the 16GB video cards on colab? With an F=1/12 (or whatever that would be with ~15GiB worth of buffers) Would it make sense to run what we'd otherwise consider rather bizarre bounds, like 20000000 B1 and 40000000 B2? Maybe not even bother with second stage? 
All times are UTC. The time now is 21:38. 
Powered by vBulletin® Version 3.8.11
Copyright ©2000  2021, Jelsoft Enterprises Ltd.