mersenneforum.org  

Go Back   mersenneforum.org > Great Internet Mersenne Prime Search > Hardware > GPU Computing > GpuOwl

Reply
 
Thread Tools
Old 2020-09-28, 01:45   #12
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

2×3×1,193 Posts
Default

P-1 bounds are a balancing act. As first stage processing of B1 becomes more efficient, the optimal B2/B1 ratio decreases. If B1 processing were to drop to zero cost, no stage 2 would make sense. Even with a 10x improvement in stage 1 speed, it still makes sense to run some stage 2. We're likely looking at changing from a B2/B1 ratio in 20-40 area for the gpuowl 6 changing to a b2/B1 ratio in the 6-12 area for gpuowl 7.

(I ran some estimates with prime95's costing formulas to come up with the above). A GPU will have different costs. Prime95 v30.4 will be different too with improved stage 2 timings.

I'm not rushing into implementing this in prime95 as prime95's default install is (I think) a mere 250MB max memory, which limits the stage 1 savings.

Last fiddled with by Prime95 on 2020-09-28 at 01:46
Prime95 is offline   Reply With Quote
Old 2020-09-28, 02:44   #13
Aramis Wyler
 
Aramis Wyler's Avatar
 
"Bill Staffen"
Jan 2013
Pittsburgh, PA, USA

2×5×41 Posts
Default

I'd happily run mprime with 30GB memory if that made a serious difference. Heck, I could go get another 32 gig kit and run it against 60 gigs for not much compared to a video card. if mprime was running larger than the L3 cache though, I am not sure it would help at all.
Aramis Wyler is offline   Reply With Quote
Old 2020-09-28, 03:39   #14
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

7×673 Posts
Default

Quote:
Originally Posted by Prime95 View Post
prime95's default install is (I think) a mere 250MB max memory, which limits the stage 1 savings.
Thanks for that explanation.
FYI, no P-1 I run has access to less than 1.5GB per prime95 worker, and that's on old 32bit laptops. Core2 Duo have 4G; other models have 6, 8, 16, 32, 48GB respectively. (Gpus range 1, 2, 3, 4, 8, 11, 16.)
kriesel is offline   Reply With Quote
Old 2020-09-28, 04:49   #15
Prime95
P90 years forever!
 
Prime95's Avatar
 
Aug 2002
Yeehaw, FL

1BF616 Posts
Default

Quote:
Originally Posted by Aramis Wyler View Post
I'd happily run mprime with 30GB memory if that made a serious difference. Heck, I could go get another 32 gig kit and run it against 60 gigs for not much compared to a video card. if mprime was running larger than the L3 cache though, I am not sure it would help at all.
The "problem" is prime95 is designed with the average user in mind who wants it running in background without impacting his/her daily activities. What that means is that the default configuration cannot use a large amount of RAM and most casual users do not change the default settings. Thus, implementing this feature will only benefit a small handful of users. I'll probably implement it, but its priority is not particularly high.
Prime95 is offline   Reply With Quote
Old 2020-09-28, 05:44   #16
axn
 
axn's Avatar
 
Jun 2003

2×2,389 Posts
Default

Quote:
Originally Posted by Prime95 View Post
P-1 bounds are a balancing act.
Do we have some hard numbers comparing these two schemes (i.e. traditional P-1-before-PRP, and P-1-during-PRP) and the expected savings and optimal bounds for a current wavefront PRP? Especially with the assumption of 1 PRP savings?

I don't have a good feel for the situation and not even sure that the new option is, in fact, superior, so just wondering.
axn is offline   Reply With Quote
Old 2020-09-28, 10:06   #17
R. Gerbicz
 
R. Gerbicz's Avatar
 
"Robert Gerbicz"
Oct 2005
Hungary

13·109 Posts
Default

If you want a simple formula, counting everything in squaremod:
total cost=Time(p,B1,B2)+(1-Prob(p,B1,B2))*(p-B1) and you should minimize this, where Time(p,B1,B2) is the cost of doing this new algorithm, so doing B1 squarings, while doing in paralel the P-1's 1st stage using an array, and then the classical 2nd stage using B2 bound [note that you could do later the 2nd stage, but that is suboptimal because with delaying it you are wasting time if you find a factor in the 2nd stage]. Probab(p,B1,B2) is the probability that you'll find a factor using B1,B2 bounds. Ofcourse Time(p,B1,B2)>=B1, and assumed above that there was no detected and corrected errors in the run (but that doesn't change much the optimal cutoffs).

Just interestingly if you have enough memory then in optimal setup you could hide P1 1st stage using O(primepi(B1)) additional squarings, so basically using only O(1) mulmod per prime, matching the classical 2nd stage's running time. Hence we are not very far to see a point where the 2nd stage would just disappear/collapsing totally.
R. Gerbicz is offline   Reply With Quote
Old 2020-09-28, 14:48   #18
kriesel
 
kriesel's Avatar
 
"TF79LL86GIMPS96gpu17"
Mar 2017
US midwest

126716 Posts
Default

Quote:
Originally Posted by Prime95 View Post
The "problem" is prime95 is designed with the average user in mind ... Thus, implementing this feature will only benefit a small handful of users. I'll probably implement it, but its priority is not particularly high.
Small fraction of users may be so, but perhaps not as small a fraction of systems/cores/workers. Colab and other cloud computing, for example, which typically have ample available ram, and top producers that automatically run occasional P-1 still needed for their assigned primality tests. (Your call as always.)

Last fiddled with by kriesel on 2020-09-28 at 14:49
kriesel is offline   Reply With Quote
Old 2020-09-28, 21:53   #19
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

24·83 Posts
Default

Quote:
Originally Posted by axn View Post
Do we have some hard numbers comparing these two schemes (i.e. traditional P-1-before-PRP, and P-1-during-PRP) and the expected savings and optimal bounds for a current wavefront PRP? Especially with the assumption of 1 PRP savings?

I don't have a good feel for the situation and not even sure that the new option is, in fact, superior, so just wondering.
In P-1 FS (First Stage), for a given B1, the number of bits in powerSmooth(B1) is ~1.442*B1.

The "old" P-1, in the most efficient implementation, was doing a number of squarings equal to that number of bits.

P-1 FS(B1) "old" cost = 1.442 * B1.

In the new setup, the PRP proceeds over this number of iterations (1.442 * B1) as normal, but we don't count these iterations towards the P-1 cost because they are useful for PRP. In addition to that PRP cost, every log2(N)+2 iterations we do one more multiplication needed only for the "new" P-1 FS ("N" being the number of buffers used by P-1 FS). Let's simplify by assuming that log2(N)+2==10, so the cost becomes:
P-1 FS(B1) "new" cost = 1.442 * B1 / 10 == "old" cost / 10.

Second Stage:
The P1 SS(B1, B2) prooceeds like this: consider all the primes between B1 and B2, let's call this "n" the number of primes to be covered by the second stage.

Some of these primes can be paired ("two with one rock") which reduces the effective number to a fraction of n, empirically about 0.85*n.

Afterwards, second stage proceeds by doing one mul for each of these primes, plus two additional muls periodically depeding on the "D" value used by the second stage:
cost(P1 SS) = n*0.85 + 2 * (B2 - B1) * (1/D).
Let's assume D==2310.
So we approximate
cost = n*0.85 + (B2 - B1)/1000.

The number of primes up to x ("primepi") can be approximated such:
primepi(x)~= x / (ln(x) - 1)

and "n" is: primepi(B2)-primepi(B1) ~= (B2 - B1)/(ln(B2) - 1)


Let's look now at the boundary between first-stage and second-stage. This boundary is given by B1. The prime gap (i.e. distance between two successive primes) around B1 is about ln(B1).

The FS "marginal cost" -- adding one more prime (around B1) to FS increases the FS cost by:
ln(B1) * 1.442 / 10.

The SS "marginal cost" at B1 (moving the lower bound B1 one prime down) increases SS cost by 0.85.

These two relations show that SS is more efficient even compared to the "fast" FS, which itself is 10x more efficient compared to the "old" FS. So no matter what, it's very much worth to continue doing Second Stage, it's not going away!

For example:
old bounds: (1M, 30M)
new bounds: (8M, 40M)

(anyway double-check with a P-1 calculator to estimate the probabilities of success)

PS: the incremental effort of FS is 1.4 * log2(B1), while incremental effort of SS (at B1) is 0.85 * prime-gap(B1). And at the B1 values we work with, the prime-gap is lower than the log2; that's why second-stage is more efficient.

Last fiddled with by preda on 2020-09-28 at 22:44
preda is offline   Reply With Quote
Old 2020-09-28, 22:39   #20
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

24·83 Posts
Default Savefiles

The savefiles are now named like this:

86059247-000014500.prp
86059247-10000-000014500.p1
86059247-10000.p2

Which is
<exponent>-<iteration>.prp
<exp>-<B1>-<iteration>.p1
<exp>-<B1>.p2

A number of savefiles is kept -- by default 10, but this can be set with -save <N>. This allows to restart from points far away in the past.

The overall format of the files is similar -- a one-line text header at the beginning, followed optionally by binary data.

The PRP and P1 savefiles store one residue that is protected with a CRC that is in the header.
The P2 savefile is tiny, and simply records up to what point of stage-two a GCD was done (stage2 can be continued from that point onward).

Last fiddled with by preda on 2020-09-28 at 22:40
preda is offline   Reply With Quote
Old 2020-09-28, 22:59   #21
preda
 
preda's Avatar
 
"Mihai Preda"
Apr 2015

132810 Posts
Default Cleanup

There is a new special folder "trashbin" (which is located in the "master directory" when using -pool, i.e. is global across instances).

When an exponent is finished, the trashbin folder is emptied, and after that the finished exponent savefiles (including proof residues) is moved to trashbin. This cleanup is automatic. There will be a way to turn off the cleanup action.

This allows a small time window to "revive" an exponent from trashbin.
preda is offline   Reply With Quote
Old 2020-09-28, 23:10   #22
storm5510
Random Account
 
storm5510's Avatar
 
Aug 2009
U.S.A.

2·3·281 Posts
Default

Someone wrote a long time ago, "Many factors can be found below B1, but only one above." This was an ECM reference. Assuming this would also apply to P-1, perhaps B2 bounds would not need to be nearly as high if B1 was stretched to cover part of the area B2 would normally do.

There are many thousands of LL double-checks needing to be ran. Is it possible to run these as PRP's?

I think I will keep my 6x until I see how this all shakes out. Thanks for all the info.
storm5510 is offline   Reply With Quote
Reply

Thread Tools


Similar Threads
Thread Thread Starter Forum Replies Last Post
GpuOwl PRP-Proof changes preda GpuOwl 20 2020-10-17 06:51
gpuowl: runtime error SELROC GpuOwl 59 2020-10-02 03:56
gpuOWL for Wagstaff GP2 GpuOwl 22 2020-06-13 16:57
gpuowl tuning M344587487 GpuOwl 14 2018-12-29 08:11
How to interface gpuOwl with PrimeNet preda PrimeNet 2 2017-10-07 21:32

All times are UTC. The time now is 11:37.

Fri Nov 27 11:37:39 UTC 2020 up 78 days, 8:48, 4 users, load averages: 1.81, 1.57, 1.39

Powered by vBulletin® Version 3.8.11
Copyright ©2000 - 2020, Jelsoft Enterprises Ltd.

This forum has received and complied with 0 (zero) government requests for information.

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation.
A copy of the license is included in the FAQ.